Re: make MaxBackends available in _PG_init
On 1/25/22, 12:01 AM, "Michael Paquier" wrote: > So, where are we on this patch? It looks like there is an agreement > that MaxBackends is used widely enough that it justifies the use of a > separate function to set and get a better value computed. There may > be other parameters that could use a brush up, but most known cases > would be addressed here. v4 looks rather straight-forward, at quick > glance. I think the patch is in decent shape. There may be a few remaining places where GetMaxBackends() is called repeatedly in the same function, but IIRC v4 already clears up the obvious ones. I don't know if this is worth worrying about too much, but I can create a new version if you think it is important. Nathan
Re: Correct error message for end-of-recovery record TLI
On 1/24/22, 8:42 PM, "Amul Sul" wrote: > On Tue, Jan 25, 2022 at 10:08 AM Michael Paquier wrote: >> >> On Wed, Dec 01, 2021 at 07:09:34PM +, Bossart, Nathan wrote: >> > The patch no longer applied, so I went ahead and rebased it. >> >> This was on the CF stack for some time, so applied. I have also >> changed the messages produced for the shutdown and online checkpoint >> records as they used the same messages so as one can get more >> context depending on the record types. > > A ton of thanks for the improvement & the commit. +1, thanks! Nathan
Re: Bug in ProcArrayApplyRecoveryInfo for snapshots crossing 4B, breaking replicas
On 1/22/22, 4:43 PM, "Tomas Vondra" wrote: > There's a bug in ProcArrayApplyRecoveryInfo, introduced by 8431e296ea, > which may cause failures when starting a replica, making it unusable. > The commit message for 8431e296ea is not very clear about what exactly > is being done and why, but the root cause is that at while processing > RUNNING_XACTS, the XIDs are sorted like this: > > /* > * Sort the array so that we can add them safely into > * KnownAssignedXids. > */ > qsort(xids, nxids, sizeof(TransactionId), xidComparator); > > where "safely" likely means "not violating the ordering expected by > KnownAssignedXidsAdd". Unfortunately, xidComparator compares the values > as plain uint32 values, while KnownAssignedXidsAdd actually calls > TransactionIdFollowsOrEquals() and compares the logical XIDs :-( Wow, nice find. > This likely explains why we never got any reports about this - most > systems probably don't leave transactions running for this long, so the > probability is much lower. And replica restarts are generally not that > common events either. I'm aware of one report with the same message [0], but I haven't read closely enough to determine whether it is the same issue. It looks like that particular report was attributed to backup_label being removed. > Attached patch is fixing this by just sorting the XIDs logically. The > xidComparator is meant for places that can't do logical ordering. But > these XIDs come from RUNNING_XACTS, so they actually come from the same > wraparound epoch (so sorting logically seems perfectly fine). The patch looks reasonable to me. Nathan [0] https://postgr.es/m/1476795473014.15979.2188%40webmail4
Re: Catalog version access
Thanks for taking a look! On 1/23/22, 7:31 PM, "Michael Paquier" wrote: > On Mon, Aug 16, 2021 at 06:12:54PM +0000, Bossart, Nathan wrote: >> I was looking at the --check switch for the postgres binary recently >> [0], and this sounds like something that might fit in nicely there. >> In the attached patch, --check will also check the control file if one >> exists. > > This would not work on a running postmaster as CreateDataDirLockFile() > is called beforehand, but we want this capability, no? I was not under the impression this was a requirement, based on the use-case discussed upthread [0]. > Abusing a bootstrap option for this purpose does not look like a good > idea, to be honest, especially for something only used internally now > and undocumented, but we want to use something aimed at an external > use with some documentation. Using a separate switch would be more > adapted IMO. This is the opposite of what Magnus proposed earlier [1]. Do we need a new pg_ctl option for this? It seems like it is really only intended for use by PostgreSQL developers, but perhaps there are other use-cases I am not thinking of. In any case, the pg_ctl option would probably end up using --check (or another similar mode) behind the scenes. > Also, I think that we should be careful with the read of > the control file to avoid false positives? We can rely on an atomic > read/write thanks to its maximum size of 512 bytes, but this looks > like a lot what has been done recently with postgres -C for runtime > GUCs, that *require* a read of the control file before grabbing the > values we are interested in. Sorry, I'm not following this one. In my proposed patch, the control file (if one exists) is read after CreateDataDirLockFile(), just like PostmasterMain(). Nathan [0] https://postgr.es/m/20210222022407.ecaygvx2ise6uwyz%40alap3.anarazel.de [1] https://postgr.es/m/CABUevEySovaEDci7c0DXOrV6c7JzWqYzfVwOiRUJxMao%3D9seEw%40mail.gmail.com
Re: do only critical work during single-user vacuum?
On 1/21/22, 2:43 PM, "John Naylor" wrote: > - to have a simple, easy to type, command AFAICT the disagreement is really just about the grammar. Sawada-san's idea would look something like VACUUM (FREEZE, INDEX_CLEANUP OFF, MIN_XID_AGE 16, MIN_MXID_AGE 16); while your proposal looks more like VACUUM (WRAPAROUND); The former is highly configurable, but it is probably annoying to type at 3 AM, and the interaction between the two *_AGE options is not exactly intuitive (although I expect MIN_XID_AGE to be sufficient in most cases). The latter is not as configurable, but it is much easier to type at 3 AM. I think simplicity is a good goal, but I don't know if the difference between the two approaches outweighs the benefits of configurability. If you are in an emergency situation, you already will have to take down the server, connect in single-user mode to the database(s) that need vacuuming, and actually do the vacuuming. The wraparound WARNING/ERROR already has a HINT that describes the next steps required. Perhaps it would be enough to also emit an example VACUUM command to use. I think folks will find the configurability useful, too. With MIN_XID_AGE, it's super easy to have pg_cron vacuum everything over 500M on the weekend (and also do index cleanup), which may allow you to use more relaxed autovacuum settings during the week. The docs already have suggestions for manually vacuuming when the load is low [0], so I think it is reasonable to build additional support for this use-case. Nathan [0] https://www.postgresql.org/docs/devel/routine-vacuuming.html#VACUUM-FOR-SPACE-RECOVERY
Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work
Thanks for your feedback. On 1/20/22, 11:47 AM, "Andres Freund" wrote: > It seems odd to change a bunch of things to not be errors anymore, but then > add new sources of errors. If we try to deal with concurrent deletions or > permission issues - otherwise what's the point of making unlink() not an error > anymore - why do we expect to be able to lstat()? My reasoning for making lstat() an ERROR was that there's a chance we need to fsync() the file, and if we can't fsync() a file for whatever reason, we definitely want to ERROR. I suppose we could conditionally ERROR based on the file name, but I don't know if that's really worth the complexity. > I'd be more on board accepting some selective errors. E.g. not erroring on > ENOENT, but continuing to error on others (most likely ENOACCESS). I think we > should *not* change the I think this approach would work for the use-case Bharath mentioned upthread. In any case, if deleting a file fails because the file was already deleted, there's no point in ERROR-ing. I think filtering errors is a bit trickier for lstat(). If we would've fsync'd the file but lstat() gives us ENOENT, we may have a problem. (However, there's also a good chance we wouldn't notice such problems if the race didn't occur.) I'll play around with it. >> + /* >> + * We just log a message if a file doesn't fit the pattern, >> it's >> + * probably some editor's lock/state file or similar... >> + */ > > An editor's lock file that starts with map- would presumably be the whole > filename plus an additional file-ending. But this check won't catch those. Right, it will either fsync() or unlink() those. I'll work on the comment. Or do you think it's worth validating that there are no trailing characters? I looked into that a bit earlier, and the code felt excessive to me, but I don't have a strong opinion here. >> + * Unlike failures to unlink() old logical rewrite >> files, we must >> + * ERROR if we're unable to sync transient state down >> to disk. This >> + * allows replay to assume that everything written out >> before >> + * checkpoint start is persisted. >>*/ > > Hm, not quite happy with the second bit. Checkpointed state being durable > isn't about replay directly, it's about the basic property of a checkpoint > being fulfilled? I'll work on this. FWIW I modeled this based on the comment above the function. Do you think that should be adjusted as well? >> + /* persist directory entries to disk */ >> + fsync_fname("pg_logical/mappings", true); > > This is probably worth backpatching, if you split it out I'll do so. Sure thing, will do shortly. Nathan
Re: Document atthasmissing default optimization avoids verification table scan
On 1/19/22, 5:15 PM, "James Coleman" wrote: > I'm open to the idea of wordsmithing here, of course, but I strongly > disagree that this is irrelevant data. There are plenty of > optimizations Postgres could theoretically implement but doesn't, so > measuring what should happen by what you think is obvious ("told it to > populate with default values - why do you need to check") is clearly > not valid. > > This patch actually came out of our specifically needing to verify > that this is true before an op precisely because docs aren't actually > clear and because we can't risk a large table scan under an exclusive > lock. We're clearly not the only ones with that question; it came up > in a comment on this blog post announcing the newly committed feature > [1]. My initial reaction was similar to David's. It seems silly to document that we don't do something that seems obviously unnecessary. However, I think you make a convincing argument for including it. I agree with David's feedback on where this information should go. Nathan
Re: Add checkpoint and redo LSN to LogCheckpointEnd log message
On 1/3/22, 5:52 PM, "Kyotaro Horiguchi" wrote: > It seems to me "LSN" or just "location" is more confusing or > mysterious than "REDO LSN" for the average user. If we want to avoid > being technically too detailed, we would use just "start LSN=%X/%X, > end LSN=%X/%X". And it is equivalent to "WAL range=[%X/%X, %X/%X]".. My first instinct was that this should stay aligned with pg_controldata, but that would mean using "location=%X/%X, REDO location=%X/%X," which doesn't seem terribly descriptive. IIUC the "checkpoint location" is the LSN of the WAL record for the checkpoint, and the "checkpoint's REDO location" is the LSN where checkpoint creation began (i.e., what you must retain for crash recovery). My vote is for "start=%X/%X, end=%X/%X." Nathan
Re: Document atthasmissing default optimization avoids verification table scan
On 9/24/21, 7:30 AM, "James Coleman" wrote: > When PG11 added the ability for ALTER TABLE ADD COLUMN to set a constant > default value without rewriting the table the doc changes did not note > how the new feature interplayed with ADD COLUMN DEFAULT NOT NULL. > Previously such a new column required a verification table scan to > ensure no values were null. That scan happens under an exclusive lock on > the table, so it can have a meaningful impact on database "accessible > uptime". I'm likely misunderstanding, but are you saying that adding a new column with a default value and a NOT NULL constraint used to require a verification scan? + Additionally adding a column with a constant default value avoids a + a table scan to verify no NULL values are present. Should this clarify that it's referring to NOT NULL constraints? Nathan
Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work
On 1/19/22, 11:08 AM, "Andres Freund" wrote: > On 2022-01-19 13:34:21 -0500, Tom Lane wrote: >> As far as the patch itself goes, I agree that failure to unlink >> is noncritical, because such a file would have no further effect >> and we can just ignore it. > > I don't agree. We iterate through the directory regularly on systems with > catalog changes + logical decoding. An ever increasing list of gunk will make > that more and more expensive. And I haven't heard a meaningful reason why we > would have map-* files that we can't remove. I think the other side of this is that we don't want checkpointing to continually fail because of a noncritical failure. That could also lead to problems down the road. > Ignoring failures like this just makes problems much harder to debug and they > tend to bite harder for it. If such noncritical failures happened regularly, the server logs will likely become filled with messages about it. Perhaps users may not notice for a while, but I don't think the proposed patch would make debugging excessively difficult. Nathan
Re: do only critical work during single-user vacuum?
On 1/19/22, 11:15 AM, "John Naylor" wrote: > This seems to be the motivating reason for wanting new configurability > on the server side. In any case, new knobs are out of scope for this > thread. If the use case is compelling enough, may I suggest starting a > new thread? Sure. Perhaps the new top-level command will use these new options someday. > Regarding the thread subject, I've been playing with the grammar, and > found it's quite easy to have > > VACUUM FOR WRAPAROUND > or > VACUUM FOR EMERGENCY > > since FOR is a reserved word (and following that can be an IDENT plus > a strcmp check) and cannot conflict with table names. This sounds a > bit more natural than VACUUM LIMIT. Opinions? I personally think VACUUM FOR WRAPAROUND is the best of the options provided thus far. Nathan
Re: do only critical work during single-user vacuum?
On 1/18/22, 9:47 PM, "Masahiko Sawada" wrote: > IIUC what we want to do here are two things: (1) select only old > tables and (2) set INDEX_CLEANUP = off, TRUNCATE = off, and FREEZE = > on. VACUUM LIMIT statement does both things at the same time. Although > I’m concerned a bit about its flexibility, it’s a reasonable solution. > > On the other hand, it’s probably also useful to do either one thing in > some cases. For instance, having a selector for (1) would be useful, > and having a new option like FAST_FREEZE for (2) would also be useful. > Given there is already a way for (2) (it does not default though), I > think it might also be a good start inventing something for (1). For > instance, a selector for VACUUM statement I came up with is: > > VACUUM (verbose on) TABLES WITH (min_xid_age = 16); > or > VACUUM (verbose on) TABLES WITH (min_age = failsafe_limit); > > We can expand it in the future to select tables by, for example, dead > tuple ratio, size, etc. > > It's a random thought but maybe worth considering. That's an interesting idea. A separate selector clause could also allow users to choose how they interacted (e.g., should the options be OR'd or AND'd). Nathan
Re: O(n) tasks cause lengthy startups and checkpoints
On 1/14/22, 11:26 PM, "Bharath Rupireddy" wrote: > On Sat, Jan 15, 2022 at 12:46 AM Bossart, Nathan wrote: >> I'd personally like to avoid creating two code paths for the same >> thing. Are there really cases when this one extra auxiliary process >> would be too many? And if so, how would a user know when to adjust >> this GUC? I understand the point that we should introduce new >> processes sparingly to avoid burdening low-end systems, but I don't >> think we should be afraid to add new ones when it is needed. > > IMO, having a GUC for enabling/disabling this new worker and it's > related code would be a better idea. The reason is that if the > postgres has no replication slots at all(which is quite possible in > real stand-alone production environments) or if the file enumeration > (directory traversal and file removal) is fast enough on the servers, > there's no point having this new worker, the checkpointer itself can > take care of the work as it is doing today. IMO introducing a GUC wouldn't be doing users many favors. Their cluster might work just fine for a long time before they begin encountering problems during startups/checkpoints. Once the user discovers the underlying reason, they have to then find a GUC for enabling a special background worker that makes this problem go away. Why not just fix the problem for everybody by default? I've been thinking about what other approaches we could take besides creating more processes. The root of the problem seems to be that there are a number of tasks that are performed synchronously that can take a long time. The process approach essentially makes these tasks asynchronous so that they do not block startup and checkpointing. But perhaps this can be done in an existing process, possibly even the checkpointer. Like the current WAL pre-allocation patch, we could do this work when the checkpointer isn't checkpointing, and we could also do small amounts of work in CheckpointWriteDelay() (or a new function called in a similar way). In theory, this would help avoid delaying checkpoints too long while doing cleanup at every opportunity to lower the chances it falls far behind. Nathan
Re: docs: pg_replication_origin_oid() description does not match behaviour
On 1/17/22, 5:24 PM, "Michael Paquier" wrote: > On Tue, Jan 18, 2022 at 10:19:41AM +0900, Ian Lawrence Barwick wrote: >> Given that the code has remained unchanged since the function was >> introduced in 9.5, it seems reasonable to change the documentation >> to match the function behaviour rather than the other way round. > > Obviously. I'll go fix that as you suggest, if there are no > objections. +1 Nathan
Re: Add sub-transaction overflow status in pg_stat_activity
On 1/14/22, 8:26 AM, "Tom Lane" wrote: > Julien Rouhaud writes: >> Like many I previously had to investigate a slowdown due to sub-transaction >> overflow, and even with the information available in a monitoring view (I had >> to rely on a quick hackish extension as I couldn't patch postgres) it's not >> terribly fun to do this way. On top of that log analyzers like pgBadger >> could >> help to highlight such a problem. > > It feels to me like far too much effort is being invested in fundamentally > the wrong direction here. If the subxact overflow business is causing > real-world performance problems, let's find a way to fix that, not put > effort into monitoring tools that do little to actually alleviate anyone's > pain. +1 An easy first step might be to increase PGPROC_MAX_CACHED_SUBXIDS and NUM_SUBTRANS_BUFFERS. This wouldn't be a long-term solution to all such performance problems, and we'd still probably want the proposed monitoring tools, but maybe it'd kick the can down the road a bit. Perhaps another improvement could be to store the topmost transaction along with the parent transaction in the subtransaction log to avoid the loop in SubTransGetTopmostTransaction(). Nathan
Re: O(n) tasks cause lengthy startups and checkpoints
On 1/14/22, 3:43 AM, "Maxim Orlov" wrote: > The code seems to be in good condition. All the tests are running ok with no > errors. Thanks for your review. > I like the whole idea of shifting additional checkpointer jobs as much as > possible to another worker. In my view, it is more appropriate to call this > worker "bg cleaner" or "bg file cleaner" or smth. > > It could be useful for systems with high load, which may deal with deleting > many files at once, but I'm not sure about "small" installations. Extra bg > worker need more resources to do occasional deletion of small amounts of > files. I really do not know how to do it better, maybe to have two different > code paths switched by GUC? I'd personally like to avoid creating two code paths for the same thing. Are there really cases when this one extra auxiliary process would be too many? And if so, how would a user know when to adjust this GUC? I understand the point that we should introduce new processes sparingly to avoid burdening low-end systems, but I don't think we should be afraid to add new ones when it is needed. That being said, if making the extra worker optional addresses the concerns about resource usage, maybe we should consider it. Justin suggested using something like max_parallel_maintenance_workers upthread [0]. > Should we also think about adding WAL preallocation into custodian worker > from the patch "Pre-alocationg WAL files" [1] ? This was brought up in the pre-allocation thread [1]. I don't think the custodian process would be the right place for it, and I'm also not as concerned about it because it will generally be a small, fixed, and configurable amount of work. In any case, I don't sense a ton of support for a new auxiliary process in this thread, so I'm hesitant to go down the same path for pre-allocation. Nathan [0] https://postgr.es/m/20211213171935.GX17618%40telsasoft.com [1] https://postgr.es/m/B2ACCC5A-F9F2-41D9-AC3B-251362A0A254%40amazon.com
Re: Add sub-transaction overflow status in pg_stat_activity
Thanks for the new patch! + +Returns a record of information about the backend's subtransactions. +The fields returned are subxact_count identifies +number of active subtransaction and subxact_overflow + shows whether the backend's subtransaction cache is +overflowed or not. + + nitpick: There is an extra "" here. Would it be more accurate to say that subxact_count is the number of subxids that are cached? It can only ever go up to PGPROC_MAX_CACHED_SUBXIDS. Nathan
Re: do only critical work during single-user vacuum?
On 1/13/22, 4:58 AM, "John Naylor" wrote: > On Wed, Jan 12, 2022 at 12:26 PM Bossart, Nathan wrote: >> As I've stated upthread, Sawada-san's suggested approach was my >> initial reaction to this thread. I'm not wedded to the idea of adding >> new options, but I think there are a couple of advantages. For both >> single-user mode and normal operation (which may be in imminent >> wraparound danger), you could use the same command: >> >> VACUUM (MIN_XID_AGE 16, ...); > > My proposed top-level statement can also be used in normal operation, > so the only possible advantage is configurability. But I don't really > see any advantage in that -- I don't think we should be moving in the > direction of adding more-intricate ways to paper over the deficiencies > in autovacuum scheduling. (It could be argued that I'm doing exactly > that in this whole thread, but [imminent] shutdown situations have > other causes besides deficient scheduling.) The new top-level command would be configurable, right? Your patch uses autovacuum_freeze_max_age/autovacuum_multixact_freeze_max_age, so the behavior of this new command now depends on the values of parameters that won't obviously be related to it. If these parameters are set very low (e.g., the default values), then this command will end up doing far more work than is probably necessary. If we did go the route of using a parameter to determine which tables to vacuum, I think vacuum_failsafe_age is a much better candidate, as it defaults to a much higher value that is more likely to prevent doing extra work. That being said, I don't know if overloading parameters is the right way to go. >> (As an aside, we'd need to figure out how XID and MXID options would >> work together. Presumably most users would want to OR them.) >> >> This doesn't really tie in super nicely with the failsafe mechanism, >> but adding something like a FAILSAFE option doesn't seem right to me, > > I agree -- it would be awkward and messy as an option. However, I see > the same problem with xid/mxid -- I would actually argue they are not > even proper options; they are "selectors". Your comments above about > 1) needing to OR them and 2) emitting a message when a VACUUM command > doesn't actually do anything are evidence of that fact. That's a fair point. But I don't think these problems are totally intractable. We already emit "skipping" messages from VACUUM sometimes, and interactions between VACUUM options exist today, too. For example, FREEZE is redundant when FULL is specified, and INDEX_CLEANUP is totally ignored when FULL is used. >> The other advantage I see with age-related options is that it can be >> useful for non-imminent-wraparound situations as well. For example, >> maybe a user just wants to manually vacuum everything (including >> indexes) with an age above 500M on the weekends. > > There is already vaccumdb for that, and I think it's method of > selecting tables is sound -- I'm not convinced that pushing table > selection to the server command as "options" is an improvement. I guess I'm ultimately imagining the new options as replacing the vacuumdb implementation. IOW vacuumdb would just use MIN_(M)XID_AGE behind the scenes (as would a new top-level command). Nathan
Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work
On 1/12/22, 10:03 PM, "Bharath Rupireddy" wrote: > On Thu, Jan 13, 2022 at 3:47 AM Bossart, Nathan wrote: >> The only feedback I have for the patch is that I don't think the new >> comments are necessary. > > I borrowed the comments as-is from the CheckPointSnapBuild introduced > by the commit b89e15105. IMO, let the comments be there as they > explain why we are not emitting ERRORs, however I will leave it to the > committer to decide on that. Huh, somehow I missed that when I looked at it yesterday. I'm going to bump this one to ready-for-committer, then. Nathan
Re: parse/analyze API refactoring
On 12/28/21, 8:25 AM, "Peter Eisentraut" wrote: > (The "withcb" naming maybe isn't great; better ideas welcome.) FWIW I immediately understood that this meant "with callback," so it might be okay. > Not included in this patch set, but food for further thought: The > pg_analyze_and_rewrite_*() functions aren't all that useful (anymore). > One might as well write > > pg_rewrite_query(parse_analyze_xxx(...)) I had a similar thought while reading through the patches. If further deduplication isn't too much trouble, I'd vote for that. Nathan
Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work
On 12/31/21, 4:44 AM, "Bharath Rupireddy" wrote: > Currently the server is erroring out when unable to remove/parse a > logical rewrite file in CheckPointLogicalRewriteHeap wasting the > amount of work the checkpoint has done and preventing the checkpoint > from finishing. This is unlike CheckPointSnapBuild does for snapshot > files i.e. it just emits a message at LOG level and continues if it is > unable to parse or remove the file. Attaching a small patch applying > the same idea to the mapping files. This seems reasonable to me. AFAICT moving on to other files after an error shouldn't cause any problems. In fact, it's probably beneficial to try to clean up as much as possible so that the files do not continue to build up. The only feedback I have for the patch is that I don't think the new comments are necessary. Nathan
Re: Add index scan progress to pg_stat_progress_vacuum
On 1/11/22, 11:46 PM, "Masahiko Sawada" wrote: > Regarding the new pg_stat_progress_vacuum_index view, why do we need > to have a separate view? Users will have to check two views. If this > view is expected to be used together with and joined to > pg_stat_progress_vacuum, why don't we provide one view that has full > information from the beginning? Especially, I think it's not useful > that the total number of indexes to vacuum (num_indexes_to_vacuum > column) and the current number of indexes that have been vacuumed > (index_ordinal_position column) are shown in separate views. I suppose we could add all of the new columns to pg_stat_progress_vacuum and just set columns to NULL as appropriate. But is that really better than having a separate view? > Also, I’m not sure how useful index_tuples_removed is; what can we > infer from this value (without a total number)? I think the idea was that you can compare it against max_dead_tuples and num_dead_tuples to get an estimate of the current cycle progress. Otherwise, it just shows that progress is being made. Nathan [0] https://postgr.es/m/7874FB21-FAA5-49BD-8386-2866552656C7%40amazon.com
Re: do only critical work during single-user vacuum?
On 1/12/22, 7:43 AM, "John Naylor" wrote: > On Wed, Jan 12, 2022 at 1:49 AM Masahiko Sawada wrote: >> As another idea, we might be able to add a new option that takes an >> optional integer value, like VACUUM (MIN_XID), VACUUM (MIN_MXID), and >> VACUUM (MIN_XID 50). We vacuum only tables whose age is older than >> the given value. If the value is omitted, we vacuum only tables whose >> age exceeds a threshold (say autovacuum_freeze_max_age * 0.95), which >> can be used in an emergency case and output in GetNewTransactionID() >> WARNINGs output. vacuumdb’s --min-xid-age and --min-mxid-age can use >> this option instead of fetching the list of tables from the server. > > That could work, and maybe also have general purpose, but I see two > problems if I understand you correctly: > > - If we have a default threshold when the values are omitted, that > implies we need to special-case single-user mode with non-obvious > behavior, which is not ideal, as Andres mentioned upthread. (Or, now > manual VACUUM by default would not do anything, except in extreme > cases, which is worse.) I agree, I don't think such options should have a default value. > - In the single-user case, the admin would still need to add > INDEX_CLEANUP = off for minimum downtime, and it should be really > simple. > > - For the general case, we would now have the ability to vacuum a > table, and possibly have no effect at all. That seems out of place > with the other options. Perhaps a message would be emitted when tables are specified but skipped due to the min-xid-age option. As I've stated upthread, Sawada-san's suggested approach was my initial reaction to this thread. I'm not wedded to the idea of adding new options, but I think there are a couple of advantages. For both single-user mode and normal operation (which may be in imminent wraparound danger), you could use the same command: VACUUM (MIN_XID_AGE 16, ...); (As an aside, we'd need to figure out how XID and MXID options would work together. Presumably most users would want to OR them.) This doesn't really tie in super nicely with the failsafe mechanism, but adding something like a FAILSAFE option doesn't seem right to me, as it's basically just an alias for a bunch of other options. In my mind, even a new top-level command would just be an alias for the aforementioned command. Of course, providing a new option is not quite as simple as opening up single-user mode and typing "BAIL OUT," but I don't know if it is prohibitively complicated for end users. They'll already have had to figure out how to start single-user mode in the first place, and we can have nice ERROR/WARNING messages that provide a suggested VACUUM command. The other advantage I see with age-related options is that it can be useful for non-imminent-wraparound situations as well. For example, maybe a user just wants to manually vacuum everything (including indexes) with an age above 500M on the weekends. Another idea is to do both. We could add age-related options, and we could also add a "BAIL OUT" command that is just an alias for a special VACUUM command that we feel will help get things under control as quickly as possible. Nathan
Re: Add index scan progress to pg_stat_progress_vacuum
On 1/11/22, 12:33 PM, "Imseih (AWS), Sami" wrote: > What about something like "The number of indexes that are eligible for > vacuuming". > This covers the cases where either an individual index is skipped or the > entire "index vacuuming" phase is skipped. Hm. I don't know if "eligible" is the right word. An index can be eligible for vacuuming but skipped because we set INDEX_CLEANUP to false. Maybe we should just stick with "The number of indexes that will be vacuumed." The only thing we may want to clarify is whether this value will change in some cases (e.g., vacuum failsafe takes effect). Nathan
Re: Add jsonlog log_destination for JSON server logs
On 1/10/22, 4:51 AM, "Michael Paquier" wrote: > The issue comes from an incorrect change in syslogger_parseArgs() > where I missed that the incrementation of argv by 3 has no need to be > changed. A build with -DEXEC_BACKEND is enough to show the failure, > which caused a crash when starting up the syslogger because of a NULL > pointer dereference. The attached v9 should be enough to switch the > CF bot to green. I've been looking at the latest patch set intermittently and playing around with jsonlog a little. It seems to work well, and I don't have any significant comments about the code. 0001 and 0002 seem straightforward and uncontroversial. IIUC 0003 simply introduces jsonlog using the existing framework. I wonder if we should consider tracking each log destination as a set of function pointers. The main logging code would just loop through the enabled log destinations and use these functions, and it otherwise would be completely detached (i.e., no "if jsonlog" blocks). This might open up the ability to define custom log destinations via modules, too. However, I don't know if there's any real demand for something like this, and it should probably be done separately from introducing jsonlog, anyway. Nathan
Re: improve CREATE EXTENSION error message
On 1/11/22, 11:23 AM, "Tom Lane" wrote: > "Bossart, Nathan" writes: >> Okay, the message looks like this in v5: > >> postgres=# CREATE EXTENSION does_not_exist; >> ERROR: extension "does_not_exist" is not available >> DETAIL: Could not open extension control file >> "/usr/local/pgsql/share/extension/does_not_exist.control": No such file or >> directory. >> HINT: The extension must first be installed on the system where >> PostgreSQL is running. > > Nobody complained about that wording, so pushed. Thanks! Nathan
Re: Add index scan progress to pg_stat_progress_vacuum
On 1/10/22, 5:01 PM, "Imseih (AWS), Sami" wrote: > I have attached the 3rd revision of the patch which also includes the > documentation changes. Also attached is a rendered html of the docs for > review. > > "max_index_vacuum_cycle_time" has been removed. > "index_rows_vacuumed" renamed to "index_tuples_removed". "tuples" is a more > consistent with the terminology used. > "vacuum_cycle_ordinal_position" renamed to "index_ordinal_position". Thanks for the new version of the patch! nitpick: I get one whitespace error when applying the patch. Applying: Expose progress for the "vacuuming indexes" phase of a VACUUM operation. .git/rebase-apply/patch:44: tab in indent. Whenever is triggered, index warning: 1 line adds whitespace errors. + + + num_indexes_to_vacuum bigint + + + The number of indexes that will be vacuumed. Only indexes with + pg_index.indisready set to "true" will be vacuumed. + Whenever is triggered, index + vacuuming will be bypassed. + + + + + We may want to avoid exhaustively listing the cases when this value will be zero. I would suggest saying, "When index cleanup is skipped, this value will be zero" instead. + + + relid oid + + + OID of the table being vacuumed. + + Do we need to include this field? I would expect indexrelid to go here. + + + leader_pid bigint + + + Process ID of the parallel group leader. This field is NULL + if this process is a parallel group leader or the + vacuuming indexes phase is not performed in parallel. + + Are there cases where the parallel group leader will have an entry in this view when parallelism is enabled? + + + index_ordinal_position bigint + + + The order in which the index is being vacuumed. Indexes are vacuumed by OID in ascending order. + + Should we include the bit about the OID ordering? I suppose that is unlikely to change in the near future, but I don't know if it is relevant information. Also, do we need to include the "index_" prefix? This view is specific for indexes. (I have the same question for index_tuples_removed.) Should this new table go after the "VACUUM phases" table? It might make sense to keep the phases table closer to where it is referenced. +/* Advertise the number of indexes to vacuum if we are not in failsafe mode */ +if (!lazy_check_wraparound_failsafe(vacrel)) +pgstat_progress_update_param(PROGRESS_VACUUM_TOTAL_INDEX_VACUUM, vacrel->nindexes); Shouldn't this be 0 when INDEX_CLEANUP is off, too? +#define PROGRESS_VACUUM_CURRENT_INDRELID 7 +#define PROGRESS_VACUUM_LEADER_PID 8 +#define PROGRESS_VACUUM_INDEX_ORDINAL9 +#define PROGRESS_VACUUM_TOTAL_INDEX_VACUUM 10 +#define PROGRESS_VACUUM_DEAD_TUPLES_VACUUMED 11 nitpick: I would suggest the following names to match the existing style: PROGRESS_VACUUM_NUM_INDEXES_TO_VACUUM PROGRESS_VACUUM_INDEX_LEADER_PID PROGRESS_VACUUM_INDEX_INDEXRELID PROGRESS_VACUUM_INDEX_ORDINAL_POSITION PROGRESS_VACUUM_INDEX_TUPLES_REMOVED Nathan
Re: Should we improve "PID XXXX is not a PostgreSQL server process" warning for pg_terminate_backend(<>)?
On 1/11/22, 10:06 AM, "John Naylor" wrote: > I pushed this with one small change -- I felt the comment didn't need > to explain the warning message, since it now simply matches the coding > more exactly. Also, v5 was a big enough change from v4 that I put > Nathan as the first author. Thanks! Nathan
Re: Throttling WAL inserts when the standby falls behind more than the configured replica_lag_in_bytes
I noticed this thread and thought I'd share my experiences building something similar for Multi-AZ DB clusters [0]. It's not a strict RPO mechanism, but it does throttle backends in an effort to keep the replay lag below a configured maximum. I can share the code if there is interest. I wrote it as a new extension, and except for one piece that I'll go into later, I was able to avoid changes to core PostgreSQL code. The extension manages a background worker that periodically assesses the state of the designated standbys and updates an atomic in shared memory that indicates how long to delay. A transaction callback checks this value and sleeps as necessary. Delay can be injected for write-enabled transactions on the primary, read-only transactions on the standbys, or both. The extension is heavily configurable so that it can meet the needs of a variety of workloads. One interesting challenge I encountered was accurately determining the amount of replay lag. The problem was twofold. First, if there is no activity on the primary, there will be nothing to replay on the standbys, so the replay lag will appear to grow unbounded. To work around this, the extension's background worker periodically creates an empty COMMIT record. Second, if a standby reconnects after a long time, the replay lag won't be accurate for some time. Instead, the replay lag will slowly increase until it reaches the correct value. Since the delay calculation looks at the trend of the replay lag, this apparent unbounded growth causes it to inject far more delay than is necessary. My guess is that this is related to 9ea3c64, and maybe it is worth rethinking that logic. For now, the extension just periodically reports the value of GetLatestXTime() from the standbys to the primary to get an accurate reading. This is done via a new replication callback mechanism (which requires core PostgreSQL changes). I can share this patch along with the extension, as I bet there are other applications for it. I should also note that the extension only considers "active" standbys and primaries. That is, ones with an active WAL sender or WAL receiver. This avoids the need to guess what should be done during a network partition, but it also means that we must gracefully handle standbys reconnecting with massive amounts of lag. The extension is designed to slowly ramp up the amount of injected delay until the standby's apply lag is trending down at a sufficient rate. I see that an approach was suggested upthread for throttling based on WAL distance instead of per-transaction. While the transaction approach works decently well for certain workloads (e.g., many small transactions like those from pgbench), it might require further tuning for very large transactions or workloads with a variety of transaction sizes. For that reason, I would definitely support building a way to throttle based on WAL generation. It might be a good idea to avoid throttling critical activity such as anti-wraparound vacuuming, too. Nathan [0] https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/multi-az-db-clusters-concepts.html
Re: make MaxBackends available in _PG_init
On 1/7/22, 12:27 PM, "Robert Haas" wrote: > On Fri, Jan 7, 2022 at 1:09 PM Bossart, Nathan wrote: >> While that approach would provide a way to safely retrieve the value, >> I think it would do little to prevent the issue in practice. If the >> size of the patch is a concern, we could also convert MaxBackends into >> a macro for calling GetMaxBackends(). This could also be a nice way >> to avoid breaking extensions that are using it correctly while >> triggering ERRORs for extensions that are using it incorrectly. I've >> attached a new version of the patch that does it this way. > > That's too magical for my taste. Fair point. v4 [0] is the less magical version. Nathan [0] https://postgr.es/m/attachment/125445/v4-0001-Disallow-external-access-to-MaxBackends.patch
Re: Add index scan progress to pg_stat_progress_vacuum
On 1/6/22, 6:14 PM, "Imseih (AWS), Sami" wrote: > I am hesitant to make column name changes for obvious reasons, as it breaks > existing tooling. However, I think there is a really good case to change > "index_vacuum_count" as the name is confusing. > "index_vacuum_cycles_completed" is the name I suggest if we agree to rename. > > For the new column, "num_indexes_to_vacuum" is good with me. Yeah, I think we can skip renaming index_vacuum_count for now. In any case, it would probably be good to discuss that in a separate thread. Nathan
Re: Worth using personality(ADDR_NO_RANDOMIZE) for EXEC_BACKEND on linux?
On 11/23/21, 11:29 AM, "Thomas Munro" wrote: > Here's a patch for Linux and also FreeBSD. The latter OS decided to > turn on ASLR by default recently, causing my workstation to fail like > this quite reliably, which reminded me to follow up with this. It > disables ASLR in pg_ctl and pg_regress, which is enough for check and > check-world, but doesn't help you if you run the server directly > (unlike the different hack done for macOS). > > For whatever random reason the failures are rarer on Linux (could be > my imagination, but I think they might be clustered, I didn't look > into the recipe for the randomness), but even without reproducing a > failure it's clear to see using pmap that this has the right effect. > I didn't bother with a check for the existence of ADDR_NO_RANDOMIZE > because it's since 2.6.12 which is definitely ancient enough. FWIW I just found this patch very useful for testing some EXEC_BACKEND stuff on Linux. Nathan
Re: Disallow quorum uncommitted (with synchronous standbys) txns in logical replication subscribers
On 1/6/22, 11:25 PM, "Jeff Davis" wrote: > On Wed, 2022-01-05 at 23:59 -0800, SATYANARAYANA NARLAPURAM wrote: >> I would like to propose a GUC send_Wal_after_quorum_committed which >> when set to ON, walsenders corresponds to async standbys and logical >> replication workers wait until the LSN is quorum committed on the >> primary before sending it to the standby. This not only simplifies >> the post failover steps but avoids unnecessary downtime for the async >> replicas. Thoughts? > > Do we need a GUC? Or should we just always require that sync rep is > satisfied before sending to async replicas? > > It feels like the sync quorum should always be ahead of the async > replicas. Unless I'm missing a use case, or there is some kind of > performance gotcha. I don't have a strong opinion on whether there needs to be a GUC, but +1 for the ability to enforce sync quorum before sending WAL to async standbys. I think that would be a reasonable default behavior. Nathan
Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file
On 1/7/22, 5:52 AM, "David Steele" wrote: > On 1/6/22 20:20, Euler Taveira wrote: >> On Thu, Jan 6, 2022, at 9:48 PM, Bossart, Nathan wrote: >>> After a quick glance, I didn't see an easy way to hold a session open >>> while the test does other things. If there isn't one, modifying >>> backup_fs_hot() to work with non-exclusive mode might be more trouble >>> than it is worth. >> >> You can use IPC::Run to start psql in background. See examples in >> src/test/recovery. > > I don't think updating backup_fs_hot() is worth it here. > > backup_fs_cold() works just fine for this case and if there is a need > for backup_fs_hot() in the future it can be implemented as needed. Thanks for the pointer on IPC::Run. I had a feeling I was missing something obvious! I think I agree with David that it still isn't worth it for just this one test. Of course, it would be great to test the non-exclusive backup logic as much as possible, but I'm not sure that this particular test will provide any sort of meaningful coverage. Nathan
Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file
On 12/2/21, 1:34 PM, "Bossart, Nathan" wrote: > On 12/2/21, 9:50 AM, "David Steele" wrote: >> On 12/2/21 12:38, David Steele wrote: >>> On 12/2/21 11:00, Andrew Dunstan wrote: >>>> >>>> Should we really be getting rid of >>>> PostgreSQL::Test::Cluster::backup_fs_hot() ? >>> >>> Agreed, it would be better to update backup_fs_hot() to use exclusive >>> mode and save out backup_label instead. >> >> Oops, of course I meant non-exclusive mode. > > +1. I'll fix that in the next revision. I finally got around to looking into this, and I think I found why it was done this way in 2018. backup_fs_hot() runs pg_start_backup(), closes the session, copies the data, and then runs pg_stop_backup() in a different session. This doesn't work with non-exclusive mode because the backup will be aborted when the session that runs pg_start_backup() is closed. pg_stop_backup() will fail with a "backup is not in progress" error. Furthermore, 010_logical_decoding_timelines.pl seems to be the only test that uses backup_fs_hot(). After a quick glance, I didn't see an easy way to hold a session open while the test does other things. If there isn't one, modifying backup_fs_hot() to work with non-exclusive mode might be more trouble than it is worth. Nathan
Re: XMAX_LOCK_ONLY and XMAX_COMMITTED (fk/multixact code)
On 12/21/21, 11:42 AM, "Mark Dilger" wrote: > + /* pre-v9.3 lock-only bit pattern */ > + ereport(ERROR, > + (errcode(ERRCODE_DATA_CORRUPTED), > +errmsg_internal("found tuple with HEAP_XMAX_COMMITTED > and" > +"HEAP_XMAX_EXCL_LOCK set and " > +"HEAP_XMAX_IS_MULTI unset"))); > + } > + > > I find this bit hard to understand. Does the comment mean to suggest that > the *upgrade* process should have eliminated all pre-v9.3 bit patterns, and > therefore any such existing patterns are certainly corruption, or does it > mean that data written by pre-v9.3 servers (and not subsequently updated) is > defined as corrupt, or ? > > I am not complaining that the logic is wrong, just trying to wrap my head > around what the comment means. This is just another way that a tuple may be marked locked-only, and we want to explicitly disallow locked-only + xmax-committed. This bit pattern may be present on servers that were pg_upgraded from pre-v9.3 versions. See commits 0ac5ad5 and 74ebba8 for more detail. Nathan
Re: Add index scan progress to pg_stat_progress_vacuum
On 12/29/21, 8:44 AM, "Imseih (AWS), Sami" wrote: > In "pg_stat_progress_vacuum", introduce 2 columns: > > * total_index_vacuum : This is the # of indexes that will be vacuumed. Keep > in mind that if failsafe mode kicks in mid-flight to the vacuum, Postgres may > choose to forgo index scans. This value will be adjusted accordingly. > * max_index_vacuum_cycle_time : The total elapsed time for a index vacuum > cycle is calculated and this value will be updated to reflect the longest > vacuum cycle. Until the first cycle completes, this value will be 0. The > purpose of this column is to give the user an idea of how long an index > vacuum cycle takes to complete. I think that total_index_vacuum is a good thing to have. I would expect this to usually just be the number of indexes on the table, but as you pointed out, this can be different when we are skipping indexes. My only concern with this new column is the potential for confusion when compared with the index_vacuum_count value. index_vacuum_count indicates the number of vacuum cycles completed, but total_index_vacuum indicates the number of indexes that will be vacuumed. However, the names sound like they could refer to the same thing to me. Perhaps we should rename index_vacuum_count to index_vacuum_cycles/index_vacuum_cycle_count, and the new column should be something like num_indexes_to_vacuum or index_vacuum_total. I don't think we need the max_index_vacuum_cycle_time column. While the idea is to give users a rough estimate for how long an index cycle will take, I don't think it will help generate any meaningful estimates for how much longer the vacuum operation will take. IIUC we won't have any idea how many total index vacuum cycles will be needed. Even if we did, the current cycle could take much more or much less time. Also, none of the other progress views seem to provide any timing information, which I suspect is by design to avoid inaccurate estimates. > Introduce a new view called "pg_stat_progress_vacuum_index". This view will > track the progress of a worker ( or leader PID ) while it's vacuuming an > index. It will expose some key columns: > > * pid: The PID of the worker process > > * leader_pid: The PID of the leader process. This is the column that can be > joined with "pg_stat_progress_vacuum". leader_pid and pid can have the same > value as a leader can also perform an index vacuum. > > * indrelid: The relid of the index currently being vacuumed > > * vacuum_cycle_ordinal_position: The processing position of the index being > vacuumed. This can be useful to determine how many indexes out of the total > indexes ( pg_stat_progress_vacuum.total_index_vacuum ) have been vacuumed > > * index_tuples_vacuumed: This is the number of index tuples vacuumed for the > index overall. This is useful to show that the vacuum is actually doing work, > as the # of tuples keeps increasing. Should we also provide some information for determining the progress of the current cycle? Perhaps there should be an index_tuples_vacuumed_current_cycle column that users can compare with the num_dead_tuples value in pg_stat_progress_vacuum. However, perhaps the number of tuples vacuumed in the current cycle can already be discovered via index_tuples_vacuumed % max_dead_tuples. +void +rusage_adjust(const PGRUsage *ru0, PGRUsage *ru1) +{ + if (ru1->tv.tv_usec < ru0->tv.tv_usec) + { + ru1->tv.tv_sec--; + ru1->tv.tv_usec += 100; + } + if (ru1->ru.ru_stime.tv_usec < ru0->ru.ru_stime.tv_usec) + { + ru1->ru.ru_stime.tv_sec--; + ru1->ru.ru_stime.tv_usec += 100; + } + if (ru1->ru.ru_utime.tv_usec < ru0->ru.ru_utime.tv_usec) + { + ru1->ru.ru_utime.tv_sec--; + ru1->ru.ru_utime.tv_usec += 100; + } +} I think this function could benefit from a comment. Without going through it line by line, it is not clear to me exactly what it is doing. I know we're still working on what exactly this stuff should look like, but I would suggest adding the documentation changes in the near future. Nathan
Re: skip replication slot snapshot/map file removal during end-of-recovery checkpoint
On 12/23/21, 3:17 AM, "Bharath Rupireddy" wrote: > Currently the end-of-recovery checkpoint can be much slower, impacting > the server availability, if there are many replication slot files > .snap or map- to be enumerated and deleted. How about skipping > the .snap and map- file handling during the end-of-recovery > checkpoint? It makes the server available faster and the next regular > checkpoint can deal with these files. If required, we can have a GUC > (skip_replication_slot_file_handling or some other better name) to > control this default being the existing behavior. I suggested something similar as a possibility in the other thread where these tasks are being discussed [0]. I think it is worth considering, but IMO it is not a complete solution to the problem. If there are frequently many such files to delete and regular checkpoints are taking longer, the shutdown/end-of-recovery checkpoint could still take a while. I think it would be better to separate these tasks from checkpointing instead. Nathan [0] https://postgr.es/m/A285A823-0AF2-4376-838E-847FA4710F9A%40amazon.com
Re: Pre-allocating WAL files
On 12/30/21, 3:52 AM, "Maxim Orlov" wrote: > I did check the patch too and found it to be ok. Check and check-world are > passed. > Overall idea seems to be good in my opinion, but I'm not sure where is the > optimal place to put the pre-allocation. > > On Thu, Dec 30, 2021 at 2:46 PM Pavel Borisov wrote: >> I've checked the patch v7. It applies cleanly, code is good, check-world >> tests passed without problems. >> I think it's ok to use checkpointer for this and the overall patch can be >> committed. But the seen performance gain makes me think again before adding >> this feature. I did tests myself a couple of months ago and got similar >> results. >> Really don't know whether is it worth the effort. Thank you both for your review. Nathan
Re: Strange path from pgarch_readyXlog()
On 12/29/21, 3:11 PM, "Tom Lane" wrote: > "Bossart, Nathan" writes: >> This crossed my mind, too. I also think one of the arrays can be >> eliminated in favor of just using the heap (after rebuilding with a >> reversed comparator). Here is a minimally-tested patch that >> demonstrates what I'm thinking. > > I already pushed a patch that de-static-izes those arrays, so this > needs rebased at least. However, now that you mention it it does > seem like maybe the intermediate arch_files[] array could be dropped > in favor of just pulling the next file from the heap. > > The need to reverse the heap's sort order seems like a problem > though. I really dislike the kluge you used here with a static flag > that inverts the comparator's sort order behind the back of the > binary-heap mechanism. It seems quite accidental that that doesn't > fall foul of asserts or optimizations in binaryheap.c. For > instance, if binaryheap_build decided it needn't do anything when > bh_has_heap_property is already true, this code would fail. In any > case, we'd need to spend O(n) time inverting the heap's sort order, > so this'd likely be slower than the current code. > > On the whole I'm inclined not to bother trying to optimize this > further. The main thing that concerned me is that somebody would > bump up NUM_FILES_PER_DIRECTORY_SCAN to the point where the static > space consumption becomes really problematic, and we've fixed that. Your assessment seems reasonable to me. If there was a better way to adjust the comparator for the heap, maybe there would be a stronger case for this approach. I certainly don't think it's worth inventing something for just this use-case. Thanks for fixing this! Nathan
Re: Strange path from pgarch_readyXlog()
On 12/29/21, 12:22 PM, "Thomas Munro" wrote: > Isn't this a corrupted pathname? > > 2021-12-29 03:39:55.708 CST [79851:1] WARNING: removal of orphan > archive status file > "pg_wal/archive_status/00010003.0028.backup00010004.ready" > failed too many times, will try again later I bet this was a simple mistake in beb4e9b. Nathan diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c index 434939be9b..b5b0d4e12f 100644 --- a/src/backend/postmaster/pgarch.c +++ b/src/backend/postmaster/pgarch.c @@ -113,7 +113,7 @@ static PgArchData *PgArch = NULL; * is empty, at which point another directory scan must be performed. */ static binaryheap *arch_heap = NULL; -static char arch_filenames[NUM_FILES_PER_DIRECTORY_SCAN][MAX_XFN_CHARS]; +static char arch_filenames[NUM_FILES_PER_DIRECTORY_SCAN][MAX_XFN_CHARS + 1]; static char *arch_files[NUM_FILES_PER_DIRECTORY_SCAN]; static int arch_files_size = 0;
Re: Add index scan progress to pg_stat_progress_vacuum
On 12/1/21, 3:02 PM, "Imseih (AWS), Sami" wrote: > The current implementation of pg_stat_progress_vacuum does not > provide progress on which index is being vacuumed making it > difficult for a user to determine if the "vacuuming indexes" phase > is making progress. By exposing which index is being scanned as well > as the total progress the scan has made for the current cycle, a > user can make better estimations on when the vacuum will complete. +1 > The proposed patch adds 4 new columns to pg_stat_progress_vacuum: > > 1. indrelid - the relid of the index being vacuumed > 2. index_blks_total - total number of blocks to be scanned in the > current cycle > 3. index_blks_scanned - number of blocks scanned in the current > cycle > 4. leader_pid - if the pid for the pg_stat_progress_vacuum entry is > a leader or a vacuum worker. This patch places an entry for every > worker pid ( if parallel ) as well as the leader pid nitpick: Shouldn't index_blks_scanned be index_blks_vacuumed? IMO it is more analogous to heap_blks_vacuumed. This will tell us which indexes are currently being vacuumed and the current progress of those operations, but it doesn't tell us which indexes have already been vacuumed or which ones are pending vacuum. I think such information is necessary to truly understand the current progress of vacuuming indexes, and I can think of a couple of ways we might provide it: 1. Make the new columns you've proposed return arrays. This isn't very clean, but it would keep all the information for a given vacuum operation in a single row. The indrelids column would be populated with all the indexes that have been vacuumed, need to be vacuumed, or are presently being vacuumed. The other index- related columns would then have the associated stats and the worker PID (which might be the same as the pid column depending on whether parallel index vacuum was being done). Alternatively, the index column could have an array of records, each containing all the information for a given index. 2. Create a new view for just index vacuum progress information. This would have similar information as 1. There would be an entry for each index that has been vacuumed, needs to be vacuumed, or is currently being vacuumed. And there would be an easy way to join with pg_stat_progress_vacuum (e.g., leader_pid, which again might be the same as our index vacuum PID depending on whether we were doing parallel index vacuum). Note that it would be possible for the PID of these entries to be null before and after we process the index. 3. Instead of adding columns to pg_stat_progress_vacuum, adjust the current ones to be more general, and then add new entries for each of the indexes that have been, need to be, or currently are being vacuumed. This is the most similar option to your current proposal, but instead of introducing a column like index_blks_total, we'd rename heap_blks_total to blks_total and use that for both the heap and indexes. I think we'd still want to add a leader_pid column. Again, we have to be prepared for the PID to be null in this case. Or we could just make the pid column always refer to the leader, and we could introduce a worker_pid column. That might create confusion, though. I wish option #1 was cleaner, because I think it would be really nice to have all this information in a single row. However, I don't expect much support for a 3-dimensional view, so I suspect option #2 (creating a separate view for index vacuum progress) is the way to go. The other benefit of option #2 versus option #3 or your original proposal is that it cleanly separates the top-level vacuum operations and the index vacuum operations, which are related at the moment, but which might not always be tied so closely together. Nathan
Re: archive modules
On 11/2/21, 8:07 AM, "Bossart, Nathan" wrote: > The main motivation is provide a way to archive without shelling out. > This reduces the amount of overhead, which can improve archival rate > significantly. It should also make it easier to archive more safely. > For example, many of the common shell commands used for archiving > won't fsync the data, but it isn't too hard to do so via C. The > current proposal doesn't introduce any extra infrastructure for > batching or parallelism, but it is probably still possible. I would > like to eventually add batching, but for now I'm only focused on > introducing basic archive module support. As noted above, the latest patch set (v11) doesn't add any batching or parallelism. Now that beb4e9b is committed (which causes the archiver to gather multiple files to archive in each scan of archive_status), it seems like a good time to discuss this a bit further. I think there are some interesting design considerations. As is, the current archive module infrastructure in the v11 patch set should help reduce the amount of overhead per-file quite a bit, and I observed a noticeable speedup with a basic file-copying archive strategy (although this is likely not representative of real-world workloads). I believe it would be possible for archive module authors to implement batching/parallelism, but AFAICT it would still require hacks similar to what folks do today with archive_command. For example, you could look ahead in archive_status, archive a bunch of files in a batch or in parallel with background workers, and then quickly return true when the archive_library is called for later files in the batch. Alternatively, we could offer some kind of built-in batching support in the archive module infrastructure. One simple approach would be to just have pgarch_readyXlog() optionally return the entire list of files gathered from the directory scan of archive_status (presently up to 64 files). Or we could provide a GUC like archive_batch_size that would allow users to limit how many files are sent to the archive_library each time. This list would be given to pgarch_archiveXlog(), which would return which files were successfully archived and which failed. I think this could be done for archive_command as well, although it might be tricky to determine which files were archived successfully. To handle that, we might just need to fail the whole batch if the archive_command return value indicates failure. Another interesting change is that the special timeline file handling added in beb4e9b becomes less useful. Presently, if a timeline history file is marked ready for archival, we force pgarch_readyXlog() to do a new scan of archive_status the next time it is called in order to pick it up as soon as possible (ordinarily it just returns the files gathered in a previous scan until it runs out). If we are sending a list of files to the archive module, it will be more difficult to ensure timeline history files are picked up so quickly. Perhaps this is a reasonable tradeoff to make when archive batching is enabled. I think the retry logic can stay roughly the same. If any files in a batch cannot be archived, wait a second before retrying. If that happens a few times in a row, stop archiving for a bit. It wouldn't be quite as precise as what's there today because the failures could be for different files each time, but I don't know if that is terribly important. Finally, I wonder if batching support is something we should bother with at all for the first round of archive module support. I believe it is something that could be easily added later, although it might require archive modules to adjust the archiving callback to accept and return a list of files. IMO the archive modules infrastructure is still an improvement even without batching, and it seems to fit nicely into the existing behavior of the archiver process. I'm curious what others think about all this. Nathan
Re: O(n) tasks cause lengthy startups and checkpoints
On 12/14/21, 12:09 PM, "Bossart, Nathan" wrote: > On 12/14/21, 9:00 AM, "Bruce Momjian" wrote: >> Have we changed temporary file handling in any recent major releases, >> meaning is this a current problem or one already improved in PG 14. > > I haven't noticed any recent improvements while working in this area. On second thought, the addition of the remove_temp_files_after_crash GUC is arguably an improvement since it could prevent files from accumulating after repeated crashes. Nathan
Re: O(n) tasks cause lengthy startups and checkpoints
On 12/14/21, 9:00 AM, "Bruce Momjian" wrote: > Have we changed temporary file handling in any recent major releases, > meaning is this a current problem or one already improved in PG 14. I haven't noticed any recent improvements while working in this area. Nathan
Re: O(n) tasks cause lengthy startups and checkpoints
On 12/13/21, 12:37 PM, "Robert Haas" wrote: > On Mon, Dec 13, 2021 at 1:21 PM Bossart, Nathan wrote: >> > But against all that, if these tasks are slowing down checkpoints and >> > that's avoidable, that seems pretty important too. Interestingly, I >> > can't say that I've ever seen any of these things be a problem for >> > checkpoint or startup speed. I wonder why you've had a different >> > experience. >> >> Yeah, it's difficult for me to justify why users should suffer long >> periods of downtime because startup or checkpointing is taking a very >> long time doing things that are arguably unrelated to startup and >> checkpointing. > > Well sure. But I've never actually seen that happen. I'll admit that surprises me. As noted elsewhere [0], we were seeing this enough with pgsql_tmp that we started moving the directory aside before starting the server. Discussions about handling this usually prompt questions about why there are so many temporary files in the first place (which is fair). FWIW all four functions noted in my original message [1] are things I've seen firsthand affecting users. Nathan [0] https://postgr.es/m/E7573D54-A8C9-40A8-89D7-0596A36ED124%40amazon.com [1] https://postgr.es/m/C1EE64B0-D4DB-40F3-98C8-0CED324D34CB%40amazon.com
Re: Add sub-transaction overflow status in pg_stat_activity
On 12/13/21, 6:30 AM, "Dilip Kumar" wrote: > On Tue, Dec 7, 2021 at 11:11 AM Justin Pryzby wrote: >> Since I think this field is usually not interesting to most users of >> pg_stat_activity, maybe this should instead be implemented as a function like >> pg_backend_get_subxact_status(pid). >> >> People who want to could use it like: >> SELECT * FROM pg_stat_activity psa, pg_backend_get_subxact_status(pid) sub; > > I have provided two function, one for subtransaction counts and other > whether subtransaction cache is overflowed or not, we can use like > this, if we think this is better way to do it then we can also add > another function for the lastOverflowedXid The general approach looks good to me. I think we could have just one function for all three values, though. Nathan
Re: O(n) tasks cause lengthy startups and checkpoints
On 12/13/21, 9:20 AM, "Justin Pryzby" wrote: > On Mon, Dec 13, 2021 at 08:53:37AM -0500, Robert Haas wrote: >> Another issue is that we don't want to increase the number of >> processes without bound. Processes use memory and CPU resources and if >> we run too many of them it becomes a burden on the system. Low-end >> systems may not have too many resources in total, and high-end systems >> can struggle to fit demanding workloads within the resources that they >> have. Maybe it would be cheaper to do more things at once if we were >> using threads rather than processes, but that day still seems fairly >> far off. > > Maybe that's an argument that this should be a dynamic background worker > instead of an auxilliary process. Then maybe it would be controlled by > max_parallel_maintenance_workers (or something similar). The checkpointer > would need to do these tasks itself if parallel workers were disabled or > couldn't be launched. I think this is an interesting idea. I dislike the prospect of having two code paths for all this stuff, but if it addresses the concerns about resource usage, maybe it's worth it. Nathan
Re: O(n) tasks cause lengthy startups and checkpoints
On 12/13/21, 5:54 AM, "Robert Haas" wrote: > I don't know whether this kind of idea is good or not. Thanks for chiming in. I have an almost-complete patch set that I'm hoping to post to the lists in the next couple of days. > One thing we've seen a number of times now is that entrusting the same > process with multiple responsibilities often ends poorly. Sometimes > it's busy with one thing when another thing really needs to be done > RIGHT NOW. Perhaps that won't be an issue here since all of these > things are related to checkpointing, but then the process name should > reflect that rather than making it sound like we can just keep piling > more responsibilities onto this process indefinitely. At some point > that seems bound to become an issue. Two of the tasks are cleanup tasks that checkpointing handles at the moment, and two are cleanup tasks that are done at startup. For now, all of these tasks are somewhat nonessential. There's no requirement that any of these tasks complete in order to finish startup or checkpointing. In fact, outside of preventing the server from running out of disk space, I don't think there's any requirement that these tasks run at all. IMO this would have to be a core tenet of a new auxiliary process like this. That being said, I totally understand your point. If there were a dozen such tasks handled by a single auxiliary process, that could cause a new set of problems. Your checkpointing and startup might be fast, but you might run out of disk space because our cleanup process can't handle it all. So a new worker could end up becoming an availability risk as well. > Another issue is that we don't want to increase the number of > processes without bound. Processes use memory and CPU resources and if > we run too many of them it becomes a burden on the system. Low-end > systems may not have too many resources in total, and high-end systems > can struggle to fit demanding workloads within the resources that they > have. Maybe it would be cheaper to do more things at once if we were > using threads rather than processes, but that day still seems fairly > far off. I do agree that it is important to be very careful about adding new processes, and if a better idea for how to handle these tasks emerges, I will readily abandon my current approach. Upthread, Andres mentioned optimizing unnecessary snapshot files, and I mentioned possibly limiting how much time startup and checkpoints spend on these tasks. I don't have too many details for the former, and for the latter, I'm worried about not being able to keep up. But if the prospect of adding a new auxiliary process for this stuff is a non- starter, perhaps I should explore that approach some more. > But against all that, if these tasks are slowing down checkpoints and > that's avoidable, that seems pretty important too. Interestingly, I > can't say that I've ever seen any of these things be a problem for > checkpoint or startup speed. I wonder why you've had a different > experience. Yeah, it's difficult for me to justify why users should suffer long periods of downtime because startup or checkpointing is taking a very long time doing things that are arguably unrelated to startup and checkpointing. Nathan
Re: do only critical work during single-user vacuum?
On 12/9/21, 5:27 PM, "Peter Geoghegan" wrote: > I imagine that this new function (to handle maintenance tasks in the > event of a wraparound emergency) would output information about its > progress. For example, it would make an up-front decision about which > tables needed to be vacuumed in order for the current DB's > datfrozenxid to be sufficiently new, before it started anything (with > handling for edge-cases with many tables, perhaps). It might also show > the size of each table, and show another line for each table that has > been processed so far, as a rudimentary progress indicator. I like the idea of having a built-in function that does the bare minimum to resolve wraparound emergencies, and I think providing some sort of simple progress indicator (even if rudimentary) would be very useful. I imagine the decision logic could be pretty simple. If we're only interested in getting the cluster out of a wraparound emergency, we can probably just look for all tables with an age over ~2B. Nathan
Re: do only critical work during single-user vacuum?
On 12/9/21, 5:06 PM, "Bossart, Nathan" wrote: > On 12/9/21, 4:36 PM, "Peter Geoghegan" wrote: >> We could then apply this criteria in new code that implements this >> "big red button" (maybe this is a new option for the postgres >> executable, a little like --single?). Something that's reasonably >> targeted, and dead simple to use. > > +1 As Andres noted, such a feature might be useful during normal operation, too. Perhaps the vacuumdb --min-xid-age stuff should be moved to a new VACUUM option. Nathan
Re: do only critical work during single-user vacuum?
On 12/9/21, 4:36 PM, "Peter Geoghegan" wrote: > We could then apply this criteria in new code that implements this > "big red button" (maybe this is a new option for the postgres > executable, a little like --single?). Something that's reasonably > targeted, and dead simple to use. +1 Nathan
Re: do only critical work during single-user vacuum?
On 12/9/21, 12:33 PM, "Bossart, Nathan" wrote: > On 12/9/21, 11:34 AM, "John Naylor" wrote: >> Now that we have a concept of a fail-safe vacuum, maybe it would be >> beneficial to skip a vacuum in single-user mode if the fail-safe >> criteria were not met at the beginning of vacuuming a relation. This >> is not without risk, of course, but it should be much faster than >> today and once up and running the admin would have a chance to get a >> handle on things. Thoughts? > > Would the --min-xid-age and --no-index-cleanup vacuumdb options help > with this? Sorry, I'm not sure what I was thinking. Of coure you cannot use vacuumdb in single-user mode. But I think something like --min-xid-age in VACUUM is what you are looking for. Nathan
Re: do only critical work during single-user vacuum?
On 12/9/21, 11:34 AM, "John Naylor" wrote: > When a user must shut down and restart in single-user mode to run > vacuum on an entire database, that does a lot of work that's > unnecessary for getting the system online again, even without > index_cleanup. We had a recent case where a single-user vacuum took > around 3 days to complete. > > Now that we have a concept of a fail-safe vacuum, maybe it would be > beneficial to skip a vacuum in single-user mode if the fail-safe > criteria were not met at the beginning of vacuuming a relation. This > is not without risk, of course, but it should be much faster than > today and once up and running the admin would have a chance to get a > handle on things. Thoughts? Would the --min-xid-age and --no-index-cleanup vacuumdb options help with this? Nathan
Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?
On 12/8/21, 3:29 AM, "Bharath Rupireddy" wrote: > Thanks for your thoughts. I'm fine either way, hence attaching two > patches here with and I will leave it for the committer 's choice. > 1) v1-0001-Add-DB_IN_END_OF_RECOVERY_CHECKPOINT-state-for-co.patch -- > adds new db state DB_IN_END_OF_RECOVERY_CHECKPOINT for control file. > 2) v1-0001-Skip-control-file-db-state-updation-during-end-of.patch -- > just skips setting db state to DB_SHUTDOWNING and DB_SHUTDOWNED in > case of end-of-recovery checkpoint so that the state will be > DB_IN_CRASH_RECOVERY which then changes to DB_IN_PRODUCTION. I've bumped this one to ready-for-committer. For the record, my preference is the second patch (for the reasons discussed upthread). Both patches might benefit from a small comment or two, too. Nathan
Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?
On 12/7/21, 8:42 PM, "Bharath Rupireddy" wrote: > On Wed, Dec 8, 2021 at 9:49 AM Bossart, Nathan wrote: >> I think that's alright. The only other small suggestion I have would >> be to say "during end-of-recovery checkpoint" instead of "while in >> end-of-recovery checkpoint." > > "while in" is being used by DB_IN_CRASH_RECOVERY and > DB_IN_ARCHIVE_RECOVERY messages. I don't think it's a good idea to > deviate from that and use "during". Fair enough. I don't have a strong opinion about this. >> Another option we might want to consider is to just skip updating the >> state entirely for end-of-recovery checkpoints. The state would >> instead go straight from DB_IN_CRASH_RECOVERY to DB_IN_PRODUCTION. I >> don't know if it's crucial to have a dedicated control file state for >> end-of-recovery checkpoints. > > Please note that end-of-recovery can take a while in production > systems (we have observed such things working with our customers) and > anything can happen during that period of time. The end-of-recovery > checkpoint is not something that gets finished momentarily. Therefore, > having a separate db state in the control file is useful. Is there some useful distinction between the states for users? ISTM that users will be waiting either way, and I don't know that an extra control file state will help all that much. The main reason I bring up this option is that the list of states is pretty short and appears to be intended to indicate the high-level status of the server. Most of the states are over 20 years old, and the newest one is over 10 years old, so I don't think new states can be added willy-nilly. Of course, I could be off-base and others might agree that this new state would be nice to have. Nathan
Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?
On 12/7/21, 5:21 PM, "Bharath Rupireddy" wrote: > On Wed, Dec 8, 2021 at 2:50 AM Bossart, Nathan wrote: >> I noticed that some (but not all) of the surrounding messages say >> "last known up at" the control file time. I'm curious why you chose >> not to use that phrasing in this case. > > If state is DB_IN_END_OF_RECOVERY_CHECKPOINT that means the db was > interrupted while in end-of-recovery checkpoint, so I used the > phrasing similar to DB_IN_CRASH_RECOVERY and DB_IN_ARCHIVE_RECOVERY > cases. I would like to keep it as-is (in the v1 patch) unless anyone > has other thoughts here? > (errmsg("database system was interrupted while in recovery at %s", > (errmsg("database system was interrupted while in recovery at log time %s", I think that's alright. The only other small suggestion I have would be to say "during end-of-recovery checkpoint" instead of "while in end-of-recovery checkpoint." Another option we might want to consider is to just skip updating the state entirely for end-of-recovery checkpoints. The state would instead go straight from DB_IN_CRASH_RECOVERY to DB_IN_PRODUCTION. I don't know if it's crucial to have a dedicated control file state for end-of-recovery checkpoints. Nathan
Re: Should we improve "PID XXXX is not a PostgreSQL server process" warning for pg_terminate_backend(<>)?
On 12/7/21, 5:21 PM, "Bharath Rupireddy" wrote: > On Wed, Dec 8, 2021 at 4:17 AM Bossart, Nathan wrote: >> I agree with Tom. I would just s/server/backend/ (as per the >> attached) and call it a day. > > Thanks. v5 patch looks good to me. I've marked the commitfest entry as ready-for-committer. Nathan
Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?
On 12/7/21, 12:10 AM, "Bharath Rupireddy" wrote: > Here's a patch that I've come up with. Please see if this looks okay > and let me know if we want to take it forward so that I can add a CF > entry. Overall, the patch seems reasonable to me. + case DB_IN_END_OF_RECOVERY_CHECKPOINT: + ereport(LOG, + (errmsg("database system was interrupted while in end-of-recovery checkpoint at %s", + str_time(ControlFile->time; + break; I noticed that some (but not all) of the surrounding messages say "last known up at" the control file time. I'm curious why you chose not to use that phrasing in this case. Nathan
Re: Pre-allocating WAL files
On 12/7/21, 9:35 AM, "Bossart, Nathan" wrote: > On 12/7/21, 12:29 AM, "Bharath Rupireddy" > wrote: >> Why can't the walwriter pre-allocate some of the WAL segments instead >> of a new background process? Of course, it might delay the main >> functionality of the walwriter i.e. flush and sync the WAL files, but >> having checkpointer do the pre-allocation makes it do another extra >> task. Here the amount of walwriter work vs checkpointer work, I'm not >> sure which one does more work compared to the other. > > The argument against adding it to the WAL writer is that we want it to > run with low latency to flush asynchronous commits. If we added WAL > pre-allocation to the WAL writer, there could periodically be large > delays. To your point on trying to avoid giving the checkpointer extra tasks (basically what we are talking about on the other thread [0]), WAL pre-allocation might not be of much concern because it will generally be a small, fixed (and configurable) amount of work, and it can be performed concurrently with the checkpoint. Plus, WAL pre-allocation should ordinarily be phased out as WAL segments become eligible for recycling. IMO it's not comparable to tasks like CheckPointSnapBuild() that can delay checkpointing for a long time. Nathan [0] https://www.postgresql.org/message-id/flat/C1EE64B0-D4DB-40F3-98C8-0CED324D34CB%40amazon.com
Re: Alter all tables in schema owner fix
On 12/7/21, 2:47 AM, "Greg Nancarrow" wrote: > On Tue, Dec 7, 2021 at 9:01 PM Amit Kapila wrote: >> >> Thanks, the patch looks mostly good to me. I have slightly modified it >> to incorporate one of Michael's suggestions, ran pgindent, and >> modified the commit message. >> > > LGTM, except in the patch commit message I'd change "Create > Publication" to "CREATE PUBLICATION". LGTM, too. Nathan
Re: Pre-allocating WAL files
On 12/7/21, 12:29 AM, "Bharath Rupireddy" wrote: > Why can't the walwriter pre-allocate some of the WAL segments instead > of a new background process? Of course, it might delay the main > functionality of the walwriter i.e. flush and sync the WAL files, but > having checkpointer do the pre-allocation makes it do another extra > task. Here the amount of walwriter work vs checkpointer work, I'm not > sure which one does more work compared to the other. The argument against adding it to the WAL writer is that we want it to run with low latency to flush asynchronous commits. If we added WAL pre-allocation to the WAL writer, there could periodically be large delays. > Another idea could be to let walwrtier or checkpointer pre-allocate > the WAL files whichever seems free as-of-the-moment when the WAL > segment pre-allocation request comes. We can go further to let the > user choose which process i.e. checkpointer or walwrtier do the > pre-allocation with a GUC maybe? My latest patch set [0] adds WAL pre-allocation to the checkpointer. In that patch set, WAL pre-allocation is done both outside of checkpoints as well as during checkpoints (via CheckPointWriteDelay()). Nathan [0] https://www.postgresql.org/message-id/CB15BEBD-98FC-4E72-86AE-513D34014176%40amazon.com
Re: Add sub-transaction overflow status in pg_stat_activity
On 12/6/21, 8:19 PM, "Dilip Kumar" wrote: > If the subtransaction cache is overflowed in some of the transactions > then it will affect all the concurrent queries as they need to access > the SLRU for checking the visibility of each tuple. But currently > there is no way to identify whether in any backend subtransaction is > overflowed or what is the current active subtransaction count. > Attached patch adds subtransaction count and subtransaction overflow > status in pg_stat_activity. I have implemented this because of the > recent complain about the same[1] I'd like to give a general +1 to this effort. Thanks for doing this! I've actually created a function to provide this information in the past, so I will help review. Nathan
Re: Do we need pre-allocate WAL files during end-of-recovery checkpoint?
On 12/6/21, 4:54 AM, "Bharath Rupireddy" wrote: > The function PreallocXlogFiles doesn't get called during > end-of-recovery checkpoint in CreateCheckPoint, see [1]. The server > becomes operational after the end-of-recovery checkpoint and may need > WAL files. However, I'm not sure how beneficial it is going to be if > the WAL is pre-allocated (as PreallocXlogFiles just allocates only 1 > extra WAL file). There is another thread for adding more effective WAL pre-allocation [0] that you might be interested in. Nathan [0] https://www.postgresql.org/message-id/flat/20201225200953.jjkrytlrzojbndh5%40alap3.anarazel.de
Re: Is it correct to update db state in control file as "shutting down" during end-of-recovery checkpoint?
On 12/6/21, 4:34 AM, "Bharath Rupireddy" wrote: > While the database is performing end-of-recovery checkpoint, the > control file gets updated with db state as "shutting down" in > CreateCheckPoint (see the code snippet at [1]) and at the end it sets > it back to "shut down" for a brief moment and then finally to "in > production". If the end-of-recovery checkpoint takes a lot of time or > the db goes down during the end-of-recovery checkpoint for whatever > reasons, the control file ends up having the wrong db state. > > Should we add a new db state something like > DB_IN_END_OF_RECOVERY_CHECKPOINT/"in end-of-recovery checkpoint" or > something else to represent the correct state? This seems like a reasonable change to me. From a quick glance, it looks like it should be a simple fix that wouldn't add too much divergence between the shutdown and end-of-recovery checkpoint code paths. Nathan
Re: O(n) tasks cause lengthy startups and checkpoints
On 12/6/21, 3:44 AM, "Bharath Rupireddy" wrote: > On Fri, Dec 3, 2021 at 11:50 PM Bossart, Nathan wrote: >> I might hack something together for the separate worker approach, if >> for no other reason than to make sure I really understand how these >> functions work. If/when a better idea emerges, we can alter course. > > Thanks. As I said upthread we've been discussing the approach of > offloading some of the checkpoint tasks like (deleting snapshot files) > internally for quite some time and I would like to share a patch that > adds a new background cleaner process (currently able to delete the > logical replication snapshot files, if required can be extended to do > other tasks as well). I don't mind if it gets rejected. Please have a > look. Thanks for sharing! I've also spent some time on a patch set, which I intend to share once I have handling for all four tasks (so far I have handling for CheckPointSnapBuild() and RemovePgTempFiles()). I'll take a look at your patch as well. Nathan
Re: pg_replslotdata - a tool for displaying replication slot information
On 12/5/21, 11:10 PM, "Michael Paquier" wrote: > On Thu, Dec 02, 2021 at 08:32:08AM +0530, Bharath Rupireddy wrote: >> On Thu, Dec 2, 2021 at 4:22 AM Andres Freund wrote: I don't have any other compelling use- cases at the moment, but I will say that it is typically nice from an administrative standpoint to be able to inspect things like this without logging into a running server. >>> >>> Weighed against the cost of maintaining (including documentation) a separate >>> tool this doesn't seem sufficient reason. >> >> IMHO, this shouldn't be a reason to not get something useful (useful >> IMO and few others in this thread) into the core. The maintenance of >> the tools generally is low compared to the core server features once >> they get reviewed and committed. > > Well, a bit less maintenance is always better than more maintenance. > An extra cost that you may be missing is related to the translation of > the documentation, as well as the translation of any new strings this > would require. FWIW, I don't directly see a use for this tool that > could not be solved with an online server. Bharath, perhaps you should maintain this outside of core PostgreSQL for now. If some compelling use-cases ever surface that make it seem worth the added maintenance burden, this thread could probably be revisited. Nathan
Re: O(n) tasks cause lengthy startups and checkpoints
On 12/3/21, 5:57 AM, "Bharath Rupireddy" wrote: > On Fri, Dec 3, 2021 at 3:01 AM Bossart, Nathan wrote: >> >> On 12/1/21, 6:48 PM, "Bharath Rupireddy" >> wrote: >> > +1 for the overall idea of making the checkpoint faster. In fact, we >> > here at our team have been thinking about this problem for a while. If >> > there are a lot of files that checkpoint has to loop over and remove, >> > IMO, that task can be delegated to someone else (maybe a background >> > worker called background cleaner or bg cleaner, of course, we can have >> > a GUC to enable or disable it). The checkpoint can just write some >> >> Right. IMO it isn't optimal to have critical things like startup and >> checkpointing depend on somewhat-unrelated tasks. I understand the >> desire to avoid adding additional processes, and maybe it is a bigger >> hammer than what is necessary to reduce the impact, but it seemed like >> a natural solution for this problem. That being said, I'm all for >> exploring other ways to handle this. > > Having a generic background cleaner process (controllable via a few > GUCs), which can delete a bunch of files (snapshot, mapping, old WAL, > temp files etc.) or some other task on behalf of the checkpointer, > seems to be the easiest solution. > > I'm too open for other ideas. I might hack something together for the separate worker approach, if for no other reason than to make sure I really understand how these functions work. If/when a better idea emerges, we can alter course. Nathan
Re: should we document an example to set multiple libraries in shared_preload_libraries?
On 12/3/21, 6:21 AM, "Bharath Rupireddy" wrote: > +1 to add here in the "Parameter Names and Values section", but do we > want to backlink every string parameter to this section? I think it > needs more effort. IMO, we can just backlink for > shared_preload_libraries alone. Thoughts? IMO this is most important for GUC_LIST_QUOTE parameters, of which there are only a handful. I don't think adding a link to every string parameter is necessary. Nathan
Re: Skip vacuum log report code in lazy_scan_heap() if possible
On 12/3/21, 7:40 AM, "Peter Geoghegan" wrote: > If my patch to unite vacuum verbose and the autovacuum logging gets > in, then this issue also goes away. Perhaps this patch should be marked Rejected in favor of that one, then. Nathan
Re: Alter all tables in schema owner fix
On 12/2/21, 11:57 PM, "tanghy.f...@fujitsu.com" wrote: > Thanks for your patch. > I tested it and it fixed this problem as expected. It also passed "make > check-world". +1, the patch looks good to me, too. My only other suggestion would be to move IsSchemaPublication() to pg_publication.c Nathan
Re: Alter all tables in schema owner fix
On 12/2/21, 7:07 PM, "vignesh C" wrote: > Currently while changing the owner of ALL TABLES IN SCHEMA > publication, it is not checked if the new owner has superuser > permission or not. Added a check to throw an error if the new owner > does not have superuser permission. > Attached patch has the changes for the same. Thoughts? Yeah, the documentation clearly states that "the new owner of a FOR ALL TABLES or FOR ALL TABLES IN SCHEMA publication must be a superuser" [0]. +/* + * Check if any schema is associated with the publication. + */ +static bool +CheckSchemaPublication(Oid pubid) I don't think the name CheckSchemaPublication() accurately describes what this function is doing. I would suggest something like PublicationHasSchema() or PublicationContainsSchema(). Also, much of this new function appears to be copied from GetPublicationSchemas(). Should we just use that instead? +CREATE ROLE regress_publication_user3 LOGIN SUPERUSER; +GRANT regress_publication_user2 TO regress_publication_user3; +SET ROLE regress_publication_user3; +SET client_min_messages = 'ERROR'; +CREATE PUBLICATION testpub4 FOR ALL TABLES IN SCHEMA pub_test; +RESET client_min_messages; +SET ROLE regress_publication_user; +ALTER ROLE regress_publication_user3 NOSUPERUSER; +SET ROLE regress_publication_user3; I think this test setup can be simplified a bit: CREATE ROLE regress_publication_user3 LOGIN; GRANT regress_publication_user2 TO regress_publication_user3; SET client_min_messages = 'ERROR'; CREATE PUBLICATION testpub4 FOR ALL TABLES IN SCHEMA pub_test; RESET client_min_messages; ALTER PUBLICATION testpub4 OWNER TO regress_publication_user3; SET ROLE regress_publication_user3; Nathan [0] https://www.postgresql.org/docs/devel/sql-alterpublication.html
Re: Temporary tables versus wraparound... again
On 10/12/21, 3:07 PM, "Greg Stark" wrote: > Here's an updated patch. I added some warning messages to autovacuum. I think this is useful optimization, and I intend to review the patch more closely soon. It looks reasonable to me after a quick glance. > One thing I learned trying to debug this situation in production is > that it's nigh impossible to find the pid of the session using a > temporary schema. The number in the schema refers to the backendId in > the sinval stuff for which there's no external way to look up the > corresponding pid. It would have been very helpful if autovacuum had > just told me which backend pid to kill. I certainly think it would be good to have autovacuum log the PID, but IIUC a query like this would help you map the backend ID to the PID: SELECT bid, pg_stat_get_backend_pid(bid) AS pid FROM pg_stat_get_backend_idset() bid; Nathan
Re: should we document an example to set multiple libraries in shared_preload_libraries?
On 12/1/21, 5:59 AM, "Bharath Rupireddy" wrote: > On Wed, Dec 1, 2021 at 6:45 PM Tom Lane wrote: >> Considering the vanishingly small number of actual complaints we've >> seen about this, that sounds ridiculously over-engineered. >> A documentation example should be sufficient. > > Thanks. Here's the v1 patch adding examples in the documentation. I think the problems you noted upthread are shared for all GUCs with type GUC_LIST_QUOTE (e.g., search_path, temp_tablespaces). Perhaps the documentation for each of these GUCs should contain a short blurb about how to properly SET a list of values. Also upthread, I see that you gave the following example for an incorrect way to set shared_preload_libraries: ALTER SYSTEM SET shared_preload_libraries = auth_delay,pg_stat_statements,sepgsql; --> wrong Why is this wrong? It seems to work okay for me. Nathan
Re: Skip vacuum log report code in lazy_scan_heap() if possible
On 10/29/21, 10:49 AM, "Bossart, Nathan" wrote: > On 10/29/21, 3:54 AM, "Greg Nancarrow" wrote: >> When recently debugging a vacuum-related issue, I noticed that there >> is some log-report processing/formatting code at the end of >> lazy_scan_heap() that, unless the vacuum VERBOSE option is specified, >> typically results in no log output (as the log-level is then DEBUG2). >> I think it is worth skipping this code if ultimately nothing is going >> to be logged (and I found that even for a tiny database, a VACUUM may >> execute this code hundreds of times). >> I have attached a simple patch for this. > > I think this logging only happens once per table, so I'm not sure it's > really worth it to short-circuit here. If it was per-page, IMO there > would be a much stronger case for it. That being said, I don't think > the proposed patch would hurt anything. Since I have no further comments, I went ahead and marked this once as ready-for-committer. Nathan
Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file
On 12/2/21, 9:50 AM, "David Steele" wrote: > On 12/2/21 12:38, David Steele wrote: >> On 12/2/21 11:00, Andrew Dunstan wrote: >>> >>> Should we really be getting rid of >>> PostgreSQL::Test::Cluster::backup_fs_hot() ? >> >> Agreed, it would be better to update backup_fs_hot() to use exclusive >> mode and save out backup_label instead. > > Oops, of course I meant non-exclusive mode. +1. I'll fix that in the next revision. Nathan
Re: O(n) tasks cause lengthy startups and checkpoints
On 12/1/21, 6:48 PM, "Bharath Rupireddy" wrote: > +1 for the overall idea of making the checkpoint faster. In fact, we > here at our team have been thinking about this problem for a while. If > there are a lot of files that checkpoint has to loop over and remove, > IMO, that task can be delegated to someone else (maybe a background > worker called background cleaner or bg cleaner, of course, we can have > a GUC to enable or disable it). The checkpoint can just write some Right. IMO it isn't optimal to have critical things like startup and checkpointing depend on somewhat-unrelated tasks. I understand the desire to avoid adding additional processes, and maybe it is a bigger hammer than what is necessary to reduce the impact, but it seemed like a natural solution for this problem. That being said, I'm all for exploring other ways to handle this. > Another idea could be to parallelize the checkpoint i.e. IIUC, the > tasks that checkpoint do in CheckPointGuts are independent and if we > have some counters like (how many snapshot/mapping files that the > server generated) Could you elaborate on this? Is your idea that the checkpointer would create worker processes like autovacuum does? Nathan
Re: O(n) tasks cause lengthy startups and checkpoints
On 12/1/21, 6:06 PM, "Euler Taveira" wrote: > Saying that a certain task is O(n) doesn't mean it needs a separate process to > handle it. Did you have a use case or even better numbers (% of checkpoint / > startup time) that makes your proposal worthwhile? I don't have specific numbers on hand, but each of the four functions I listed is something I routinely see impacting customers. > For (3), there is already a GUC that would avoid the slowdown during startup. > Use it if you think the startup time is more important that disk space > occupied > by useless files. Setting remove_temp_files_after_crash to false only prevents temp file cleanup during restart after a backend crash. It is always called for other startups. > For (4), you are forgetting that the on-disk state of replication slots is > stored in the pg_replslot/SLOTNAME/state. It seems you cannot just rename the > replication slot directory and copy the state file. What happen if there is a > crash before copying the state file? Good point. I think it's possible to deal with this, though. Perhaps the files that should be deleted on startup should go in a separate directory, or maybe we could devise a way to ensure the state file is copied even if there is a crash at an inconvenient time. Nathan
Re: O(n) tasks cause lengthy startups and checkpoints
On 12/1/21, 2:56 PM, "Andres Freund" wrote: > On 2021-12-01 20:24:25 +0000, Bossart, Nathan wrote: >> I realize adding a new maintenance worker might be a bit heavy-handed, >> but I think it would be nice to have somewhere to offload tasks that >> really shouldn't impact startup and checkpointing. I imagine such a >> process would come in handy down the road, too. WDYT? > > -1. I think the overhead of an additional worker is disproportional here. And > there's simplicity benefits in having a predictable cleanup interlock as well. Another idea I had was to put some upper limit on how much time is spent on such tasks. For example, a checkpoint would only spend X minutes on CheckPointSnapBuild() before giving up until the next one. I think the main downside of that approach is that it could lead to unbounded growth, so perhaps we would limit (or even skip) such tasks only for end-of-recovery and shutdown checkpoints. Perhaps the startup tasks could be limited in a similar fashion. > I think particularly for the snapshot stuff it'd be better to optimize away > unnecessary snapshot files, rather than making the cleanup more asynchronous. I can look into this. Any pointers would be much appreciated. Nathan
Re: Fix inappropriate uses of PG_GETARG_UINT32()
On 12/1/21, 10:29 AM, "Peter Eisentraut" wrote: > The attached patch fixes this by accepting the argument using > PG_GETARG_INT32(), doing some checks, and then casting it to unsigned > for the rest of the code. > > The patch also fixes another inappropriate use in an example in the > documentation. These two were the only inappropriate uses I found, > after we had fixed a few recently. LGTM Nathan
O(n) tasks cause lengthy startups and checkpoints
Hi hackers, Thanks to 61752af, SyncDataDirectory() can make use of syncfs() to avoid individually syncing all database files after a crash. However, as noted earlier this year [0], there are still a number of O(n) tasks that affect startup and checkpointing that I'd like to improve. Below, I've attempted to summarize each task and to offer ideas for improving matters. I'll likely split each of these into its own thread, given there is community interest for such changes. 1) CheckPointSnapBuild(): This function loops through pg_logical/snapshots to remove all snapshots that are no longer needed. If there are many entries in this directory, this can take a long time. The note above this function indicates that this is done during checkpoints simply because it is convenient. IIUC there is no requirement that this function actually completes for a given checkpoint. My current idea is to move this to a new maintenance worker. 2) CheckPointLogicalRewriteHeap(): This function loops through pg_logical/mappings to remove old mappings and flush all remaining ones. IIUC there is no requirement that the "remove old mappings" part must complete for a given checkpoint, but the "flush all remaining" portion allows replay after a checkpoint to only "deal with the parts of a mapping that have been written out after the checkpoint started." Therefore, I think we should move the "remove old mappings" part to a new maintenance worker (probably the same one as for 1), and we should consider using syncfs() for the "flush all remaining" part. (I suspect the main argument against the latter will be that it could cause IO spikes.) 3) RemovePgTempFiles(): This step can delay startup if there are many temporary files to individually remove. This step is already optionally done after a crash via the remove_temp_files_after_crash GUC. I propose that we have startup move the temporary file directories aside and create new ones, and then a separate worker (probably the same one from 1 and 2) could clean up the old files. 4) StartupReorderBuffer(): This step deletes logical slot data that has been spilled to disk. This code appears to be written to avoid deleting different types of files in these directories, but AFAICT there shouldn't be any other files. Therefore, I think we could do something similar to 3 (i.e., move the directories aside during startup and clean them up via a new maintenance worker). I realize adding a new maintenance worker might be a bit heavy-handed, but I think it would be nice to have somewhere to offload tasks that really shouldn't impact startup and checkpointing. I imagine such a process would come in handy down the road, too. WDYT? Nathan [0] https://postgr.es/m/32B59582-AA6C-4609-B08F-2256A271F7A5%40amazon.com
Re: SKIP LOCKED assert triggered
On 11/12/21, 8:56 AM, "Simon Riggs" wrote: > The combination of these two statements in a transaction hits an > Assert in heapam.c at line 4770 on REL_14_STABLE I've been unable to reproduce this. Do you have any tips for how to do so? Does there need to be some sort of concurrent workload? Nathan
Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file
On 12/1/21, 8:27 AM, "David Steele" wrote: > On 11/30/21 18:31, Bossart, Nathan wrote: >> Do you think it's still worth trying to make it safe, or do you think >> we should just remove exclusive mode completely? > > My preference would be to remove it completely, but I haven't gotten a > lot of traction so far. In this thread, I count 6 people who seem alright with removing it, and 2 who might be opposed, although I don't think anyone has explicitly stated they are against it. Nathan
Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file
On 11/30/21, 2:58 PM, "David Steele" wrote: > I did figure out how to keep the safe part of exclusive backup (not > having to maintain a connection) while removing the dangerous part > (writing backup_label into PGDATA), but it was a substantial amount of > work and I felt that it had little chance of being committed. Do you think it's still worth trying to make it safe, or do you think we should just remove exclusive mode completely? > Attaching the thread [1] that I started with a patch to remove exclusive > backup for reference. Ah, good, some light reading. :) Nathan
Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file
On 11/30/21, 2:27 PM, "Tom Lane" wrote: > If we're willing to outright remove it, I don't have any great objection. > My original two cents was that we shouldn't put effort into improving it; > but removing it isn't that. I might try to put a patch together for the January commitfest, given there is enough support. Nathan
Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file
On 11/30/21, 9:51 AM, "Stephen Frost" wrote: > I disagree that that’s a satisfactory approach. It certainly wasn’t > intended or documented as part of the original feature and therefore > to call it satisfactory strikes me quite strongly as revisionist > history. It looks like the exclusive way has been marked deprecated in all supported versions along with a note that it will eventually be removed. If it's not going to be removed out of fear of breaking backward compatibility, I think the documentation should be updated to say that. However, unless there is something that is preventing users from switching to the non-exclusive approach, I think it is reasonable to begin thinking about removing it. Nathan
Re: pg_replslotdata - a tool for displaying replication slot information
On 11/30/21, 6:14 AM, "Peter Eisentraut" wrote: > On 23.11.21 06:09, Bharath Rupireddy wrote: >> The replication slots data is stored in binary format on the disk under >> the pg_replslot/<> directory which isn't human readable. If >> the server is crashed/down (for whatever reasons) and unable to come up, >> currently there's no way for the user/admin/developer to know what were >> all the replication slots available at the time of server crash/down to >> figure out what's the restart lsn, xid, two phase info or types of slots >> etc. > > What do you need that for? You can't do anything with a replication > slot while the server is down. One use-case might be to discover the value you need to set for max_replication_slots, although it's pretty trivial to discover the number of replication slots by looking at the folder directly. However, you also need to know how many replication origins there are, and AFAIK there isn't an easy way to read the replorigin_checkpoint file at the moment. IMO a utility like this should also show details for the replication origins. I don't have any other compelling use- cases at the moment, but I will say that it is typically nice from an administrative standpoint to be able to inspect things like this without logging into a running server. Nathan
Re: improve CREATE EXTENSION error message
On 11/29/21, 2:32 PM, "Chapman Flack" wrote: > If it were me, I would combine that DETAIL and HINT as one larger HINT, > and use DETAIL for specific details about what actually happened (such > as the exact filename sought and the %m). > > The need for those details doesn't go away; they're still what you need > when what went wrong is some other freak occurrence the hint doesn't > explain. How's this? postgres=# CREATE EXTENSION does_not_exist; ERROR: extension "does_not_exist" is not available DETAIL: Extension control file "/usr/local/pgsql/share/extension/does_not_exist.control" does not exist. HINT: The extension must first be installed on the system where PostgreSQL is running. The pg_available_extensions view lists the available extensions. Nathan
Re: improve CREATE EXTENSION error message
On 11/29/21, 2:13 PM, "Bossart, Nathan" wrote: > Alright, here's v3. In this version, I actually removed the message > about the control file entirely, so now the error message looks like > this: > > postgres=# CREATE EXTENSION does_not_exist; > ERROR: extension "does_not_exist" is not available > DETAIL: The extension must first be installed on the system where > PostgreSQL is running. > HINT: The pg_available_extensions view lists the extensions that are > available for installation. > > I can add the control file part back if we think it's necessary. Hm. I should probably adjust the hint to avoid confusion from "installed on the system" and "available for installation." Maybe something like The pg_available_extensions view lists the available extensions. Nathan
Re: improve CREATE EXTENSION error message
On 11/29/21, 1:38 PM, "Chapman Flack" wrote: > On 11/29/21 16:31, Daniel Gustafsson wrote: >> That's a good point, the hint is targeting users who might not even know that >> an extension needs to be physically and separately installed on the machine >> before it can be installed in their database; so maybe using "installed" here >> isn't entirely helpful at all. That being said I'm at a loss for a more >> suitable word, "available" perhaps? > > Maybe a larger break with the "This means the extension something something" > formulation, and more on the lines of > > HINT: an extension must first be present (for example, installed with a > package manager) on the system where PostgreSQL is running. I like this idea. I can do it this way in the next revision if others agree. Nathan
Re: improve CREATE EXTENSION error message
On 11/29/21, 1:03 PM, "Tom Lane" wrote: > If we issue the hint only for errno == ENOENT, I think we could be > less wishy-washy (and if it's not ENOENT, the hint is likely > inappropriate anyway). I'm thinking something more like > > HINT: This means the extension is not installed on the system. Good idea. > I'm not quite satisfied with the "on the system" wording, but I'm > not sure of what would be better. I agree that we can't just say > "is not installed", because people will confuse that with whether > it is installed within the database. Right. The only other idea I have at the moment is to say something like This means the extension is not available[ on the system]. I don't know whether that is actually any less confusing, though. Nathan
Re: improve CREATE EXTENSION error message
On 11/29/21, 12:33 PM, "Daniel Gustafsson" wrote: > I haven't given the suggested wording too much thought, but in general that > sounds like a good idea. Thanks. I'm flexible with the wording. Nathan
Re: Postgres restart in the middle of exclusive backup and the presence of backup_label file
On 11/26/21, 7:33 AM, "Tom Lane" wrote: > Michael Paquier writes: >> On Thu, Nov 25, 2021 at 06:19:03PM -0800, SATYANARAYANA NARLAPURAM wrote: >>> If we are keeping it then why not make it better? > >> Well, non-exclusive backups are better by design in many aspects, so I >> don't quite see the point in spending time on something that has more >> limitations than what's already in place. > > IMO the main reason for keeping it is backwards compatibility for users > who have a satisfactory backup arrangement using it. That same argument > implies that we shouldn't change how it works (at least, not very much). The issues with exclusive backups seem to be fairly well-documented (e.g., c900c15), but perhaps there should also be a note in the "Backup Control Functions" table [0]. Nathan [0] https://www.postgresql.org/docs/devel/functions-admin.html#FUNCTIONS-ADMIN-BACKUP
Re: XMAX_LOCK_ONLY and XMAX_COMMITTED (fk/multixact code)
On 11/25/21, 9:16 AM, "Mark Dilger" wrote: >> On Nov 24, 2021, at 12:53 PM, Bossart, Nathan wrote: >> >> Another option we might consider is only checking for the >> HEAP_XMAX_LOCK_ONLY bit instead of everything in >> HEAP_XMAX_IS_LOCKED_ONLY. IIUC everything else is only expected to >> happen for upgrades from v9.2 or earlier, so it might be pretty rare >> at this point. Otherwise, I'll extract the exact bit pattern for the >> error message as you suggest. > >I would prefer to detect and report any "can't happen" bit patterns without >regard for how likely the pattern may be. The difficulty is in proving that a >bit pattern is disallowed. Just because you can't find a code path in the >current code base that would create a pattern doesn't mean it won't have >legitimately been created by some past release or upgrade path. As such, any >prohibitions explicitly in the backend, such as Asserts around a condition, >are really valuable. You can know that the pattern is disallowed, since the >server would Assert on it if encountered. > > Aside from that, I don't really buy the argument that databases upgraded from > v9.2 or earlier are rare. Even if servers *running* v9.2 or earlier are (or > become) rare, servers initialized that far back which have been upgraded one > or more times since then may be common. Okay, I'll do it that way in the next revision. Nathan
Re: XMAX_LOCK_ONLY and XMAX_COMMITTED (fk/multixact code)
On 11/23/21, 4:59 PM, "Mark Dilger" wrote: >> On Nov 23, 2021, at 4:51 PM, Bossart, Nathan wrote: >> >> This is a good point. Right now, you'd have to manually inspect the >> infomask field to determine the exact form of the invalid combination. >> My only worry with this is that we'd want to make sure these checks >> stayed consistent with the definition of HEAP_XMAX_IS_LOCKED_ONLY in >> htup_details.h. I'm guessing HEAP_XMAX_IS_LOCKED_ONLY is unlikely to >> change all that often, though. > > I expect that your patch will contain some addition to the amcheck (or > pg_amcheck) tests, so if we ever allow some currently disallowed bit > combination, we'd be reminded of the need to update amcheck. So I'm not too > worried about that aspect of this. > > I prefer not to have to get a page (or full file) from a customer when the > check reports corruption. Even assuming they are comfortable giving that, > which they may not be, it is an extra round trip to the customer asking for > stuff. Another option we might consider is only checking for the HEAP_XMAX_LOCK_ONLY bit instead of everything in HEAP_XMAX_IS_LOCKED_ONLY. IIUC everything else is only expected to happen for upgrades from v9.2 or earlier, so it might be pretty rare at this point. Otherwise, I'll extract the exact bit pattern for the error message as you suggest. Nathan
Re: XMAX_LOCK_ONLY and XMAX_COMMITTED (fk/multixact code)
On 11/23/21, 4:36 PM, "Mark Dilger" wrote: > I have to wonder if, when corruption is reported for conditions like this: > > + if ((ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) && > + HEAP_XMAX_IS_LOCKED_ONLY(ctx->tuphdr->t_infomask)) > > if the first thing we're going to want to know is which branch of the > HEAP_XMAX_IS_LOCKED_ONLY macro evaluated true? Should we split this check to > do each branch of the macro separately, such as: > > if (ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) > { > if (ctx->tuphdr->t_infomask & HEAP_XMAX_LOCK_ONLY) > ... report something ... > else if ((ctx->tuphdr->t_infomask & (HEAP_XMAX_IS_MULTI | > HEAP_LOCK_MASK)) == HEAP_XMAX_EXCL_LOCK) > ... report a different thing ... > } This is a good point. Right now, you'd have to manually inspect the infomask field to determine the exact form of the invalid combination. My only worry with this is that we'd want to make sure these checks stayed consistent with the definition of HEAP_XMAX_IS_LOCKED_ONLY in htup_details.h. I'm guessing HEAP_XMAX_IS_LOCKED_ONLY is unlikely to change all that often, though. Nathan
Re: XMAX_LOCK_ONLY and XMAX_COMMITTED (fk/multixact code)
The archives seem unhappy with my attempt to revive this old thread, so here is a link to it in case anyone is looking for more context: https://www.postgresql.org/message-id/flat/3476708e-7919-20cb-ca45-6603470565f7%40amazon.com Nathan
Re: Sequence's value can be rollback after a crashed recovery.
On 11/23/21, 1:41 PM, "Tom Lane" wrote: > I wrote: >> I wonder though if we shouldn't try to improve the existing text. >> The phrasing "never rolled back" seems like it's too easily >> misinterpreted. Maybe rewrite the block like >> ... > > A bit of polishing later, maybe like the attached. The doc updates look good to me. Yesterday I suggested possibly adding a way to ensure that nextval() called in an uncommitted transaction was persistent, but I think we'd have to also ensure that synchronous replication waits for those records, too. Anyway, I don't think it is unreasonable to require the transaction to be committed to avoid duplicates from nextval(). Nathan
Re: Sequence's value can be rollback after a crashed recovery.
On 11/22/21, 5:10 AM, "Laurenz Albe" wrote: > On Mon, 2021-11-22 at 15:43 +0800, Andy Fan wrote: >> The performance argument was expected before this writing. If we look at the >> nextval_interval more carefully, we can find it would not flush the xlog >> every >> time even the sequence's cachesize is 1. Currently It happens every 32 times >> on the nextval_internal at the worst case. > > Right, I didn't think of that. Still, I'm -1 on this performance regression. I periodically hear rumblings about this behavior as well. At the very least, it certainly ought to be documented if it isn't yet. I wouldn't mind trying my hand at that. Perhaps we could also add a new configuration parameter if users really want to take the performance hit. Nathan
Re: Improving psql's \password command
On 11/21/21, 11:15 AM, "Tom Lane" wrote: > "Bossart, Nathan" writes: >> Yeah, I was looking for a way to simplify this, too. Your approach >> seems reasonable enough to me. > > Hearing no contrary opinions, pushed that way. Thanks! Nathan