Using operators to do query hints
I've been playing with an idea I had a while back. Basically that it would be useful to have some "noop" operators that are used purely to influence the planner. For context I've suggested in the past that there are two categories of hints: 1 Hints that override the planner's decisions with explicit planning decisions. These we don't like for a variety of reasons, notably because users don't have very good knowledge of all the plan types available and new plan types come out in the future. 2 Hints that just influence the estimates that the planner uses to make its decisions. These seem more acceptable as it amounts to admitting the users know the distribution of their data and the behaviour of their functions better than the statistics. Also, there are plenty of cases where the statistics are really just wild guesses. They still allow the planner to make decisions about what join orders or types and so on given the updated data. So I thought adding some noop operators that took a boolean on one side and a float on the other and simply returned the boolean at run-time but used the float (which would have to be a constant) from the lhs as the selectivity would be useful. In practice it's usable but not nearly as widely useful as I expected. 1) The boolean being passed through has to be a real clause with real fields. Otherwise the planner is too clever and turns it into a one-time filter which doesn't affect the plan :/ 2) If it's a clauase that is potentially going to be an index condition then you need to repeat the clause outside the selectivity noop operator. Otherwise the selectivity operator hides it from the planner. 3) It doesn't work on joins at all. Joins are done using the join selectivity function on the join clause's operator. There's no way to pass construct a simple noop wrapper that would still work with that that I can see. Nonetheless what I have does seem to be somewhat handy for simple cases. I added some simple SQL wrapper functions such as pg_unlikely(): postgres=# explain select * from test where i<100; QUERY PLAN - Index Only Scan using i on test (cost=0.28..10.01 rows=99 width=4) Index Cond: (i < 100) (2 rows) postgres=# explain select * from test where pg_unlikely(i<100); QUERY PLAN Index Only Scan using i on test (cost=0.28..10.50 rows=1 width=4) Index Cond: (i < 100) Filter: ('1e-06'::double precision %%% (i < 100)) (3 rows) However this doesn't really do what you might really be hoping. specifically, it doesn't actually affect the planner's choice of the index scan in that query. It affects the estimate of the result of the scan and that might be useful for subsequent nodes of the plan. But not for the index scan itself: postgres=# explain select * from test where pg_unlikely(i<500); QUERY PLAN - Seq Scan on test (cost=0.00..22.50 rows=1 width=4) Filter: ((i < 500) AND ('1e-06'::double precision %%% (i < 500))) (2 rows) So. I dunno. This feels like it's something that could be quite handy but it's not as good as I had hoped and I think I'm out of ideas for making it more powerful. -- greg
Re: Design of pg_stat_subscription_workers vs pgstats
On Tue, 25 Jan 2022 at 01:32, Andres Freund wrote: > > Hi, > > I was looking the shared memory stats patch again. Can you point me to this thread? I looked for it but couldn't find it. -- greg
Re: Commitfest manager for 2022-03
I would like to volunteer. On Fri, 25 Feb 2022 at 05:31, Julien Rouhaud wrote: > > Hi, > > The final commitfest for pg15 will start in a few days, and I didn't see any > discussion on it or anyone volunteering to be a CFM. > > I thought it would be a good idea to send this reminder now and avoid the same > situation as the last commitfest, to avoid unnecessary pain for the CFM(s). > > Is there any volunteer? > > For the record there are already 246 active patches registered. > > -- greg
Re: Commitfest manager for 2022-03
On Sat, 26 Feb 2022 at 01:33, Julien Rouhaud wrote: > > On Sat, Feb 26, 2022 at 02:42:33PM +0900, Michael Paquier wrote: > > On Fri, Feb 25, 2022 at 01:58:55PM -0600, David Steele wrote: > > > On 2/25/22 12:39, Greg Stark wrote: > > >> I would like to volunteer. > > > > > I've been hoping somebody would volunteer, so I'm all in favor of you > > > being > > > CF. > > > > Greg as CFM would be fine. It's nice to see someone volunteer. > > +1, thanks a lot Greg! > > Note that the last CF of a major version is probably even more work than > "regular" CF, so if other people are interested there's probably room for > multiple CFMs. I do have the time available. What I don't necessarily have is insight into everything that needs to be done, especially behind the scenes. So if I could have someone to give me tips about things I'm missing or review the things I am doing to see that I've covered everything that would be great. -- greg
CommitFest begins tomorrow... Get your patches in
28 days has February ... So the March commitfest begins tomorrow. Meet your friendly neighbourhood Commitfest Manager for March. Greetings! If you have a patch you're hoping to get feedback on or you're expecting to get committed this month make sure it's in the commitfest at https://commitfest.postgresql.org/37/ Currently there are 199 patches marked Needs Review and 25 Ready for Committer. This is the last commitfest of the cycle so we'll be concentrating on patches that are likely to be committed this commitfest. I'll move any patches that are unlikely to be committed to target Postgres 16 and the July commitfest shortly. If you think your patch is really viable for Postgres 15 despite needing feedback it can be helpful to post describing exactly what you're blocked on and what remains to be done. It also sometimes helps get attention if you make clear that you have the time available this month to work on it in response to feedback. The "round robin reviewers" system is no longer. Generally people can just grab any unassigned patch from the commitfest and review it but if for some reason you're reluctant feel free to email me -- please put "commitfest" in the subject line. If you've submitted a patch and are feeling blocked waiting on feedback please update the thread on the list. If you're still not getting responses please let me know -- again, please put "commitfest" in the subject. -- greg
Re: Removing unneeded self joins
On Thu, 1 Jul 2021 at 02:38, Ronan Dunklau wrote: > > Well in some cases they can't, when the query is not emitting redundant > predicates by itself but they are added by something else like a view or a RLS > policy. > Maybe it would be worth it to allow spending a bit more time planning for > those cases ? Yeah, I'm generally in favour of doing more work in the optimizer to save query authors work writing queries. My question is whether it handles cases like: select b.x,c.y from t join t2 as b on (b.id = t.id) join t2 as c on (c.id = t.id) That is, if you join against the same table twice on the same qual. Does the EC mechanism turn this into a qual on b.id = c.id and then turn this into a self-join that can be removed? That's the usual pattern I've seen this arise. Not so much that people write self joins explicitly but that they add a join to check some column but that is happening in some isolated piece of code that doesn't know that that join is already in the query. You can easily end up with a lot of joins against the same table this way. It's not far different from the old chestnut select (select x from t2 where id = t.id) as x, (select y from t2 where id = t.id) as y from t which is actually pretty hard to avoid sometimes. -- greg
Re: Removing unneeded self joins
I don't think the benchmarking that's needed is to check whether pruning unnecessary joins is helpful. Obviously it's going to be hard to measure on simple queries and small tables. But the resulting plan is unambiguously superior and in more complex cases could extra i/o. The benchmarking people were looking for in the past was testing the impact of the extra planning work in cases where it doesn't end up being applied. I'm not sure what the worst case is, perhaps a many-way self-join where the join clauses are not suitable for pruning?
Commitfest 2022-03 Patch Triage Part 1a.i
Last November Daniel Gustafsson did a patch triage. It took him three emails to get through the patches in the commitfest back then. Since then we've had the November and the January commitfests so I was interested to see how many of these patches had advanced I'm only part way through the first email but so far only two patches have changed status -- and both to "Returned with feedback" :( So I'm going to post updates but I'm going to break it up into smaller batches because otherwise it'll take me a month before I post anything. > 1608: schema variables, LET command > === > After 18 CF's and two very long threads it seems to be nearing completion > judging by Tomas' review. There is an ask in that review for a second pass > over the docs by a native speaker, any takers? Patch has a new name, "session variables, LET command" There's been a *lot* of work on this patch so I'm loath to bump it. The last review was from Julien Rouhaud which had mostly code style comments but it had one particular concern about using xact callback in core and about EOX cleanup. Pavel, do you have a plan to improve this or are you looking for suggestions from someone about how you should solve this problem? > 1741: Index Skip Scan > = > An often requested feature which has proven hard to reach consensus on an > implementation for. The thread(s) have stalled since May, is there any hope > of > taking this further? Where do we go from here with this patch? "Often requested indeed! I would love to be able to stop explaining to people that Postgres can't handle this case well. It seems there are multiple patch variants around and suggestions from Heikki and Peter G about fundamental interface choices. It would be nice to have a good summary from someone who is involved about what's actually left unresolved. > 1712: Remove self join on a unique column > = > This has moved from "don't write bad queries" to "maybe we should do something > about this". It seems like there is concensus that it can be worth paying the > added planner cost for this, especially if gated by a GUC to keep it out of > installations where it doesn't apply. The regression on large join queries > hasn't been benchmarked it seems (or I missed it), but the patch seems to have > matured and be quite ready. Any takers on getting it across the finish line? There hasn't been any review since the v29 patch was posted in July. That said, to my eye it seemed like pretty basic functionality errors were being corrected quite late. All the bugs got patch updates quickly but does this have enough tests to be sure it's working right? The only real objection was about whether the planning time justified the gains since the gains are small. But I think they were resolved by making the optimization optional. Do we have consensus that that resolved the issue or do we still need benchmarks showing the planning time hit is reasonable? > 2161: standby recovery fails when re-replaying due to missing directory which > was removed in previous replay. > = > Tom and Robert seem to be in agreement that parts of this patchset are good, > and that some parts are not. The thread has stalled and the patch no longer > apply, so unless someone feels like picking up the good pieces this seems like > a contender to close for now. Tom's feedback seems to have been acted on last November. And the last update in January was that it was passing CI now. Is this ready to commit now? > 2138: Incremental Materialized View Maintenance > === > The size of the > patchset and the length of the thread make it hard to gauge just far away it > is, maybe the author or a reviewer can summarize the current state and outline > what is left for it to be committable. There is an updated patch set as of February but I have the same difficulty wrapping my head around the amount of info here. Is this one really likely to be commitable in 15? If not I think we should move this to 16 now and concentrate on patches that will be commitable in this release. > 2218: Implement INSERT SET syntax > = > The author has kept this patch updated, and has seemingly updated according to > the review comments. Tom: do you have any opinions on whether the updated > version addresses your concerns wrt the SELECT rewrite? I don't see any discussion implying that Tom's concerns were met. I'm not exactly clear why Tom's concerns are real problems though -- wouldn't it be a *good* thing if we have a more expressive syntax? But that's definitely what needs to be resolved before it can move ahead. So unless there's objections I'm going to update this to "waiting on author". -- greg
Commitfest 2022-03 Starts Now
The final commitfest of this release begins now. Whereas most commitfests are about getting feedback to authors so they can advance the patch -- this one is about actually committing patches to wrap up the release. Please when reviewing patches try to push yourself to make the difficult call about whether a patch will actually be feasible to commit in this commitfest. If not be up front about it and say so because we need to focus energy in this commitfest on patches that are committable in Postgres 15. There are a lot of great patches in the commitfest that would really be great to get committed! -- greg
Re: Commitfest 2022-03 Patch Triage Part 1a.i
As Justin suggested I CC the authors from these patches I'm adding them here. Some of the patches have multiple "Authors" listed in the commitfest which may just be people who posted updated patches so I may have added more people than necessary. [If you received two copies of this do not reply to the first one or you'll get bounces. Hopefully I've cleaned the list now but we'll see...]
Re: Commitfest 2022-03 Patch Triage Part 1a.i
On Tue, 1 Mar 2022 at 14:59, Greg Stark wrote: > > As Justin suggested I CC the authors from these patches I'm adding > them here. Some of the patches have multiple "Authors" listed in the > commitfest which may just be people who posted updated patches so I > may have added more people than necessary. > [If you received two copies of this do not reply to the first one or > you'll get bounces. Hopefully I've cleaned the list now but we'll > see...] Nope. One more address to remove from the CC. -- greg
Commitfest 2022-03 Patch Triage Part 1b
> 2096: psql - add SHOW_ALL_RESULTS option > > Peter posted an updated version of Fabiens patch about a month ago (which at > this point no longer applies) fixing a few issues, but also point at old > review > comments still unaddressed. Since this was pushed, but had to be reverted, I > assume there is a desire for the feature but it seems to need more work still. It looks like Peter and Fabien were debating the merits of a libpq change and probably that won't happen this release cycle. Is there a kernel of psql functionality that can be extracted from this without the libpq change in this release cycle or should it wait until we add the functionality to libpq? If it's the latter then perhaps we should move this to 16? > 1651: GROUP BY optimization > === > This is IMO a desired optimization, and after all the heavy lifting by Tomas > the patch seems to be in pretty good shape. This is two patches and it sounds like the first one is ready for committer whereas the second one is less clear. Or is the second one meant to be an alternative for the first one? > > 2377: pg_ls_* functions for showing metadata and recurse (pg_ls_tmpdir to show > shared filesets) > == > The question of what to do with lstat() on Windows is still left unanswered, > but the patchset has been split to up to be able to avoid it. Stephen and > Tom, > having done prior reviews do you have any thoughts on this? Is this still blocked on lstat for windows? I couldn't tell, is there consensus on a behaviour for windows even if that just means failing or returning partial results on windows? Other than that it seems like there's a lot of this patch that has positive reviews and is ready for committing. > 2349: global temporary table > > GTT has been up for discussion numerous times in tbe past, and I can't judge > whether this proposal has a better chance than previous ones. I do note the > patch has a number of crashes reported lately, and no reviews from senior > contributors in a while, making it seem unlikely to be committed in this CF. > Since the patch is very big, can it be teased apart and committed separately > for easier review? I think Andres's review decisively makes it clear this in an uncommittable state. https://www.postgresql.org/message-id/20220225074500.sfizxbmlrj2s6hp5%40alap3.anarazel.de https://www.postgresql.org/message-id/20220227041304.mnimeqkhwktrjyht%40alap3.anarazel.de It's definitely not going to make it this release and will probably need a significant amount of time next release cycle. IMHO dividing it up into smaller features does seem like it would be more effective at getting things committed. Should we mark this returned with feedback or just move it to the next commitfest as waiting on author? > 2433: Erase the distinctClause if the result is unique by definition > > (parts of) The approach taken in this patch has been objected against in favor > of work that Tom has proposed. Until that work materialize this patch is > blocked, and thus I think we are better of closing it and re-opening it when > it > gets unstuck. Unless Tom has plans to hack on this shortly? Ugh. This is a problematic dynamic. Tom has a different idea of what direction to take this but hasn't had a chance to work on it. So what's Andy Fan supposed to do here? He can't read Tom's mind and nobody else can really help him. Ultimately we all have limited time so this is a thing that will happen but is there anything we can do to resolve it in this case? We definitely shouldn't spend lots of time on this patch unless we're going to be ok going ahead without Tom's version of it. Is this something we can do using the Andy's data structure for now and change in the future? It looks like the Skip Scan patch was related to this work in some way? Is it blocked on it? -- greg
Re: Commitfest 2022-03 Patch Triage Part 1a.i
On Wed, 2 Mar 2022 at 07:12, Daniel Gustafsson wrote: > Thanks for picking it up and continuing with recent developments. Let me know > if you want a hand in triaging patchsets. While I have the time there may be patches I may need help coming to the right conclusions about what actions to take. I think the main thing I can do to help is to take patches that have no chance of making this release and taking them off our collective plates. -- Hopefully after they've received feedback but as this is the last commitfest of the release that's a secondary concern. But I'm unclear exactly what the consequences in the commitfest app are of specific state changes. As I understand it there are basically two alternatives: 1) Returned with feedback -- does this make it harder for an author to resume work release? Can they simply reactivate the CF entry or do they need to start a new one and then lose history in the app? 2) Moved to next commitfest -- this seems to just drag the pain on. Then it has to get triaged again next commitfest and if it's actually stalled (or progressing fine without needing feedback) that's just extra work for nothing. Do I have this right? What is the right state to put a patch in that means "this patch doesn't need to be triaged again unless the author actually feels progress has been made and needs new feedback or thinks its committable"? -- greg
Re: Commitfest 2022-03 Patch Triage Part 1b
Just FYI. Better to follow up to the thread for the patch that's already in the CF. Otherwise your patch will missed by someone who looks at the CF entry to see the latest patch.
Re: [Proposal] Global temporary tables
It doesn't look like this is going to get committed this release cycle. I understand more feedback could be valuable, especially on the overall design, but as this is the last commitfest of the release we should focus on other patches for now and spend that time in the next release cycle. I'm going to bump this one now as Waiting on Author for the design documentation Robert asks for and probably a plan for how to separate that design into multiple separable features as Andres suggested. I'm still hopeful we get to advance this early in 16 because I think everyone agrees the feature would be great.
Re: [PATCH] minor reloption regression tests improvement
I don't think this is worth spending time adding tests for. I get what you're saying that this is at least semi-intentional behaviour and desirable behaviour so it should have tests ensuring that it continues to work. But it's not documented behaviour and the test is basically testing that the implementation is this specific implementation. I don't think the test is really a bad idea but neither is it really very useful and I think it's not worth having people spend time reviewing and discussing. I'm leaning to saying this patch is rejected.
Re: Commitfest 2022-03 One Week in. 3 Commits 213 Patches Remaining
So it's 8 days into the commitfest. So far 3 patches have been committed: * parse/analyze API refactoring by Peter Eisentraut * FUNCAPI tuplestore helper function by Melanie Plagemen committed by Michael Paquier * Typo in pgbench messages by Kawamoto Masay committed by Tatsuo Ishii (There was also a flurry of five commits from Tom on Feb 28 including one that needed to be backpatched subsequently) In addition 5 patches have been moved moved to a future commitfest, withdrawn, or rejected. So that has removed 8 patches from the commitfest. There are now 189 "Needs Review" and 24 "Ready for Committer" patches. Of those 24 "Ready for Committer" patches only 3 actually have committers assigned andonly 6 have had any emails since the beginning of the commitfest. Is there anything I can do to get committers assigned to these patches? Should I do a round-robin style assignment for any of these?
Re: Commitfest 2022-03 One Week in. 3 Commits 213 Patches Remaining
On Wed, 9 Mar 2022 at 15:44, David Steele wrote: > > On 3/9/22 13:38, Greg Stark wrote: > Should I do a round-robin style assignment for any of these? > > I don't think this is a good idea. Committers pick the patches they are > going to commit. > > What prefer to do is bump any committers that have been involved in a > patch thread to see if they are willing to commit it. Well yes, I suppose that's probably what I had in mind despite calling it that. But I've been skimming the set of "Ready for Committer" patches and I'm a bit down on them. Many of them seem to mostly have gotten feedback from committers already and the type of feedback that leads me to think it's ready for commit. I suppose people mark patches "Ready for Committer" when the level of feedback they require is more in depth or more design feedback that they think requires a committer even if it's not ready for commit. So I'm going to go through the patches and ask the committers who have already commented if they think the patch is on track to be committed this release or should be pushed to the next commitfest. -- greg
Re: Commitfest 2022-03 One Week in. 3 Commits 213 Patches Remaining
On Wed, 9 Mar 2022 at 16:46, Greg Stark wrote: > Many of them seem to mostly have gotten > feedback from committers already and the type of feedback that leads > me to think it's ready for commit. Er. I meant *not* the type of feedback that leads me to think it's ready for commit. I mostly see patches with early design feedback or even feedback questioning whether they're good ideas at all. -- greg
Re: Tablesync early exit
This patch has been through five CFs without any review. It's a very short patch and I'm assuming the only issue is that it's not clear whether it's a good idea or not and there are few developers familiar with this area of code? Amit, it looks like you were the one who asked for it to be split off from the logical decoding of 2PC patch in [1]. Can you summarize what questions remain here? Should we just commit this or is there any issue that needs to be debated? [1] https://www.postgresql.org/message-id/CAA4eK1Jxu-3qxtkfA_dKoquQgGZVcB+k9_-yT5=9gdew84t...@mail.gmail.com On Sat, 6 Mar 2021 at 20:56, Peter Smith wrote: > > Hi hackers. > > I propose a small optimization can be added to the tablesync replication code. > > This proposal (and simple patch) was first discussed here [1]. > > Basic idea is the tablesync could/should detect if there is anything > to do *before* it enters the apply main loop. Calling > process_sync_tables() before the apply main loop offers a quick way > out, so the message handling will not be unnecessarily between > workers. This will be a small optimization. > > But also, IMO this is a more natural separation of work. E.g tablesync > worker will finish when the table is synced - not go one extra step... > > ~~ > > This patch was already successfully used for several versions > (v43-v50) of another 2PC patch [2], but it was eventually removed from > there because, although it has its own independent value, it was not > required for that patch series [3]. > > > [1] > https://www.postgresql.org/message-id/CAHut%2BPtjk-Qgd3R1a1_tr62CmiswcYphuv0pLmVA-%2B2s8r0Bkw%40mail.gmail.com > [2] > https://www.postgresql.org/message-id/flat/CAHut%2BPsd5nyg-HG6rGO2_5jzXuSA1Eq5%2BB5J2VJo0Q2QWi-1HQ%40mail.gmail.com#1c2683756b32e267d96b7177ba95 > [3] > https://www.postgresql.org/message-id/CAA4eK1Jxu-3qxtkfA_dKoquQgGZVcB%2Bk9_-yT5%3D9GDEW84TF%2BA%40mail.gmail.com > > Kind Regards, > Peter Smith. > Fujitsu Australia > > -- greg
Re: [PATCH] pgbench: add multiconnect option
Hi guys, It looks like David sent a patch and Fabien sent a followup patch. But there hasn't been a whole lot of discussion or further patches. It sounds like there are some basic questions about what the right interface should be. Are there specific questions that would be helpful for moving forward?
Re: Window Function "Run Conditions"
This looks like an awesome addition. I have one technical questions... Is it possible to actually transform the row_number case into a LIMIT clause or make the planner support for this case equivalent to it (in which case we can replace the LIMIT clause planning to transform into a window function)? The reason I ask is because the Limit plan node is actually quite a bit more optimized than the general window function plan node. It calculates cost estimates based on the limit and can support Top-N sort nodes. But the bigger question is whether this patch is ready for a committer to look at? Were you able to resolve Andy Fan's bug report? Did you resolve the two questions in the original email?
Re: Skip partition tuple routing with constant partition key
There are a whole lot of different patches in this thread. However this last one https://commitfest.postgresql.org/37/3270/ created by Amit seems like a fairly straightforward optimization that can be evaluated on its own separately from the others and seems quite mature. I'm actually inclined to set it to "Ready for Committer". Incidentally a quick read-through of the patch myself and the only question I have is how the parameters of the adaptive algorithm were chosen. They seem ludicrously conservative to me and a bit of simple arguments about how expensive an extra check is versus the time saved in the boolean search should be easy enough to come up with to justify whatever values make sense.
Commitfest Update
So far this commitfest these 10 patches have been marked committed. That leaves us with 175 "Needs Review" and 28 "Ready for Comitter" so quite a ways to go ... * FUNCAPI tuplestore helper function * parse/analyze API refactoring * Add wal_compression=zstd * Add id's to various elements in protocol.sgml * Fix alter data type of clustered/replica identity columns * Fix flaky tests when synchronous_commit = off * Support of time zone patterns - of, tzh and tzm * Optionally automatically disable subscription on error * fix race condition between DROP TABLESPACE and checkpointing * ICU for global collation In case anyone's looking for inspiration... Here are the list of patches marked Ready for Committer: * document the need to analyze partitioned tables * ExecTypeSetColNames is fundamentally broken * pg_dump - read data for some options from external file * Add comment about startup process getting a free procState array slot always * Doc patch for retryable xacts * Function to log backtrace of postgres processes * Allow multiple recursive self-references * Expose get_query_def() * Add checkpoint and redo LSN to LogCheckpointEnd log message * Fast COPY FROM command for the foreign tables * Consider parallel for LATERAL subqueries having LIMIT/OFFSET * Faster pglz compression * Full support for index LP_DEAD hint bits on standby * KnownAssignedXidsGetAndSetXmin performance * Parameter for planner estimates of recursive queries * enhancing plpgsql API for debugging and tracing * Make message at end-of-recovery less scary * Identify missing publications from publisher while create/alter subscription. * Allow providing restore_command as a command line option to pg_rewind * Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work * Allow batched insert during cross-partition updates * Add callback table access method to reset filenode when dropping relation * use has_privs_for_role for predefined roles * range_agg with multirange inputs * Add new reloption to views for enabling row level security * Fix pg_rewind race condition just after promotion * Mitigate pg_rewind race condition, if config file is enlarged concurrently. * remove exclusive backup mode -- greg
Re: Proposal for internal Numeric to Uint64 conversion function.
On Fri, 11 Mar 2022 at 15:17, Tom Lane wrote: > > Amul Sul writes: > > Yeah, that's true, I am too not sure if we really need to refactor > > all those; If we want, I can give it a try. > > The patch as-presented isn't very compelling for > lack of callers of the new function Tom, are you saying you think we're not interested in just adding this function unless it's part of this refactoring? Amul, do you think if we did numeric_to_int64/numeric_to_uint64 as a refactored API and a second patch that made numeric_pg_lsn and other consumers use it it would clean up the code significantly? -- greg
Re: proposal: enhancing plpgsql debug API - returns text value of variable content
It looks like this is -- like a lot of plpgsql patches -- having difficulty catching the attention of reviewers and committers. Aleksander asked for a test and Pavel put quite a bit of work into adding a good test case. I actually like that there's a test because it shows the API can be used effectively. >From my quick skim of this code it does indeed look like it's ready to commit. It's mostly pretty mechanical code to expose a couple fields so that a debugger can see them. Pavel, are you planning to add a debugger to contrib using this? The test example code looks like it would already be kind of useful even in this form.
Re: jsonpath syntax extensions
This patch seems to be getting ignored. Like David I'm a bit puzzled because it doesn't seem like an especially obscure or difficult patch to review. Yet it's been multiple years without even a superficial "does it meet the coding requirements" review let alone a design review. Can we get a volunteer to at least give it a quick once-over? I don't think it's ideal to be doing this in the last CF but neither is it very appetizing to just shift it to the next CF without a review after two years... On Thu, 27 Feb 2020 at 10:58, Nikita Glukhov wrote: > > Hi, hackers! > > Attached patches implement several useful jsonpath syntax extensions. > I already published them two years ago in the original SQL/JSON thread, > but then after creation of separate threads for SQL/JSON functions and > JSON_TABLE I forgot about them. > > A brief description of the patches: > > 1. Introduced new jsonpath modifier 'pg' which is used for enabling > PostgreSQL-specific extensions. This feature was already proposed in the > discussion of jsonpath's like_regex implementation. > > 2. Added support for raw jbvObject and jbvArray JsonbValues inside jsonpath > engine. Now, jsonpath can operate with JSON arrays and objects only in > jbvBinary form. But with introduction of array and object constructors in > patches #4 and #5 raw in-memory jsonb containers can appear in jsonpath > engine. > In some places we can iterate through jbvArrays, in others we need to encode > jbvArrays and jbvObjects into jbvBinay. > > 3. SQL/JSON sequence construction syntax. A simple comma-separated list can be > used to concatenate single values or sequences into a single resulting > sequence. > > SELECT jsonb_path_query('[1, 2, 3]', 'pg $[*], 4, 2 + 3'); > jsonb_path_query > -- > 1 > 2 > 3 > 4 > 5 > > SELECT jsonb_path_query('{ "a": [1, 2, 3], "b": [4, 5] }', > 'pg ($.a[*], $.b[*]) ? (@ % 2 == 1)'); >jsonb_path_query > -- > 1 > 3 > 5 > > > Patches #4-#6 implement ECMAScript-like syntax constructors and accessors: > > 4. Array construction syntax. > This can also be considered as enclosing a sequence constructor into brackets. > > SELECT jsonb_path_query('[1, 2, 3]', 'pg [$[*], 4, 2 + 3]'); > jsonb_path_query > -- > [1, 2, 3, 4, 5] > > Having this feature, jsonb_path_query_array() becomes somewhat redundant. > > > 5. Object construction syntax. It is useful for constructing derived objects > from the interesting parts of the original object. (But this is not > sufficient > to "project" each object in array, item method like '.map()' is needed here.) > > SELECT jsonb_path_query('{"b": 2}', 'pg { a : 1, b : $.b, "x y" : $.b + 3 > }'); > jsonb_path_query > --- > { "a" : 1, "b": 3, "x y": 5 } > > Fields with empty values are simply skipped regardless of lax/strict mode: > > SELECT jsonb_path_query('{"a": 1}', 'pg { b : $.b, a : $.a ? (@ > 1) }'); > jsonb_path_query > -- > {} > > > 6. Object subscription syntax. This gives us ability to specify what key to > extract on runtime. The syntax is the same as ordinary array subscription > syntax. > > -- non-existent $.x is simply skipped in lax mode > SELECT jsonb_path_query('{"a": "b", "b": "c"}', 'pg $[$.a, "x", "a"]'); > jsonb_path_query > -- > "c" > "b" > > SELECT jsonb_path_query('{"a": "b", "b": "c"}', 'pg $[$fld]', '{"fld": > "b"}'); > jsonb_path_query > -- > "c" > > -- > Nikita Glukhov > Postgres Professional: http://www.postgrespro.com > The Russian Postgres Company -- greg
Re: jsonpath syntax extensions
Hm. Actually... These changes were split off from the JSON_TABLE patches? Are they still separate or have they been merged into those other patches since? I see the JSON_TABLE thread is getting more comments do those reviews include these patches? On Mon, 21 Mar 2022 at 16:09, Greg Stark wrote: > > This patch seems to be getting ignored. Like David I'm a bit puzzled > because it doesn't seem like an especially obscure or difficult patch > to review. Yet it's been multiple years without even a superficial > "does it meet the coding requirements" review let alone a design > review. > > Can we get a volunteer to at least give it a quick once-over? I don't > think it's ideal to be doing this in the last CF but neither is it > very appetizing to just shift it to the next CF without a review after > two years... > > On Thu, 27 Feb 2020 at 10:58, Nikita Glukhov wrote: > > > > Hi, hackers! > > > > Attached patches implement several useful jsonpath syntax extensions. > > I already published them two years ago in the original SQL/JSON thread, > > but then after creation of separate threads for SQL/JSON functions and > > JSON_TABLE I forgot about them. > > > > A brief description of the patches: > > > > 1. Introduced new jsonpath modifier 'pg' which is used for enabling > > PostgreSQL-specific extensions. This feature was already proposed in the > > discussion of jsonpath's like_regex implementation. > > > > 2. Added support for raw jbvObject and jbvArray JsonbValues inside jsonpath > > engine. Now, jsonpath can operate with JSON arrays and objects only in > > jbvBinary form. But with introduction of array and object constructors in > > patches #4 and #5 raw in-memory jsonb containers can appear in jsonpath > > engine. > > In some places we can iterate through jbvArrays, in others we need to encode > > jbvArrays and jbvObjects into jbvBinay. > > > > 3. SQL/JSON sequence construction syntax. A simple comma-separated list can > > be > > used to concatenate single values or sequences into a single resulting > > sequence. > > > > SELECT jsonb_path_query('[1, 2, 3]', 'pg $[*], 4, 2 + 3'); > > jsonb_path_query > > -- > > 1 > > 2 > > 3 > > 4 > > 5 > > > > SELECT jsonb_path_query('{ "a": [1, 2, 3], "b": [4, 5] }', > > 'pg ($.a[*], $.b[*]) ? (@ % 2 == 1)'); > >jsonb_path_query > > -- > > 1 > > 3 > > 5 > > > > > > Patches #4-#6 implement ECMAScript-like syntax constructors and accessors: > > > > 4. Array construction syntax. > > This can also be considered as enclosing a sequence constructor into > > brackets. > > > > SELECT jsonb_path_query('[1, 2, 3]', 'pg [$[*], 4, 2 + 3]'); > > jsonb_path_query > > -- > > [1, 2, 3, 4, 5] > > > > Having this feature, jsonb_path_query_array() becomes somewhat redundant. > > > > > > 5. Object construction syntax. It is useful for constructing derived > > objects > > from the interesting parts of the original object. (But this is not > > sufficient > > to "project" each object in array, item method like '.map()' is needed > > here.) > > > > SELECT jsonb_path_query('{"b": 2}', 'pg { a : 1, b : $.b, "x y" : $.b + 3 > > }'); > > jsonb_path_query > > --- > > { "a" : 1, "b": 3, "x y": 5 } > > > > Fields with empty values are simply skipped regardless of lax/strict mode: > > > > SELECT jsonb_path_query('{"a": 1}', 'pg { b : $.b, a : $.a ? (@ > 1) }'); > > jsonb_path_query > > -- > > {} > > > > > > 6. Object subscription syntax. This gives us ability to specify what key to > > extract on runtime. The syntax is the same as ordinary array subscription > > syntax. > > > > -- non-existent $.x is simply skipped in lax mode > > SELECT jsonb_path_query('{"a": "b", "b": "c"}', 'pg $[$.a, "x", "a"]'); > > jsonb_path_query > > -- > > "c" > > "b" > > > > SELECT jsonb_path_query('{"a": "b", "b": "c"}', 'pg $[$fld]', '{"fld": > > "b"}'); > > jsonb_path_query > > -- > > "c" > > > > -- > > Nikita Glukhov > > Postgres Professional: http://www.postgrespro.com > > The Russian Postgres Company > > > > -- > greg -- greg
Re: Temporary tables versus wraparound... again
No problem, I can update the patch and check on the fuzz. But the actual conflict is just in the test and I'm not sure it's really worth having a test at all. It's testing a pretty low level detail. So I'm leaning toward fixing the conflict by just ripping the test out. Nathan also pointed out there was a simpler way to get the pid. I don't think the way I was doing it was wrong but I'll double check that.
Re: Commitfest manager for 2022-03
On Mon, 21 Mar 2022 at 21:48, Andres Freund wrote: > > Hi, > > On 2022-02-26 16:12:27 -0500, Greg Stark wrote: > > I do have the time available. What I don't necessarily have is insight > > into everything that needs to be done, especially behind the scenes. > > One thing that desperately needs doing, particularly during the last > commitfest, is looking through CF entries and pruning stuff that shouldn't be > there anymore or that are in the wrong state. Thanks > One can't just automatically mark all failing runs as "waiting on author" > because sometimes there are issues with somebody else posting an incremental > diff confusing cfbot or spurious test failures... What I'm seeing is patches that are failing with either the 027_stream_regress.pl failure that I see is being actively worked on in another thread or simply a timeout which I'm not sure but may be the same issue? But I'll do a pass and then do another pass later in the week when those problems may have been ironed out. -- greg
Re: Temporary tables versus wraparound... again
Here's a rebased patch. I split the test into a separate patch that I would lean to dropping. But at least it applies now. I did look into pg_stat_get_backend_pid() and I guess it would work but going through the stats mechanism does seem like going the long way around since we're already looking at the backendId info directly here, we just weren't grabbing the pid. I did make a small change, I renamed the checkTempNamespaceStatus() function to checkTempNamespaceStatusAndPid(). It seems unlikely there are any external consumers of this function (the only internal consumer is autovacuum.c). But just in case I renamed it to protect against any external modules failing from the added parameter. From eb6ec2edfcb10aafc3874262276638932a97add7 Mon Sep 17 00:00:00 2001 From: Greg Stark Date: Tue, 22 Mar 2022 15:54:59 -0400 Subject: [PATCH v3 2/2] Add test for truncating temp tables advancing relfrozenxid This test depends on other transactions not running at the same time so that relfrozenxid can advance so it has to be moved to its own schedule. --- src/test/regress/expected/temp.out | 21 + src/test/regress/parallel_schedule | 10 +++--- src/test/regress/sql/temp.sql | 19 +++ 3 files changed, 47 insertions(+), 3 deletions(-) diff --git a/src/test/regress/expected/temp.out b/src/test/regress/expected/temp.out index a5b3ed34a3..1fee5521af 100644 --- a/src/test/regress/expected/temp.out +++ b/src/test/regress/expected/temp.out @@ -82,6 +82,27 @@ SELECT * FROM temptest; - (0 rows) +DROP TABLE temptest; +-- Test that ON COMMIT DELETE ROWS resets the relfrozenxid when the +-- table is truncated. This requires this test not be run in parallel +-- with other tests as concurrent transactions will hold back the +-- globalxmin +CREATE TEMP TABLE temptest(col int) ON COMMIT DELETE ROWS; +SELECT relpages, reltuples, relallvisible, relfrozenxid FROM pg_class where oid = 'temptest'::regclass \gset old_ +BEGIN; +INSERT INTO temptest (select generate_series(1,1000)); +ANALYZE temptest; -- update relpages, reltuples +COMMIT; +SELECT relpages, reltuples, relallvisible, relfrozenxid FROM pg_class where oid = 'temptest'::regclass \gset new_ +SELECT :old_relpages = :new_relpages AS pages_reset, + :old_reltuples = :new_reltuples AS tuples_reset, + :old_relallvisible = :new_relallvisible AS allvisible_reset, + :old_relfrozenxid <> :new_relfrozenxid AS frozenxid_advanced; + pages_reset | tuples_reset | allvisible_reset | frozenxid_advanced +-+--+--+ + t | t| t| t +(1 row) + DROP TABLE temptest; -- Test ON COMMIT DROP BEGIN; diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule index 6d8f524ae9..f919c2f978 100644 --- a/src/test/regress/parallel_schedule +++ b/src/test/regress/parallel_schedule @@ -116,10 +116,14 @@ test: json jsonb json_encoding jsonpath jsonpath_encoding jsonb_jsonpath # -- # Another group of parallel tests # with depends on create_misc -# NB: temp.sql does a reconnect which transiently uses 2 connections, -# so keep this parallel group to at most 19 tests # -- -test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion truncate alter_table sequence polymorphism rowtypes returning largeobject with xml +test: plancache limit plpgsql copy2 domain rangefuncs prepare conversion truncate alter_table sequence polymorphism rowtypes returning largeobject with xml + +# -- +# Run this alone because it transiently uses 2 connections and also +# tests relfrozenxid advances when truncating temp tables +# -- +test: temp # -- # Another group of parallel tests diff --git a/src/test/regress/sql/temp.sql b/src/test/regress/sql/temp.sql index 424d12b283..5f0c39b5e7 100644 --- a/src/test/regress/sql/temp.sql +++ b/src/test/regress/sql/temp.sql @@ -79,6 +79,25 @@ SELECT * FROM temptest; DROP TABLE temptest; +-- Test that ON COMMIT DELETE ROWS resets the relfrozenxid when the +-- table is truncated. This requires this test not be run in parallel +-- with other tests as concurrent transactions will hold back the +-- globalxmin +CREATE TEMP TABLE temptest(col int) ON COMMIT DELETE ROWS; + +SELECT relpages, reltuples, relallvisible, relfrozenxid FROM pg_class where oid = 'temptest'::regclass \gset old_ +BEGIN; +INSERT INTO temptest (select generate_series(1,1000)); +ANALYZE temptest; -- update relpages, reltuples +COMMIT; +SELECT relpages, reltuples, relallvisible, relfrozenxid FROM pg_class where oid = 'temptest'::regclass \gset new_ +SELECT :old_relpages = :new_relpages AS pages_reset, + :old_reltuples = :new_reltuples AS tuples_reset, + :old_relallvisible = :new_relallvisible AS allvisible_reset, + :old_relfrozenxid <> :new_relfrozenxid
Re: pg13.2: invalid memory alloc request size NNNN
I think to get a size of -4 you would be trying to read a varlena pointer pointing to four nul bytes. I bet if you run dd on the corresponding block you'll find a chunk of nuls in the page. That perhaps makes sense with ZFS where if a new page was linked to the tree but never written it would be an uninitialized page rather than the old data. I'm becoming increasingly convinced that there are a lot of storage systems out there that just lose data whenever they crash or lose power. Systems that are supposed to be better than that.
Re: Asynchronous and "direct" IO support for PostgreSQL.
On Tue, 23 Feb 2021 at 05:04, Andres Freund wrote: > > ## Callbacks > > In the core AIO pieces there are two different types of callbacks at the > moment: > > Shared callbacks, which can be invoked by any backend (normally the issuing > backend / the AIO workers, but can be other backends if they are waiting for > the IO to complete). For operations on shared resources (e.g. shared buffer > reads/writes, or WAL writes) these shared callback needs to transition the > state of the object the IO is being done for to completion. E.g. for a shared > buffer read that means setting BM_VALID / unsetting BM_IO_IN_PROGRESS. > > The main reason these callbacks exist is that they make it safe for a backend > to issue non-blocking IO on buffers (see the deadlock section above). As any > blocked backend can cause the IO to complete, the deadlock danger is gone. So firstly this is all just awesome work and I have questions but I don't want them to come across in any way as criticism or as a demand for more work. This is really great stuff, thank you so much! The callbacks make me curious about two questions: 1) Is there a chance that a backend issues i/o, the i/o completes in some other backend and by the time this backend gets around to looking at the buffer it's already been overwritten again? Do we have to initiate I/O again or have you found a way to arrange that this backend has the buffer pinned from the time the i/o starts even though it doesn't handle the comletion? 2) Have you made (or considered making) things like sequential scans (or more likely bitmap index scans) asynchronous at a higher level. That is, issue a bunch of asynchronous i/o and then handle the pages and return the tuples as the pages arrive. Since sequential scans and bitmap scans don't guarantee to read the pages in order they're generally free to return tuples from any page in any order. I'm not sure how much of a win that would actually be since all the same i/o would be getting executed and the savings in shared buffers would be small but if there are mostly hot pages you could imagine interleaving a lot of in-memory pages with the few i/os instead of sitting idle waiting for the async i/o to return. > ## Stats > > There are two new views: pg_stat_aios showing AIOs that are currently > in-progress, pg_stat_aio_backends showing per-backend statistics about AIO. This is impressive. How easy is it to correlate with system aio stats? -- greg
Re: Asynchronous and "direct" IO support for PostgreSQL.
I guess what I would be looking for in stats would be a way to tell what the bandwidth, latency, and queue depth is. Ideally one day broken down by relation/index and pg_stat_statement record. I think seeing the actual in flight async requests in a connection is probably not going to be very useful in production. It's very low level and in production the user is just going to find that overwhelming detail. It is kind of cool to see the progress in sequential operations but I think that should be solved in a higher level way than this anyways. What we need to calculate these values would be the kinds of per-op stats nfsiostat uses from /proc/self/mountstats: https://utcc.utoronto.ca/~cks/space/blog/linux/NFSMountstatsNFSOps So number of async reads we've initiated, how many callbacks have been called, total cumulative elapsed time between i/o issued and i/o completed, total bytes of i/o initiated, total bytes of i/o completed. As well a counter of requests which returned errors (eof? i/o error?) If there are other locks or queues internally to postgres total time spent in those states. I have some vague idea that we should have a generic infrastructure for stats that automatically counts things associated with plan nodes and automatically bubbles that data up to the per-transaction, per-backend, per-relation, and pg_stat_statements stats. But that's a whole other ball of wax :)
Re: SSL SNI
Hate to be that guy but This still doesn't seem like it is IPv6-ready. Is there any harm in having SNI with an IPv6 address there if it gets through?
Re: SSL SNI
> Do you mean the IPv6 detection code is not correct? What is the problem? This bit, will recognize ipv4 addresses but not ipv6 addresses: + /* + * Set Server Name Indication (SNI), but not if it's a literal IP address. + * (RFC 6066) + */ + if (!(strspn(conn->pghost, "0123456789.") == strlen(conn->pghost) || + strchr(conn->pghost, ':'))) + {
Temporary tables versus wraparound... again
We had an outage caused by transaction wraparound. And yes, one of the first things I did on this site was check that we didn't have any databases that were in danger of wraparound. However since then we added a monitoring job that used a temporary table with ON COMMIT DELETE ROWS. Since it was a simple monitoring job it stayed connected to the database and used this small temporary table for a very long period of time. The temporary table never got vacuumed by autovacuum and never by the monitoring job (since it was being truncated on every transaction why would it need to be vacuumed...). We've been around this bush before. Tom added orphaned table protection to autovacuum precisely because temporary tables can cause the datfrozenxid to get held back indefinitely. Then Michael Paquier and Tsunakawa Takayuki both found it worth making this more aggressive. But none of that helped as the temporary schema was still in use so they were not considered "orphaned" temp tables at all. I think we need to add some warnings to autovacuum when it detects *non* orphaned temporary tables that are older than the freeze threshold. However in the case of ON COMMIT DELETE ROWS we can do better. Why not just reset the relfrozenxid and other stats as if the table was freshly created when it's truncated? I put together this quick patch to check the idea and it seems to integrate fine in the code. I'm not sure about a few points but I don't think they're showstoppers. 1) Should we update relpages and reltuples. I think so but an argument could be made that people might be depending on running analyze once when the data is loaded and then not having to run analyze on every data load. 2) adding the dependency on heapam.h to heap.c makes sense because of heap_inplace_update bt it may be a bit annoying because I suspect that's a useful sanity check that the tableam stuff hasn't been bypassed 3) I added a check to the regression tests but I'm not sure it's a good idea to actually commit this. It could fail if there's a parallel transaction going on and even moving the test to the serial schedule might not guarantee that never happens due to autovacuum running analyze? I didn't actually add the warning to autovacuum yet. -- greg From 76eb00c43fba2a293dc4a079307e675e0eeaff06 Mon Sep 17 00:00:00 2001 From: Greg Stark Date: Sun, 8 Nov 2020 11:54:50 -0500 Subject: [PATCH] update relfrozenxmin when truncating temp tables --- src/backend/catalog/heap.c | 45 ++ src/test/regress/expected/temp.out | 21 ++ src/test/regress/parallel_schedule | 9 +--- src/test/regress/sql/temp.sql | 19 4 files changed, 91 insertions(+), 3 deletions(-) diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c index 4cd7d76..ffe36bb 100644 --- a/src/backend/catalog/heap.c +++ b/src/backend/catalog/heap.c @@ -30,6 +30,7 @@ #include "postgres.h" #include "access/genam.h" +#include "access/heapam.h" #include "access/htup_details.h" #include "access/multixact.h" #include "access/relation.h" @@ -3277,6 +3278,48 @@ RelationTruncateIndexes(Relation heapRelation) } /* + * Reset the relfrozenxid and other stats to the same values used when + * creating tables. This is used after non-transactional truncation. + * + * This reduces the need for long-running programs to vacuum their own + * temporary tables (since they're not covered by autovacuum) at least in the + * case where they're ON COMMIT DELETE ROWS. + * + * see also src/backend/commands/vacuum.c vac_update_relstats() + * also see AddNewRelationTuple() above + */ + +static void +ResetVacStats(Relation rel) +{ + HeapTuple ctup; + Form_pg_class pgcform; + Relation classRel; + + /* Fetch a copy of the tuple to scribble on */ + classRel = table_open(RelationRelationId, RowExclusiveLock); + ctup = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(RelationGetRelid(rel))); + if (!HeapTupleIsValid(ctup)) + elog(ERROR, "pg_class entry for relid %u vanished during truncation", + RelationGetRelid(rel)); + pgcform = (Form_pg_class) GETSTRUCT(ctup); + + /* + * Update relfrozenxid + */ + + pgcform->relpages = 0; + pgcform->reltuples = 0; + pgcform->relallvisible = 0; + pgcform->relfrozenxid = RecentXmin; + pgcform->relminmxid = GetOldestMultiXactId(); + + heap_inplace_update(classRel, ctup); + + table_close(classRel, RowExclusiveLock); +} + +/* * heap_truncate * * This routine deletes all data within all the specified relations. @@ -3340,6 +3383,7 @@ heap_truncate_one_rel(Relation rel) /* Truncate the underlying relation */ table_relation_nontransactional_truncate(rel); + ResetVacStats(rel); /* If the relation has indexes, truncate the indexes too */ RelationTruncateIndexes(rel); @@ -3351,6 +3395,7 @@ heap_truncate_on
Re: Temporary tables versus wraparound... again
On Mon, 9 Nov 2020 at 00:17, Noah Misch wrote: > > > 2) adding the dependency on heapam.h to heap.c makes sense because of > > heap_inplace_update bt it may be a bit annoying because I suspect > > that's a useful sanity check that the tableam stuff hasn't been > > bypassed > > That is not terrible. How plausible would it be to call vac_update_relstats() > for this, instead of reimplementing part of it? It didn't seem worth it to change its API to add boolean flags to skip setting some of the variables (I was originally only doing relfrozenxid and minmmxid). Now that I'm doing most of the variables maybe it makes a bit more sense. > > @@ -3340,6 +3383,7 @@ heap_truncate_one_rel(Relation rel) > > > > /* Truncate the underlying relation */ > > table_relation_nontransactional_truncate(rel); > > + ResetVacStats(rel); > > I didn't test, but I expect this will cause a stats reset for the second > TRUNCATE here: > > CREATE TABLE t (); > ... > BEGIN; > TRUNCATE t; > TRUNCATE t; -- inplace relfrozenxid reset > ROLLBACK; -- inplace reset survives > > Does that indeed happen? Apparently no, see below. I have to say I was pretty puzzled by the actual behaviour which is that the rollback actually does roll back the inplace update. But I *think* what is happening is that the first truncate does an MVCC update so the inplace update happens only to the newly created tuple which is never commited. Thinking about things a bit this does worry me a bit. I wonder if inplace update is really safe outside of vacuum where we know we're not in a transaction that can be rolled back. But IIRC doing a non-inplace update on pg_class for these columns breaks other things. I don't know if that's still true. Also, in checking this question I realized I had missed 3d351d91. I should be initializing reltuples to -1 not 0. postgres=# vacuum t; VACUUM postgres=# select relname,relpages,reltuples,relallvisible,relfrozenxid from pg_class where oid='t'::regclass; relname | relpages | reltuples | relallvisible | relfrozenxid -+--+---+---+-- t |9 | 2000 | 9 |15557 (1 row) postgres=# begin; BEGIN postgres=*# truncate t; TRUNCATE TABLE postgres=*# truncate t; TRUNCATE TABLE postgres=*# select relname,relpages,reltuples,relallvisible,relfrozenxid from pg_class where oid='t'::regclass; relname | relpages | reltuples | relallvisible | relfrozenxid -+--+---+---+-- t |0 | 0 | 0 |15562 (1 row) postgres=*# abort; ROLLBACK postgres=# select relname,relpages,reltuples,relallvisible,relfrozenxid from pg_class where oid='t'::regclass; relname | relpages | reltuples | relallvisible | relfrozenxid -+--+---+---+-- t |9 | 2000 | 9 |15557 (1 row) -- greg
Re: don't allocate HashAgg hash tables when running explain only
On Wed, 18 Nov 2020 at 05:40, Heikki Linnakangas wrote: > > On 13/11/2020 18:10, Alexey Bashtanov wrote: >> > I would appreciate if someone could have a look at the patch attached, > > which makes executor skip initializing hash tables when doing explain only. > > Makes sense. Committed, thanks for the patch! Egads. That seems like a backpatchable bug fix to me. Have we been doing this all along?! -- greg
Re: Is postgres ready for 2038?
On Wed, 18 Nov 2020 at 12:22, Tom Lane wrote: > > Andrew Dunstan writes: > > On 11/18/20 9:44 AM, Tom Lane wrote: > >> Hmm. Digging around, I see that Mkvcbuild.pm intentionally absorbs > >> _USE_32BIT_TIME_T when building with a Perl that defines that. > >> I don't know what the state of play is in terms of Windows Perl > >> distributions getting off of that, but maybe we should press people > >> to not be using such Perl builds. > > > I think there's a good argument to ban it if we're doing a 64 bit build > > (and why would we do anything else?) > > I'm not really eager to ban it. If somebody is building with an old > Perl distribution, and doesn't particularly care that the installation > will break in 2038, why should we force them to care? Wait, is configuring with a Perl that has 32-bit time_t driving the rest of Postgres to use 32-bit timestamps? That seems like the tail wagging the dog. It seems like a sensible compromise would be to have Postgres's configure default to 64-bit time_t and have a flag to choose 32-bit time_t and then have a configure check that errors out if the time_t in Perl doesn't match with a hint to either find a newer Perl distribution or configure with the flag to choose 32-bit. Ie, don't silently assume users want 32-bit time_t but leave them the option to choose it explicitly. -- greg
Re: Proposal: Save user's original authenticated identity for logging
On Fri, 29 Jan 2021 at 18:41, Tom Lane wrote: > > Ah. So basically, this comes into play when you consider that some > outside-the-database entity is your "real" authenticated identity. > That seems reasonable when using Kerberos or the like, though it's > not real meaningful for traditional password-type authentication. > I'd misunderstood your point before. I wonder if there isn't room to handle this the other way around. To configure Postgres to not need a CREATE ROLE for every role but delegate the user management to the external authentication service. So Postgres would consider the actual role to be the one kerberos said it was even if that role didn't exist in pg_role. Presumably you would want to delegate to a corresponding authorization system as well so if the role was absent from pg_role (or more likely fit some pattern) Postgres would ignore pg_role and consult the authorization system configured like AD or whatever people use with Kerberos these days. -- greg
Re: [PATCH] Document heuristics for parameterized paths
On Mon, 6 Dec 2021 at 13:01, Steinar H. Gunderson wrote: > > +one that must cannot be delayed right away (because of outer join must cannot? -- greg
Re: [PATCH] Document heuristics for parameterized paths
On Wed, 15 Dec 2021 at 23:22, Greg Stark wrote: > > On Mon, 6 Dec 2021 at 13:01, Steinar H. Gunderson > wrote: > > > > +one that must cannot be delayed right away (because of outer join > > must cannot? Actually on further reading... "delayed right away"? -- greg
Re: Transparent column encryption
> In the server, the encrypted datums are stored in types called > encryptedr and encryptedd (for randomized and deterministic > encryption). These are essentially cousins of bytea. Does that mean someone could go in with psql and select out the data without any keys and just get a raw bytea-like representation? That seems like a natural and useful thing to be able to do. For example to allow dumping a table and loading it elsewhere and transferring keys through some other channel (perhaps only as needed).
Re: Documenting when to retry on serialization failure
Fwiw I think the real problem with automatic retries is that the SQL interface doesn't lend itself to it because the server never really knows if the command is going to be followed by a commit or more commands. I actually think if that problem were tackled it would very likely be a highly appreciated option. Because I think there's a big overlap between the set of users interested in higher isolation levels and the set of users writing stored procedures defining their business logic. They're both kind of "traditional" SQL engine approaches and both lend themselves to the environment where you have a lot of programmers working on a system and you're not able to do things like define strict locking and update orderings. So a lot of users are probably looking at something like "BEGIN; SELECT create_customer_order(); COMMIT" and wondering why the server can't handle automatically retrying the query if they get an isolation failure. There are actually other reasons why providing the whole logic for the transaction up front with a promise that it'll be the whole transaction is attractive. E.g. vacuum could ignore a transaction if it knows the transaction will never look at the table it's processing... Or automatic deadlock testing tools could extract the list of tables being accessed and suggest "lock table" commands to put at the head of the transaction sorted in a canonical order. These things may not be easy but they're currently impossible for the same reasons automatically retrying is. The executor doesn't know what subsequent commands will be coming after the current one and doesn't know whether it has the whole transaction.
Re: Add index scan progress to pg_stat_progress_vacuum
I had a similar question. And I'm still not clear from the response what exactly index_blks_total is and whether it addresses it. I think I agree that a user is likely to want to see the progress in a way they can understand which means for a single index at a time. I think what you're describing is that index_blks_total and index_blks_scanned are the totals across all the indexes? That isn't clear from the definitions but if that's what you intend then I think that would work. (For what it's worth what I was imagining was having a pair of counters for blocks scanned and max blocks in this index and a second counter for number of indexes processed and max number of indexes. But I don't think that's necessarily any better than what you have)
Re: WIP: WAL prefetch (another approach)
On Fri, 26 Nov 2021 at 21:47, Tom Lane wrote: > > Yeah ... on the one hand, that machine has shown signs of > hard-to-reproduce flakiness, so it's easy to write off the failures > I saw as hardware issues. On the other hand, the flakiness I've > seen has otherwise manifested as kernel crashes, which is nothing > like the consistent test failures I was seeing with the patch. Hm. I asked around and found a machine I can use that can run PPC binaries, but it's actually, well, confusing. I think this is an x86 machine running Leopard which uses JIT to transparently run PPC binaries. I'm not sure this is really a good test. But if you're interested and can explain the tests to run I can try to get the tests running on this machine: IBUILD:~ gsstark$ uname -a Darwin IBUILD.MIT.EDU 9.8.0 Darwin Kernel Version 9.8.0: Wed Jul 15 16:55:01 PDT 2009; root:xnu-1228.15.4~1/RELEASE_I386 i386 IBUILD:~ gsstark$ sw_vers ProductName: Mac OS X ProductVersion: 10.5.8 BuildVersion: 9L31a
Re: WIP: WAL prefetch (another approach)
The actual hardware of this machine is a Mac Mini Core 2 Duo. I'm not really clear how the emulation is done and whether it makes a reasonable test environment or not. Hardware Overview: Model Name: Mac mini Model Identifier: Macmini2,1 Processor Name: Intel Core 2 Duo Processor Speed: 2 GHz Number Of Processors: 1 Total Number Of Cores: 2 L2 Cache: 4 MB Memory: 2 GB Bus Speed: 667 MHz Boot ROM Version: MM21.009A.B00
Re: Allow DELETE to use ORDER BY and LIMIT/OFFSET
On Thu, 16 Dec 2021 at 22:18, Tom Lane wrote: > > * If the sort order is underspecified, or you omit ORDER BY > entirely, then it's not clear which rows will be operated on. > The LIMIT might stop after just some of the rows in a peer > group, and you can't predict which ones. Meh, that never seemed very compelling to me. I think that's on the user and there are *lots* of cases where the user can easily know enough extra context to know that's what she wants. In particular I've often wanted to delete one of two identical records and it would have been really easy to just do a DELETE LIMIT 1. I know there are ways to do it but it's always seemed like unnecessary hassle when there was a natural way to express it. > * UPDATE/DELETE necessarily involve the equivalent of SELECT > FOR UPDATE, which may cause the rows to be ordered more > surprisingly than you expected, ie the sort happens *before* > rows are replaced by their latest versions, which might have > different sort keys. This... is a real issue. Or is it? Just to be clear I think what you're referring to is a case like: INSERT INTO t values (1),(2) In another session: BEGIN; UPDATE t set c=0 where c=2 DELETE FROM t ORDER BY c ASC LIMIT 1 In other session: COMMIT Which row was deleted? In this case it was the row where c=1. Even though the UPDATE reported success (i.e. 1 row updated) so presumably it happened "before" the delete. The delete saw the ordering from before it was blocked and saw the ordering with c=1, c=2 so apparently it happened "before" the update. There are plenty of other ways to see the same surprising serialization failure. If instead of a DELETE you did an UPDATE there are any number of functions or other features that have been added which can expose the order in which the updates happened. The way to solve those cases would be to use serializable isolation (or even just repeatable read in this case). Wouldn't that also solve the DELETE serialization failure too? -- greg
Re: pg_dump versus ancient server versions
On Fri, 22 Oct 2021 at 19:27, Robert Haas wrote: > > Another thing to think about in that regard: how likely is that > PostgreSQL 7.4 and PostgreSQL 15 both compile and run on the same > operating system? I suspect the answer is "not very." I seem to recall > Greg Stark trying to compile really old versions of PostgreSQL for a > conference talk some years ago, and he got back to a point where it > just became impossible to make work on modern toolchains even with a > decent amount of hackery. That was when I compared sorting performance over time. I was able to get Postgres to build back to the point where 64-bit architecture support was added. From Andrew Dunstans comment later in this thread I'm guessing that was 7.2 That was basically at the point where 64-bit architecture support was added. It looks like the earliest date on the graphs in the talk are 2002-11-27 which matches the 7.3 release date. I think building earlier versions would have been doable if I had built them in 32-bit mode. -- greg
Re: WIP: WAL prefetch (another approach)
What tools and tool versions are you using to build? Is it just GCC for PPC? There aren't any special build processes to make a fat binary involved? On Thu, 16 Dec 2021 at 23:11, Tom Lane wrote: > > Greg Stark writes: > > But if you're interested and can explain the tests to run I can try to > > get the tests running on this machine: > > I'm not sure that machine is close enough to prove much, but by all > means give it a go if you wish. My test setup was explained in [1]: > > >> To recap, the test lashup is: > >> * 2003 PowerMac G4 (1.25GHz PPC 7455, 7200 rpm spinning-rust drive) > >> * Standard debug build (--enable-debug --enable-cassert) > >> * Out-of-the-box configuration, except add wal_consistency_checking = all > >> and configure a wal-streaming standby on the same machine > >> * Repeatedly run "make installcheck-parallel", but skip the tablespace > >> test to avoid issues with the standby trying to use the same directory > >> * Delay long enough after each installcheck-parallel to let the > >> standby catch up (the run proper is ~24 min, plus 2 min for catchup) > > Remember also that the code in question is not in HEAD; you'd > need to apply Munro's patches, or check out some commit from > around 2021-04-22. > > regards, tom lane > > [1] https://www.postgresql.org/message-id/3502526.1619925367%40sss.pgh.pa.us -- greg
Re: WIP: WAL prefetch (another approach)
I have IBUILD:postgresql gsstark$ ls /usr/bin/*gcc* /usr/bin/gcc /usr/bin/gcc-4.0 /usr/bin/gcc-4.2 /usr/bin/i686-apple-darwin9-gcc-4.0.1 /usr/bin/i686-apple-darwin9-gcc-4.2.1 /usr/bin/powerpc-apple-darwin9-gcc-4.0.1 /usr/bin/powerpc-apple-darwin9-gcc-4.2.1 I'm guessing I should do CC=/usr/bin/powerpc-apple-darwin9-gcc-4.2.1 or maybe 4.0.1. What version is on your G4?
Re: WIP: WAL prefetch (another approach)
Hm. I seem to have picked a bad checkout. I took the last one before the revert (45aa88fe1d4028ea50ba7d26d390223b6ef78acc). Or there's some incompatibility with the emulation and the IPC stuff parallel workers use. 2021-12-17 17:51:51.688 EST [50955] LOG: background worker "parallel worker" (PID 54073) was terminated by signal 10: Bus error 2021-12-17 17:51:51.688 EST [50955] DETAIL: Failed process was running: SELECT variance(unique1::int4), sum(unique1::int8), regr_count(unique1::float8, unique1::float8) FROM (SELECT * FROM tenk1 UNION ALL SELECT * FROM tenk1 UNION ALL SELECT * FROM tenk1 UNION ALL SELECT * FROM tenk1) u; 2021-12-17 17:51:51.690 EST [50955] LOG: terminating any other active server processes 2021-12-17 17:51:51.748 EST [54078] FATAL: the database system is in recovery mode 2021-12-17 17:51:51.761 EST [50955] LOG: all server processes terminated; reinitializing
Re: WIP: WAL prefetch (another approach)
On Fri, 17 Dec 2021 at 18:40, Tom Lane wrote: > > Greg Stark writes: > > Hm. I seem to have picked a bad checkout. I took the last one before > > the revert (45aa88fe1d4028ea50ba7d26d390223b6ef78acc). > > FWIW, I think that's the first one *after* the revert. Doh But the bigger question is. Are we really concerned about this flaky problem? Is it worth investing time and money on? I can get money to go buy a G4 or G5 and spend some time on it. It just seems a bit... niche. But if it's a real bug that represents something broken on other architectures that just happens to be easier to trigger here it might be worthwhile. -- greg
Re: Getting rid of regression test input/ and output/ files
On Sun, 19 Dec 2021 at 18:41, Corey Huinker wrote: > > Which brings up a tangential question, is there value in having something > that brings in one or more env vars as psql vars directly. I'm thinking > something like: > > \importenv pattern [prefix] Oof. That gives me the security heebie jeebies. Off the top of my head PHP, CGI, SSH have all dealt with vulnerabilities caused by accidentally importing variables they didn't intend to. -- greg
Re: Unifying VACUUM VERBOSE and log_autovacuum_min_duration output
I haven't read the patch yet. But some thoughts based on the posted output: 1) At first I was quite skeptical about losing the progress reporting. I've often found it quite useful. But looking at the examples I'm convinced. Or rather I think a better way to look at it is that the progress output for the operator should be separated from the metrics logged. As an operator what I want to see is some progress indicator ""starting table scan", "overflow at x% of table scanned, starting index scan", "processing index 1" "index 2"... so I can have some idea of how much longer the vacuum will take and see whether I need to raise maintenance_work_mem and by how much. I don't need to see all the metrics while it's running. 2) I don't much like the format. I want to be able to parse the output with awk or mtail or even just grep for relevant lines. Things like "index scan not needed" make it hard to parse since you don't know what it will look like if they are needed. I would have expected something like "index scans: 0" which is actually already there up above. I'm not clear how this line is meant to be read. Is it just explaining *why* the index scan was skipped? It would just be missing entirely if it wasn't skipped? Fwiw, having it be parsable is why I wouldn't want it to be multiple ereports. That would mean it could get interleaved with other errors from other backends. That would be a disaster.
Re: new release pspg
This is really cool. Now I just need to figure out how to integrate it with using Emacs for my terminal. I still want to use emacs enter and edit my queries but it would be cool to be able to hit a key and launch an xterm and send the query output to pspg
Re: fdatasync performance problem with large number of DB files
On Wed, 10 Mar 2021 at 20:25, Tom Lane wrote: > > So this means that in less-than-bleeding-edge kernels, syncfs can > only be regarded as a dangerous toy. If we expose an option to use > it, there had better be large blinking warnings in the docs. Isn't that true for fsync and everything else related on less-than-bleeding-edge kernels anyways? -- greg
Re: New IndexAM API controlling index vacuum strategies
On Thu, 18 Mar 2021 at 14:37, Peter Geoghegan wrote: > They usually involve some *combination* of Postgres problems, > application code problems, and DBA error. Not any one thing. I've seen > problems with application code that runs DDL at scheduled intervals, > which interacts badly with vacuum -- but only really on the rare > occasions when freezing is required! What I've seen is an application that regularly ran ANALYZE on a table. This worked fine as long as vacuums took less than the interval between analyzes (in this case 1h) but once vacuum started taking longer than that interval autovacuum would cancel it every time due to the conflicting lock. That would have just continued until the wraparound vacuum which wouldn't self-cancel except that there was also a demon running which would look for sessions stuck on a lock and kill the blocker -- which included killing the wraparound vacuum. And yes, this demon is obviously a terrible idea but of course it was meant for killing buggy user queries. It wasn't expecting to find autovacuum jobs blocking things. The real surprise for that user was that VACUUM could be blocked by things that someone would reasonably want to run regularly like ANALYZE. -- greg
Re: shared memory stats: high level design decisions: consistency, dropping
On Sun, 21 Mar 2021 at 18:16, Stephen Frost wrote: > > Greetings, > > * Tom Lane (t...@sss.pgh.pa.us) wrote: > > I also believe that the snapshotting behavior has advantages in terms > > of being able to perform multiple successive queries and get consistent > > results from them. Only the most trivial sorts of analysis don't need > > that. > > > > In short, what you are proposing sounds absolutely disastrous for > > usability of the stats views, and I for one will not sign off on it > > being acceptable. > > > > I do think we could relax the consistency guarantees a little bit, > > perhaps along the lines of only caching view rows that have already > > been read, rather than grabbing everything up front. But we can't > > just toss the snapshot concept out the window. It'd be like deciding > > that nobody needs MVCC, or even any sort of repeatable read. > > This isn't the same use-case as traditional tables or relational > concepts in general- there aren't any foreign keys for the fields that > would actually be changing across these accesses to the shared memory > stats- we're talking about gross stats numbers like the number of > inserts into a table, not an employee_id column. In short, I don't > agree that this is a fair comparison. I use these stats quite a bit and do lots of slicing and dicing with them. I don't think it's as bad as Tom says but I also don't think we can be quite as loosy-goosy as I think Andres or Stephen might be proposing either (though I note that haven't said they don't want any consistency at all). The cases where the consistency really matter for me is when I'm doing math involving more than one statistic. Typically that's ratios. E.g. with pg_stat_*_tables I routinely divide seq_tup_read by seq_scan or idx_tup_* by idx_scans. I also often look at the ratio between n_tup_upd and n_tup_hot_upd. And no, it doesn't help that these are often large numbers after a long time because I'm actually working with the first derivative of these numbers using snapshots or a time series database. So if you have the seq_tup_read incremented but not seq_scan incremented you could get a wildly incorrect calculation of "tup read per seq scan" which actually matters. I don't think I've ever done math across stats for different objects. I mean, I've plotted them together and looked at which was higher but I don't think that's affected by some plots having peaks slightly out of sync with the other. I suppose you could look at the ratio of access patterns between two tables and know that they're only ever accessed by a single code path at the same time and therefore the ratios would be meaningful. But I don't think users would be surprised to find they're not consistent that way either. So I think we need to ensure that at least all the values for a single row representing a single object are consistent. Or at least that there's *some* correct way to retrieve a consistent row and that the standard views use that. I don't think we need to guarantee that every possible plan will always be consistent even if you access the row multiple times in a self-join or use the lookup function on individual columns separately. -- greg
Re: Temporary tables versus wraparound... again
Here's an updated patch. I added some warning messages to autovacuum. One thing I learned trying to debug this situation in production is that it's nigh impossible to find the pid of the session using a temporary schema. The number in the schema refers to the backendId in the sinval stuff for which there's no external way to look up the corresponding pid. It would have been very helpful if autovacuum had just told me which backend pid to kill. I still have the regression test in the patch and as before I think it's probably not worth committing. I'm leaning to reverting that section of the patch before comitting. Incidentally there's still a hole here where a new session can attach to an existing temporary schema where a table was never truncated due to a session dieing abnormally. That new session could be a long-lived session but never use the temporary schema causing the old table to just sit there. Autovacuum has no way to tell it's not actually in use. I tend to think the optimization to defer cleaning the temporary schema until it's used might not really be an optimization. It still needs to be cleaned someday so it's just moving the work around. Just removing that optimization might be the easiest way to close this hole. The only alternative I see is adding a flag to PROC or somewhere where autovacuum can see if the backend has actually initialized the temporary schema yet or not. From 1315daf80e668b03cf2aab04106fe53ad606c9d0 Mon Sep 17 00:00:00 2001 From: Greg Stark Date: Tue, 12 Oct 2021 17:17:51 -0400 Subject: [PATCH v2] Update relfrozenxmin when truncating temp tables Make ON COMMIT DELETE ROWS reset relfrozenxmin and other table stats like normal truncate. Otherwise even typical short-lived transactions using temporary tables can easily cause them to reach relfrozenxid. Also add warnings when old temporary tables are found to still be in use during autovacuum. Long lived sessions using temporary tables are required to vacuum them themselves. For the warning to be useful modify checkTempNamespaceStatus to return the backend pid using it so that we can inform super-user which pid to terminate. Otherwise it's quite tricky to determine as a user. --- src/backend/access/transam/varsup.c | 12 ++--- src/backend/catalog/heap.c | 53 + src/backend/catalog/namespace.c | 9 +-- src/backend/postmaster/autovacuum.c | 48 - src/include/catalog/namespace.h | 2 +- src/test/regress/expected/temp.out | 21 +++ src/test/regress/parallel_schedule | 9 --- src/test/regress/sql/temp.sql | 19 + 8 files changed, 151 insertions(+), 22 deletions(-) diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c index a6e98e7..84ccd2f 100644 --- a/src/backend/access/transam/varsup.c +++ b/src/backend/access/transam/varsup.c @@ -129,14 +129,16 @@ GetNewTransactionId(bool isSubXact) errmsg("database is not accepting commands to avoid wraparound data loss in database \"%s\"", oldest_datname), errhint("Stop the postmaster and vacuum that database in single-user mode.\n" - "You might also need to commit or roll back old prepared transactions, or drop stale replication slots."))); + "You might also need to commit or roll back old prepared transactions,\n" + "drop temporary tables, or drop stale replication slots."))); else ereport(ERROR, (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), errmsg("database is not accepting commands to avoid wraparound data loss in database with OID %u", oldest_datoid), errhint("Stop the postmaster and vacuum that database in single-user mode.\n" - "You might also need to commit or roll back old prepared transactions, or drop stale replication slots."))); + "You might also need to commit or roll back old prepared transactions,\n" + "drop temporary tables, or drop stale replication slots."))); } else if (TransactionIdFollowsOrEquals(xid, xidWarnLimit)) { @@ -149,14 +151,16 @@ GetNewTransactionId(bool isSubXact) oldest_datname, xidWrapLimit - xid), errhint("To avoid a database shutdown, execute a database-wide VACUUM in that database.\n" - "You might also need to commit or roll back old prepared transactions, or drop stale replication slots."))); + "You might also need to commit or roll back old prepared transactions,\n" + "drop temporary tables, or drop stale replication slots."))); else ereport(WARNING, (errmsg("database with OID %u must be vacuumed within %u transactions", oldest_datoid, xidWrapLimit - xid), errhint("
Re: Extending amcheck to check toast size and compression
Right so here's a review. I think the patch is committable as is. It's an improvement and it does the job as promised. I do have some comments but I don't think they're serious issues and would actually be pretty happy committing it as is. Fwiw I didn't realize how short the patch was at first and it probably doesn't need yet another review. /* Toasted attributes too large to be untoasted should never be stored */ if (toast_pointer.va_rawsize > VARLENA_SIZE_LIMIT) 1) I know this used to say MaxAlloc -- personally I would probably go with that but either is fine. But if you're going to have a separate constant there could be more of a comment explaining why that's the maximum -- probably with a pointer to MaxAlloc and postgres.h's VARSIZE macros. The switch statement at line 1443 seems a bit ... baroque. Is it clearer than a simple "if cmid != TOAST_PGLZ_COMPRESSION_ID && cmid != TOAST_LZ4_COMPRESSION_ID)" ? I mean, I see this is easier to add more cases to but I found dealing with a case that falls through and no default is a lot of cognitive overhead to understand what is in the end just effectively a simple branch. Fwiw compilers aren't always the best at optimizing switch statements. It's entirely possible a compiler may end up building a whole lookup table of jumps for this thing. Not that it's performance critical but ... But all that's more words than necessary for a minor style comment. Fwiw I spent a few minutes thinking about and writing up this suggestion and then only afterwards realized the code in question wasn't from this patch. I'll mention it anyways but it's not relevant to this patch review I guess :) I found the whole expected_chunk_seq parameter thing a bit confusing and less useful than possible. I would instead suggestion: Allocate an array of the expected number of chunk numbers before calling check_toast_tuple and then just gather the chunk_seq that are returned. When it's finished you can do things like: a) Check if they're all ascending and report index corruption if not. b) Check if any numbers are missing and report which ones are missing and/or how many. c) Check if there are duplicates and report that. These would all be easier for a user to interpret than "index scan returned chunk 5 when expecting chunk 9". On Tue, 4 May 2021 at 12:20, Mark Dilger wrote: > > Hackers, > > During the version 14 development period, a few checks of toasted attributes > were written but never committed. For the version 15 development cycle, I'd > like to consider extending the checks of toasted attributes. First, no > toasted attribute should ever have a rawsize larger than the 1GB varlena > limit. Second, no compressed toasted attribute should have an extsize > indicating that the toast expanded during toasting. Such a extsize could > mean the compression code is malfunctioning, or that the extsize or rawsize > fields are corrupt. Third, any compressed attribute should have a valid > compression method ID. > > These checks are cheap. Actually retrieving the compressed toasted data and > checking that it uncompresses correctly would have very different performance > implications, but that is not included in this patch. > > > > — > Mark Dilger > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > > -- greg On Wed, 14 Jul 2021 at 10:58, Mark Dilger wrote: > > > > > On Jul 14, 2021, at 3:33 AM, Heikki Linnakangas wrote: > > > >> +/* The largest valid toast va_rawsize */ > >> +#define VARLENA_SIZE_LIMIT 0x3FFF > >> + > > > > Hmm, a toasted datum cannot be larger than MaxAllocSize, because it's > > reconstituted in a palloc'd datum, right? > > No datum size exceeds MaxAllocSize, and no datum expands when compressed > (because for those that do expand under any particular compression algorithm, > we opt to instead store the datum uncompressed), so no valid toast pointer > should contain a va_rawsize field greater than MaxAllocSize. Any toast > pointers that have larger va_rawsize fields are therefore corrupt. > > VARLENA_SIZE_LIMIT is defined here equal to MaxAllocSize: > > src/include/utils/memutils.h:#define MaxAllocSize ((Size) > 0x3fff) /* 1 gigabyte - 1 */ > > Earlier versions of the patch used MaxAllocSize rather than defining > VARLENA_SIZE_LIMIT, but review comments suggested that was less clear. > > — > Mark Dilger > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > > > > -- greg
Re: Extending amcheck to check toast size and compression
On Wed., Oct. 20, 2021, 12:41 Mark Dilger, wrote: > > I used a switch statement to trigger a compiler warning in such an event. > Catching better compiler diagnostics is an excellent reason to choose this structure. I guess all I could ask is that the comment saying no default branch say this is the motivation. >
Thinking about ANALYZE stats and autovacuum and large non-uniform tables
One problem I've seen in multiple databases and is when a table has a mixture of data sets within it. E.g. A queue table where 99% of the entries are "done" but most queries are working with the 1% that are "new" or in other states. Often the statistics are skewed by the "done" entries and give bad estimates for query planning when the query is actually looking at the other rows. We've always talked about this as a "skewed distribution" or "intercolumn correlation" problem. And we've developed some tools for dealing with those issues. But I've been thinking that's not the only problem with these cases. The problem I'm finding is that the distribution of these small subsets can swing quickly. And understanding intercolumn correlations even if we could do it perfectly would be no help at all. Consider a table with millions of rows that are "done" but none that are "pending". Inserting just a few hundred or thousand new pending rows makes any estimates based on the existing statistics entirely incorrect. Even if we had perfect statistics capable of making perfect estimates they would be entirely wrong once a few inserts of pending rows are done. Worse, this is kind of true for even n_dead_tup, n_mod_since_analyze, etc are kind of affected by this. It's easy (at least on older versions, maybe Peter's work has fixed this for btree) to get severe index bloat because vacuum doesn't run for a long time relative to the size of the busy portion of a table. I'm imagining to really tackle this we should be doing something like noticing when inserts, updates, deletes are affecting key values that are "rare" according to the statistics and triggering autovacuum ANALYZE statements that use indexes to only update the statistics for the relevant key ranges. Obviously this could get complex quickly. Perhaps it should be something users could declare. Some kind of "partitioned statistics" where you declare a where clause and we generate statistics for the table where that where clause is true. Then we could fairly easily also count things like n_mod_since_analyze for that where clause too. And yes, partitioning the table could be a solution to this in some cases. I think there are reasons why it isn't always going to work for these issues though, not least that users will likely have other ways they want to partition the data already. -- greg
Re: How to retain lesser paths at add_path()?
On Wed, 31 Jul 2019 at 11:45, Robert Haas wrote: > > On Wed, Jul 31, 2019 at 11:07 AM Tom Lane wrote: > > What you'd want to do for something like the above, I think, is to > > have some kind of figure of merit or other special marking for paths > > that will have some possible special advantage in later planning > > steps. Then you can teach add_path that that's another dimension it > > should consider, in the same way that paths with different sort orders > > or parallizability attributes don't dominate each other. > > Yeah, but I have to admit that this whole design makes me kinda > uncomfortable. Every time somebody comes up with a new figure of > merit, it increases not only the number of paths retained but also the > cost of comparing two paths to possibly reject one of them. But this is a fundamental problem with having lots of possible reasons a path might be good. Not a problem with the algorithm. I'm imagining that you're both right. If we had some sort of way to look at the shape of the query and make decisions early on about what figures of merit might be relevant then we might be able to pick just a few. Sort of like how we currently only check paths that match some join or other query feature. -- greg
hot_standby_feedback vs excludeVacuum and snapshots
I'm poking around to see debug a vacuuming problem and wondering if I've found something more serious. As far as I can tell the snapshots on HOT standby are built using a list of running xids that the primary builds and puts in the WAL and that seems to include all xids from transactions running in all databases. The HOT standby would then build a snapshot and eventually send the xmin of that snapshot back to the primary in the hot standby feedback and that would block vacuuming tuples that might be visible to the standby. Many ages ago Alvaro sweated blood to ensure vacuums could run for long periods of time without holding back the xmin horizon and blocking other vacuums from cleaning up tuples. That's the purpose of the excludeVacuum flag in GetCurrentVirtualXIDs(). That's possible because we know vacuums won't insert any tuples that queries might try to view and also vacuums won't try to perform any sql queries on other tables. I can't find anywhere that the standby snapshot building mechanism gets this same information about which xids are actually vacuums that can be ignored when building a snapshot. So I'm concerned that the hot standby sending back its xmin would be effectively undermining this mechanism and forcing vacuum xids to be included in the xmin horizon and prevent vacuuming of tuples. Am I missing something obvious? Is this a known problem? -- greg
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On 3 April 2018 at 11:35, Anthony Iliopoulos wrote: > Hi Robert, > > Fully agree, and the errseq_t fixes have dealt exactly with the issue > of making sure that the error is reported to all file descriptors that > *happen to be open at the time of error*. But I think one would have a > hard time defending a modification to the kernel where this is further > extended to cover cases where: > > process A does write() on some file offset which fails writeback, > fsync() gets EIO and exit()s. > > process B does write() on some other offset which succeeds writeback, > but fsync() gets EIO due to (uncleared) failures of earlier process. Surely that's exactly what process B would want? If it calls fsync and gets a success and later finds out that the file is corrupt and didn't match what was in memory it's not going to be happy. This seems like an attempt to co-opt fsync for a new and different purpose for which it's poorly designed. It's not an async error reporting mechanism for writes. It would be useless as that as any process could come along and open your file and eat the errors for writes you performed. An async error reporting mechanism would have to document which writes it was giving errors for and give you ways to control that. The semantics described here are useless for everyone. For a program needing to know the error status of the writes it executed, it doesn't know which writes are included in which fsync call. For a program using fsync for its original intended purpose of guaranteeing that the all writes are synced to disk it no longer has any guarantee at all. > This would be a highly user-visible change of semantics from edge- > triggered to level-triggered behavior. It was always documented as level-triggered. This edge-triggered concept is a completely surprise to application writers. -- greg
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On 3 April 2018 at 14:36, Anthony Iliopoulos wrote: > If EIO persists between invocations until explicitly cleared, a process > cannot possibly make any decision as to if it should clear the error I still don't understand what "clear the error" means here. The writes still haven't been written out. We don't care about tracking errors, we just care whether all the writes to the file have been flushed to disk. By "clear the error" you mean throw away the dirty pages and revert part of the file to some old data? Why would anyone ever want that? > But instead of deconstructing and debating the semantics of the > current mechanism, why not come up with the ideal desired form of > error reporting/tracking granularity etc., and see how this may be > fitted into kernels as a new interface. Because Postgres is portable software that won't be able to use some Linux-specific interface. And doesn't really need any granular error reporting system anyways. It just needs to know when all writes have been synced to disk. -- greg
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On 8 April 2018 at 04:27, Craig Ringer wrote: > On 8 April 2018 at 10:16, Thomas Munro > wrote: > > If the kernel does writeback in the middle, how on earth is it supposed to > know we expect to reopen the file and check back later? > > Should it just remember "this file had an error" forever, and tell every > caller? In that case how could we recover? We'd need some new API to say > "yeah, ok already, I'm redoing all my work since the last good fsync() so > you can clear the error flag now". Otherwise it'd keep reporting an error > after we did redo to recover, too. There is no spoon^H^H^H^H^Herror flag. We don't need fsync to keep track of any errors. We just need fsync to accurately report whether all the buffers in the file have been written out. When you call fsync again the kernel needs to initiate i/o on all the dirty buffers and block until they complete successfully. If they complete successfully then nobody cares whether they had some failure in the past when i/o was initiated at some point in the past. The problem is not that errors aren't been tracked correctly. The problem is that dirty buffers are being marked clean when they haven't been written out. They consider dirty filesystem buffers when there's hardware failure preventing them from being written "a memory leak". As long as any error means the kernel has discarded writes then there's no real hope of any reliable operation through that interface. Going to DIRECTIO is basically recognizing this. That the kernel filesystem buffer provides no reliable interface so we need to reimplement it ourselves in user space. It's rather disheartening. Aside from having to do all that work we have the added barrier that we don't have as much information about the hardware as the kernel has. We don't know where raid stripes begin and end, how big the memory controller buffers are or how to tell when they're full or empty or how to flush them. etc etc. We also don't know what else is going on on the machine. -- greg
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On 8 April 2018 at 22:47, Anthony Iliopoulos wrote: > On Sun, Apr 08, 2018 at 10:23:21PM +0100, Greg Stark wrote: >> On 8 April 2018 at 04:27, Craig Ringer wrote: >> > On 8 April 2018 at 10:16, Thomas Munro > > The question is, what should the kernel and application do in cases > where this is simply not possible (according to freebsd that keeps > dirty pages around after failure, for example, -EIO from the block > layer is a contract for unrecoverable errors so it is pointless to > keep them dirty). You'd need a specialized interface to clear-out > the errors (and drop the dirty pages), or potentially just remount > the filesystem. Well firstly that's not necessarily the question. ENOSPC is not an unrecoverable error. And even unrecoverable errors for a single write doesn't mean the write will never be able to succeed in the future. But secondly doesn't such an interface already exist? When the device is dropped any dirty pages already get dropped with it. What's the point in dropping them but keeping the failing device? But just to underline the point. "pointless to keep them dirty" is exactly backwards from the application's point of view. If the error writing to persistent media really is unrecoverable then it's all the more critical that the pages be kept so the data can be copied to some other device. The last thing user space expects to happen is if the data can't be written to persistent storage then also immediately delete it from RAM. (And the *really* last thing user space expects is for this to happen and return no error.) -- greg
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On 9 April 2018 at 15:22, Anthony Iliopoulos wrote: > On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote: >> > Sure, there could be knobs for limiting how much memory such "zombie" > pages may occupy. Not sure how helpful it would be in the long run > since this tends to be highly application-specific, and for something > with a large data footprint one would end up tuning this accordingly > in a system-wide manner. Surely this is exactly what the kernel is there to manage. It has to control how much memory is allowed to be full of dirty buffers in the first place to ensure that the system won't get memory starved if it can't clean them fast enough. That isn't even about persistent hardware errors. Even when the hardware is working perfectly it can only flush buffers so fast. The whole point of the kernel is to abstract away shared resources. It's not like user space has any better view of the situation here. If Postgres implemented all this in DIRECT_IO it would have exactly the same problem only with less visibility into what the rest of the system is doing. If every application implemented its own buffer cache we would be back in the same boat only with a fragmented memory allocation. > This has the potential to leave other > applications running in the same system with very little memory, in > cases where for example original application crashes and never clears > the error. I still think we're speaking two different languages. There's no application anywhere that's going to "clear the error". The application has done the writes and if it's calling fsync it wants to wait until the filesystem can arrange for the write to be persisted. If the application could manage without the persistence then it wouldn't have called fsync. The only way to "clear out" the error would be by having the writes succeed. There's no reason to think that wouldn't be possible sometime. The filesystem could remap blocks or an administrator could replace degraded raid device components. The only thing Postgres could do to recover would be create a new file and move the data (reading from the dirty buffer in memory!) to a new file anyways so we would "clear the error" by just no longer calling fsync on the old file. We always read fsync as a simple write barrier. That's what the documentation promised and it's what Postgres always expected. It sounds like the kernel implementors looked at it as some kind of communication channel to communicate status report for specific writes back to user-space. That's a much more complex problem and would have entirely different interface. I think this is why we're having so much difficulty communicating. > It is reasonable, but even FreeBSD has a big fat comment right > there (since 2017), mentioning that there can be no recovery from > EIO at the block layer and this needs to be done differently. No > idea how an application running on top of either FreeBSD or Illumos > would actually recover from this error (and clear it out), other > than remounting the fs in order to force dropping of relevant pages. > It does provide though indeed a persistent error indication that > would allow Pg to simply reliably panic. But again this does not > necessarily play well with other applications that may be using > the filesystem reliably at the same time, and are now faced with > EIO while their own writes succeed to be persisted. Well if they're writing to the same file that had a previous error I doubt there are many applications that would be happy to consider their writes "persisted" when the file was corrupt. Ironically the earlier discussion quoted talked about how applications that wanted more granular communication would be using O_DIRECT -- but what we have is fsync trying to be *too* granular such that it's impossible to get any strong guarantees about anything with it. >> One has to wonder how many applications actually use this correctly, >> considering PostgreSQL cares about data durability/consistency so much >> and yet we've been misunderstanding how it works for 20+ years. > > I would expect it would be very few, potentially those that have > a very simple process model (e.g. embedded DBs that can abort a > txn on fsync() EIO). Honestly I don't think there's *any* way to use the current interface to implement reliable operation. Even that embedded database using a single process and keeping every file open all the time (which means file descriptor limits limit its scalability) can be having silent corruption whenever some other process like a backup program comes along and calls fsync (or even sync?). -- greg
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On 9 April 2018 at 11:50, Anthony Iliopoulos wrote: > On Mon, Apr 09, 2018 at 09:45:40AM +0100, Greg Stark wrote: >> On 8 April 2018 at 22:47, Anthony Iliopoulos wrote: > To make things a bit simpler, let us focus on EIO for the moment. > The contract between the block layer and the filesystem layer is > assumed to be that of, when an EIO is propagated up to the fs, > then you may assume that all possibilities for recovering have > been exhausted in lower layers of the stack. Well Postgres is using the filesystem. The interface between the block layer and the filesystem may indeed need to be more complex, I wouldn't know. But I don't think "all possibilities" is a very useful concept. Neither layer here is going to be perfect. They can only promise that all possibilities that have actually been implemented have been exhausted. And even among those only to the degree they can be done automatically within the engineering tradeoffs and constraints. There will always be cases like thin provisioned devices that an operator can expand, or degraded raid arrays that can be repaired after a long operation and so on. A network device can't be sure whether a remote server may eventually come back or not and have to be reconfigured by a human or system automation tool to point to the new server or new network configuration. > Right. This implies though that apart from the kernel having > to keep around the dirtied-but-unrecoverable pages for an > unbounded time, that there's further an interface for obtaining > the exact failed pages so that you can read them back. No, the interface we have is fsync which gives us that information with the granularity of a single file. The database could in theory recognize that fsync is not completing on a file and read that file back and write it to a new file. More likely we would implement a feature Oracle has of writing key files to multiple devices. But currently in practice that's not what would happen, what would happen would be a human would recognize that the database has stopped being able to commit and there are hardware errors in the log and would stop the database, take a backup, and restore onto a new working device. The current interface is that there's one error and then Postgres would pretty much have to say, "sorry, your database is corrupt and the data is gone, restore from your backups". Which is pretty dismal. > There is a clear responsibility of the application to keep > its buffers around until a successful fsync(). The kernels > do report the error (albeit with all the complexities of > dealing with the interface), at which point the application > may not assume that the write()s where ever even buffered > in the kernel page cache in the first place. Postgres cannot just store the entire database in RAM. It writes things to the filesystem all the time. It calls fsync only when it needs a write barrier to ensure consistency. That's only frequent on the transaction log to ensure it's flushed before data modifications and then periodically to checkpoint the data files. The amount of data written between checkpoints can be arbitrarily large and Postgres has no idea how much memory is available as filesystem buffers or how much i/o bandwidth is available or other memory pressure there is. What you're suggesting is that the application should have to babysit the filesystem buffer cache and reimplement all of it in user-space because the filesystem is free to throw away any data any time it chooses? The current interface to throw away filesystem buffer cache is unmount. It sounds like the kernel would like a more granular way to discard just part of a device which makes a lot of sense in the age of large network block devices. But I don't think just saying that the filesystem buffer cache is now something every application needs to re-implement in user-space really helps with that, they're going to have the same problems to solve. -- greg
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On 10 April 2018 at 02:59, Craig Ringer wrote: > Nitpick: In most cases the kernel reserves disk space immediately, > before returning from write(). NFS seems to be the main exception > here. I'm kind of puzzled by this. Surely NFS servers store the data in the filesystem using write(2) or the in-kernel equivalent? So if the server is backed by a filesystem where write(2) preallocates space surely the NFS server must behave as if it'spreallocating as well? I would expect NFS to provide basically the same set of possible failures as the underlying filesystem (as long as you don't enable nosync of course). -- greg
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On 10 April 2018 at 19:58, Joshua D. Drake wrote: > You can't unmount the file system --- that requires writing out all of the > pages > such that the dirty bit is turned off. I always wondered why Linux didn't implement umount -f. It's been in BSD since forever and it's a major annoyance that it's missing in Linux. Even without leaking memory it still leaks other resources, causes confusion and awkward workarounds in UI and automation software. -- greg
Partitioning versus autovacuum
So we now support `ANALYZE partitioned_table` which will gather statistics for the main table by gathering stats from all the partitions. However as far as I can tell autovacuum will never actually trigger this analyze. Because we never generate any update records for the parent table in the statistics. Have I missed something? I didn't find any discussion of this in the threads from when partitioning was committed but there were a lot of discussions and I could easily have missed it. Is there a story for this? Some way to configure things so that autovacuum will analyze partitioned tables? Or should we look at doing something? Maybe whether we analyze a child we should also update the parent -- and if there's no stats yet run analyze on it? This may be a serious enough problem for users that it may warrant backpatching. Not having any stats is resulting in some pretty weird plans for us.
Re: Partitioning versus autovacuum
Actually I did just find it in the To-do wiki: Have autoanalyze of parent tables occur when child tables are modified - http://archives.postgresql.org/pgsql-performance/2010-06/msg00137.php On Mon., Sep. 30, 2019, 1:48 p.m. Greg Stark, wrote: > So we now support `ANALYZE partitioned_table` which will gather statistics > for the main table by gathering stats from all the partitions. > > However as far as I can tell autovacuum will never actually trigger this > analyze. Because we never generate any update records for the parent table > in the statistics. Have I missed something? > > I didn't find any discussion of this in the threads from when partitioning > was committed but there were a lot of discussions and I could easily have > missed it. > > Is there a story for this? Some way to configure things so that autovacuum > will analyze partitioned tables? > > Or should we look at doing something? Maybe whether we analyze a child we > should also update the parent -- and if there's no stats yet run analyze on > it? > > This may be a serious enough problem for users that it may warrant > backpatching. Not having any stats is resulting in some pretty weird plans > for us. >
Re: Partitioning versus autovacuum
Actually -- I'm sorry to followup to myself (twice) -- but that's wrong. That Todo item predates the modern partitioning code. It came from when the partitioned statistics were added for inheritance trees. The resulting comment almost doesn't make sense any more since it talks about updates to the parent table and treats them as distinct from updates to the children. In any case it's actually not true any more as updates to the parent table aren't even tracked any more -- see below. My modest proposal is that we should count any updates that arrive through the parent table as mods for both the parent and child. A more ambitious proposal would have updates to the children also count against the parent somehow but I'm not sure exactly how. And I'm not sure we shouldn't be updating the parent statistics whenever we run analyze on a child anyways but again I'm not sure how. postgres=# postgres=# create table p (i integer primary key, t text) partition by range (i) ; CREATE TABLE postgres=# create table p0 partition of p for values from (0) to (10); CREATE TABLE postgres=# analyze p; ANALYZE postgres=# analyze p0; ANALYZE postgres=# select pg_stat_get_mod_since_analyze('p'::regclass) as p, pg_stat_get_mod_since_analyze('p0'::regclass) as p0; p | p0 ---+ 0 | 0 (1 row) postgres=# insert into p values (2); INSERT 0 1 postgres=# select pg_stat_get_mod_since_analyze('p'::regclass) as p, pg_stat_get_mod_since_analyze('p0'::regclass) as p0; p | p0 ---+ 0 | 1 (1 row)
Re: maintenance_work_mem used by Vacuum
It's a bit unfortunate that we're doing the pending list flush while the vacuum memory is allocated at all. Is there any reason other than the way the callbacks are defined that gin doesn't do the pending list flush before vacuum does the heap scan and before it allocates any memory using maintenance_work_mem? (I'm guessing doing it after vacuum is finished would have different problems with tuples in the pending queue not getting vacuumed?)
Re: Partitioning versus autovacuum
At the risk of forking this thread... I think there's actually a planner estimation bug here too. Consider this test case of a simple partitioned table and a simple join. The cardinality estimates for each partition and the Append node are all perfectly accurate. But the estimate for the join is way off. The corresponding test case without partitioning produces a perfect cardinality estimate for the join. I've never completely wrapped my head around the planner selectivity estimations. IIRC join restrictions are treated differently from single-relation restrictions. Perhaps what's happening here is that the single-relation restrictions are being correctly estimated based on the child partitions but the join restriction code hasn't been taught the same tricks? stark=# create table p (i integer, j integer) partition by list (i); CREATE TABLE stark=# create table p0 partition of p for values in (0); CREATE TABLE stark=# create table p1 partition of p for values in (1); CREATE TABLE stark=# insert into p select 0,generate_series(1,1000); INSERT 0 1000 stark=# insert into p select 1,generate_series(1,1000); INSERT 0 1000 stark=# analyze p0; ANALYZE stark=# analyze p1; ANALYZE stark=# create table q (i integer); CREATE TABLE stark=# insert into q values (0); INSERT 0 1 stark=# analyze q; ANALYZE -- Query partitioned table, get wildly off row estimates for join stark=# explain analyze select * from q join p using (i) where j between 1 and 500; ┌─┐ │ QUERY PLAN │ ├─┤ │ Hash Join (cost=1.02..44.82 rows=5 width=8) (actual time=0.060..1.614 rows=500 loops=1)│ │ Hash Cond: (p0.i = q.i) │ │ -> Append (cost=0.00..40.00 rows=1000 width=8) (actual time=0.030..1.127 rows=1000 loops=1) │ │ -> Seq Scan on p0 (cost=0.00..20.00 rows=500 width=8) (actual time=0.029..0.440 rows=500 loops=1) │ │ Filter: ((j >= 1) AND (j <= 500)) │ │ Rows Removed by Filter: 500 │ │ -> Seq Scan on p1 (cost=0.00..20.00 rows=500 width=8) (actual time=0.018..0.461 rows=500 loops=1) │ │ Filter: ((j >= 1) AND (j <= 500)) │ │ Rows Removed by Filter: 500 │ │ -> Hash (cost=1.01..1.01 rows=1 width=4) (actual time=0.011..0.012 rows=1 loops=1) │ │ Buckets: 1024 Batches: 1 Memory Usage: 9kB │ │ -> Seq Scan on q (cost=0.00..1.01 rows=1 width=4) (actual time=0.005..0.006 rows=1 loops=1) │ │ Planning time: 0.713 ms │ │ Execution time: 1.743 ms │ └─┘ (14 rows) -- Query non-partitioned table get accurate row estimates for join stark=# create table pp as (Select * from p); SELECT 2000 stark=# analyze pp; ANALYZE stark=# explain analyze select * from q join pp using (i) where j between 1 and 500; ┌─┐ │ QUERY PLAN │ ├─┤ │ Hash Join (cost=1.02..48.77 rows=500 width=8) (actual time=0.027..0.412 rows=500 loops=1) │ │ Hash Cond: (pp.i = q.i) │ │ -> Seq Scan on pp (cost=0.00..39.00 rows=1000 width=8) (actual time=0.014..0.243 rows=1000 loops=1) │ │ Filter: ((j >= 1) AND (j <= 500)) │ │ Rows Removed by Filter: 1000 │ │ -> Hash (cost=1.01..1.01 rows=1 width=4) (actual time=0.005..0.005 rows=1 loops=1) │ │ Buckets: 1024 Batches: 1 Memory Usage: 9kB │ │ -> Seq Scan on q (cost=0.00..1.01 rows=1 width=4) (actual time=0.003..0.003 rows=1 loops=1) │ │ Planning time: 0.160 ms │ │ Execution time: 0.456 ms │ └─┘ (10 rows)
Can't we give better table bloat stats easily?
Everywhere I've worked I've seen people struggle with table bloat. It's hard to even measure how much of it you have or where, let alone actually fix it. If you search online you'll find dozens of different queries estimating how much empty space are in your tables and indexes based on pg_stats statistics and suppositions about header lengths and padding and plugging them into formulas of varying credibility. But isn't this all just silliiness these days? We could actually sum up the space recorded in the fsm and get a much more trustworthy number in milliseconds. I rigged up a quick proof of concept and the code seems super simple and quick. There's one or two tables where the number is a bit suspect and there's no fsm if vacuum hasn't run but that seems pretty small potatoes for such a huge help in reducing user pain.
LIKE foo% optimization easily defeated by OR?
Our database has a query that looks like this -- note the OR between a simple equality qual and a LIKE qual: => explain SELECT 1 AS one FROM "redirect_routes" WHERE redirect_routes.path = 'foobar' OR redirect_routes.path LIKE 'foobar/%'; QUERY PLAN --- Seq Scan on redirect_routes (cost=0.00..1776.23 rows=5 width=4) Filter: (((path)::text = 'foobar'::text) OR ((path)::text ~~ 'foobar/%'::text)) (2 rows) The database uses a sequential scan even though both of the sides of that OR have valid indexes that can satisfy them (and for much lower costs): => explain SELECT 1 AS one FROM "redirect_routes" WHERE redirect_routes.path = 'foobar' ; QUERY PLAN --- Index Only Scan using index_redirect_routes_on_path_text_pattern_ops on redirect_routes (cost=0.41..4.43 rows=1 width=4) Index Cond: (path = 'foobar'::text) (2 rows) => explain SELECT 1 AS one FROM "redirect_routes" WHERE redirect_routes.path LIKE 'foobar/%'; QUERY PLAN --- Index Only Scan using index_redirect_routes_on_path_text_pattern_ops on redirect_routes (cost=0.41..4.44 rows=4 width=4) Index Cond: ((path ~>=~ 'foobar/'::text) AND (path ~<~ 'foobar0'::text)) Filter: ((path)::text ~~ 'foobar/%'::text) (3 rows) I'm guessing the LIKE optimization isn't clever enough to kick in when it's buried under an OR? Does it only kick in at the top level of the quals? -- greg
Re: LIKE foo% optimization easily defeated by OR?
On 3 January 2018 at 22:34, Alexander Korotkov wrote: > I've checked similar case on database with PostgreSQL mailing lists. It > works for me. Wow that's fascinating. I wonder why it's not kicking in for me. I have checked with enable_seqscan=off but I'll have to do some more investigations. I'll try it on different instances of the database. -- greg
Re: LIKE foo% optimization easily defeated by OR?
I think I found the bug 18" from the monitor I'll just be over here with the paper bag over my head mumbling about running RESET ALL before running tests...
Re: [HACKERS] Restricting maximum keep segments by repslots
On 11 January 2018 at 09:55, Sergei Kornilov wrote: > if (active_pid != 0) > status = "streaming"; > else > status = "keeping"; Perhaps "idle" by analogy to a pg_stat_activity entry for a backend that's connected but not doing anything. > status = "may_lost"; Perhaps "stale" or "expired"? Is this patch in bike-shed territory? Are there any questions about whether we want the basic shape to look like this? Fwiw I think there's a real need for this feature so I would like to get it in for Postgres 11. -- greg
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On 11 January 2018 at 19:41, Peter Eisentraut wrote: > Two, what to do when the memory limit is reached. With the old > accounting, this was easy, because we'd decide for each subtransaction > independently whether to spill it to disk, when it has reached its 4096 > limit. Now, we are looking at a global limit, so we have to find a > transaction to spill in some other way. The proposed patch searches > through the entire list of transactions to find the largest one. But as > the patch says: > > "XXX With many subtransactions this might be quite slow, because we'll > have to walk through all of them. There are some options how we could > improve that: (a) maintain some secondary structure with transactions > sorted by amount of changes, (b) not looking for the entirely largest > transaction, but e.g. for transaction using at least some fraction of > the memory limit, and (c) evicting multiple transactions at once, e.g. > to free a given portion of the memory limit (e.g. 50%)." AIUI spilling to disk doesn't affect absorbing future updates, we would just keep accumulating them in memory right? We won't need to unspill until it comes time to commit. Is there any actual advantage to picking the largest transaction? it means fewer spills and fewer unspills at commit time but that just a bigger spike of i/o and more of a chance of spilling more than necessary to get by. In the end it'll be more or less the same amount of data read back, just all in one big spike when spilling and one big spike when committing. If you spilled smaller transactions you would have a small amount of i/o more frequently and have to read back small amounts for many commits. But it would add up to the same amount of i/o (or less if you avoid spilling more than necessary). The real aim should be to try to pick the transaction that will be committed furthest in the future. That gives you the most memory to use for live transactions for the longest time and could let you process the maximum amount of transactions without spilling them. So either the oldest transaction (in the expectation that it's been open a while and appears to be a long-lived batch job that will stay open for a long time) or the youngest transaction (in the expectation that all transactions are more or less equally long-lived) might make sense. -- greg
Re: WIP: a way forward on bootstrap data
I'm 1000% on board with replacing oid constants with symbolic names that get substituted programmatically. However I wonder why we're bothering inventing a new syntax that doesn't actually do much more than present static tabular data. If things like magic proname->prosrc behaviour are not valuable then we're not getting much out of this perl-friendly syntax that a simpler more standard format wouldn't get us. So just as a straw man proposal What if we just replaced the data file with a csv file that could be maintained in a spreadsheet. It could easily be parsed by perl and we could even have perl scripts that load the records into memory and modify them. You could even imagine writing a postgres script that loaded the csv file into a temporary table, did complex SQL updates or other DML, then wrote it back out to a csv file.
Re: Bug in Physical Replication Slots (at least 9.5)?
On 19 January 2017 at 09:37, Kyotaro HORIGUCHI wrote: > > Though I haven't look closer to how a modification is splitted > into WAL records. A tuple cannot be so long. As a simple test, I > observed rechder->xl_tot_len at the end of XLogRecordAssemble > inserting an about 400KB not-so-compressable string into a text > column, but I saw a series of many records with shorter than > several thousand bytes. I think the case to check is a commit record with many thousands of subtransactions. I'm not sure you can fill a whole segment though. -- greg
Re: log bind parameter values on error
On Mon, 9 Dec 2019 at 15:17, Tom Lane wrote: > > Meh ... people will inevitably complain that they needed to see the > whole value, and we'll end up having to add another configuration > variable. Let's not go there just yet. I haven't been following this whole thread but my initial reaction is that this particular configuration parameter would actually carry it's weight. While some people have applications where they know the expected size of the data and can safely log the entire data to the logs other people deal with user-supplied data that can be very large. It would suck to have something like Gitlab dump entire merge diffs to the log and those aren't even very large, probably under a megabyte. Some users will have individual data that are up to 1GB -- greg
Re: VACUUM memory management
On Mon, 9 Dec 2019 at 14:03, Ibrar Ahmed wrote: > I'd > actually argue that the segment size should be substantially smaller > than 1 GB, like say 64MB; there are still some people running systems > which are small enough that allocating 1 GB when we may need only 6 > bytes can drive the system into OOM." I don't even see why you would allocated as much as 64MB. I would think something around 1MB would be more sensible. So you might need an array of segment pointers as long as a few thousand pointers, big deal. We can handle repalloc on 8kB arrays pretty easily. -- greg
Re: verbose cost estimate
On Mon, 9 Dec 2019 at 17:14, Tomas Vondra wrote: > > On Sat, Dec 07, 2019 at 11:34:12AM -0500, Tom Lane wrote: > >Justin Pryzby writes: > >> Jeff said: > >>> |What would I find very useful is a verbosity option to get the cost > >>> |estimates expressed as a multiplier of each *_cost parameter, rather than > >>> |just as a scalar. > > > >> It seems to me that's "just" a matter of redefining Cost and fixing > >> everything that breaks: > > > >> struct Cost { > >> double seq, rand; > >> double cpu_tuple, cpu_index_tuple, cpu_oper; > >> double parallel_setup; // This is probably always in startup_cost > >> and never in run_cost > >> double parallel_tuple; // This is probably always in run_cost and > >> never in startup_cost > >> double disable; > >> }; > > > >> I'm perhaps 50% done with that - is there some agreement that's a desirable > >> goal and a good way to do it ? > > > >No, I think this will get rejected out of hand. The implications for > >the planner's speed and memory consumption seem quite unacceptable > >for the size of the benefit. What you're showing above probably > >doubles the size of most Paths, and the added cycles in hot-spots > >like add_path seem pretty daunting. > > > > Yeah, that's an issue. But I have to admit my main issue with this > proposal is that I have no idea how I'd interpret this Cost. I mean, > what do the fields express for different types of paths? How do they > contribute to the actual cost of that path? What I think users would be able to do with this info is understand which parameter to tweak to raise the estimated cost of the node. Everyone knows if you see a index scan is being used but is taking longer than a sequential scan then you might try raising random_page_cost. But I rarely see people tweaking the more "exotic" parameters like operator_tuple_cost or index_tuple_cost and when they do they aren't really sure what nodes they're affecting... I remember planning to do a very similar thing back in the 8.3 era and never getting around to it. You could imaging even storing these for the overall plan in the logs and building a large matrix of actual execution values versus these broken out individual costs. Then it becomes a standard linear optimization problem to find the optimal values for each parameter to minimize inaccurate plan estimates (and to identify cases where there are outliers). -- greg
Re: On disable_cost
On Wed, 11 Dec 2019 at 01:24, Laurenz Albe wrote: > > On Tue, 2019-12-10 at 15:50 -0700, Jim Finnerty wrote: > > As a proof of concept, I hacked around a bit today to re-purpose one of the > > bits of the Cost structure to mean "is_disabled" so that we can distinguish > > 'disabled' from 'non-disabled' paths without making the Cost structure any > > bigger. In fact, it's still a valid double. The obvious choice would have > > been to re-purpose the sign bit, but I've had occasion to exploit negative > > costs before so for this POC I used the high-order bit of the fractional > > bits of the double. (see Wikipedia for double precision floating point for > > the layout). > > > > The idea is to set a special bit when disable_cost is added to a cost. > > Dedicating multiple bits instead of just 1 would be easily done, but as it > > is we can accumulate many disable_costs without overflowing, so just > > comparing the cost suffices. > > Doesn't that rely on a specific implementation of double precision (IEEE)? > I thought that we don't want to limit ourselves to platforms with IEEE floats. We could always implement it again in another format However, I wouldn't have expected to be bit twiddling. I would have expected to use standard functions like ldexp to do this. In fact I think if you use the high bit of the exponent you could do it entirely using ldexp and regular double comparisons (with fabs). Ie, to set the bit you set cost = ldexp(cost, __DBL_MAX_EXP__/2). And to check for the bit being set you compare ilogb(cost, __DBL_MAX_EXP__/2). Hm. that doesn't handle if the cost is already < 1 in which case I guess you would have to set it to 1 first. Or reserve the two high bits of the cost so you can represent disabled values that had negative exponents before being disabled. I wonder if it wouldn't be a lot cleaner and more flexible to just go with a plain float for Cost and use the other 32 bits for counters and bitmasks and still be ahead of the game. A double can store 2^1024 but a float 2^128 which still feels like it should be more than enough to store the kinds of costs plans have without the disabled costs. 2^128 milliseconds is still 10^28 years which is an awfully expensive query -- greg
Re: On disable_cost
I think this would be ready to abstract away behind a few functions that could always be replaced by something else later... However on further thought I really think just using a 32-bit float and 32 bits of other bitmaps or counters would be a better approach. On Sun., Dec. 15, 2019, 14:54 Tom Lane, wrote: > Thomas Munro writes: > > On Wed, Dec 11, 2019 at 7:24 PM Laurenz Albe > wrote: > >> Doesn't that rely on a specific implementation of double precision > (IEEE)? > >> I thought that we don't want to limit ourselves to platforms with IEEE > floats. > > > Just by the way, you might want to read the second last paragraph of > > the commit message for 02ddd499. The dream is over, we're never going > > to run on Vax. > > Still, the proposed hack is doubling down on IEEE dependency in a way > that I quite dislike, in that (a) it doesn't just read float values > but generates new ones (and assumes that the hardware/libc will react in > a predictable way to them), (b) in a part of the code that has no damn > business having close dependencies on float format, and (c) for a gain > far smaller than what we got from the Ryu code. > > We have had prior discussions about whether 02ddd499 justifies adding > more IEEE dependencies elsewhere. I don't think it does. IEEE 754 > is not the last word that will ever be said on floating-point arithmetic, > any more than x86_64 is the last CPU architecture that anyone will ever > care about. We should keep our dependencies on it well circumscribed. > > regards, tom lane > > >
Re: Implementing Incremental View Maintenance
On Sun, 31 Mar 2019 at 23:22, Yugo Nagata wrote: > > Firstly, this will handle simple definition views which includes only > selection, projection, and join. Standard aggregations (count, sum, avg, > min, max) are not planned to be implemented in the first patch, but these > are commonly used in materialized views, so I'll implement them later on. It's fine to not have all the features from day 1 of course. But I just picked up this comment and the followup talking about splitting AVG into SUM and COUNT and I had a comment. When you do look at tackling aggregates I don't think you should restrict yourself to these specific standard aggregations. We have all the necessary abstractions to handle all aggregations that are feasible, see https://www.postgresql.org/docs/devel/xaggr.html#XAGGR-MOVING-AGGREGATES What you need to do -- I think -- is store the "moving aggregate state" before the final function. Then whenever a row is inserted or deleted or updated (or whenever another column is updated which causes the value to row to enter or leave the aggregation) apply either aggtransfn or aggminvtransfn to the state. I'm not sure if you want to apply the final function on every update or only lazily either may be better in some usage.
Re: Statistical aggregate functions are not working with PARTIAL aggregation
Don't we have a build farm animal that runs under valgrind that would have caught this?
Re: Create TOAST table only if AM needs
Just throwing this out there Perhaps we should just disable toasting for non-heap tables entirely for now? That way at least people can use it and storage plugins just have to be able to deal with large datums in their own (or throw errors). On Fri., May 17, 2019, 5:56 p.m. Ashwin Agrawal, wrote: > On Fri, May 17, 2019 at 2:42 PM Robert Haas wrote: > >> In any case, it bears saying that >> tableam is a remarkable accomplishment regardless of whatever >> shortcomings it has in this area or elsewhere. >> > > Big +1 to this. >
Re: New vacuum option to do only freezing
On Mon 5 Nov 2018, 12:49 Robert Haas That seems non-orthogonal. We have an existing flag to force freezing > (FREEZE); we don't need a second option that does the same thing. > Skipping the index scans (and thus necessarily the second heap pass) > is a separate behavior which we don't currently have a way to control. > It sounds like it might be better to name this "VACUUM (FAST)” and document that it skips some of the normal (and necessary) work that vacuum does and is only suitable for avoiding wraparounds and not sufficient for avoiding bloat >
Query pattern tha Postgres doesn't handle well
At my day job I've been doing a fair amount of routine query and schema optimization and I've noticed on particular query shape that has repeatedly caused problems, and it's one we've talked about before. select * from table where simple-restriction 0 OR (complex-subquery) For example something like: SELECT * FROM projects WHERE ispublic OR project_id IN (SELECT project_id FROM project_members WHERE userid = ?) Either half of this clause can easily be executed using indexes but the combination forces Postgres to do a full sequential table scan. The solution has been to rewrite each of these cases into: SELECT * FROM projects WHERE ispublic UNION ALL SELECT * FROM projects WHERE NOT ispublic AND project_id IN (SELECT project_id FROM project_members WHERE userid = ?) But there are several problems with this. 1) It's often difficult to get the conditions to be exactly disjoint such that you can use UNION ALL, and if you can't then UNION can be slow and possibly even incorrect. 2) The resulting query is difficult to combine with other clauses. They must either be copied into both sides of the UNION or wrapped around the outside and hope that Postgres pushes them down into the union branches where necessary. More complex conditions can't be pushed down, and in particular combining two conditions that have been rewritten to use union is very difficult. 3) It's extremely difficult to generate this kind of query in an ORM such as ActiveRecord without breaking the abstractions and causing surprising interactions. I'm not sure I have much to offer here. I definitely don't know where to start implementing it. But I know it's come up before on the list and just thought I would mention that I've noticed it being a recurring problem for at least this user. -- greg
Re: Online enabling of checksums
> The change of the checksum state is WAL logged with a new xlog record. All > the buffers written by the background worker are forcibly enabled full page > writes to make sure the checksum is fully updated on the standby even if no > actual contents of the buffer changed. Hm. That doesn't sound necessary to me. If you generate a checkpoint (or just wait until a new checkpoint has started) then go through and do a normal xlog record for every page (any xlog record, a noop record even) then the normal logic for full page writes ought to be sufficient. If the noop record doesn't need a full page write it's because someone else has already come in and done one and that one will set the checksum. In fact if any page has an lsn > the checkpoint start lsn for the checkpoint after the flag was flipped then you wouldn't need to issue any record at all.
jsonlog logging only some messages?
I tried loading the jsonlog module from https://github.com/michaelpq/pg_plugins into Postgres 9.6. However it seems it resulted in logs only for session log messages but not any background worker log messages. We have log_checkpoints set but there were no log messages in the json log about checkpoints. Nor were there any from autovacuum. Also I have log_destination set to stderr,cvslog and as I understand it the jsonlog module effectively overrides the "stderr" output which goes to the file named *.log (which I find super confusing, incidentally). But I was expecting to get the csv file as well. We didn't, we only got the *.log file with the (partial) json logs. Is there something I'm missing here or is this unexpected? -- greg
Re: jsonlog logging only some messages?
On 27 February 2018 at 02:04, Michael Paquier wrote: > On Mon, Feb 26, 2018 at 05:38:56PM +0000, Greg Stark wrote: >> I tried loading the jsonlog module from >> https://github.com/michaelpq/pg_plugins into Postgres 9.6. >> >> However it seems it resulted in logs only for session log messages but >> not any background worker log messages. We have log_checkpoints set >> but there were no log messages in the json log about checkpoints. Nor >> were there any from autovacuum. > > Hm. I have just loaded jsonlog on a 9.6 and 10 instance where > log_checkpoints is enabled with this background worker which logs a > simple string every 10s: > https://github.com/michaelpq/pg_plugins/tree/master/hello_world > > Both checkpoint logs and the logs of the bgworker are showing up for me. Weird. I guess I have some more debugging with gdb to do. >> Also I have log_destination set to stderr,cvslog and as I understand >> it the jsonlog module effectively overrides the "stderr" output which >> goes to the file named *.log (which I find super confusing, >> incidentally). But I was expecting to get the csv file as well. We >> didn't, we only got the *.log file with the (partial) json logs. Is >> there something I'm missing here or is this unexpected? > > Yeah, that's unfortunately expected... jsonlog enforces > edata->output_to_server to false, which has the advantage of disabling > extra log outputs, but also has the advantage of preventing duplicate > entries into stderr. One way that I can see to solve things would be to > patch the backend and get output_to_server replaced by a filter which > allows a plugin to choose what are the extra output types allowed. In > this case you don't want stderr later on, but would still want csvlog to > happen, instead of having an all-or-nothing switch. I haven't tested, > but it could be possible to have as well jsonlog enforce Log_destination > to only use csvlog after it generates its entry so as stderr is not > duplicated by csvlog gets though. Not sure how you would reset the > parameter though, so you may perhaps want to invoke an extra plugin > which outputs to csvlog as jsonlog cascades through things. I would actually lean another way. If jsonlog opened *.json and wrote there then it could leave the output_to_server field unchanged. It does look like this might be a bit awkward with yet more of the static functions needing to be duplicated. -- greg