Re: logical decoding and replication of sequences, take 2
On Fri, Jul 28, 2023 at 6:12 PM Tomas Vondra wrote: > > On 7/28/23 11:42, Amit Kapila wrote: > > On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra > > wrote: > >> > >> On 7/26/23 09:27, Amit Kapila wrote: > >>> On Wed, Jul 26, 2023 at 9:37 AM Amit Kapila > >>> wrote: > >> > >> Anyway, I was thinking about this a bit more, and it seems it's not as > >> difficult to use the page LSN to ensure sequences don't go backwards. > >> > > > > While studying the changes for this proposal and related areas, I have > > a few comments: > > 1. I think you need to advance the origin if it is changed due to > > copy_sequence(), otherwise, if the sync worker restarts after > > SUBREL_STATE_FINISHEDCOPY, then it will restart from the slot's LSN > > value. > > > > True, we want to restart at the new origin_startpos. > > > 2. Between the time of SYNCDONE and READY state, the patch can skip > > applying non-transactional sequence changes even if it should apply > > it. The reason is that during that state change > > should_apply_changes_for_rel() decides whether to apply change based > > on the value of remote_final_lsn which won't be set for > > non-transactional change. I think we need to send the start LSN of a > > non-transactional record and then use that as remote_final_lsn for > > such a change. > > Good catch. remote_final_lsn is set in apply_handle_begin, but that > won't happen for sequences. We're already sending the LSN, but > logicalrep_read_sequence ignores it - it should be enough to add it to > LogicalRepSequence and then set it in apply_handle_sequence(). > As per my understanding, the LSN sent is EndRecPtr of record which is the beginning of the next record (means current_record_end + 1). For comparing the current record, we use the start_position of the record as we do when we use the remote_final_lsn via apply_handle_begin(). > > > > 3. For non-transactional sequence change apply, we don't set > > replorigin_session_origin_lsn/replorigin_session_origin_timestamp as > > we are doing in apply_handle_commit_internal() before calling > > CommitTransactionCommand(). So, that can lead to the origin moving > > backwards after restart which will lead to requesting and applying the > > same changes again and for that period of time sequence can go > > backwards. This needs some more thought as to what is the correct > > behaviour/solution for this. > > > > I think saying "origin moves backwards" is a bit misleading. AFAICS the > origin position is not actually moving backwards, it's more that we > don't (and can't) move it forwards for each non-transactional change. So > yeah, we may re-apply those, and IMHO that's expected - the sequence is > allowed to be "ahead" on the subscriber. > But, if this happens then for a period of time the sequence will go backwards relative to what one would have observed before restart. -- With Regards, Amit Kapila.
Re: Row pattern recognition
>>> - PATTERN variables do not have to exist in the DEFINE clause. They are >>> - considered TRUE if not present. >> Do you think we really need this? I found a criticism regarding this. >> https://link.springer.com/article/10.1007/s13222-022-00404-3 >> "3.2 Explicit Definition of All Row Pattern Variables" >> What do you think? > > I think that a large part of obeying the standard is to allow queries > from other engines to run the same on ours. The standard does not > require the pattern variables to be defined and so there are certainly > queries out there without them, and that hurts migrating to > PostgreSQL. Yeah, migration is good point. I agree we should have the feature. >>> When we get to adding count in the MEASURES clause, there will be a >>> difference between no match and empty match, but that does not apply >>> here. >> Can you elaborate more? I understand that "no match" and "empty match" >> are different things. But I do not understand how the difference >> affects the result of count. > > This query: > > SELECT v.a, wcnt OVER w, count(*) OVER w > FROM (VALUES ('A')) AS v (a) > WINDOW w AS ( > ORDER BY v.a > MEASURES count(*) AS wcnt > ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > PATTERN (B) > DEFINE B AS B.a = 'B' > ) > > produces this result: > > a | wcnt | count > ---+--+--- > A | | 0 > (1 row) > > Inside the window specification, *no match* was found and so all of > the MEASURES are null. The count(*) in the target list however, still > exists and operates over zero rows. > > This very similar query: > > SELECT v.a, wcnt OVER w, count(*) OVER w > FROM (VALUES ('A')) AS v (a) > WINDOW w AS ( > ORDER BY v.a > MEASURES count(*) AS wcnt > ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING > PATTERN (B?) > DEFINE B AS B.a = 'B' > ) > > produces this result: > > a | wcnt | count > ---+--+--- > A |0 | 0 > (1 row) > > In this case, the pattern is B? instead of just B, which produces an > *empty match* for the MEASURES to be applied over. Thank you for the detailed explanation. I think I understand now. Best reagards, -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
Re: Avoid possible memory leak (src/common/rmtree.c)
On Tue, Jul 25, 2023 at 04:45:22PM +0200, Daniel Gustafsson wrote: > Skimming the tree there doesn't seem to be any callers which aren't exiting or > ereporting on failure so the real-world impact seems low. That being said, > silencing static analyzers could be reason enough to delay allocation. A different reason would be out-of-core code that uses rmtree() in a memory context where the leak would be an issue if facing a failure continuously? Delaying the allocation after the OPENDIR() seems like a good practice anyway. -- Michael signature.asc Description: PGP signature
Re: Support worker_spi to execute the function dynamically.
On Fri, Jul 28, 2023 at 12:06:33PM +0200, Alvaro Herrera wrote: > Hmm, I think having all the workers doing their in the same table is > better -- if nothing else, because it gives us the opportunity to show > how to use some other coding technique (but also because we are forced > to write the SQL code in a way that's correct for potentially multiple > concurrent workers, which sounds useful to demonstrate). Can't we > instead solve the race condition by having some shared resource that > blocks the other workers from proceeding until the schema has been > created? Perhaps an LWLock, or a condition variable, or an advisory > lock. That's an idea interesting idea that you have here. So basically, you would have all the workers use the same schema do their counting work for the same base table? Or should each worker use the same schema, perhaps defined by a GUC, but different tables? One thing that has been itching me a bit with this module was to be able to pass down to the main worker routine more arguments than just an int ID, but I could not find myself do that for just for the wait event patch, like: - The database to connect to. - The table to create. - The schema to use. If any of these are NULL, just use as default what we have now, with perhaps the bgworker PID as ID instead of a user-specified one. Having a shared memory state is second thing I was planning to add, and that can be useful as point of reference in a template. The other patch about custom wait events introduces that, FWIW, to track the custom wait events added. -- Michael signature.asc Description: PGP signature
Re: Support worker_spi to execute the function dynamically.
On Fri, Jul 28, 2023 at 01:34:15PM -0700, Andres Freund wrote: > On 2023-07-28 13:45:29 +0900, Michael Paquier wrote: >> Having each bgworker on its own schema would be enough to prevent >> conflicts, but I'd like to add a second thing: a check on >> pg_stat_activity.wait_event after starting the workers. I have added >> something like that in the patch I have posted today for the custom >> wait events at [1] and it enforces the startup sequences of the >> workers in a stricter way. > > Is that very meaningful? ISTM the interesting thing to check for would be that > the state is idle? That's interesting for the sake of the other patch to check that the custom events are reported. Anyway, I am a bit short in time, so I have applied the simplest fix where the dynamic workers just use a different base ID to get out of your way. -- Michael signature.asc Description: PGP signature
Re: Eager page freeze criteria clarification
On Fri, Jul 28, 2023 at 5:03 PM Melanie Plageman wrote: > When you were working on this, what were the downsides of only having the > criteria that 1) page would be all_visible/all_frozen and 2) we did prune > something (i.e. no consideration of FPIs)? Conditioning freezing on "all_visible && all_frozen" seems likely to always be a good idea when it comes to any sort of speculative trigger criteria. Most of the wins in this area will come from avoiding future FPIs, and you can't hope to do that unless you freeze everything on the page. The cost of freezing when "all_visible && all_frozen" does not hold may be quite low, but the benefits will also be very low. Also, the way that recovery conflicts are generated for freeze records somewhat depends on this right now -- though you could probably fix that if you had to. Note that we *don't* really limit ourselves to cases where an FPI was generated from pruning, exactly -- even on 16. With page-level checksums we can also generate an FPI just to set a hint bit. That triggers freezing too, even on insert-only tables (assuming checksums are enabled). I expect that that will be an important condition for triggering freezing in practice. Your question about 2 seems equivalent to "why not just always freeze?". I don't think that that's a bad question -- quite the opposite. Even trying to give an answer to this question would amount to getting involved in new work on VACUUM, though. So I don't think I can help you here. -- Peter Geoghegan
Re: Extension Enhancement: Buffer Invalidation in pg_buffercache
Hello I had a look at the patch and tested it on CI bot, it compiles and tests fine both autoconf and meson. I noticed that the v2 patch contains the v1 patch file as well. Not sure if intended but put there my accident. > I don't think "invalidating" is the right terminology. Note that we already > have InvalidateBuffer() - but it's something we can't allow users to do, as > it > throws away dirty buffer contents (it's used for things like dropping a > table). > > How about using "discarding" for this functionality? I think "invalidating" is the right terminology here, it is exactly what the feature is doing, it tries to invalidate a buffer ID by calling InvalidateBuffer() routine inside buffer manager and calls FlushBuffer() before invalidating if marked dirty. The problem here is that InvalidateBuffer() could be dangerous because it allows a user to invalidate buffer that may have data in other tables not owned by the current user, I think it all comes down to the purpose of this feature. Based on the description in this email thread, I feel like this feature should be categorized as a developer-only feature, to be used by PG developer to experiment and observe some development works by invalidating one more more specific buffers. If this is the case, it may be helpful to add a "DEVELOPER_OPTIONS" in GUC, which allows or disallows the TryInvalidateBuffer() to run or to return error if user does not have this developer option enabled. If the purpose of this feature is for general users, then it would make sense to have something like pg_unwarm (exactly opposite of pg_prewarm) that takes table name (instead of buffer ID) and drop all buffers associated with that table name. There will be permission checks as well so a user cannot pg_unwarm a table owned by someone else. User in this case won't be able to invalidate a particular buffer, but he/she should not have to as a regular user anyway. thanks! Cary Huang - HighGo Software Inc. (Canada) cary.hu...@highgo.ca www.highgo.ca
Re: Eager page freeze criteria clarification
On Fri, Jul 28, 2023 at 4:49 PM Peter Geoghegan wrote: > > On Fri, Jul 28, 2023 at 4:30 PM Andres Freund wrote: > > > Put differently, I can't see any reason to care whether pruning > > > emitted an FPI or not. Either way, it's very unlikely that freezing > > > needs to do so. > > > > +many > > > > Using FPIs having been generated during pruning as part of the logic to > > decide > > whether to freeze makes no sense to me. We likely need some heuristic here, > > but I suspect it has to look quite different than the FPI condition. > > I think that it depends on how you frame it. It makes zero sense that > that's the only thing that can do something like this at all. It makes > some sense as an incremental improvement, since there is no way that > we can lose by too much, even if you make arbitrarily pessimistic > assumptions. In other words, you won't be able to come up with an > adversarial benchmark where this loses by too much. When you were working on this, what were the downsides of only having the criteria that 1) page would be all_visible/all_frozen and 2) we did prune something (i.e. no consideration of FPIs)? > As I said, I'm no longer interested in working on VACUUM (though I do > hope that Melanie or somebody else picks up where I left off). I have > nothing to say about any new work in this area. If you want me to do > something in the scope of the work on 16, as a release management > task, please be clear about what that is. I'm only talking about this in the context of future work for 17. - Melanie
Re: Eager page freeze criteria clarification
On Fri, Jul 28, 2023 at 4:30 PM Andres Freund wrote: > Using FPIs having been generated during pruning as part of the logic to decide > whether to freeze makes no sense to me. We likely need some heuristic here, > but I suspect it has to look quite different than the FPI condition. Why do we need a heuristic at all? What we're talking about here, AIUI, is under what circumstance we should eagerly freeze a page that can be fully frozen, leaving no tuples that vacuum ever needs to worry about again except if the page is further modified. Freezing in such circumstances seems highly likely to work out in our favor. The only argument against it is that the page might soon be modified again, in which case the effort of freezing would be mostly wasted. But I'm not sure whether that's really a valid rationale, because the win from not having to vacuum the page again if it isn't modified seems pretty big, and emitting a freeze record for a page we've just pruned doesn't seem that expensive. If it is a valid rationale, my first thought would be to make the heuristic based on vacuum_freeze_min_age. XID age isn't a particularly great proxy for whether the page is still actively being modified, but at least it has tradition going for it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Eager page freeze criteria clarification
On Fri, Jul 28, 2023 at 4:30 PM Andres Freund wrote: > > Put differently, I can't see any reason to care whether pruning > > emitted an FPI or not. Either way, it's very unlikely that freezing > > needs to do so. > > +many > > Using FPIs having been generated during pruning as part of the logic to decide > whether to freeze makes no sense to me. We likely need some heuristic here, > but I suspect it has to look quite different than the FPI condition. I think that it depends on how you frame it. It makes zero sense that that's the only thing that can do something like this at all. It makes some sense as an incremental improvement, since there is no way that we can lose by too much, even if you make arbitrarily pessimistic assumptions. In other words, you won't be able to come up with an adversarial benchmark where this loses by too much. As I said, I'm no longer interested in working on VACUUM (though I do hope that Melanie or somebody else picks up where I left off). I have nothing to say about any new work in this area. If you want me to do something in the scope of the work on 16, as a release management task, please be clear about what that is. -- Peter Geoghegan
Re: Support worker_spi to execute the function dynamically.
Hi, On 2023-07-28 13:45:29 +0900, Michael Paquier wrote: > Having each bgworker on its own schema would be enough to prevent > conflicts, but I'd like to add a second thing: a check on > pg_stat_activity.wait_event after starting the workers. I have added > something like that in the patch I have posted today for the custom > wait events at [1] and it enforces the startup sequences of the > workers in a stricter way. Is that very meaningful? ISTM the interesting thing to check for would be that the state is idle? Greetings, Andres Freund
Re: Eager page freeze criteria clarification
Hi, On 2023-07-28 16:19:47 -0400, Robert Haas wrote: > On Fri, Jul 28, 2023 at 3:00 PM Peter Geoghegan wrote: > > > Or are we trying to determine how likely > > > the freeze record is to emit an FPI and avoid eager freezing when it > > > isn't worth the cost? > > > > No, that's not something that we're doing right now (we just talked > > about doing something like that). In 16 VACUUM just "makes the best > > out of a bad situation" when an FPI was already required during > > pruning. We have already "paid for the privilege" of writing some WAL > > for the page at that point, so it's reasonable to not squander a > > window of opportunity to avoid future FPIs in future VACUUM > > operations, by freezing early. > > This doesn't make sense to me. It's true that if the pruning record > emitted an FPI, then the freezing record probably won't need to do so > either, unless by considerable ill-fortune the redo pointer was moved > in the tiny window between pruning and freezing. But isn't that also > true that if the pruning record *didn't* emit an FPI? In that case, > the pruning record wasn't the first WAL-logged modification to the > page during the then-current checkpoint cycle, and some earlier > modification already did it. But in that case also the freezing won't > need to emit a new FPI, except for the same identical case where the > redo pointer moves at just the wrong time. > > Put differently, I can't see any reason to care whether pruning > emitted an FPI or not. Either way, it's very unlikely that freezing > needs to do so. +many Using FPIs having been generated during pruning as part of the logic to decide whether to freeze makes no sense to me. We likely need some heuristic here, but I suspect it has to look quite different than the FPI condition. Greetings, Andres
Re: Eager page freeze criteria clarification
On Fri, Jul 28, 2023 at 3:00 PM Peter Geoghegan wrote: > > Or are we trying to determine how likely > > the freeze record is to emit an FPI and avoid eager freezing when it > > isn't worth the cost? > > No, that's not something that we're doing right now (we just talked > about doing something like that). In 16 VACUUM just "makes the best > out of a bad situation" when an FPI was already required during > pruning. We have already "paid for the privilege" of writing some WAL > for the page at that point, so it's reasonable to not squander a > window of opportunity to avoid future FPIs in future VACUUM > operations, by freezing early. This doesn't make sense to me. It's true that if the pruning record emitted an FPI, then the freezing record probably won't need to do so either, unless by considerable ill-fortune the redo pointer was moved in the tiny window between pruning and freezing. But isn't that also true that if the pruning record *didn't* emit an FPI? In that case, the pruning record wasn't the first WAL-logged modification to the page during the then-current checkpoint cycle, and some earlier modification already did it. But in that case also the freezing won't need to emit a new FPI, except for the same identical case where the redo pointer moves at just the wrong time. Put differently, I can't see any reason to care whether pruning emitted an FPI or not. Either way, it's very unlikely that freezing needs to do so. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Eager page freeze criteria clarification
On Fri, Jul 28, 2023 at 3:27 PM Melanie Plageman wrote: > I see. I don't have an opinion on the "best of a bad situation" > argument. Though, I think it is worth amending the comment in the code > to include this explanation. I think that the user-facing docs should be completely overhauled to make that clear, and reasonably accessible. It's hard to talk about in code comments because it's something that can only be effective over time, across multiple VACUUM operations. > But, ISTM that there should also be some independent heuristic to > determine whether or not it makes sense to freeze the page. That could > be related to whether or not it will be cheap to do so (in which case > we can check if we will have to emit an FPI as part of the freeze > record) or it could be related to whether or not the freezing is > likely to be pointless (we are likely to update the page again soon). It sounds like you're interested in adding additional criteria to trigger page-level freezing. That's something that the current structure anticipates. To give a slightly contrived example: it would be very easy to add another condition, where (say) we call random() to determine whether or not to freeze the page. You'd very likely want to gate this new trigger criteria in the same way as the FPI criteria (i.e. only do it when "prunestate->all_visible && prunestate->all_frozen" hold). We've decoupled the decision to freeze from what it means to execute freezing itself. Having additional trigger criteria makes a lot of sense. I'm sure that it makes sense to add at least one more, and it seems possible that adding several more makes sense. Obviously, that will have to be new work targeting 17, though. I made a decision to stop working on VACUUM, though, so I'm afraid I won't be able to offer much help with any of this. (Happy to give more background information, though.) -- Peter Geoghegan
Re: Eager page freeze criteria clarification
On Fri, Jul 28, 2023 at 3:00 PM Peter Geoghegan wrote: > > On Fri, Jul 28, 2023 at 11:13 AM Melanie Plageman > wrote: > > if (pagefrz.freeze_required || tuples_frozen == 0 || > > (prunestate->all_visible && prunestate->all_frozen && > > fpi_before != pgWalUsage.wal_fpi)) > > { > > > > I'm trying to understand the condition fpi_before != > > pgWalUsage.wal_fpi -- don't eager freeze if pruning emitted an FPI. > > You mean "prunestate->all_visible && prunestate->all_frozen", which is > a condition of applying FPI-based eager freezing, but not traditional > lazy freezing? Ah no, a very key word in my sentence was off: I meant "don't eager freeze *unless* pruning emitted an FPI" I was asking about the FPI-trigger optimization specifically. I'm clear on the all_frozen/all_visible criteria. > > Is this test meant to guard against unnecessary freezing or to avoid > > freezing when the cost is too high? That is, are we trying to > > determine how likely it is that the page has been recently modified > > and avoid eager freezing when it would be pointless (because the page > > will soon be modified again)? > > Sort of. This cost of freezing over time is weirdly nonlinear, so it's > hard to give a simple answer. > > The justification for the FPI trigger optimization is that FPIs are > overwhelmingly the cost that really matters when it comes to freezing > (and vacuuming in general) -- so we might as well make the best out of > a bad situation when pruning happens to get an FPI. There can easily > be a 10x or more cost difference (measured in total WAL volume) > between freezing without an FPI and freezing with an FPI. ... > In 16 VACUUM just "makes the best > out of a bad situation" when an FPI was already required during > pruning. We have already "paid for the privilege" of writing some WAL > for the page at that point, so it's reasonable to not squander a > window of opportunity to avoid future FPIs in future VACUUM > operations, by freezing early. > > We're "taking a chance" on being able to get freezing out of the way > early when an FPI triggers freezing. It's not guaranteed to work out > in each individual case, of course, but even if we assume it's fairly > unlikely to work out (which is very pessimistic) it's still very > likely a good deal. > > This strategy (the 16 strategy of freezing eagerly because we already > got an FPI) seems safer than a strategy involving freezing eagerly > because we won't get an FPI as a result. If for no other reason than > this: with the approach in 16 we already know for sure that we'll have > written an FPI anyway. It's hard to imagine somebody being okay with > the FPIs, but not being okay with the other extra WAL. I see. I don't have an opinion on the "best of a bad situation" argument. Though, I think it is worth amending the comment in the code to include this explanation. But, ISTM that there should also be some independent heuristic to determine whether or not it makes sense to freeze the page. That could be related to whether or not it will be cheap to do so (in which case we can check if we will have to emit an FPI as part of the freeze record) or it could be related to whether or not the freezing is likely to be pointless (we are likely to update the page again soon). It sounds like it was discussed before, but I'd be interested in revisiting it and happy to test out various ideas. - Melanie
Re: Eager page freeze criteria clarification
On Fri, Jul 28, 2023 at 11:13 AM Melanie Plageman wrote: > if (pagefrz.freeze_required || tuples_frozen == 0 || > (prunestate->all_visible && prunestate->all_frozen && > fpi_before != pgWalUsage.wal_fpi)) > { > > I'm trying to understand the condition fpi_before != > pgWalUsage.wal_fpi -- don't eager freeze if pruning emitted an FPI. You mean "prunestate->all_visible && prunestate->all_frozen", which is a condition of applying FPI-based eager freezing, but not traditional lazy freezing? Obviously, the immediate impact of that is that the FPI trigger condition is not met unless we know for sure that the page will be marked all-visible and all-frozen in the visibility map afterwards. A page that's eligible to become all-visible will also be seen as eligible to become all-frozen in the vast majority of cases, but there are some rare and obscure cases involving MultiXacts that must be considered here. There is little point in freezing early unless we have at least some chance of not having to freeze the same page again in the future (ideally forever). There is generally no point in freezing most of the tuples on a page when there'll still be one or more not-yet-eligible unfrozen tuples that get left behind -- we might as well not bother with freezing at all when we see a page where that'll be the outcome from freezing. However, there is (once again) a need to account for rare and extreme cases -- it still needs to be possible to do that. Specifically, we may be forced to freeze a page that's left with some remaining unfrozen tuples when VACUUM is fundamentally incapable of freezing them due to its OldestXmin/removable cutoff being too old. That can happen when VACUUM needs to freeze according to the traditional age-based settings, and yet the OldestXmin/removable cutoff gets held back (by a leaked replication slot or whatever). (Actually, VACUUM FREEZE freeze will freeze only a subset of the tuples from some heap pages far more often. VACUUM FREEZE seems like a bad design to me, though -- it uses the most aggressive possible XID cutoff for freezing when it should probably hold off on freezing those individual pages where we determine that it makes little sense. We need to focus more on physical pages and their costs, and less on XID cutoffs.) > Is this test meant to guard against unnecessary freezing or to avoid > freezing when the cost is too high? That is, are we trying to > determine how likely it is that the page has been recently modified > and avoid eager freezing when it would be pointless (because the page > will soon be modified again)? Sort of. This cost of freezing over time is weirdly nonlinear, so it's hard to give a simple answer. The justification for the FPI trigger optimization is that FPIs are overwhelmingly the cost that really matters when it comes to freezing (and vacuuming in general) -- so we might as well make the best out of a bad situation when pruning happens to get an FPI. There can easily be a 10x or more cost difference (measured in total WAL volume) between freezing without an FPI and freezing with an FPI. > Or are we trying to determine how likely > the freeze record is to emit an FPI and avoid eager freezing when it > isn't worth the cost? No, that's not something that we're doing right now (we just talked about doing something like that). In 16 VACUUM just "makes the best out of a bad situation" when an FPI was already required during pruning. We have already "paid for the privilege" of writing some WAL for the page at that point, so it's reasonable to not squander a window of opportunity to avoid future FPIs in future VACUUM operations, by freezing early. We're "taking a chance" on being able to get freezing out of the way early when an FPI triggers freezing. It's not guaranteed to work out in each individual case, of course, but even if we assume it's fairly unlikely to work out (which is very pessimistic) it's still very likely a good deal. This strategy (the 16 strategy of freezing eagerly because we already got an FPI) seems safer than a strategy involving freezing eagerly because we won't get an FPI as a result. If for no other reason than this: with the approach in 16 we already know for sure that we'll have written an FPI anyway. It's hard to imagine somebody being okay with the FPIs, but not being okay with the other extra WAL. > But, the final rationale is still not clear to me. Could we add a > comment above the if condition specifying both: > a) what the test is a proxy for > b) the intended outcome (when do we expect to eager freeze) > And perhaps we could even describe a scenario where this heuristic is > effective? There are lots of scenarios where it'll be effective. I agree that there is a need to document this stuff a lot better. I have a pending doc patch that overhauls the user-facing docs in this area. My latest draft is here: https://postgr.es/m/CAH2-Wz=UUJz+MMb1AxFzz-HDA=1t1fx_kmrdovopzxkpa-t...@mail.gmail.com
Re: add timing information to pg_upgrade
On Fri, Jul 28, 2023 at 10:38:14AM -0700, Nathan Bossart wrote: > I'm okay with simply adding the time. However, I wonder if we want to > switch to seconds, minutes, hours, etc. if the step takes longer. This > would add a bit of complexity, but it would improve human readability. > Alternatively, we could keep the code simple and machine readable by always > using milliseconds. ... or maybe we show both like psql does. Attached іs a new version of the patch in which I've made use of the INSTR_TIME_* macros and enhanced the output formatting (largely borrowed from PrintTiming() in src/bin/psql/common.c). -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 6a4b836b33c3e18a57438044a10da95ec89dcb65 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Thu, 27 Jul 2023 16:16:45 -0700 Subject: [PATCH v2 1/1] add timing information to pg_upgrade --- src/bin/pg_upgrade/util.c | 55 ++- 1 file changed, 54 insertions(+), 1 deletion(-) diff --git a/src/bin/pg_upgrade/util.c b/src/bin/pg_upgrade/util.c index 21ba4c8f12..25e43a593a 100644 --- a/src/bin/pg_upgrade/util.c +++ b/src/bin/pg_upgrade/util.c @@ -9,14 +9,19 @@ #include "postgres_fe.h" +#include #include #include "common/username.h" #include "pg_upgrade.h" +#include "portability/instr_time.h" LogOpts log_opts; +static instr_time step_start; + static void pg_log_v(eLogType type, const char *fmt, va_list ap) pg_attribute_printf(2, 0); +static char *get_time_str(double time_ms); /* @@ -137,6 +142,8 @@ prep_status(const char *fmt,...) /* trim strings */ pg_log(PG_REPORT_NONL, "%-*s", MESSAGE_WIDTH, message); + + INSTR_TIME_SET_CURRENT(step_start); } /* @@ -170,6 +177,8 @@ prep_status_progress(const char *fmt,...) pg_log(PG_REPORT, "%-*s", MESSAGE_WIDTH, message); else pg_log(PG_REPORT_NONL, "%-*s", MESSAGE_WIDTH, message); + + INSTR_TIME_SET_CURRENT(step_start); } static void @@ -280,11 +289,55 @@ pg_fatal(const char *fmt,...) } +static char * +get_time_str(double time_ms) +{ + double seconds; + double minutes; + double hours; + double days; + + if (time_ms < 1000.0) + return psprintf("%.3f ms", time_ms); + + seconds = time_ms / 1000.0; + minutes = floor(seconds / 60.0); + seconds -= 60.0 * minutes; + if (minutes < 60.0) + return psprintf("%.3f ms (%02d:%06.3f)", + time_ms, (int) minutes, seconds); + + hours = floor(minutes / 60.0); + minutes -= 60.0 * hours; + if (hours < 24.0) + return psprintf("%.3f ms (%02d:%02d:%06.3f)", + time_ms, (int) hours, (int) minutes, seconds); + + days = floor(hours / 24.0); + hours -= 24.0 * days; + return psprintf("%.3f ms (%.0f d %02d:%02d:%06.3f)", + time_ms, days, (int) hours, (int) minutes, seconds); +} + + void check_ok(void) { + instr_time step_end; + char *step_time; + + Assert(!INSTR_TIME_IS_ZERO(step_start)); + + INSTR_TIME_SET_CURRENT(step_end); + INSTR_TIME_SUBTRACT(step_end, step_start); + step_time = get_time_str(INSTR_TIME_GET_MILLISEC(step_end)); + + /* reset start time */ + INSTR_TIME_SET_ZERO(step_start); + /* all seems well */ - report_status(PG_REPORT, "ok"); + report_status(PG_REPORT, "ok\t%s", step_time); + pfree(step_time); } -- 2.25.1
Re: In Postgres 16 BETA, should the ParseNamespaceItem have the same index as it's RangeTableEntry?
Hello, Thank you Amit, changing the 3rd argument to 0 fixes some of the errors (there are 6 out of 24 errors still failing) but it throws a new one "ERROR: bad buffer ID: 0". We will need to take a more in depth look here on why this is occuring, but thank you so much for the help. Thank you, Matheus Farias Em qua., 26 de jul. de 2023 às 08:30, Amit Langote escreveu: > Hello, > > On Fri, Jul 21, 2023 at 5:05 AM Farias de Oliveira > wrote: > > > > Hello, > > > > Thank you for the help guys and I'm so sorry for my late response. > Indeed, the error relies on the ResultRelInfo. In > GetResultRTEPermissionInfo() function, it does a checking on the > relinfo->ri_RootResultRelInfo variable. I believe that it should go inside > this scope: > > > > > > if (relinfo->ri_RootResultRelInfo) > > { > > /* > > * For inheritance child result relations (a partition routing target > > * of an INSERT or a child UPDATE target), this returns the root > > * parent's RTE to fetch the RTEPermissionInfo because that's the only > > * one that has one assigned. > > */ > > rti = relinfo->ri_RootResultRelInfo->ri_RangeTableIndex; > > } > > > > The relinfo contains: > > > > {type = T_ResultRelInfo, ri_RangeTableIndex = 5, ri_RelationDesc = > 0x7f44e3308cc8, ri_NumIndices = 0, ri_IndexRelationDescs = 0x0, > ri_IndexRelationInfo = 0x0, ri_RowIdAttNo = 0, > > ri_extraUpdatedCols = 0x0, ri_projectNew = 0x0, ri_newTupleSlot = 0x0, > ri_oldTupleSlot = 0x0, ri_projectNewInfoValid = false, ri_TrigDesc = 0x0, > ri_TrigFunctions = 0x0, > > ri_TrigWhenExprs = 0x0, ri_TrigInstrument = 0x0, ri_ReturningSlot = > 0x0, ri_TrigOldSlot = 0x0, ri_TrigNewSlot = 0x0, ri_FdwRoutine = 0x0, > ri_FdwState = 0x0, > > ri_usesFdwDirectModify = false, ri_NumSlots = 0, > ri_NumSlotsInitialized = 0, ri_BatchSize = 0, ri_Slots = 0x0, ri_PlanSlots > = 0x0, ri_WithCheckOptions = 0x0, > > ri_WithCheckOptionExprs = 0x0, ri_ConstraintExprs = 0x0, > ri_GeneratedExprsI = 0x0, ri_GeneratedExprsU = 0x0, ri_NumGeneratedNeededI > = 0, ri_NumGeneratedNeededU = 0, > > ri_returningList = 0x0, ri_projectReturning = 0x0, > ri_onConflictArbiterIndexes = 0x0, ri_onConflict = 0x0, > ri_matchedMergeAction = 0x0, ri_notMatchedMergeAction = 0x0, > > ri_PartitionCheckExpr = 0x0, ri_ChildToRootMap = 0x0, > ri_ChildToRootMapValid = false, ri_RootToChildMap = 0x0, > ri_RootToChildMapValid = false, ri_RootResultRelInfo = 0x0, > > ri_PartitionTupleSlot = 0x0, ri_CopyMultiInsertBuffer = 0x0, > ri_ancestorResultRels = 0x0} > > > > Since relinfo->ri_RootResultRelInfo = 0x0, the rti will have no value > and Postgres will interpret that the ResultRelInfo must've been created > only for filtering triggers and the relation is not being inserted into. > > The relinfo variable is created with the create_entity_result_rel_info() > function: > > > > ResultRelInfo *create_entity_result_rel_info(EState *estate, char > *graph_name, > > char *label_name) > > { > > RangeVar *rv; > > Relation label_relation; > > ResultRelInfo *resultRelInfo; > > > > ParseState *pstate = make_parsestate(NULL); > > > > resultRelInfo = palloc(sizeof(ResultRelInfo)); > > > > if (strlen(label_name) == 0) > > { > > rv = makeRangeVar(graph_name, AG_DEFAULT_LABEL_VERTEX, -1); > > } > > else > > { > > rv = makeRangeVar(graph_name, label_name, -1); > > } > > > > label_relation = parserOpenTable(pstate, rv, RowExclusiveLock); > > > > // initialize the resultRelInfo > > InitResultRelInfo(resultRelInfo, label_relation, > > list_length(estate->es_range_table), NULL, > > estate->es_instrument); > > > > // open the parse state > > ExecOpenIndices(resultRelInfo, false); > > > > free_parsestate(pstate); > > > > return resultRelInfo; > > } > > > > In this case, how can we get the relinfo->ri_RootResultRelInfo to store > the appropriate data? > > Your function doesn't seem to have access to the ModifyTableState > node, so setting ri_RootResultRelInfo to the correct ResultRelInfo > node does not seem doable. > > As I suggested in my previous reply, please check if passing 0 (not > list_length(estate->es_range_table)) for the 3rd argument > InitResultRelInfo() fixes the problem and gives the correct result. > > -- > Thanks, Amit Langote > EDB: http://www.enterprisedb.com >
Re: Let's make PostgreSQL multi-threaded
On Thu, 15 Jun 2023 at 11:07, Hannu Krosing wrote: > > One more unexpected benefit of having shared caches would be easing > access to other databases. > > If the system caches are there for all databases anyway, then it > becomes much easier to make queries using objects from multiple > databases. We have several optimizations in our visibility code that allow us to remove dead tuples from this database when another database still has a connection that has an old snapshot in which the deleting transaction of this database has not yet committed. This is allowed because we can say with confidence that other database's connections will never be able to see this database's tables. If we were to allow cross-database data access, that would require cross-database snapshot visibility checks, and that would severely hinder these optimizations. As an example, it would increase the work we need to do for snapshots: For the snapshot data of tables that aren't shared catalogs, we only need to consider our own database's backends for visibility. With cross-database visibility, we would need to consider all active backends for all snapshots, and this can be significantly more work. Kind regards, Matthias van de Meent Neon (https://neon.tech/)
Re: New PostgreSQL Contributors
On Fri, Jul 28, 2023 at 07:19:21PM +0200, Vik Fearing wrote: > On 7/28/23 17:29, Christoph Berg wrote: > >> New PostgreSQL Contributors: >> >>Ajin Cherian >>Alexander Kukushkin >>Alexander Lakhin >>Dmitry Dolgov >>Hou Zhijie >>Ilya Kosmodemiansky >>Melanie Plageman >>Michael Banck >>Michael Brewer >>Paul Jungwirth >>Peter Smith >>Vigneshwaran C >> >> New PostgreSQL Major Contributors: >> >>Stacey Haysler >>Steve Singer > > Congratulations to all, and well deserved! +1, congrats! -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: add timing information to pg_upgrade
On Fri, Jul 28, 2023 at 01:10:14PM +0530, Bharath Rupireddy wrote: > How about using INSTR_TIME_SET_CURRENT, INSTR_TIME_SUBTRACT and > INSTR_TIME_GET_MILLISEC macros instead of gettimeofday and explicit > calculations? That seems like a good idea. > +report_status(PG_REPORT, "ok (took %ld ms)", elapsed_ms); > > I think it's okay to just leave it with "ok \t %ld ms", elapsed_ms); > without "took". FWIW, pg_regress does that way, see below: I'm okay with simply adding the time. However, I wonder if we want to switch to seconds, minutes, hours, etc. if the step takes longer. This would add a bit of complexity, but it would improve human readability. Alternatively, we could keep the code simple and machine readable by always using milliseconds. > Just curious, out of all the reported pg_upgrade prep_status()-es, > which ones are taking more time? I haven't done any testing on meaningful workloads yet, but the following show up on an empty cluster: "creating dump of database schemas", "analyzing all rows in the new cluster", "setting next transaction ID and epoch for new cluster", "restoring datbase schemas in the new cluster", and "sync data directory to disk". I imagine the dump, restore, and file-copying steps will stand out once you start adding data. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: New PostgreSQL Contributors
On 7/28/23 17:29, Christoph Berg wrote: New PostgreSQL Contributors: Ajin Cherian Alexander Kukushkin Alexander Lakhin Dmitry Dolgov Hou Zhijie Ilya Kosmodemiansky Melanie Plageman Michael Banck Michael Brewer Paul Jungwirth Peter Smith Vigneshwaran C New PostgreSQL Major Contributors: Stacey Haysler Steve Singer Congratulations to all, and well deserved! -- Vik Fearing
Re: harmonize password reuse in vacuumdb, clusterdb, and reindexdb
On Wed, Jul 19, 2023 at 10:43:11AM -0700, Nathan Bossart wrote: > On Wed, Jun 28, 2023 at 10:24:09PM -0700, Nathan Bossart wrote: >> The same commit that added this comment (ff402ae) also set the >> allow_password_reuse parameter to true in vacuumdb's connectDatabase() >> calls. I found a message from the corresponding thread that provides some >> additional detail [0]. I wonder if this comment should instead recommend >> against using the allow_password_reuse flag unless reconnecting to the same >> host/port/user target. Connecting to different databases with the same >> host/port/user information seems okay. Maybe I am missing something... > > I added Tom here since it looks like he was the original author of this > comment. Tom, do you have any concerns with updating the comment for > connectDatabase() in src/fe_utils/connect_utils.c like this? I went ahead and committed this. I'm happy to revisit if there are concerns. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
New PostgreSQL Contributors
The PostgreSQL contributors team has been looking over the community activity and, over the first half of this year, has been recognizing new contributors to be listed on https://www.postgresql.org/community/contributors/ New PostgreSQL Contributors: Ajin Cherian Alexander Kukushkin Alexander Lakhin Dmitry Dolgov Hou Zhijie Ilya Kosmodemiansky Melanie Plageman Michael Banck Michael Brewer Paul Jungwirth Peter Smith Vigneshwaran C New PostgreSQL Major Contributors: Julien Rouhaud Stacey Haysler Steve Singer Thank you all for contributing to PostgreSQL to make it such a great project! For the contributors team, Christoph
Re: Synchronizing slots from primary to standby
On Thu, Jul 27, 2023 at 10:55 AM Amit Kapila wrote: > > I wonder if we anyway some sort of design like this because we > shouldn't allow to spawn as many workers as the number of databases. > There has to be some existing or new GUC like max_sync_slot_workers > which decided the number of workers. It seems reasonable to not have one slot sync worker for each database. IMV, the slot sync workers must be generic and independently manageable - generic in the sense that given a database and primary conninfo, each worker must sync all the slots related to the given database, independently mangeable in the sense that separate GUC for number of sync workers, launchable directly by logical replication launcher dynamically. The work division amongst the sync workers can be simple, the logical replication launcher builds a shared memory structure based on number of slots to sync and starts the sync workers dynamically, and each sync worker picks {dboid, slot name, conninfo} from the shared memory, syncs it and proceeds with other slots. Thoughts? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Eager page freeze criteria clarification
Hi, While hacking on the pruning and page-level freezing code and am a bit confused by the test guarding eager freezing [1]: /* * Freeze the page when heap_prepare_freeze_tuple indicates that at least * one XID/MXID from before FreezeLimit/MultiXactCutoff is present. Also * freeze when pruning generated an FPI, if doing so means that we set the * page all-frozen afterwards (might not happen until final heap pass). */ if (pagefrz.freeze_required || tuples_frozen == 0 || (prunestate->all_visible && prunestate->all_frozen && fpi_before != pgWalUsage.wal_fpi)) { I'm trying to understand the condition fpi_before != pgWalUsage.wal_fpi -- don't eager freeze if pruning emitted an FPI. Is this test meant to guard against unnecessary freezing or to avoid freezing when the cost is too high? That is, are we trying to determine how likely it is that the page has been recently modified and avoid eager freezing when it would be pointless (because the page will soon be modified again)? Or are we trying to determine how likely the freeze record is to emit an FPI and avoid eager freezing when it isn't worth the cost? Or something else? The commit message says: > Also teach VACUUM to trigger page-level freezing whenever it detects > that heap pruning generated an FPI. We'll have already written a large > amount of WAL just to do that much, so it's very likely a good idea to > get freezing out of the way for the page early. And I found the thread where it was discussed [2]. Several possible explanations are mentioned in the thread. But, the final rationale is still not clear to me. Could we add a comment above the if condition specifying both: a) what the test is a proxy for b) the intended outcome (when do we expect to eager freeze) And perhaps we could even describe a scenario where this heuristic is effective? - Melanie [1] https://github.com/postgres/postgres/blob/master/src/backend/access/heap/vacuumlazy.c#L1802 [2] https://www.postgresql.org/message-id/flat/CAH2-Wzm_%3DmrWO%2BkUAJbR_gM_6RzpwVA8n8e4nh3dJGHdw_urew%40mail.gmail.com#c5aae6ea65e07de92a24a35445473448
Re: Synchronizing slots from primary to standby
On Mon, Jul 24, 2023 at 9:00 AM Amit Kapila wrote: > > > 2. All candidate standbys will start one slot sync worker per logical > > slot which might not be scalable. > > Yeah, that doesn't sound like a good idea but IIRC, the proposed patch > is using one worker per database (for all slots corresponding to a > database). Right. It's based on one worker for each database. > > Is having one (or a few more - not > > necessarily one for each logical slot) worker for all logical slots > > enough? > > I guess for a large number of slots the is a possibility of a large > gap in syncing the slots which probably means we need to retain > corresponding WAL for a much longer time on the primary. If we can > prove that the gap won't be large enough to matter then this would be > probably worth considering otherwise, I think we should find a way to > scale the number of workers to avoid the large gap. I think the gap is largely determined by the time taken to advance each slot and the amount of WAL that each logical slot moves ahead on primary. I've measured the time it takes for pg_logical_replication_slot_advance with different amounts WAL on my system. It took 2595ms/5091ms/31238ms to advance the slot by 3.7GB/7.3GB/13GB respectively. To put things into perspective here, imagine there are 3 logical slots to sync for a single slot sync worker and each of them are in need of advancing the slot by 3.7GB/7.3GB/13GB of WAL. The slot sync worker gets to slot 1 again after 2595ms+5091ms+31238ms (~40sec), gets to slot 2 again after advance time of slot 1 with amount of WAL that the slot has moved ahead on primary during 40sec, gets to slot 3 again after advance time of slot 1 and slot 2 with amount of WAL that the slot has moved ahead on primary and so on. If WAL generation on the primary is pretty fast, and if the logical slot moves pretty fast on the primary, the time it takes for a single sync worker to sync a slot can increase. Now, let's think what happens if there's a large gap, IOW, a logical slot on standby is behind X amount of WAL from that of the logical slot on primary. The standby needs to retain more WAL for sure. IIUC, primary doesn't need to retain the WAL required for a logical slot on standby, no? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: POC: Extension for adding distributed tracing - pg_tracing
> What do you think about using INSTR_TIME_SET_CURRENT, INSTR_TIME_SUBTRACT and INSTR_TIME_GET_MILLISEC > macros for timing calculations? If you're talking of the two instances where I'm modifying the instr_time's ticks, it's because I can't use the macros there. The first case is for the parse span. I only have the start timestamp using GetCurrentStatementStartTimestamp and don't have access to the start instr_time so I need to build the duration from 2 timestamps. The second case is when building node spans from the planstate. I directly have the duration from Instrumentation. I guess one fix would be to use an int64 for the span duration to directly store nanoseconds instead of an instr_time but I do use the instrumentation macros outside of those two cases to get the duration of other spans. > Also, have you thought about a way to trace existing (running) queries without directly instrumenting them? That's a good point. I was focusing on leaving the sampling decision to the caller through the sampled flag and only recently added the pg_tracing_sample_rate parameter to give more control. It should be straightforward to add an option to create standalone traces based on sample rate alone. This way, setting the sample rate to 1 would force the queries running in the session to be traced. On Fri, Jul 28, 2023 at 3:02 PM Nikita Malakhov wrote: > Hi! > > What do you think about using INSTR_TIME_SET_CURRENT, INSTR_TIME_SUBTRACT > and INSTR_TIME_GET_MILLISEC > macros for timing calculations? > > Also, have you thought about a way to trace existing (running) queries > without directly instrumenting them? It would > be much more usable for maintenance and support personnel, because in > production environments you rarely could > change query text directly. For the current state the most simple solution > is switch tracing on and off by calling SQL > function, and possibly switch tracing for prepared statement the same way, > but not for any random query. > > I'll check the patch for the race conditions. > > -- > Regards, > Nikita Malakhov > Postgres Professional > The Russian Postgres Company > https://postgrespro.ru/ >
Re: Memory consumed by child SpecialJoinInfo in partitionwise join planning
Hi Richard, On Fri, Jul 28, 2023 at 2:03 PM Richard Guo wrote: > > > It doesn't seem too expensive to translate SpecialJoinInfos, so I think > it's OK to construct and free the SpecialJoinInfo for a child join on > the fly. So the approach in 0002 looks reasonable to me. But there is > something that needs to be fixed in 0002. Thanks for the review. I am fine if going ahead with 0002 is enough. > > + bms_free(child_sjinfo->commute_above_l); > + bms_free(child_sjinfo->commute_above_r); > + bms_free(child_sjinfo->commute_below_l); > + bms_free(child_sjinfo->commute_below_r); > > These four members in SpecialJoinInfo only contain outer join relids. > They do not need to be translated. So they would reference the same > memory areas in child_sjinfo as in parent_sjinfo. We should not free > them, otherwise they would become dangling pointers in parent sjinfo. > Good catch. Saved my time. I would have caught this rather hard way when running regression. Thanks a lot. I think we should 1. add an assert to make sure that commute_above_* do not require any transations i.e. they do not contain any parent relids of the child rels. 2. Do not free those. 3. Add a comment about keeping the build and free functions in sync. I will work on those next. -- Best Wishes, Ashutosh Bapat
Re: POC: Extension for adding distributed tracing - pg_tracing
Hi! What do you think about using INSTR_TIME_SET_CURRENT, INSTR_TIME_SUBTRACT and INSTR_TIME_GET_MILLISEC macros for timing calculations? Also, have you thought about a way to trace existing (running) queries without directly instrumenting them? It would be much more usable for maintenance and support personnel, because in production environments you rarely could change query text directly. For the current state the most simple solution is switch tracing on and off by calling SQL function, and possibly switch tracing for prepared statement the same way, but not for any random query. I'll check the patch for the race conditions. -- Regards, Nikita Malakhov Postgres Professional The Russian Postgres Company https://postgrespro.ru/
Re: Postgres v15 windows bincheck regression test failures
28.07.2023 14:42, Noah Misch wrpte: That was about a bug that appears when using TCP sockets. ... Yes, and according to the failed test output, TCP sockets were used: # 'pg_amcheck: error: connection to server at "127.0.0.1", port 49393 failed: server closed the connection unexpectedly # This probably means the server terminated abnormally # before or while processing the request. Best regards, Alexander
Re: logical decoding and replication of sequences, take 2
On 7/28/23 14:35, Ashutosh Bapat wrote: > On Tue, Jul 25, 2023 at 10:02 PM Tomas Vondra > wrote: >> >> On 7/24/23 14:57, Ashutosh Bapat wrote: >>> ... >>> 2) Currently, the sequences hash table is in reorderbuffer, i.e. global. I was thinking maybe we should have it in the transaction (because we need to do cleanup at the end). It seem a bit inconvenient, because then we'd need to either search htabs in all subxacts, or transfer the entries to the top-level xact (otoh, we already do that with snapshots), and cleanup on abort. What do you think? >>> >>> Hash table per transaction seems saner design. Adding it to the top >>> level transaction should be fine. The entry will contain an XID >>> anyway. If we add it to every subtransaction we will need to search >>> hash table in each of the subtransactions when deciding whether a >>> sequence change is transactional or not. Top transaction is a >>> reasonable trade off. >>> >> >> It's not clear to me what design you're proposing, exactly. >> >> If we track it in top-level transactions, then we'd need copy the data >> whenever a transaction is assigned as a child, and perhaps also remove >> it when there's a subxact abort. > > I thought, esp. with your changes to assign xid, we will always know > the top level transaction when a sequence is assigned a relfilenode. > So the refilenodes will always get added to the correct hash directly. > I didn't imagine a case where we will need to copy the hash table from > sub-transaction to top transaction. If that's true, yes it's > inconvenient. > Well, it's a matter of efficiency. To check if a sequence change is transactional, we need to check if it's for a relfilenode created in the current transaction (it can't be for relfilenode created in a concurrent top-level transaction, due to MVCC). If you don't copy the entries into the top-level xact, you have to walk all subxacts and search all of those, for each sequence change. And there may be quite a few of both subxacts and sequence changes ... I wonder if we need to search the other top-level xacts, but we probably need to do that. Because it might be a subxact without an assignment, or something like that. > As to the abort, don't we already remove entries on subtxn abort? > Having per transaction hash table doesn't seem to change anything > much. > What entries are we removing? My point is that if we copy the entries to the top-level xact, we probably need to remove them on abort. Or we could leave them in the top-level xact hash. >> >> And we'd need to still search the hashes in all toplevel transactions on >> every sequence increment - in principle we can't have increment for a >> sequence created in another in-progress transaction, but maybe it's just >> not assigned yet. > > We hold a strong lock on sequence when changing its relfilenode. The > sequence whose relfilenode is being changed can not be accessed by any > concurrent transaction. So I am not able to understand what you are > trying to say. > How do you know the subxact has already been recognized as such? It may be treated as top-level transaction for a while, until the assignment. > I think per (top level) transaction hash table is cleaner design. It > puts the hash table where it should be. But if that makes code > difficult, current design works too. > regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: logical decoding and replication of sequences, take 2
On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra wrote: > > Anyway, I was thinking about this a bit more, and it seems it's not as > difficult to use the page LSN to ensure sequences don't go backwards. > The 0005 change does that, by: > > 1) adding pg_sequence_state, that returns both the sequence state and >the page LSN > > 2) copy_sequence returns the page LSN > > 3) tablesync then sets this LSN as origin_startpos (which for tables is >just the LSN of the replication slot) > > AFAICS this makes it work - we start decoding at the page LSN, so that > we skip the increments that could lead to the sequence going backwards. > I like this design very much. It makes things simpler than complex. Thanks for doing this. I am wondering whether we could reuse pg_sequence_last_value() instead of adding a new function. But the name of the function doesn't leave much space for expanding its functionality. So we are good with a new one. Probably some code deduplication. -- Best Wishes, Ashutosh Bapat
Re: logical decoding and replication of sequences, take 2
On 7/28/23 11:42, Amit Kapila wrote: > On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra > wrote: >> >> On 7/26/23 09:27, Amit Kapila wrote: >>> On Wed, Jul 26, 2023 at 9:37 AM Amit Kapila wrote: >> >> Anyway, I was thinking about this a bit more, and it seems it's not as >> difficult to use the page LSN to ensure sequences don't go backwards. >> > > While studying the changes for this proposal and related areas, I have > a few comments: > 1. I think you need to advance the origin if it is changed due to > copy_sequence(), otherwise, if the sync worker restarts after > SUBREL_STATE_FINISHEDCOPY, then it will restart from the slot's LSN > value. > True, we want to restart at the new origin_startpos. > 2. Between the time of SYNCDONE and READY state, the patch can skip > applying non-transactional sequence changes even if it should apply > it. The reason is that during that state change > should_apply_changes_for_rel() decides whether to apply change based > on the value of remote_final_lsn which won't be set for > non-transactional change. I think we need to send the start LSN of a > non-transactional record and then use that as remote_final_lsn for > such a change. Good catch. remote_final_lsn is set in apply_handle_begin, but that won't happen for sequences. We're already sending the LSN, but logicalrep_read_sequence ignores it - it should be enough to add it to LogicalRepSequence and then set it in apply_handle_sequence(). > > 3. For non-transactional sequence change apply, we don't set > replorigin_session_origin_lsn/replorigin_session_origin_timestamp as > we are doing in apply_handle_commit_internal() before calling > CommitTransactionCommand(). So, that can lead to the origin moving > backwards after restart which will lead to requesting and applying the > same changes again and for that period of time sequence can go > backwards. This needs some more thought as to what is the correct > behaviour/solution for this. > I think saying "origin moves backwards" is a bit misleading. AFAICS the origin position is not actually moving backwards, it's more that we don't (and can't) move it forwards for each non-transactional change. So yeah, we may re-apply those, and IMHO that's expected - the sequence is allowed to be "ahead" on the subscriber. I don't see a way to improve this, except maybe having a separate LSN for non-transactional changes (for each origin). > 4. BTW, while checking this behaviour, I noticed that the initial sync > worker for sequence mentions the table in the LOG message: "LOG: > logical replication table synchronization worker for subscription > "mysub", table "s" has finished". Won't it be better here to refer to > it as a sequence? > Thanks, I'll fix that. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Row pattern recognition
On 7/28/23 13:02, Tatsuo Ishii wrote: Attached is the v3 patch. In this patch following changes are made. - PATTERN variables do not have to exist in the DEFINE clause. They are - considered TRUE if not present. Do you think we really need this? I found a criticism regarding this. https://link.springer.com/article/10.1007/s13222-022-00404-3 "3.2 Explicit Definition of All Row Pattern Variables" What do you think? I think that a large part of obeying the standard is to allow queries from other engines to run the same on ours. The standard does not require the pattern variables to be defined and so there are certainly queries out there without them, and that hurts migrating to PostgreSQL. - I am working on making window aggregates RPR aware now. The implementation is in progress and far from completeness. An example is below. I think row 2, 3, 4 of "count" column should be NULL instead of 3, 2, 0, 0. Same thing can be said to other rows. Probably this is an effect of moving aggregate but I still studying the window aggregation code. This tells me again that RPR is not being run in the right place. All windowed aggregates and frame-level window functions should Just Work with no modification. I am not touching each aggregate function. I am modifying eval_windowaggregates() in nodeWindowAgg.c, which calls each aggregate function. Do you think it's not the right place to make window aggregates RPR aware? Oh, okay. SELECT company, tdate, first_value(price) OVER W, count(*) OVER w FROM stock WINDOW w AS ( PARTITION BY company ORDER BY tdate ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING AFTER MATCH SKIP PAST LAST ROW INITIAL PATTERN (START UP+ DOWN+) DEFINE START AS TRUE, UP AS price > PREV(price), DOWN AS price < PREV(price) ); company | tdate| first_value | count --++-+--- company1 | 2023-07-01 | 100 | 4 company1 | 2023-07-02 | | 3 company1 | 2023-07-03 | | 2 company1 | 2023-07-04 | | 0 company1 | 2023-07-05 | | 0 company1 | 2023-07-06 | 90 | 4 company1 | 2023-07-07 | | 3 company1 | 2023-07-08 | | 2 company1 | 2023-07-09 | | 0 company1 | 2023-07-10 | | 0 company2 | 2023-07-01 | 50 | 4 company2 | 2023-07-02 | | 3 company2 | 2023-07-03 | | 2 company2 | 2023-07-04 | | 0 company2 | 2023-07-05 | | 0 company2 | 2023-07-06 | 60 | 4 company2 | 2023-07-07 | | 3 company2 | 2023-07-08 | | 2 company2 | 2023-07-09 | | 0 company2 | 2023-07-10 | | 0 In this scenario, row 1's frame is the first 5 rows and specified SKIP PAST LAST ROW, so rows 2-5 don't have *any* frame (because they are skipped) and the result of the outer count should be 0 for all of them because there are no rows in the frame. Ok. Just I want to make sure. If it's other aggregates like sum or avg, the result of the outer aggregates should be NULL. They all behave the same way as in a normal query when they receive no rows as input. When we get to adding count in the MEASURES clause, there will be a difference between no match and empty match, but that does not apply here. Can you elaborate more? I understand that "no match" and "empty match" are different things. But I do not understand how the difference affects the result of count. This query: SELECT v.a, wcnt OVER w, count(*) OVER w FROM (VALUES ('A')) AS v (a) WINDOW w AS ( ORDER BY v.a MEASURES count(*) AS wcnt ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING PATTERN (B) DEFINE B AS B.a = 'B' ) produces this result: a | wcnt | count ---+--+--- A | | 0 (1 row) Inside the window specification, *no match* was found and so all of the MEASURES are null. The count(*) in the target list however, still exists and operates over zero rows. This very similar query: SELECT v.a, wcnt OVER w, count(*) OVER w FROM (VALUES ('A')) AS v (a) WINDOW w AS ( ORDER BY v.a MEASURES count(*) AS wcnt ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING PATTERN (B?) DEFINE B AS B.a = 'B' ) produces this result: a | wcnt | count ---+--+--- A |0 | 0 (1 row) In this case, the pattern is B? instead of just B, which produces an *empty match* for the MEASURES to be applied over. -- Vik Fearing
Re: logical decoding and replication of sequences, take 2
On Tue, Jul 25, 2023 at 10:02 PM Tomas Vondra wrote: > > On 7/24/23 14:57, Ashutosh Bapat wrote: > > ... > > > >> > >> > >> 2) Currently, the sequences hash table is in reorderbuffer, i.e. global. > >> I was thinking maybe we should have it in the transaction (because we > >> need to do cleanup at the end). It seem a bit inconvenient, because then > >> we'd need to either search htabs in all subxacts, or transfer the > >> entries to the top-level xact (otoh, we already do that with snapshots), > >> and cleanup on abort. > >> > >> What do you think? > > > > Hash table per transaction seems saner design. Adding it to the top > > level transaction should be fine. The entry will contain an XID > > anyway. If we add it to every subtransaction we will need to search > > hash table in each of the subtransactions when deciding whether a > > sequence change is transactional or not. Top transaction is a > > reasonable trade off. > > > > It's not clear to me what design you're proposing, exactly. > > If we track it in top-level transactions, then we'd need copy the data > whenever a transaction is assigned as a child, and perhaps also remove > it when there's a subxact abort. I thought, esp. with your changes to assign xid, we will always know the top level transaction when a sequence is assigned a relfilenode. So the refilenodes will always get added to the correct hash directly. I didn't imagine a case where we will need to copy the hash table from sub-transaction to top transaction. If that's true, yes it's inconvenient. As to the abort, don't we already remove entries on subtxn abort? Having per transaction hash table doesn't seem to change anything much. > > And we'd need to still search the hashes in all toplevel transactions on > every sequence increment - in principle we can't have increment for a > sequence created in another in-progress transaction, but maybe it's just > not assigned yet. We hold a strong lock on sequence when changing its relfilenode. The sequence whose relfilenode is being changed can not be accessed by any concurrent transaction. So I am not able to understand what you are trying to say. I think per (top level) transaction hash table is cleaner design. It puts the hash table where it should be. But if that makes code difficult, current design works too. -- Best Wishes, Ashutosh Bapat
Re: [PoC] pg_upgrade: allow to upgrade publisher node
On Fri, 21 Jul 2023 at 13:00, Hayato Kuroda (Fujitsu) wrote: > > Dear hackers, > > > Based on the above, we are considering that we delay the timing of shutdown > > for > > logical walsenders. The preliminary workflow is: > > > > 1. When logical walsenders receives siginal from checkpointer, it consumes > > all > >of WAL records, change its state into WALSNDSTATE_STOPPING, and stop > > doing > >anything. > > 2. Then the checkpointer does the shutdown checkpoint > > 3. After that postmaster sends signal to walsenders, same as current > > implementation. > > 4. Finally logical walsenders process the shutdown checkpoint record and > > update > > the > > confirmed_lsn after the acknowledgement from subscriber. > > Note that logical walsenders don't have to send a shutdown checkpoint > > record > > to subscriber but following keep_alive will help us to increment the > > confirmed_lsn. > > 5. All tasks are done, they exit. > > > > This mechanism ensures that the confirmed_lsn of active slots is same as the > > current > > WAL location of old publisher, so that 0003 patch would become more simpler. > > We would not have to calculate the acceptable difference anymore. > > > > One thing we must consider is that any WALs must not be generated while > > decoding > > the shutdown checkpoint record. It causes the PANIC. IIUC the record leads > > SnapBuildSerializationPoint(), which just serializes snapbuild or restores > > from > > it, so the change may be acceptable. Thought? > > I've implemented the ideas from my previous proposal, PSA another patch set. > Patch 0001 introduces the state WALSNDSTATE_STOPPING to logical walsenders. > The > workflow remains largely the same as described in my previous post, with the > following additions: > > * A flag has been added to track whether all the WALs have been flushed. The > logical walsender can only exit after the flag is set. This ensures that all > WALs are flushed before the termination of the walsender. > * Cumulative statistics are now forcibly written before changing the state. > While the previous involved reporting stats upon process exit, the current > approach > must report earlier due to the checkpointer's termination timing. See > comments > in CheckpointerMain() and atop pgstat_before_server_shutdown(). > * At the end of processes, slots are now saved to disk. > > > Patch 0002 adds --include-logical-replication-slots option to pg_upgrade, > not changed from previous set. > > Patch 0003 adds a check function, which becomes simpler. > The previous version calculated the "acceptable" difference between > confirmed_lsn > and the current WAL position. This was necessary because shutdown records > could > not be sent to subscribers, creating a disparity in these values. However, > this > approach had drawbacks, such as needing adjustments if record sizes changed. > > Now, the record can be sent to subscribers, so the hacking is not needed > anymore, > at least in the context of logical replication. The consistency is now > maintained > by the logical walsenders, so slots created by the backend could not be. > We must consider what should be... > > How do you think? Here is a patch which checks that there are no WAL records other than CHECKPOINT_SHUTDOWN WAL record to be consumed based on the discussion from [1]. Patch 0001 and 0002 is same as the patch posted by Kuroda-san, Patch 0003 exposes pg_get_wal_records_content to get the WAL records along with the WAL record type between start and end lsn. pg_walinspect contrib module already exposes a function for this requirement, I have moved this functionality to be exposed from the backend. Patch 0004 has slight change in check function to check that there are no other records other than CHECKPOINT_SHUTDOWN to be consumed. The attached patch has the changes for the same. Thoughts? [1] - https://www.postgresql.org/message-id/CAA4eK1Kem-J5NM7GJCgyKP84pEN6RsG6JWo%3D6pSn1E%2BiexL1Fw%40mail.gmail.com Regards, Vignesh From 49360788fea01756aa9578da3dad34786bd1281a Mon Sep 17 00:00:00 2001 From: Hayato Kuroda Date: Fri, 14 Apr 2023 08:59:03 + Subject: [PATCH 4/4] pg_upgrade: Add check function for --include-logical-replication-slots option XXX: Actually, this commit disallows to support slots which are created by user backends. In the checking function we ensure that all the avtive slots have confirmed_flush_lsn which is same as current WAL position, and they would not be the same. For slots which are used by logical replication, logical walsenders guarantee that at the shutdown. For individual slots, however, cannot be handled by walsenders, so confirmed_flush_lsn is behind shutdown checkpoint record. Author: Hayato Kuroda Reviewed-by: Wang Wei, Vignesh C --- src/bin/pg_upgrade/check.c| 59 .../t/003_logical_replication_slots.pl| 134 +++--- 2 files changed, 140 insertions(+), 53 deletions(-) diff --git
Re: POC: Extension for adding distributed tracing - pg_tracing
Hi, > I'd keep Autotools build up to date Definitely. The patch is still in a rough state and updating Autotools fell through the cracks. > I'm currently playing with the patch and > reviewing sources, if you need any cooperation - please let us know. The goal for me was to get validation on the design and to check if there was anything that would be a blocker before commiting more time on the project. There are already several things I was planning on improving: - The parse span is something I've added recently. I've realised that I could rely on the stmtStartTimestamp to get the time spent in parsing in the post parse hook so the code around this is still rough. - There are probably race conditions on the reads and writes of the query file that need to be fixed - I'm using the same Span structure for everything while Executor spans only need a small subset which is a bit wasteful. I've tried to use two different shared buffers (base spans and spans with counters) but it was making things more confusing. - Finish commenting code and start writing the doc So any comments, missing features and reviews on the current state of the project is welcome. Regards, Anthonin On Fri, Jul 28, 2023 at 9:35 AM Nikita Malakhov wrote: > Hi, > > I'd keep Autotools build up to date, because Meson is very limited in > terms of not very > up-to-date OS versions. Anyway, it's OK now. I'm currently playing with > the patch and > reviewing sources, if you need any cooperation - please let us know. > I'm +1 for committing > this patch after review. > > -- > Regards, > Nikita Malakhov > Postgres Professional > The Russian Postgres Company > https://postgrespro.ru/ >
Re: Postgres v15 windows bincheck regression test failures
On Fri, Jul 28, 2023 at 07:00:01AM +0300, Alexander Lakhin wrote: > 28.07.2023 05:17, Noah Misch wrote: > >On Tue, Jun 20, 2023 at 07:49:52AM -0400, Russell Foster wrote: > >>/* > >>* We don't use Unix-domain sockets on Windows by default, even if the > >>* build supports them. (See comment at remove_temp() for a reason.) > >>* Override at your own risk. > >>*/ > >> > >>Is there some sort of race condition in the SSPI code that sometimes > >>doesn't gracefully finish/close the connection when the backend > >>decides to exit due to error? > >No. remove_temp() is part of test driver "pg_regress". Non-test usage is > >unaffected. Even for test usage, folks have reported no failures from the > >cause mentioned in the remove_temp() comment. > > It seems to me that it's just another manifestation of bug #16678 ([1]). > See also commits 6051857fc and 29992a6a5. > > [1] > https://www.postgresql.org/message-id/flat/16678-253e48d34dc0c376%40postgresql.org That was about a bug that appears when using TCP sockets. The remove_temp() comment involves code that doesn't run when using TCP sockets. I don't think they can be manifestations of the same phenomenon.
Re: WAL Insertion Lock Improvements
On Wed, Jul 26, 2023 at 1:27 AM Andres Freund wrote: > > > 0001 has been now applied. I have done more tests while looking at > > this patch since yesterday and was surprised to see higher TPS numbers > > on HEAD with the same tests as previously, and the patch was still > > shining with more than 256 clients. > > Just a small heads up: > > I just rebased my aio tree over the commit and promptly, on the first run, saw > a hang. I did some debugging on that. Unfortunately repeated runs haven't > repeated that hang, despite quite a bit of trying. Hm. Please share workload details, test scripts, system info and any special settings for running in my setup. > The symptom I was seeing is that all running backends were stuck in > LWLockWaitForVar(), even though the value they're waiting for had > changed. Which obviously "shouldn't be possible". Were the backends stuck there indefinitely? IOW, did they get into a deadlock? > It's of course possible that this is AIO specific, but I didn't see anything > in stacks to suggest that. > > I do wonder if this possibly exposed an undocumented prior dependency on the > value update always happening under the list lock. I'm going through the other thread mentioned by Michael Paquier. I'm wondering if the deadlock issue illustrated here https://www.postgresql.org/message-id/55BB50D3.9000702%40iki.fi is showing up again, because 71e4cc6b8e reduced the contention on waitlist lock and made things *a bit* faster. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
On Thu, 27 Jul 2023 at 16:01, Peter Geoghegan wrote: > > On Thu, Jul 27, 2023 at 7:59 AM Matthias van de Meent > wrote: > > > Basically, the patch that added that feature had to revise the index > > > AM API, in order to support a mode of operation where scans return > > > groupings rather than tuples. Whereas this patch requires none of > > > that. It makes affected index scans as similar as possible to > > > conventional index scans. > > > > Hmm, yes. I see now where my confusion started. You called it out in > > your first paragraph of the original mail, too, but that didn't help > > me then: > > > > The wiki does not distinguish "Index Skip Scans" and "Loose Index > > Scans", but these are not the same. > > A lot of people (myself included) were confused on this point for > quite a while. I've taken the liberty to update the "Loose indexscan" wiki page [0], adding detail that Loose indexscans are distinct from Skip scans, and showing some high-level distinguishing properties. I also split the TODO entry for `` "loose" or "skip" scan `` into two, and added links to the relevant recent threads so that it's clear these are different (and that some previous efforts may have had a confusing name). I hope this will reduce the chance of future confusion between the two different approaches to improving index scan performance. Kind regards, Matthias van de Meent Neon (https://neon.tech) [0]: https://wiki.postgresql.org/wiki/Loose_indexscan
Re: Row pattern recognition
>> Attached is the v3 patch. In this patch following changes are made. > > Excellent. Thanks! You are welcome! > A few quick comments: > > - PERMUTE is still misspelled as PREMUTE Oops. Will fix. > - PATTERN variables do not have to exist in the DEFINE clause. They are > - considered TRUE if not present. Do you think we really need this? I found a criticism regarding this. https://link.springer.com/article/10.1007/s13222-022-00404-3 "3.2 Explicit Definition of All Row Pattern Variables" What do you think? >> - I am working on making window aggregates RPR aware now. The >>implementation is in progress and far from completeness. An example >>is below. I think row 2, 3, 4 of "count" column should be NULL >>instead of 3, 2, 0, 0. Same thing can be said to other >>rows. Probably this is an effect of moving aggregate but I still >>studying the window aggregation code. > > This tells me again that RPR is not being run in the right place. All > windowed aggregates and frame-level window functions should Just Work > with no modification. I am not touching each aggregate function. I am modifying eval_windowaggregates() in nodeWindowAgg.c, which calls each aggregate function. Do you think it's not the right place to make window aggregates RPR aware? >> SELECT company, tdate, first_value(price) OVER W, count(*) OVER w FROM >> stock >> WINDOW w AS ( >> PARTITION BY company >> ORDER BY tdate >> ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING >> AFTER MATCH SKIP PAST LAST ROW >> INITIAL >> PATTERN (START UP+ DOWN+) >> DEFINE >>START AS TRUE, >>UP AS price > PREV(price), >>DOWN AS price < PREV(price) >> ); >> company | tdate| first_value | count >> --++-+--- >> company1 | 2023-07-01 | 100 | 4 >> company1 | 2023-07-02 | | 3 >> company1 | 2023-07-03 | | 2 >> company1 | 2023-07-04 | | 0 >> company1 | 2023-07-05 | | 0 >> company1 | 2023-07-06 | 90 | 4 >> company1 | 2023-07-07 | | 3 >> company1 | 2023-07-08 | | 2 >> company1 | 2023-07-09 | | 0 >> company1 | 2023-07-10 | | 0 >> company2 | 2023-07-01 | 50 | 4 >> company2 | 2023-07-02 | | 3 >> company2 | 2023-07-03 | | 2 >> company2 | 2023-07-04 | | 0 >> company2 | 2023-07-05 | | 0 >> company2 | 2023-07-06 | 60 | 4 >> company2 | 2023-07-07 | | 3 >> company2 | 2023-07-08 | | 2 >> company2 | 2023-07-09 | | 0 >> company2 | 2023-07-10 | | 0 > > In this scenario, row 1's frame is the first 5 rows and specified SKIP > PAST LAST ROW, so rows 2-5 don't have *any* frame (because they are > skipped) and the result of the outer count should be 0 for all of them > because there are no rows in the frame. Ok. Just I want to make sure. If it's other aggregates like sum or avg, the result of the outer aggregates should be NULL. > When we get to adding count in the MEASURES clause, there will be a > difference between no match and empty match, but that does not apply > here. Can you elaborate more? I understand that "no match" and "empty match" are different things. But I do not understand how the difference affects the result of count. >> I am going to add this thread to CommitFest and plan to add both of >> you as reviewers. Thanks in advance. > > My pleasure. Thank you for working on this difficult feature. Thank you for accepting being registered as a reviewer. Your comments are really helpful. -- Tatsuo Ishii SRA OSS LLC English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
Re: cataloguing NOT NULL constraints
> > Given the following sequence: > > > > drop table if exists p,c; > > create table p(a int primary key); > > create table c() inherits (p); > > alter table p drop constraint p_pkey; > > However, c.a remains non-nullable, with a NOT NULL constraint that > > claims to be inherited: > > > > \d+ c > > Table "public.c" > > Column | Type | Collation | Nullable | Default | Storage | > > Compression | Stats target | Description > > +-+---+--+-+-+-+--+- > > a | integer | | not null | | plain | > > | | > > Not null constraints: > > "c_a_not_null" NOT NULL "a" (inherited) > > Inherits: p > > Access method: heap > > > > That's a problem, because now the NOT NULL constraint on c cannot be > > dropped (attempting to drop it on c errors out because it thinks it's > > inherited, but it can't be dropped via p, because p.a is already > > nullable). So I implemented a fix for this (namely: fix the inhcount to be 0 initially), and it works well, but it does cause a definitional problem: any time we create a child table that inherits from another table that has a primary key, all the columns in the child table will get normal, visible, droppable NOT NULL constraints. Thus, pg_dump for example will output that constraint exactly as if the user had specified it in the child's CREATE TABLE command. By itself this doesn't bother me, though I admit it seems a little odd. When you restore such a setup from pg_dump, things work perfectly -- I mean, you don't get a second constraint. But if you do drop the constraint, then it will be reinstated by the next pg_dump as if you hadn't dropped it, by way of it springing to life from the PK. To avoid that, one option would be to make this NN constraint undroppable ... but I don't see how. One option might be to add a pg_depend row that links the NOT NULL constraint to its PK constraint. But this will be a strange case that occurs nowhere else, since other NOT NULL constraint don't have such pg_depend rows. Also, I won't know how pg_dump likes this until I implement it. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: Support worker_spi to execute the function dynamically.
On 2023-Jul-28, Michael Paquier wrote: > So you have faced a race condition where the commit of the transaction > doing the schema creation for the static workers is delayed long > enough that the dynamic workers don't see it, and bumped on a catalog > conflict when they try to create the same schemas. > > Having each bgworker on its own schema would be enough to prevent > conflicts, but I'd like to add a second thing: a check on > pg_stat_activity.wait_event after starting the workers. I have added > something like that in the patch I have posted today for the custom > wait events at [1] and it enforces the startup sequences of the > workers in a stricter way. Hmm, I think having all the workers doing their in the same table is better -- if nothing else, because it gives us the opportunity to show how to use some other coding technique (but also because we are forced to write the SQL code in a way that's correct for potentially multiple concurrent workers, which sounds useful to demonstrate). Can't we instead solve the race condition by having some shared resource that blocks the other workers from proceeding until the schema has been created? Perhaps an LWLock, or a condition variable, or an advisory lock. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: [PoC] Reducing planning time when tables have many partitions
Hi Yuya, Andrey, On Fri, Jul 28, 2023 at 9:58 AM Andrey Lepikhov wrote: > >> > > Discovering quality of partition pruning at the stage of execution > > initialization and using your set of patches I have found some dubious > > results with performance degradation. Look into the test case in > > attachment. > > Here is three queries. Execution times: > > 1 - 8s; 2 - 30s; 3 - 131s (with your patch set). > > 1 - 5s; 2 - 10s; 3 - 33s (current master). > > > > Maybe it is a false alarm, but on my laptop I see this degradation at > > every launch. > Sorry for this. It was definitely a false alarm. In this patch, > assertion checking adds much overhead. After switching it off, I found > out that this feature solves my problem with a quick pass through the > members of an equivalence class. Planning time results for the queries > from the previous letter: > 1 - 0.4s, 2 - 1.3s, 3 - 1.3s; (with the patches applied) > 1 - 5s; 2 - 8.7s; 3 - 22s; (current master). I measured planning time using my scripts setup.sql and queries.sql attached to [1] with and without assert build using your patch. The timings are recorded in the attached spreadsheet. I have following observations 1. The patchset improves the planning time of queries involving partitioned tables by an integral factor. Both in case of partitionwise join and without it. The speedup is 5x to 21x in my experiment. That's huge. 2. There's slight degradation in planning time of queries involving unpartitioned tables. But I have seen that much variance usually. 3. assert and debug enabled build shows degradation in planning time in all the cases. 4. There is substantial memory increase in all the cases. It's percentage wise predominant when the partitionwise join is not used. Given that most of the developers run assert enabled builds it would be good to bring down the degradation there while keeping the excellent speedup in non-assert builds. Queries on partitioned tables eat a lot of memory anyways, increasing that further should be avoided. I have not studied the patches. But I think the memory increase has to do with our Bitmapset structure. It's space inefficient when there are thousands of partitions involved. See my comment at [2] [1] https://www.postgresql.org/message-id/caexhw5stmouobe55pmt83r8uxvfcph+pvo5dnpdrvcsbgxe...@mail.gmail.com [2] https://www.postgresql.org/message-id/CAExHW5s4EqY43oB%3Dne6B2%3D-xLgrs9ZGeTr1NXwkGFt2j-OmaQQ%40mail.gmail.com -- Best Wishes, Ashutosh Bapat planning time measurement.ods Description: application/vnd.oasis.opendocument.spreadsheet
Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)
Hi Tom, Richard, On Mon, Jul 24, 2023 at 8:17 AM Richard Guo wrote: > > Thanks for pushing it! With this fix, I saw a noticeable increase in the memory consumption of planner. I was running experiments mentioned in [1] The reason is the Bitmapset created by bms_union() are not freed during planning and when there are thousands of child joins involved, bitmapsets takes up a large memory and there any a large number of bitmaps. Attached 0002 patch fixes the memory consumption by calculating appinfos only once and using them twice. The number look like below Number of tables joined | without patch | with patch | -- 2 | 40.8 MiB | 40.3 MiB | 3 | 151.6 MiB | 146.9 MiB | 4 | 463.9 MiB | 445.5 MiB | 5 |1663.9 MiB | 1563.3 MiB | The memory consumption is prominent at higher number of joins as that exponentially increases the number of child joins. Attached setup.sql and queries.sql and patch 0001 were used to measure memory similar to [1]. I don't think it's a problem with the patch itself. We should be able to use Bitmapset APIs similar to what patch is doing. But there's a problem with our Bitmapset implementation. It's not space efficient for thousands of partitions. A quick calculation reveals this. If the number of partitions is 1000, the matching partitions will usually be 1000, 2000, 3000 and so on. Thus the bitmapset represnting the relids will be {b 1000, 2000, 3000, ...}. To represent a single 1000th digit current Bitmapset implementation will allocate 1000/8 = 125 bytes of memory. A 5 way join will require 125 * 5 = 625 bytes of memory. This is even true for lower relid numbers since they will be 1000 bits away e.g. (1, 1001, 2001, 3001, ...). So 1000 such child joins require 625KiB memory. Doing this as many times as the number of joins we get 120 * 625 KiB = 75 MiB which is closer to the memory difference we see above. Even if we allocate a list to hold 5 integers it will not take 625 bytes. I think we need to improve our Bitmapset representation to be efficient in such cases. Of course whatever representation we choose has to be space efficient for a small number of tables as well and should gel well with our planner logic. So I guess some kind of dynamic representation which changes the underlying layout based on the contents is required. I have looked up past hacker threads to see if this has been discussed previously. I don't think this is the thread to discuss it and also I am not planning to work on it in the immediate future. But I thought I would mention it where I found it. [1] https://www.postgresql.org/message-id/caexhw5stmouobe55pmt83r8uxvfcph+pvo5dnpdrvcsbgxe...@mail.gmail.com -- Best Wishes, Ashutosh Bapat From 0f00fc48b9ebf73b54724ba82cec4a69d5f63846 Mon Sep 17 00:00:00 2001 From: Ashutosh Bapat Date: Wed, 12 Jul 2023 14:34:14 +0530 Subject: [PATCH 1/2] Report memory used for planning a query in EXPLAIN ANALYZE The memory used in the CurrentMemoryContext and its children is sampled before and after calling pg_plan_query() from ExplainOneQuery(). The difference in the two samples is reported as the memory consumed while planning the query. This may not account for the memory allocated in memory contexts which are not children of CurrentMemoryContext. These contexts are usually other long lived contexts, e.g. CacheMemoryContext, which are shared by all the queries run in a session. The consumption in those can not be attributed only to a given query and hence should not be reported any way. The memory consumption is reported as "Planning Memory" property in EXPLAIN ANALYZE output. Ashutosh Bapat --- src/backend/commands/explain.c | 12 ++-- src/backend/commands/prepare.c | 7 ++- src/backend/utils/mmgr/mcxt.c | 19 +++ src/include/commands/explain.h | 3 ++- src/include/utils/memutils.h | 1 + 5 files changed, 38 insertions(+), 4 deletions(-) diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index 8570b14f62..9f859949f0 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -397,16 +397,20 @@ ExplainOneQuery(Query *query, int cursorOptions, planduration; BufferUsage bufusage_start, bufusage; + Size mem_consumed; if (es->buffers) bufusage_start = pgBufferUsage; INSTR_TIME_SET_CURRENT(planstart); + mem_consumed = MemoryContextMemUsed(CurrentMemoryContext); /* plan the query */ plan = pg_plan_query(query, queryString, cursorOptions, params); INSTR_TIME_SET_CURRENT(planduration); INSTR_TIME_SUBTRACT(planduration, planstart); + mem_consumed = MemoryContextMemUsed(CurrentMemoryContext) + - mem_consumed; /* calc differences of buffer counters. */ if (es->buffers) @@ -417,7 +421,7 @@ ExplainOneQuery(Query *query, int cursorOptions, /* run it (if needed) and produce
Re: logical decoding and replication of sequences, take 2
On Wed, Jul 26, 2023 at 8:48 PM Tomas Vondra wrote: > > On 7/26/23 09:27, Amit Kapila wrote: > > On Wed, Jul 26, 2023 at 9:37 AM Amit Kapila wrote: > > Anyway, I was thinking about this a bit more, and it seems it's not as > difficult to use the page LSN to ensure sequences don't go backwards. > While studying the changes for this proposal and related areas, I have a few comments: 1. I think you need to advance the origin if it is changed due to copy_sequence(), otherwise, if the sync worker restarts after SUBREL_STATE_FINISHEDCOPY, then it will restart from the slot's LSN value. 2. Between the time of SYNCDONE and READY state, the patch can skip applying non-transactional sequence changes even if it should apply it. The reason is that during that state change should_apply_changes_for_rel() decides whether to apply change based on the value of remote_final_lsn which won't be set for non-transactional change. I think we need to send the start LSN of a non-transactional record and then use that as remote_final_lsn for such a change. 3. For non-transactional sequence change apply, we don't set replorigin_session_origin_lsn/replorigin_session_origin_timestamp as we are doing in apply_handle_commit_internal() before calling CommitTransactionCommand(). So, that can lead to the origin moving backwards after restart which will lead to requesting and applying the same changes again and for that period of time sequence can go backwards. This needs some more thought as to what is the correct behaviour/solution for this. 4. BTW, while checking this behaviour, I noticed that the initial sync worker for sequence mentions the table in the LOG message: "LOG: logical replication table synchronization worker for subscription "mysub", table "s" has finished". Won't it be better here to refer to it as a sequence? -- With Regards, Amit Kapila.
Re: Row pattern recognition
On 7/26/23 14:21, Tatsuo Ishii wrote: Attached is the v3 patch. In this patch following changes are made. Excellent. Thanks! A few quick comments: - PERMUTE is still misspelled as PREMUTE - PATTERN variables do not have to exist in the DEFINE clause. They are considered TRUE if not present. (1) I completely changed the pattern matching engine so that it performs backtracking. Now the engine evaluates all pattern elements defined in PATTER against each row, saving matched pattern variables in a string per row. For example if the pattern element A and B evaluated to true, a string "AB" is created for current row. This continues until all pattern matching fails or encounters the end of full window frame/partition. After that, the pattern matching engine creates all possible "pattern strings" and apply the regular expression matching to each. For example if we have row 0 = "AB" row 1 = "C", possible pattern strings are: "AC" and "BC". If it matches, the length of matching substring is saved. After all possible trials are done, the longest matching substring is chosen and it becomes the width (number of rows) in the reduced window frame. See row_is_in_reduced_frame, search_str_set and search_str_set_recurse in nodeWindowAggs.c for more details. For now I use a naive depth first search and probably there is a lot of rooms for optimization (for example rewriting it without using recursion). Suggestions/patches are welcome. My own reviews will only focus on correctness for now. Once we get a good set of regression tests all passing, I will focus more on optimization. Of course, others might want to review the performance now. Vik Fearing wrote: I strongly disagree with this. Window function do not need to know how the frame is defined, and indeed they should not. WinGetFuncArgInFrame should answer yes or no and the window function just works on that. Otherwise we will get extension (and possibly even core) functions that don't handle the frame properly. So I moved row_is_in_reduce_frame into WinGetFuncArgInFrame so that those window functions are not needed to be changed. (3) Window function rpr was removed. We can use first_value instead. Excellent. (4) Remaining tasks/issues. - For now I disable WinSetMarkPosition because RPR often needs to access a row before the mark is set. We need to fix this in the future. Noted, and agreed. - I am working on making window aggregates RPR aware now. The implementation is in progress and far from completeness. An example is below. I think row 2, 3, 4 of "count" column should be NULL instead of 3, 2, 0, 0. Same thing can be said to other rows. Probably this is an effect of moving aggregate but I still studying the window aggregation code. This tells me again that RPR is not being run in the right place. All windowed aggregates and frame-level window functions should Just Work with no modification. SELECT company, tdate, first_value(price) OVER W, count(*) OVER w FROM stock WINDOW w AS ( PARTITION BY company ORDER BY tdate ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING AFTER MATCH SKIP PAST LAST ROW INITIAL PATTERN (START UP+ DOWN+) DEFINE START AS TRUE, UP AS price > PREV(price), DOWN AS price < PREV(price) ); company | tdate| first_value | count --++-+--- company1 | 2023-07-01 | 100 | 4 company1 | 2023-07-02 | | 3 company1 | 2023-07-03 | | 2 company1 | 2023-07-04 | | 0 company1 | 2023-07-05 | | 0 company1 | 2023-07-06 | 90 | 4 company1 | 2023-07-07 | | 3 company1 | 2023-07-08 | | 2 company1 | 2023-07-09 | | 0 company1 | 2023-07-10 | | 0 company2 | 2023-07-01 | 50 | 4 company2 | 2023-07-02 | | 3 company2 | 2023-07-03 | | 2 company2 | 2023-07-04 | | 0 company2 | 2023-07-05 | | 0 company2 | 2023-07-06 | 60 | 4 company2 | 2023-07-07 | | 3 company2 | 2023-07-08 | | 2 company2 | 2023-07-09 | | 0 company2 | 2023-07-10 | | 0 In this scenario, row 1's frame is the first 5 rows and specified SKIP PAST LAST ROW, so rows 2-5 don't have *any* frame (because they are skipped) and the result of the outer count should be 0 for all of them because there are no rows in the frame. When we get to adding count in the MEASURES clause, there will be a difference between no match and empty match, but that does not apply here. I am going to add this thread to CommitFest and plan to add both of you as reviewers. Thanks in advance. My pleasure. Thank you for working on this difficult feature. -- Vik Fearing
Re: Support worker_spi to execute the function dynamically.
On Fri, Jul 28, 2023 at 02:11:48PM +0530, Bharath Rupireddy wrote: > I don't think something like [1] is complex. It makes worker_spi > foolproof. Rather, the other approach proposed, that is to provide > non-conflicting worker IDs to worker_spi_launch in the TAP test file, > looks complicated to me. And it's easy for someone to come, add a test > case with conflicting IDs input to worker_spi_launch and end up in the > same state that we're in now. Sure, but that's not really something that worries me for a template such as this one, for the sake of these tests. So I'd leave things to be as they are, slightly simpler. That's a minor point, for sure :) -- Michael signature.asc Description: PGP signature
Re: Row pattern recognition
On 7/28/23 09:09, Tatsuo Ishii wrote: We already recalculate a frame each time a row is processed even without RPR. See ExecWindowAgg. Yes, after each row. Not for each function. Ok, I understand now. Closer look at the code, I realized that each window function calls update_frameheadpos, which computes the frame head position. But actually it checks winstate->framehead_valid and if it's already true (probably by other window function), then it does nothing. Also RPR always requires a frame option ROWS BETWEEN CURRENT ROW, which means the frame head is changed each time current row position changes. Off topic for now: I wonder why this restriction is in place and whether we should respect or ignore it. That is a discussion for another time, though. My guess is, it is because other than ROWS BETWEEN CURRENT ROW has little or no meaning. Consider following example: Yes, that makes sense. I strongly disagree with this. Window function do not need to know how the frame is defined, and indeed they should not. We already break the rule by defining *support functions. See windowfuncs.c. The support functions don't know anything about the frame, they just know when a window function is monotonically increasing and execution can either stop or be "passed through". I see following code in window_row_number_support: /* * The frame options can always become "ROWS BETWEEN UNBOUNDED * PRECEDING AND CURRENT ROW". row_number() always just increments by * 1 with each row in the partition. Using ROWS instead of RANGE * saves effort checking peer rows during execution. */ req->frameOptions = (FRAMEOPTION_NONDEFAULT | FRAMEOPTION_ROWS | FRAMEOPTION_START_UNBOUNDED_PRECEDING | FRAMEOPTION_END_CURRENT_ROW); I think it not only knows about frame but it even changes the frame options. This seems far from "don't know anything about the frame", no? That's the planner support function. The row_number() function itself is not even allowed to *have* a frame, per spec. We allow it, but as you can see from that support function, we completely replace it. So all of the partition-level window functions are not affected by RPR anyway. I have two comments about this: It isn't just for convenience, it is for correctness. The window functions do not need to know which rows they are *not* operating on. There is no such thing as a "full" or "reduced" frame. The standard uses those terms to explain the difference between before and after RPR is applied, but window functions do not get to choose which frame they apply over. They only ever apply over the reduced window frame. I agree that "full window frame" and "reduced window frame" do not exist at the same time, and in the end (after computation of reduced frame), only "reduced" frame is visible to window functions/aggregates. But I still do think that "full window frame" and "reduced window frame" are important concept to explain/understand how PRP works. If we are just using those terms for documentation, then okay. -- Vik Fearing
Re: postgres_fdw: wrong results with self join + enable_nestloop off
Hi Richard, On Mon, Jul 24, 2023 at 11:45 AM Richard Guo wrote: > On Fri, Jul 21, 2023 at 8:51 PM Etsuro Fujita wrote: >> * In this bit I changed the last argument to NIL, which would be >> nitpicking, though. >> >> @@ -1038,7 +1038,7 @@ postgresGetForeignPaths(PlannerInfo *root, >> add_path(baserel, (Path *) path); >> >> /* Add paths with pathkeys */ >> - add_paths_with_pathkeys_for_rel(root, baserel, NULL); >> + add_paths_with_pathkeys_for_rel(root, baserel, NULL, NULL); > This was my oversight. No. IIUC, I think that that would work well as-proposed, but I changed it as such, for readability. > So the two patches both look good to me now. Cool! I pushed the first patch after polishing it a little bit, so here is a rebased version of the second patch, in which I modified the ForeignPath and CustomPath cases in reparameterize_path_by_child() to reflect the new members fdw_restrictinfo and custom_restrictinfo, for safety, and tweaked a comment a bit. Thanks for looking! Best regards, Etsuro Fujita 0002-Allow-join-pushdown-even-if-pseudoconstant-quals-v3.patch Description: Binary data
Re: Avoid unused value (src/fe_utils/print.c)
On Tue, Jul 25, 2023 at 1:04 AM Ranier Vilela wrote: > Checked today, Coverity does not complain about need_recordsep. > Great! Thanks. So v2 patch makes Coverity happy, and as for me doesn't make the code less readable. Does anyone have any objections? Best regards, Karina Litskevich Postgres Professional: http://postgrespro.com/
Re: [PoC] Reducing planning time when tables have many partitions
Hello, On Fri, Jul 28, 2023 at 1:27 PM Andrey Lepikhov wrote: > Sorry for this. It was definitely a false alarm. In this patch, > assertion checking adds much overhead. After switching it off, I found > out that this feature solves my problem with a quick pass through the > members of an equivalence class. Planning time results for the queries > from the previous letter: > 1 - 0.4s, 2 - 1.3s, 3 - 1.3s; (with the patches applied) > 1 - 5s; 2 - 8.7s; 3 - 22s; (current master). > > I have attached flamegraph that shows query 2 planning process after > applying this set of patches. As you can see, overhead at the > equivalence class routines has gone. I really appreciate testing the patches and sharing your results. The results are interesting because they show that our optimization effectively reduces planning time for your workload containing different queries than I have used in my benchmarks. Thank you again for reviewing this. -- Best regards, Yuya Watari
Re: Support worker_spi to execute the function dynamically.
On Fri, Jul 28, 2023 at 1:26 PM Michael Paquier wrote: > > On Fri, Jul 28, 2023 at 10:47:39AM +0530, Bharath Rupireddy wrote: > > +# check their existence. Use IDs that do not overlap with the schemas > > created > > +# by the previous workers. > > > > While using different IDs in tests is a simple fix, -1 for it. I'd > > prefer if worker_spi uses different schema prefixes for static and > > dynamic bg workers to avoid conflicts. We can either look at > > MyBgworkerEntry->bgw_type in worker_spi_main and have schema name as > > '{static, dyamic}_worker_schema_%d', id or pass schema name in > > bgw_extra. > > For the sake of a test module, I am not really convinced that there is > any need to go down to such complexity with the names of the schemas > created. I don't think something like [1] is complex. It makes worker_spi foolproof. Rather, the other approach proposed, that is to provide non-conflicting worker IDs to worker_spi_launch in the TAP test file, looks complicated to me. And it's easy for someone to come, add a test case with conflicting IDs input to worker_spi_launch and end up in the same state that we're in now. [1] diff --git a/src/test/modules/worker_spi/t/001_worker_spi.pl b/src/test/modules/worker_spi/t/001_worker_spi.pl index c293871313..700530afc7 100644 --- a/src/test/modules/worker_spi/t/001_worker_spi.pl +++ b/src/test/modules/worker_spi/t/001_worker_spi.pl @@ -27,16 +27,16 @@ is($result, 't', "dynamic bgworker launched"); $node->poll_query_until( 'postgres', qq[SELECT count(*) > 0 FROM information_schema.tables - WHERE table_schema = 'schema4' AND table_name = 'counted';]); + WHERE table_schema = 'dynamic_worker_schema4' AND table_name = 'counted';]); $node->safe_psql('postgres', - "INSERT INTO schema4.counted VALUES ('total', 0), ('delta', 1);"); + "INSERT INTO dynamic_worker_schema4.counted VALUES ('total', 0), ('delta', 1);"); # Issue a SIGHUP on the node to force the worker to loop once, accelerating # this test. $node->reload; # Wait until the worker has processed the tuple that has just been inserted. $node->poll_query_until('postgres', - qq[SELECT count(*) FROM schema4.counted WHERE type = 'delta';], '0'); -$result = $node->safe_psql('postgres', 'SELECT * FROM schema4.counted;'); + qq[SELECT count(*) FROM dynamic_worker_schema4.counted WHERE type = 'delta';], '0'); +$result = $node->safe_psql('postgres', 'SELECT * FROM dynamic_worker_schema4.counted;'); is($result, qq(total|1), 'dynamic bgworker correctly consumed tuple data'); note "testing bgworkers loaded with shared_preload_libraries"; diff --git a/src/test/modules/worker_spi/worker_spi.c b/src/test/modules/worker_spi/worker_spi.c index 903dcddef9..02b4204aa2 100644 --- a/src/test/modules/worker_spi/worker_spi.c +++ b/src/test/modules/worker_spi/worker_spi.c @@ -135,10 +135,19 @@ worker_spi_main(Datum main_arg) int index = DatumGetInt32(main_arg); worktable *table; StringInfoData buf; - charname[20]; + charname[NAMEDATALEN]; table = palloc(sizeof(worktable)); - sprintf(name, "schema%d", index); + + /* +* Use different schema names for static and dynamic bg workers to avoid +* name conflicts. +*/ + if (strcmp(MyBgworkerEntry->bgw_type, "worker_spi") == 0) + sprintf(name, "worker_schema%d", index); + else if (strcmp(MyBgworkerEntry->bgw_type, "worker_spi dynamic") == 0) + sprintf(name, "dynamic_worker_schema%d", index); + table->schema = pstrdup(name); table->name = pstrdup("counted"); -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Memory consumed by child SpecialJoinInfo in partitionwise join planning
On Thu, Jul 27, 2023 at 10:10 PM Ashutosh Bapat < ashutosh.bapat@gmail.com> wrote: > The attached patch (0002) fixes this issue by > 1. declaring child SpecialJoinInfo as a local variable, thus > allocating memory on the stack instead of CurrentMemoryContext. The > memory on the stack is freed automatically. > 2. Freeing the Relids in SpecialJoinInfo explicitly after > SpecialJoinInfo has been used. It doesn't seem too expensive to translate SpecialJoinInfos, so I think it's OK to construct and free the SpecialJoinInfo for a child join on the fly. So the approach in 0002 looks reasonable to me. But there is something that needs to be fixed in 0002. + bms_free(child_sjinfo->commute_above_l); + bms_free(child_sjinfo->commute_above_r); + bms_free(child_sjinfo->commute_below_l); + bms_free(child_sjinfo->commute_below_r); These four members in SpecialJoinInfo only contain outer join relids. They do not need to be translated. So they would reference the same memory areas in child_sjinfo as in parent_sjinfo. We should not free them, otherwise they would become dangling pointers in parent sjinfo. Thanks Richard
Re: Support worker_spi to execute the function dynamically.
On Fri, Jul 28, 2023 at 10:47:39AM +0530, Bharath Rupireddy wrote: > +# check their existence. Use IDs that do not overlap with the schemas > created > +# by the previous workers. > > While using different IDs in tests is a simple fix, -1 for it. I'd > prefer if worker_spi uses different schema prefixes for static and > dynamic bg workers to avoid conflicts. We can either look at > MyBgworkerEntry->bgw_type in worker_spi_main and have schema name as > '{static, dyamic}_worker_schema_%d', id or pass schema name in > bgw_extra. For the sake of a test module, I am not really convinced that there is any need to go down to such complexity with the names of the schemas created. -- Michael signature.asc Description: PGP signature
Re: pg_rewind fails with in-place tablespace
On Tue, Jul 25, 2023 at 04:36:42PM +0900, Michael Paquier wrote: > On Wed, Jul 19, 2023 at 09:31:35PM +0800, 赵锐(惜元) wrote: >> To help reproduce the failure, I have attached a tap test. And I am >> pleased to say that I have also identified a solution for this >> problem, which I have included in the patch. >> Thank you for your attention to this matter. > > Issue reproduced here, and agreed that we'd better do something about > that. I am not sure if your patch is right for the job though, but > I'll try to study that a bit more. It took me some time to remember that for the case of a local source we'd finish by using recurse_dir() and consider the in-place tablespace as a regular directory, so a fix located in libpq_traverse_files() sounds good to me. + if (strncmp(link_target, "pg_tblspc/", strlen("pg_tblspc/")) == 0) + type = FILE_TYPE_DIRECTORY; + else + type = FILE_TYPE_SYMLINK; However this is not consistent with the other places where we detect if an in-place tablespace is used, like pg_basebackup.c, where we rely on the fact that the tablespace path is a relative path, using is_absolute_path() to make the difference between a normal and in-place tablespace. I would choose consistency and do the same here, checking if we have an absolute or relative path, depending on the result of pg_tablespace_location(). Testing only for the creation of the tablespace is fine for the sake of the report, but I would slightly more here and create a table on this tablespace with some data, and a check_query() once pg_rewind is done. I am finishing with the attached. Thoughts? -- Michael From 728bd093959f0d51c234f88729798165176fc0f6 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Fri, 28 Jul 2023 16:53:21 +0900 Subject: [PATCH] Fix pg_rewind with in-place tablespaces when source is remote --- src/bin/pg_rewind/libpq_source.c | 11 ++- src/bin/pg_rewind/t/001_basic.pl | 20 src/bin/pg_rewind/t/RewindTest.pm | 1 + 3 files changed, 31 insertions(+), 1 deletion(-) diff --git a/src/bin/pg_rewind/libpq_source.c b/src/bin/pg_rewind/libpq_source.c index 0d8e9ee2d1..417c74cfef 100644 --- a/src/bin/pg_rewind/libpq_source.c +++ b/src/bin/pg_rewind/libpq_source.c @@ -298,7 +298,16 @@ libpq_traverse_files(rewind_source *source, process_file_callback_t callback) link_target = PQgetvalue(res, i, 3); if (link_target[0]) - type = FILE_TYPE_SYMLINK; + { + /* + * In-place tablespaces are directories located in pg_tblspc/ with + * relative paths. + */ + if (is_absolute_path(link_target)) +type = FILE_TYPE_SYMLINK; + else +type = FILE_TYPE_DIRECTORY; + } else if (isdir) type = FILE_TYPE_DIRECTORY; else diff --git a/src/bin/pg_rewind/t/001_basic.pl b/src/bin/pg_rewind/t/001_basic.pl index 031594e14e..c7b48255a7 100644 --- a/src/bin/pg_rewind/t/001_basic.pl +++ b/src/bin/pg_rewind/t/001_basic.pl @@ -18,6 +18,12 @@ sub run_test RewindTest::setup_cluster($test_mode); RewindTest::start_primary(); + # Create an in-place tablespace with some data on it. + primary_psql("CREATE TABLESPACE space_test LOCATION ''"); + primary_psql("CREATE TABLE space_tbl (d text) TABLESPACE space_test"); + primary_psql( + "INSERT INTO space_tbl VALUES ('in primary, before promotion')"); + # Create a test table and insert a row in primary. primary_psql("CREATE TABLE tbl1 (d text)"); primary_psql("INSERT INTO tbl1 VALUES ('in primary')"); @@ -78,6 +84,13 @@ sub run_test "insert into drop_tbl values ('in primary, after promotion')"); primary_psql("DROP TABLE drop_tbl"); + # Insert some data in the in-place tablespace for the old primary and + # the standby. + primary_psql( + "INSERT INTO space_tbl VALUES ('in primary, after promotion')"); + standby_psql( + "INSERT INTO space_tbl VALUES ('in standby, after promotion')"); + # Before running pg_rewind, do a couple of extra tests with several # option combinations. As the code paths taken by those tests # do not change for the "local" and "remote" modes, just run them @@ -145,6 +158,13 @@ sub run_test RewindTest::run_pg_rewind($test_mode); + check_query( + 'SELECT * FROM space_tbl ORDER BY d', + qq(in primary, before promotion +in standby, after promotion +), + 'table content'); + check_query( 'SELECT * FROM tbl1', qq(in primary diff --git a/src/bin/pg_rewind/t/RewindTest.pm b/src/bin/pg_rewind/t/RewindTest.pm index 4957791e94..8fbbd521cb 100644 --- a/src/bin/pg_rewind/t/RewindTest.pm +++ b/src/bin/pg_rewind/t/RewindTest.pm @@ -131,6 +131,7 @@ sub setup_cluster $node_primary->append_conf( 'postgresql.conf', qq( wal_keep_size = 320MB +allow_in_place_tablespaces = on )); return; } -- 2.40.1 signature.asc Description: PGP signature
Re: add timing information to pg_upgrade
On Fri, Jul 28, 2023 at 5:21 AM Nathan Bossart wrote: > > This information can be used to better understand where the time is going > and to validate future improvements. I'm open to suggestions on formatting > the timing information, assuming folks are interested in this idea. > > Thoughts? +1 for adding time taken. Some comments on the patch: 1. +gettimeofday(_start, NULL); +gettimeofday(_end, NULL); +start_ms = (step_start.tv_sec * 1000L) + (step_start.tv_usec / 1000L); +end_ms = (step_end.tv_sec * 1000L) + (step_end.tv_usec / 1000L); +elapsed_ms = end_ms - start_ms; How about using INSTR_TIME_SET_CURRENT, INSTR_TIME_SUBTRACT and INSTR_TIME_GET_MILLISEC macros instead of gettimeofday and explicit calculations? 2. > Checking database user is the install user ok (took 3 ms) > Checking database connection settings ok (took 4 ms) +report_status(PG_REPORT, "ok (took %ld ms)", elapsed_ms); I think it's okay to just leave it with "ok \t %ld ms", elapsed_ms); without "took". FWIW, pg_regress does that way, see below: ok 2 + boolean50 ms ok 3 + char 32 ms ok 4 + name 33 ms > Performing Consistency Checks > - > Checking cluster versions ok (took 0 ms) > Checking database user is the install user ok (took 3 ms) > Checking database connection settings ok (took 4 ms) > Checking for prepared transactions ok (took 2 ms) > Checking for system-defined composite types in user tables ok (took 82 ms) > Checking for reg* data types in user tables ok (took 55 ms) Just curious, out of all the reported pg_upgrade prep_status()-es, which ones are taking more time? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: POC: Extension for adding distributed tracing - pg_tracing
Hi, I'd keep Autotools build up to date, because Meson is very limited in terms of not very up-to-date OS versions. Anyway, it's OK now. I'm currently playing with the patch and reviewing sources, if you need any cooperation - please let us know. I'm +1 for committing this patch after review. -- Regards, Nikita Malakhov Postgres Professional The Russian Postgres Company https://postgrespro.ru/
Re: Support to define custom wait events for extensions
On Fri, Jul 28, 2023 at 6:36 AM Michael Paquier wrote: > > I have spent more time polishing the docs and the comments. This v9 > looks in a rather committable shape now with docs, tests and core > routines in place. Thanks. Here are some comments on v9 patch: 1. - so an LWLock wait event might be reported as - just extension rather than the - extension-assigned name. + if the extension's library is not loaded; so a custom wait event might + be reported as just ??? + rather than the custom name assigned. Trying to understand why '???' is any better than 'extension' for a registered custom wait event of an unloaded extension? PS: Looked at other instances where '???' is being used for representing an unknown "thing". 2. Have an example of how a custom wait event is displayed in the example in the docs "Here is an example of how wait events can be viewed:". We can use the worker_spi wait event type there. 3. - so an LWLock wait event might be reported as - just extension rather than the - extension-assigned name. + . In some cases, the name + assigned by an extension will not be available in all server processes + if the extension's library is not loaded; so a custom wait event might + be reported as just ??? Are we missing to explicitly say what wait event will be reported for an LWLock when the extension library is not loaded? 4. + Add-ins can define custom wait events under the wait event type I see a few instances of Add-ins/add-in in xfunc.sgml. Isn't it better to use the word extension given that glossary defines what an extension is https://www.postgresql.org/docs/current/glossary.html#GLOSSARY-EXTENSION? 5. +} WaitEventExtensionCounter; + +/* pointer to the shared memory */ +static WaitEventExtensionCounter *waitEventExtensionCounter; How about naming the structure variable as WaitEventExtensionCounterData and pointer as WaitEventExtensionCounter? This keeps all the static variable names consistent WaitEventExtensionNames, WaitEventExtensionNamesAllocated and WaitEventExtensionCounter. 6. +/* Check the wait event class. */ +Assert((wait_event_info & 0xFF00) == PG_WAIT_EXTENSION); + +/* This should only be called for user-defined wait event. */ +Assert(eventId >= NUM_BUILTIN_WAIT_EVENT_EXTENSION); Maybe, we must turn the above asserts into ereport(ERROR) to protect against an extension sending in an unregistered wait_event_info? Especially, the first Assert((wait_event_info & 0xFF00) == PG_WAIT_EXTENSION); checks that the passed in wait_event_info is previously returned by WaitEventExtensionNew. IMO, these assertions better fit for errors. 7. + * Extensions can define their own wait events in this categiry. First, Typo - s/categiry/category 8. + First, + * they should call WaitEventExtensionNew() to get one or more wait event + * IDs that are allocated from a shared counter. Can WaitEventExtensionNew() be WaitEventExtensionNew(int num_ids, int *result) to get the required number of wait event IDs in one call similar to RequestNamedLWLockTranche? Currently, an extension needs to call WaitEventExtensionNew() N number of times to get N wait event IDs. Maybe the existing WaitEventExtensionNew() is good, but just a thought. 9. # The expected result is a special pattern here with a newline coming from the # first query where the shared memory state is set. $result = $node->poll_query_until( 'postgres', qq[SELECT worker_spi_init(); SELECT wait_event FROM pg_stat_activity WHERE backend_type ~ 'worker_spi';], qq[ worker_spi_main]); This test doesn't have to be that complex with the result being a special pattern, SELECT worker_spi_init(); can just be within a separate safe_psql. 10. +wsstate = ShmemInitStruct("custom_wait_event", Name the shared memory just "worker_spi" to make it generic and extensible. Essentially, it is a woker_spi shared memory area part of it is for custom wait event id. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: remaining sql/json patches
Hello, On Thu, Jul 27, 2023 at 6:36 PM Shinoda, Noriyoshi (HPE Services Japan - FSIP) wrote: > Hi, > Thank you for developing such a great feature. The attached patch formats the > documentation like any other function definition: > - Added right parenthesis to json function calls. > - Added to json functions. > - Added a space to the 'expression' part of the json_scalar function. > - Added a space to the 'expression' part of the json_serialize function. Thanks for checking and the patch. Will push shortly. > It seems that the three functions added this time do not have tuples in the > pg_proc catalog. Is it unnecessary? Yes. These are not functions that get pg_proc entries, but SQL constructs that *look like* functions, similar to XMLEXISTS(), etc. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Re: unrecognized node type while displaying a Path due to dangling pointer
On Tue, Jul 11, 2023 at 4:30 PM Tom Lane wrote: > > Jeevan Chalke writes: > > Attached patch. > > I would be astonished if this fixes anything. The code still doesn't > know which paths are referenced by which other ones, and so the place > where we free a previously-added path can't know what to do. > > I've speculated about adding some form of reference counting to paths > (maybe just a "pin" flag rather than a full refcount) so that we could > be smarter about this. The existing kluge for "don't free IndexPaths" > could be replaced by setting the pin mark on any IndexPath that we > make a bitmap path from. Up to now it hasn't seemed necessary to > generalize that hack, but maybe it's time. Can you show a concrete > case where we are freeing a still-referenced path? Set of patches in [1] add infrastructure to reference, link and unlink paths.The patches are raw and have some TODOs there. But I think that infrastructure will solve this problem as a side effect. Please take a look and let me know if this is as per your speculation. It's more than just pinning though. The patch set uses references to free memory consumed by paths which remain unused. The memory consumed is substantial when partitionwise join is used and there are thousands of partitions. [1] https://www.postgresql.org/message-id/CAExHW5tUcVsBkq9qT%3DL5vYz4e-cwQNw%3DKAGJrtSyzOp3F%3DXacA%40mail.gmail.com -- Best Wishes, Ashutosh Bapat
Re: POC: Extension for adding distributed tracing - pg_tracing
> Agree, something goes wrong when using Autotools (but not Meson) on > both Linux and MacOS. I didn't investigate the issue though. I was only using meson and forgot to keep Automake files up to date when I've split pg_tracing.c in multiple files (span.c, explain.c...). On Fri, Jul 28, 2023 at 8:10 AM Nikita Malakhov wrote: > Hi, > > I've fixed the Autotools build, please check patch below (v2). > > On Thu, Jul 27, 2023 at 6:39 PM Aleksander Alekseev < > aleksan...@timescale.com> wrote: > >> Hi, >> >> > Also FYI, there are build warnings because functions >> > const char * get_span_name(const Span * span, const char *qbuffer) >> > and >> > const char * get_operation_name(const Span * span, const char *qbuffer) >> > do not have default inside switch and no return outside of switch. >> >> You are right, there are a few warnings: >> >> ``` >> [1566/1887] Compiling C object contrib/pg_tracing/pg_tracing.so.p/span.c.o >> ../contrib/pg_tracing/span.c: In function ‘get_span_name’: >> ../contrib/pg_tracing/span.c:210:1: warning: control reaches end of >> non-void function [-Wreturn-type] >> 210 | } >> | ^ >> ../contrib/pg_tracing/span.c: In function ‘get_operation_name’: >> ../contrib/pg_tracing/span.c:249:1: warning: control reaches end of >> non-void function [-Wreturn-type] >> 249 | } >> | ^ >> ``` >> >> Here is the patch v2 with a quick fix. >> >> > but got errors calling make check and cannot install the extension >> >> Agree, something goes wrong when using Autotools (but not Meson) on >> both Linux and MacOS. I didn't investigate the issue though. >> >> -- >> Best regards, >> Aleksander Alekseev >> > > > -- > Regards, > > -- > Nikita Malakhov > Postgres Professional > The Russian Postgres Company > https://postgrespro.ru/ >
Re: POC: Extension for adding distributed tracing - pg_tracing
Hi, I've fixed the Autotools build, please check patch below (v2). On Thu, Jul 27, 2023 at 6:39 PM Aleksander Alekseev < aleksan...@timescale.com> wrote: > Hi, > > > Also FYI, there are build warnings because functions > > const char * get_span_name(const Span * span, const char *qbuffer) > > and > > const char * get_operation_name(const Span * span, const char *qbuffer) > > do not have default inside switch and no return outside of switch. > > You are right, there are a few warnings: > > ``` > [1566/1887] Compiling C object contrib/pg_tracing/pg_tracing.so.p/span.c.o > ../contrib/pg_tracing/span.c: In function ‘get_span_name’: > ../contrib/pg_tracing/span.c:210:1: warning: control reaches end of > non-void function [-Wreturn-type] > 210 | } > | ^ > ../contrib/pg_tracing/span.c: In function ‘get_operation_name’: > ../contrib/pg_tracing/span.c:249:1: warning: control reaches end of > non-void function [-Wreturn-type] > 249 | } > | ^ > ``` > > Here is the patch v2 with a quick fix. > > > but got errors calling make check and cannot install the extension > > Agree, something goes wrong when using Autotools (but not Meson) on > both Linux and MacOS. I didn't investigate the issue though. > > -- > Best regards, > Aleksander Alekseev > -- Regards, -- Nikita Malakhov Postgres Professional The Russian Postgres Company https://postgrespro.ru/ pg-tracing-v2.patch Description: Binary data