Re: Use of additional index columns in rows filtering
On Wed, Aug 2, 2023 at 6:32 PM Peter Geoghegan wrote: > I don't dispute the fact that this can only happen when the planner > believes (with good reason) that the expected cost will be lower. But > I maintain that there is a novel risk to be concerned about, which is > meaningfully distinct from the general risk of regressions that comes > from making just about any change to the planner. The important > principle here is that we should have a strong bias in the direction > of making quals into true "Access Predicates" whenever practical. > > Yeah, technically the patch doesn't directly disrupt how existing > index paths get generated. But it does have the potential to disrupt > it indirectly, by providing an alternative very-similar index path > that's likely to outcompete the existing one in these cases. I think > that we should have just one index path that does everything well > instead. You can see this for yourself, quite easily. Start by running the relevant query from the regression tests, which is: SELECT * FROM tenk1 WHERE thousand = 42 AND (tenthous = 1 OR tenthous = 3 OR tenthous = 42); EXPLAIN (ANALYZE, BUFFERS) confirms that the patch makes the query slightly faster, as expected. I see 7 buffer hits for the bitmap index scan plan on master, versus only 4 buffer hits for the patch's index scan. Obviously, this is because we go from multiple index scans (multiple bitmap index scans) to only one. But, if I run this insert statement and try the same thing again, things look very different: insert into tenk1 (thousand, tenthous) select 42, i from generate_series(43, 1000) i; (Bear in mind that we've inserted rows that don't actually need to be selected by the query in question.) Now the master branch's plan works in just the same way as before -- it has exactly the same overhead (7 buffer hits). Whereas the patch still gets the same risky plan -- which now blows up. The plan now loses by far more than it could ever really hope to win by: 336 buffer hits. (It could be a lot higher than this, even, but you get the picture.) Sure, it's difficult to imagine a very general model that captures this sort of risk, in the general case. But you don't need a degree in actuarial science to understand that it's inherently a bad idea to juggle chainsaws -- no matter what your general risk tolerance happens to be. Some things you just don't do. -- Peter Geoghegan
Re: ALTER COLUMN ... SET EXPRESSION to alter stored generated column's expression
On Wed, Aug 2, 2023 at 9:16 PM jian he wrote: > On Wed, Aug 2, 2023 at 6:36 PM Amul Sul wrote: > > > > Hi, > > > > Currently, we have an option to drop the expression of stored generated > columns > > as: > > > > ALTER [ COLUMN ] column_name DROP EXPRESSION [ IF EXISTS ] > > > > But don't have support to update that expression. The attached patch > provides > > that as: > > > > ALTER [ COLUMN ] column_name SET EXPRESSION expression > > > > Note that this form of ALTER is meant to work for the column which is > already > > generated. It then changes the generation expression in the catalog and > rewrite > > the table, using the existing table rewrite facilities for ALTER TABLE. > > Otherwise, an error will be reported. > > > > To keep the code flow simple, I have renamed the existing function that > was in > > use for DROP EXPRESSION so that it can be used for SET EXPRESSION as > well, > > which is a similar design as SET/DROP DEFAULT. I kept this renaming code > > changes in a separate patch to minimize the diff in the main patch. > > > > Demo: > > -- Create table > > CREATE TABLE t1 (x int, y int GENERATED ALWAYS AS (x * 2) STORED); > > INSERT INTO t1 VALUES(generate_series(1,3)); > > > > -- Check the generated data > > SELECT * FROM t1; > > x | y > > ---+--- > > 1 | 2 > > 2 | 4 > > 3 | 6 > > (3 rows) > > > > -- Alter the expression > > ALTER TABLE t1 ALTER COLUMN y SET EXPRESSION (x * 4); > > > > -- Check the new data > > SELECT * FROM t1; > > x | y > > ---+ > > 1 | 4 > > 2 | 8 > > 3 | 12 > > (3 rows) > > > > Thank you. > > -- > > Regards, > > Amul Sul > > EDB: http://www.enterprisedb.com > - > setup. > > BEGIN; > set search_path = test; > DROP TABLE if exists gtest_parent, gtest_child; > > CREATE TABLE gtest_parent (f1 date NOT NULL, f2 bigint, f3 bigint > GENERATED ALWAYS AS (f2 * 2) STORED) PARTITION BY RANGE (f1); > > CREATE TABLE gtest_child PARTITION OF gtest_parent > FOR VALUES FROM ('2016-07-01') TO ('2016-08-01'); -- inherits gen expr > > CREATE TABLE gtest_child2 PARTITION OF gtest_parent ( > f3 WITH OPTIONS GENERATED ALWAYS AS (f2 * 22) STORED -- overrides gen > expr > ) FOR VALUES FROM ('2016-08-01') TO ('2016-09-01'); > > CREATE TABLE gtest_child3 (f1 date NOT NULL, f2 bigint, f3 bigint > GENERATED ALWAYS AS (f2 * 33) STORED); > ALTER TABLE gtest_parent ATTACH PARTITION gtest_child3 FOR VALUES FROM > ('2016-09-01') TO ('2016-10-01'); > > INSERT INTO gtest_parent (f1, f2) VALUES ('2016-07-15', 1); > INSERT INTO gtest_parent (f1, f2) VALUES ('2016-07-15', 2); > INSERT INTO gtest_parent (f1, f2) VALUES ('2016-08-15', 3); > UPDATE gtest_parent SET f1 = f1 + 60 WHERE f2 = 1; > > ALTER TABLE ONLY gtest_parent ALTER COLUMN f3 SET EXPRESSION (f2 * 4); > ALTER TABLE gtest_child ALTER COLUMN f3 SET EXPRESSION (f2 * 10); > COMMIT; > > set search_path = test; > SELECT table_name, column_name, is_generated, generation_expression > FROM information_schema.columns > WHERE table_name in ('gtest_child','gtest_child1', > 'gtest_child2','gtest_child3') > order by 1,2; > result: > table_name | column_name | is_generated | generation_expression > --+-+--+--- > gtest_child | f1 | NEVER| > gtest_child | f1 | NEVER| > gtest_child | f2 | NEVER| > gtest_child | f2 | NEVER| > gtest_child | f3 | ALWAYS | (f2 * 2) > gtest_child | f3 | ALWAYS | (f2 * 10) > gtest_child2 | f1 | NEVER| > gtest_child2 | f1 | NEVER| > gtest_child2 | f2 | NEVER| > gtest_child2 | f2 | NEVER| > gtest_child2 | f3 | ALWAYS | (f2 * 22) > gtest_child2 | f3 | ALWAYS | (f2 * 2) > gtest_child3 | f1 | NEVER| > gtest_child3 | f1 | NEVER| > gtest_child3 | f2 | NEVER| > gtest_child3 | f2 | NEVER| > gtest_child3 | f3 | ALWAYS | (f2 * 2) > gtest_child3 | f3 | ALWAYS | (f2 * 33) > (18 rows) > > one partition, one column 2 generated expression. Is this the expected > behavior? That is not expected & acceptable. But, somehow, I am not able to reproduce this behavior. Could you please retry this experiment by adding "table_schema" in your output query? Thank you. Regards, Amul
Re: Extract numeric filed in JSONB more effectively
Hi čt 3. 8. 2023 v 2:51 odesílatel Andy Fan napsal: > Hi Jian: > > >> return PointerGetDatum(v->val.numeric); >> should be something like >> PG_RETURN_NUMERIC(v->val.numeric); >> ? >> > > Thanks for this reminder, a new patch is attached. and commitfest > entry is added as well[1]. For recording purposes, I compared the > new operator with all the existing operators. > > select 1 from tb where (a->'a')::numeric = 2; 30.56ms > select 1 from tb where (a->>'a')::numeric = 2; 29.43ms > select 1 from tb where (a@->'a') = 2; 14.80ms > > [1] https://commitfest.postgresql.org/44/4476/ > > I don't like this solution because it is bloating operators and it is not extra readable. For completeness you should implement cast for date, int, boolean too. Next, the same problem is with XML or hstore type (probably with any types that are containers). It is strange so only casting is 2x slower. I don't like the idea so using a special operator is 2x faster than common syntax for casting. It is a signal, so there is a space for optimization. Black magic with special operators is not user friendly for relatively common problems. Maybe we can introduce some *internal operator* "extract to type", and in rewrite stage we can the pattern (x->'field')::type transform to OP(x, 'field', typid) Regards Pavel -- > Best Regards > Andy Fan >
Re: Fix incorrect start up costs for WindowAgg paths (bug #17862)
On Wed, 31 May 2023 at 12:59, David Rowley wrote: > > On Wed, 12 Apr 2023 at 21:03, David Rowley wrote: > > I'll add this to the "Older bugs affecting stable branches" section of > > the PG 16 open items list > > When I wrote the above, it was very soon after the feature freeze for > PG16. I wondered, since we tend not to do cost changes as part of bug > fixes due to not wanting to destabilise plans between minor versions > if we could instead just fix it in PG16 given the freeze had *just* > started. That's no longer the case, so I'm just going to move this > out from where I added it in the PG16 Open items "Live issues" section > and just add a July CF entry for it instead. I'm keen to move this patch along. It's not a particularly interesting patch and don't expect much interest in it, but I feel it's pretty important to have the planner not accidentally choose a cheap startup plan when a WindowAgg is going to fetch the entire subplan's tuples. I've made another pass over the patch and made a bunch of cosmetic changes. As far as mechanical changes, I only changed the EXCLUDE TIES and EXCLUDE GROUP behaviour when there is no ORDER BY clause in the WindowClause. If there's no ORDER BY then subtracting 1.0 rows seems like the right thing to do rather than what the previous patch did. I (temporarily) left the DEBUG1 elog in there if anyone wants to test for themselves (saves debugger use). In the absence of that, I'm planning on just pushing it to master only tomorrow. It seems fairly low risk and unlikely to attract too much interest since it only affects startup costs of WindowAgg nodes. I'm currently thinking it's a bad idea to backpatch this but I'd consider it more if someone else thought it was a good idea or if more people came along complaining about poor plan choice in plans containing WindowAggs. Currently, it seems better not to destabilise plans in the back branches. (CC'd Tim, who reported #17862, as he may have an opinion on this) The only thought I had while looking at this again aside from what I changed was if get_windowclause_startup_tuples() should go in selfuncs.c. I wondered if it would be neater to use convert_numeric_to_scalar() instead of the code I had to add to convert the (SMALL|BIG)INT Consts in FOLLOWING to double. Aside from that reason, it seems we don't have many usages of DEFAULT_INEQ_SEL outside of selfuncs.c. I didn't feel strongly enough about this to actually move the function. The updated patch is attached. Here are the results of my testing (note the DEBUG message matches the COUNT(*) result in all cases apart from one case where COUNT(*) returns 0 and the estimated tuples is 1.0). create table ab (a int, b int); insert into ab select a,b from generate_series(1,100) a, generate_series(1,100) b; analyze ab; set client_min_messages=debug1; # select count(*) over () from ab limit 1; DEBUG: startup_tuples = 1 count --- 1 (1 row) # select count(*) over (partition by a) from ab limit 1; DEBUG: startup_tuples = 100 count --- 100 (1 row) # select count(*) over (partition by a order by b) from ab limit 1; DEBUG: startup_tuples = 1 count --- 1 (1 row) # select count(*) over (partition by a order by b rows between current row and unbounded following) from ab limit 1; DEBUG: startup_tuples = 100 count --- 100 (1 row) # select count(*) over (partition by a order by b rows between current row and 10 following) from ab limit 1; DEBUG: startup_tuples = 11 count --- 11 (1 row) # select count(*) over (partition by a order by b rows between current row and 10 following exclude current row) from ab limit 1; DEBUG: startup_tuples = 10 count --- 10 (1 row) # select count(*) over (partition by a order by b rows between current row and 10 following exclude ties) from ab limit 1; DEBUG: startup_tuples = 11 count --- 11 (1 row) # select count(*) over (partition by a order by b range between current row and 10 following exclude ties) from ab limit 1; DEBUG: startup_tuples = 11 count --- 11 (1 row) # select count(*) over (partition by a order by b range between current row and unbounded following exclude ties) from ab limit 1; DEBUG: startup_tuples = 100 count --- 100 (1 row) # select count(*) over (partition by a order by b range between current row and unbounded following exclude group) from ab limit 1; DEBUG: startup_tuples = 99 count --- 99 (1 row) # select count(*) over (partition by a order by b groups between current row and unbounded following exclude group) from ab limit 1; DEBUG: startup_tuples = 99 count --- 99 (1 row) # select count(*) over (partition by a rows between current row and unbounded following exclude group) from ab limit 1; DEBUG: startup_tuples = 1 count --- 0 (1 row) # select count(*) over (partition by a rows between current row and unbounded following exclude ties) from ab limit 1; DEBUG: startup_tuples =
Re: Oversight in reparameterize_path_by_child leading to executor crash
On Tue, Aug 1, 2023 at 6:50 PM Tom Lane wrote: > > Alternatively, could we postpone the reparameterization until > createplan.c? Having to build reparameterized expressions we might > not use seems expensive, and it would also contribute to the memory > bloat being complained of in nearby threads. > As long as the translation happens only once, it should be fine. It's the repeated translations which should be avoided. -- Best Wishes, Ashutosh Bapat
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
On Wed, Aug 2, 2023 at 11:19 PM Amit Kapila wrote: > > On Wed, Aug 2, 2023 at 4:09 PM Melih Mutlu wrote: > > > > PFA an updated version with some of the earlier reviews addressed. > > Forgot to include them in the previous email. > > > > It is always better to explicitly tell which reviews are addressed but > anyway, I have done some minor cleanup in the 0001 patch including > removing includes which didn't seem necessary, modified a few > comments, and ran pgindent. I also thought of modifying some variable > names based on suggestions by Peter Smith in an email [1] but didn't > find many of them any better than the current ones so modified just a > few of those. If you guys are okay with this then let's commit it and > then we can focus more on the remaining patches. > I checked the latest patch v25-0001. LGTM. ~~ BTW, I have re-tested many cases of HEAD versus HEAD+v25-0001 (using current test scripts previously mentioned in this thread). Because v25-0001 is only a refactoring patch we expect that the results should be the same as for HEAD, and that is what I observed. -- Kind Regards, Peter Smith. Fujitsu Australia
Fix bogus Asserts in calc_non_nestloop_required_outer
As stated in [1], all paths arriving here are parameterized by top parents, so we should check against top_parent_relids if it exists in the two Asserts. Attached is a patch fixing that. [1] https://www.postgresql.org/message-id/CAMbWs4_UoVcCwkVMfi9TjSC%3Do5U6BRHUNZiVhrvSbDfU2HaeDA%40mail.gmail.com Thanks Richard v1-0001-Fix-bogus-Asserts-in-calc_non_nestloop_required_outer.patch Description: Binary data
Re: Use of additional index columns in rows filtering
On Wed, Aug 2, 2023 at 6:48 AM Tomas Vondra wrote: > How come we don't know that until the execution time? Surely when > building the paths/plans, we match the clauses to the index keys, no? Or > are you saying that just having a scan key is not enough for it to be > "access predicate"? In principle we can and probably should recognize the difference between "Access Predicates" and "Index Filter Predicates" much earlier on, during planning. Probably when index paths are first built. It seems quite likely that there will turn out to be at least 2 or 3 reasons to do this. The EXPLAIN usability issue might be enough of a reason on its own, though. > Anyway, this patch is mostly about "Index Cond" mixing two types of > predicates. But the patch is really about "Filter" predicates - moving > some of them from table to index. So quite similar to the "index filter > predicates" except that those are handled by the index AM. I understand that that's how the patch is structured. It is nevertheless possible (as things stand) that the patch will make the planner shift from a plan that uses "Access Predicates" to the maximum extent possible when scanning a composite index, to a similar plan that has a similar index scan, for the same index, but with fewer "Access Predicates" in total. In effect, the patched planner will swap one type of predicate for another type because doing so enables the executor to scan an index once instead of scanning it several times. I don't dispute the fact that this can only happen when the planner believes (with good reason) that the expected cost will be lower. But I maintain that there is a novel risk to be concerned about, which is meaningfully distinct from the general risk of regressions that comes from making just about any change to the planner. The important principle here is that we should have a strong bias in the direction of making quals into true "Access Predicates" whenever practical. Yeah, technically the patch doesn't directly disrupt how existing index paths get generated. But it does have the potential to disrupt it indirectly, by providing an alternative very-similar index path that's likely to outcompete the existing one in these cases. I think that we should have just one index path that does everything well instead. > But differentiating between access and filter predicates (at the index > AM level) seems rather independent of what this patch aims to do. My concern is directly related to the question of "access predicates versus filter predicates", and the planner's current ignorance on which is which. The difference may not matter too much right now, but ISTM that your patch makes it matter a lot more. And so in that indirect sense it does seem relevant. The planner has always had a strong bias in the direction of making clauses that can be index quals into index quals, rather than filter predicates. It makes sense to do that even when they aren't very selective. This is a similar though distinct principle. It's awkward to discuss the issue, since we don't really have official names for these things just yet (I'm just going with what Markus calls them for now). And because there is more than one type of "index filter predicate" in play with the patch (namely those in the index AM, and those in the index scan executor node). But my concern boils down to access predicates being strictly better than equivalent index filter predicates. > I'm not following. Why couldn't there be some second-order effects? > Maybe it's obvious / implied by something said earlier, but I think > every time we decide between multiple choices, there's a risk of picking > wrong. Naturally, I agree that some amount of risk is inherent. I believe that the risk I have identified is qualitatively different to the standard, inherent kind of risk. > Anyway, is this still about this patch or rather about your SAOP patch? It's mostly not about my SAOP patch. It's just that my SAOP patch will do exactly the right thing in this particular case (at least once combined with Alena Rybakina's OR-to-SAOP transformation patch). It is therefore quite clear that a better plan is possible in principle. Clearly we *can* have the benefit of only one single index scan (i.e. no BitmapOr node combining multiple similar index scans), without accepting any novel new risk to get that benefit. We should just have one index path that does it all, while avoiding duplicative index paths that embody a distinction that really shouldn't exist -- it should be dealt with at runtime instead. > True. IMHO the danger or "index filter" predicates is that people assume > index conditions eliminate large parts of the index - which is not > necessarily the case here. If we can improve this, cool. I think that we'd find a way to use the same information in the planner, too. It's not just users that should care about the difference. And I don't think that it's particularly likely to be limited to SAOP/MDAM stuff. As
Re: Extract numeric filed in JSONB more effectively
Hi Jian: > return PointerGetDatum(v->val.numeric); > should be something like > PG_RETURN_NUMERIC(v->val.numeric); > ? > Thanks for this reminder, a new patch is attached. and commitfest entry is added as well[1]. For recording purposes, I compared the new operator with all the existing operators. select 1 from tb where (a->'a')::numeric = 2; 30.56ms select 1 from tb where (a->>'a')::numeric = 2; 29.43ms select 1 from tb where (a@->'a') = 2; 14.80ms [1] https://commitfest.postgresql.org/44/4476/ -- Best Regards Andy Fan v2-0001-Add-jsonb-operator-to-return-a-numeric-directly.patch Description: Binary data
Re: Configurable FP_LOCK_SLOTS_PER_BACKEND
I thought it might be helpful to share some more details from one of the case studies behind Nik's suggestion. Bursty contention on lock_manager lwlocks recently became a recurring cause of query throughput drops for GitLab.com, and we got to study the behavior via USDT and uprobe instrumentation along with more conventional observations (see https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301). This turned up some interesting finds, and I thought sharing some of that research might be helpful. Results so far suggest that increasing FP_LOCK_SLOTS_PER_BACKEND would have a much larger positive impact than any other mitigation strategy we have evaluated. Rather than reducing hold duration or collision rate, adding fastpath slots reduces the frequency of even having to acquire those lock_manager lwlocks. I suspect this would be helpful for many other workloads, particularly those having high frequency queries whose tables collectively have more than about 16 or indexes. Lowering the lock_manager lwlock acquisition rate means lowering its contention rate (and probably also its contention duration, since exclusive mode forces concurrent lockers to queue). I'm confident this would help our workload, and I strongly suspect it would be generally helpful by letting queries use fastpath locking more often. > However, the lmgr/README says this is meant to alleviate contention on > the lmgr partition locks. Wouldn't it be better to increase the number > of those locks, without touching the PGPROC stuff? That was my first thought too, but growing the lock_manager lwlock tranche isn't nearly as helpful. On the slowpath, each relation's lock tag deterministically hashes onto a specific lock_manager lwlock, so growing the number of lock_manager lwlocks just makes it less likely for two or more frequently locked relations to hash onto the same lock_manager. In contrast, growing the number of fastpath slots completely avoids calls to the slowpath (i.e. no need to acquire a lock_manager lwlock). The saturation condition we'd like to solve is heavy contention on one or more of the lock_manager lwlocks. Since that is driven by the slowpath acquisition rate of heavyweight locks, avoiding the slowpath is better than just moderately reducing the contention on the slowpath. To be fair, increasing the number of lock_manager locks definitely can help to a certain extent, but it doesn't cover an important general case. As a thought experiment, suppose we increase the lock_manager tranche to some arbitrarily large size, larger than the number of relations in the db. This unrealistically large size means we have the best case for avoiding collisions -- each relation maps uniquely onto its own lock_manager lwlock. That helps a lot in the case where the workload is spread among many non-overlapping sets of relations. But it doesn't help a workload where any one table is accessed frequently via slowpath locking. Continuing the thought experiment, if that frequently queried table has 16 or more indexes, or if it is joined to other tables that collectively add up to over 16 relations, then each of those queries is guaranteed to have to use the slowpath and acquire the deterministically associated lock_manager lwlocks. So growing the tranche of lock_manager lwlocks would help for some workloads, while other workloads would not be helped much at all. (As a concrete example, a workload at GitLab has several frequently queried tables with over 16 indexes that consequently always use at least some slowpath locks.) For additional context: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#what-influences-lock_manager-lwlock-acquisition-rate Summarizes the pathology and its current mitigations. https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1357834678 Documents the supporting research methodology. https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365370510 What code paths require an exclusive mode lwlock for lock_manager? https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365595142 Comparison of fastpath vs. slowpath locking, including quantifying the rate difference. https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365630726 Confirms the acquisition rate of lock_manager locks is not uniform. The sampled workload has a 3x difference in the most vs. least frequently acquired lock_manager lock, corresponding to the workload's most frequently accessed relations. > Well, that has a cost too, as it makes PGPROC larger, right? At the > moment that struct is already ~880B / 14 cachelines, adding 48 XIDs > would make it +192B / +3 cachelines. I doubt that won't impact other > common workloads ... That's true; growing the data structure may affect L2/L3 cache hit rates when touching PGPROC. Is that cost worth the benefit of using fastpath for a higher percentage of table locks? The answer may be workload- and
First draft of back-branch release notes is done
... at https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c6344d7686f3e3c8243c2c6771996cfc63e71eae This is a bit earlier than usual because I will be tied up with $LIFE for the next couple of days. I will catch up with any subsequent back-branch commits over the weekend. As usual, please send comments/corrections by Sunday. regards, tom lane
Re: [PATCH] Allow Postgres to pick an unused port to listen
> On 10 Jul 2023, at 14:27, Daniel Gustafsson wrote: > This patch is Waiting on Author in the current commitfest with no new patch > presented following the discussion here. Is there an update ready or should > we > close it in this CF in favour of a future one? Since the thread stalled here with the patch waiting on author since May I will go ahead and mark it returned with feedback in this CF. Please feel free to re-open a new entry in a future CF. -- Daniel Gustafsson
Re: An attempt to avoid locally-committed-but-not-replicated-to-standby-transactions in synchronous replication
> On 30 Jan 2023, at 06:55, Bharath Rupireddy > wrote: > I started to spend time on this feature again. Thanks all for your > comments so far. Since there hasn't been any updates for the past six months, and the patch hasn't applied for a few months, I am marking this returned with feedback for now. Please feel free to open a new entry in a future CF for this patch when there is a new version. -- Daniel Gustafsson
Re: POC, WIP: OR-clause support for indexes
On Wed, Aug 2, 2023 at 8:58 AM Alena Rybakina wrote: > No, I haven't thought about it yet. I studied the example and it would > really be nice to add optimization here. I didn't notice any problems > with its implementation. I also have an obvious example with the "or" > operator, for example > , select * from multi_test, where (a, b ) = ( 1, 1 ) or (a, b ) = ( 2, 1 > ) ...; > > Although I think such a case will be used less often. Right. As I said, I don't particularly care about the row constructor syntax -- it's not essential. In my experience patches like this one that ultimately don't succeed usually *don't* have specific problems that cannot be fixed. The real problem tends to be ambiguity about the general high level design. So more than anything else, ambiguity is the thing that you need to minimize to be successful here. This is the #1 practical problem, by far. This may be the only thing about your patch that I feel 100% sure of. In my experience it can actually be easier to expand the scope of a project, and to come up with a more general solution: https://en.wikipedia.org/wiki/Inventor%27s_paradox I'm not trying to make your work more difficult by expanding its scope. I'm actually trying to make your work *easier* by expanding its scope. I don't claim to know what the specific scope of your patch should be at all. Just that it might be possible to get a much clearer picture of what the ideal scope really is by *trying* to generalize it further -- that understanding is what we lack right now. Even if this exercise fails in some way, it won't really have been a failure. The reasons why it fails will still be interesting and practically relevant. > It seems to me the most difficult thing is to notice problematic cases > where the transformations are incorrect, but I think it can be implemented. Right. To be clear, I am sure that it won't be practical to come up with a 100% theoretically pure approach. If for no other reason than this: normalizing to CNF in all cases will run into problems with very complex predicates. It might even be computationally intractable (could just be very slow). So there is a clear need to keep theoretical purity in balance with practical considerations. There is a need for some kind of negotiation between those two things. Probably some set of heuristics will ultimately be required to keep costs and benefits in balance. > I agree with your position, but I still don't understand how to consider > transformations to generalized cases without relying on special cases. Me neither. I wish I could say something a bit less vague here. I don't expect you to determine what set of heuristics will ultimately be required to determine when and how to perform CNF conversions, in the general case. But having at least some vague idea seems like it might increase confidence in your design. > As I understand it, you assume that it is possible to apply > transformations at the index creation stage, but there I came across the > selectivity overestimation problem. > > I still haven't found a solution for this problem. Do you think that this problem is just an accidental side-effect? It isn't necessarily the responsibility of your patch to fix such things. If it's even possible for selectivity estimates to change, then it's already certain that sometimes they'll be worse than before -- if only because of chance interactions. The optimizer is often right for the wrong reasons, and wrong for the right reasons -- we cannot really expect any patch to completely avoid problems like that. > To be honest, I think that in your examples I understand better what you > mean by normalization to the conjunctive norm, because I only had a > theoretical idea from the logic course. > > Hence, yes, normalization/security checks - now I understand why they > are necessary. As I explained to Jim, I am trying to put things in this area on a more rigorous footing. For example, I have said that the way that the nbtree code executes SAOP quals is equivalent to DNF. That is basically true, but it's also my own slightly optimistic interpretation of history and of the design. That's a good start, but it's not enough on its own. My interpretation might still be wrong in some subtle way, that I have yet to discover. That's really what I'm concerned about with your patch, too. I'm currently trying to solve a problem that I don't yet fully understand, so for me "getting a properly working flow of information" seems like a good practical exercise. I'm trying to generalize the design of my own patch as far as I can, to see what breaks, and why it breaks. My intuition is that this will help me with my own patch by forcing me to gain a truly rigorous understanding of the problem. My suggestion about generalizing your approach to cover RowCompareExpr cases is what I would do, if I were you, and this was my patch. That's almost exactly what I'm doing with my own patch already, in fact. -- Peter Geoghegan
Re: Synchronizing slots from primary to standby
On Tue, Aug 1, 2023 at 5:01 PM shveta malik wrote: > > > The work division amongst the sync workers can > > be simple, the logical replication launcher builds a shared memory > > structure based on number of slots to sync and starts the sync workers > > dynamically, and each sync worker picks {dboid, slot name, conninfo} > > from the shared memory, syncs it and proceeds with other slots. > > Do you mean the logical replication launcher builds a shared memory > structure based > on the number of 'dbs' to sync as I understood from your initial comment? Yes. I haven't looked at the 0003 patch posted upthread. However, the standby must do the following at a minimum: - Make GUCs synchronize_slot_names and max_slot_sync_workers of PGC_POSTMASTER type needing postmaster restart when changed as they affect the number of slot sync workers. - LR (logical replication) launcher connects to primary to fetch the logical slots specified in synchronize_slot_names. This is a one-time task. - LR launcher prepares a dynamic shared memory (created via dsm_create) with some state like locks for IPC and an array of {slot_name, dboid_associated_with_slot, is_sync_in_progress} - maximum number of elements in the array is the number of slots specified in synchronize_slot_names. This is a one-time task. - LR launcher decides the *best* number of slot sync workers - (based on some perf numbers) it can just launch, say, one worker per 2 or 4 or 8 etc. slots. - Each slot sync worker then picks up a slot from the DSM, connects to primary using primary conn info, syncs it, and moves to another slot. Not having the capability of on-demand stop/launch of slot sync workers makes the above design simple IMO. Thoughts? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: add timing information to pg_upgrade
On Wed, Aug 02, 2023 at 09:09:14AM -0700, Nathan Bossart wrote: > On Wed, Aug 02, 2023 at 01:02:53PM +0530, Bharath Rupireddy wrote: >> On Wed, Aug 2, 2023 at 12:45 PM Peter Eisentraut >> wrote: >>> I think we should change the output format to be more like initdb, like >>> >>> Doing something ... ok >>> >>> without horizontally aligning all the "ok"s. >> >> While this looks simple, we might end up with a lot of diff and >> changes after removing MESSAGE_WIDTH. There's a significant part of >> pg_upgrade code that deals with MESSAGE_WIDTH. I don't think it's >> worth the effort. Therefore, I'd prefer the simplest possible fix - >> change the message to '"Checking for \"aclitem\" data type in user >> tables". It may be an overkill, but we can consider adding >> Assert(sizeof(message) < MESSAGE_WIDTH) in progress report functions >> to not encourage new messages to end up in the same formatting issue. > > I don't think it's that difficult. ІMO the bigger question is whether we > want to back-patch such a change to v16 at this point. Here is a work-in-progress patch that seems to get things pretty close to what Peter is suggesting. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 1a8aab9c9c86c8fa140ccdb53d3479a77de0e762 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 2 Aug 2023 10:20:11 -0700 Subject: [PATCH 1/1] work-in-progress: fix pg_upgrade output --- src/bin/pg_upgrade/util.c | 14 +- 1 file changed, 5 insertions(+), 9 deletions(-) diff --git a/src/bin/pg_upgrade/util.c b/src/bin/pg_upgrade/util.c index 21ba4c8f12..bea6d7289d 100644 --- a/src/bin/pg_upgrade/util.c +++ b/src/bin/pg_upgrade/util.c @@ -50,10 +50,10 @@ end_progress_output(void) if (log_opts.isatty) { printf("\r"); - pg_log(PG_REPORT_NONL, "%-*s", MESSAGE_WIDTH, ""); + pg_log(PG_REPORT_NONL, "%-*s\r ", MESSAGE_WIDTH, ""); } else if (log_opts.verbose) - pg_log(PG_REPORT_NONL, "%-*s", MESSAGE_WIDTH, ""); + pg_log(PG_REPORT_NONL, " "); } /* @@ -114,9 +114,6 @@ cleanup_output_dirs(void) * prep_status * * Displays a message that describes an operation we are about to begin. - * We pad the message out to MESSAGE_WIDTH characters so that all of the - * "ok" and "failed" indicators line up nicely. (Overlength messages - * will be truncated, so don't get too verbose.) * * A typical sequence would look like this: * prep_status("about to flarb the next %d files", fileCount); @@ -135,8 +132,7 @@ prep_status(const char *fmt,...) vsnprintf(message, sizeof(message), fmt, args); va_end(args); - /* trim strings */ - pg_log(PG_REPORT_NONL, "%-*s", MESSAGE_WIDTH, message); + pg_log(PG_REPORT_NONL, "%s ... ", message); } /* @@ -167,9 +163,9 @@ prep_status_progress(const char *fmt,...) * put the individual progress items onto the next line. */ if (log_opts.isatty || log_opts.verbose) - pg_log(PG_REPORT, "%-*s", MESSAGE_WIDTH, message); + pg_log(PG_REPORT, "%s ...", message); else - pg_log(PG_REPORT_NONL, "%-*s", MESSAGE_WIDTH, message); + pg_log(PG_REPORT_NONL, "%s ... ", message); } static void -- 2.25.1
Re: SIGQUIT handling, redux
Hi, On 2023-08-02 12:35:19 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2020-09-11 11:52:55 -0400, Tom Lane wrote: > >> It's simple enough that maybe we could back-patch it, once it's > >> aged awhile in HEAD. OTOH, given the lack of field reports of > >> trouble here, I'm not sure back-patching is worth the risk. > > > FWIW, looking at collected stack traces in azure, there's a slow but steady > > stream of crashes below StartupPacketTimeoutHandler. ... > > Unsurprisingly just in versions before 14, where this change went in. > > I think that might be enough evidence for backpatching the commit? I've not > > heard of issues due to the checks in check_on_shmem_exit_lists_are_empty(). > > I'd be willing to take a look at this in a few weeks when $real_life > is a bit less demanding. Cool. > Right before minor releases would likely be a bad idea anyway. Agreed. I had just waded through the stacks, so I thought it'd be worth bringing up, didn't intend to imply we should backpatch immediately. Aside: Analyzing this kind of thing at scale is made considerably more painful by "expected" ereport(PANIC)s (say out of disk space during WAL writes) calling abort() and dumping core... While there are other PANICs you really want cores of. Greetings, Andres Freund
Re: SIGQUIT handling, redux
Andres Freund writes: > On 2020-09-11 11:52:55 -0400, Tom Lane wrote: >> It's simple enough that maybe we could back-patch it, once it's >> aged awhile in HEAD. OTOH, given the lack of field reports of >> trouble here, I'm not sure back-patching is worth the risk. > FWIW, looking at collected stack traces in azure, there's a slow but steady > stream of crashes below StartupPacketTimeoutHandler. ... > Unsurprisingly just in versions before 14, where this change went in. > I think that might be enough evidence for backpatching the commit? I've not > heard of issues due to the checks in check_on_shmem_exit_lists_are_empty(). I'd be willing to take a look at this in a few weeks when $real_life is a bit less demanding. Right before minor releases would likely be a bad idea anyway. regards, tom lane
Re: SIGQUIT handling, redux
Hi, On 2020-09-11 11:52:55 -0400, Tom Lane wrote: > It's simple enough that maybe we could back-patch it, once it's > aged awhile in HEAD. OTOH, given the lack of field reports of > trouble here, I'm not sure back-patching is worth the risk. FWIW, looking at collected stack traces in azure, there's a slow but steady stream of crashes below StartupPacketTimeoutHandler. Most seem to be things like libcrypto->malloc->StartupPacketTimeoutHandler->proc_exit->socket_close->free->crash there's a few other variants, some where the stack apparently was not decipherable for the relevant tooling. Note that this wouldn't even include cases where this caused hangs - which is quite common IME. Unsurprisingly just in versions before 14, where this change went in. I think that might be enough evidence for backpatching the commit? I've not heard of issues due to the checks in check_on_shmem_exit_lists_are_empty(). Greetings, Andres Freund
Re: pg_upgrade fails with in-place tablespace
> I don't think that your solution is the correct move. Having > pg_tablespace_location() return the physical location of the > tablespace is very useful because that's the location where the > physical files are, and client applications don't need to know the > logic behind the way a path is built. I understand your concern. I suggest defining the standard behavior of the in-place tablespace to match that of the pg_default and pg_global tablespaces. The location of the pg_default and pg_global tablespaces is not a concern because they have default paths. Similarly, the location of the in-place tablespace is not a concern because they are all pg_tblspc/. Another reason is that, up until now, we have been creating in-place tablespaces using the following command: CREATE TABLESPACE space_test LOCATION '' However, when using pg_dump to back up this in-place tablespace, it is dumped with a different path: CREATE TABLESPACE space_test LOCATION 'pg_tblspc/' This can be confusing because it does not match the initial path that we created. Additionally, pg_restore cannot restore this in-place tablespace because the CREATE TABLESPACE command only supports an empty path for creating in-place tablespaces. > That may not work on Windows where the driver letter is appended at > the beginning of the path, no? There is is_absolute_path() to do this > job in a more portable way. You are correct. I have made the necessary modifications to ensure compatibility with in-place tablespaces. Please find the updated code attached. -- Best regards, Rui Zhao -- From:Michael Paquier Sent At:2023 Aug. 1 (Tue.) 08:57 To:Mark Cc:pgsql-hackers Subject:Re: pg_upgrade fails with in-place tablespace On Sat, Jul 29, 2023 at 11:10:22PM +0800, Rui Zhao wrote: > 2) Only check the tablespace with an absolute path in pg_upgrade. > There are also other solutions, such as supporting the creation of > relative-path tablespace in function CreateTableSpace. But do we > really need relative-path tablespace? I think in-place tablespace > is enough by now. My solution may be more lightweight and > harmless. + /* The path of the in-place tablespace is always pg_tblspc/. */ if (!S_ISLNK(st.st_mode)) - PG_RETURN_TEXT_P(cstring_to_text(sourcepath)); + PG_RETURN_TEXT_P(cstring_to_text("")); I don't think that your solution is the correct move. Having pg_tablespace_location() return the physical location of the tablespace is very useful because that's the location where the physical files are, and client applications don't need to know the logic behind the way a path is built. - " spcname != 'pg_global'"); + " spcname != 'pg_global' AND " + " pg_catalog.pg_tablespace_location(oid) ~ '^/'"); That may not work on Windows where the driver letter is appended at the beginning of the path, no? There is is_absolute_path() to do this job in a more portable way. -- Michael 0001-Fix-pg_upgrade-with-in-place-tablespaces.patch Description: Binary data
Re: add timing information to pg_upgrade
On Wed, Aug 02, 2023 at 01:02:53PM +0530, Bharath Rupireddy wrote: > On Wed, Aug 2, 2023 at 12:45 PM Peter Eisentraut wrote: >> I think we should change the output format to be more like initdb, like >> >> Doing something ... ok >> >> without horizontally aligning all the "ok"s. > > While this looks simple, we might end up with a lot of diff and > changes after removing MESSAGE_WIDTH. There's a significant part of > pg_upgrade code that deals with MESSAGE_WIDTH. I don't think it's > worth the effort. Therefore, I'd prefer the simplest possible fix - > change the message to '"Checking for \"aclitem\" data type in user > tables". It may be an overkill, but we can consider adding > Assert(sizeof(message) < MESSAGE_WIDTH) in progress report functions > to not encourage new messages to end up in the same formatting issue. I don't think it's that difficult. ІMO the bigger question is whether we want to back-patch such a change to v16 at this point. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: PATCH: Using BRIN indexes for sorted output
On 8/2/23 17:25, Sergey Dudoladov wrote: > Hello, > >> Parallel version is not supported, but I think it should be possible. > > @Tomas are you working on this ? If not, I would like to give it a try. > Feel free to try. Just keep it in a separate part/patch, to make it easier to combine the work later. >> static void >> AssertCheckRanges(BrinSortState *node) >> { >> #ifdef USE_ASSERT_CHECKING >> >> #endif >> } > > I guess it should not be empty at the ongoing development stage. > > Attached a small modification of the patch with a draft of the docs. > Thanks. FWIW it's generally better to always post the whole patch series, otherwise the cfbot gets confused as it's unable to combine stuff from different messages. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: add timing information to pg_upgrade
On Wed, Aug 02, 2023 at 09:14:06AM +0200, Peter Eisentraut wrote: > On 01.08.23 18:00, Nathan Bossart wrote: >> One of the main purposes of this thread is to gauge interest. I'm hoping >> there are other developers who are interested in reducing >> pg_upgrade-related downtime, and it seemed like it'd be nice to have >> built-in functionality for measuring the step times instead of requiring >> folks to apply this patch every time. And I think users might also be >> interested in this information, if for no other reason than to help us >> pinpoint which steps are taking longer for various workloads. > > If it's just for developers and expert users, perhaps existing shell-level > functionality like "pg_upgrade | ts" would suffice? Sure, we could just leave it at that unless anyone sees a reason to bake it in. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: ALTER COLUMN ... SET EXPRESSION to alter stored generated column's expression
On Wed, Aug 2, 2023 at 6:36 PM Amul Sul wrote: > > Hi, > > Currently, we have an option to drop the expression of stored generated > columns > as: > > ALTER [ COLUMN ] column_name DROP EXPRESSION [ IF EXISTS ] > > But don't have support to update that expression. The attached patch provides > that as: > > ALTER [ COLUMN ] column_name SET EXPRESSION expression > > Note that this form of ALTER is meant to work for the column which is already > generated. It then changes the generation expression in the catalog and > rewrite > the table, using the existing table rewrite facilities for ALTER TABLE. > Otherwise, an error will be reported. > > To keep the code flow simple, I have renamed the existing function that was in > use for DROP EXPRESSION so that it can be used for SET EXPRESSION as well, > which is a similar design as SET/DROP DEFAULT. I kept this renaming code > changes in a separate patch to minimize the diff in the main patch. > > Demo: > -- Create table > CREATE TABLE t1 (x int, y int GENERATED ALWAYS AS (x * 2) STORED); > INSERT INTO t1 VALUES(generate_series(1,3)); > > -- Check the generated data > SELECT * FROM t1; > x | y > ---+--- > 1 | 2 > 2 | 4 > 3 | 6 > (3 rows) > > -- Alter the expression > ALTER TABLE t1 ALTER COLUMN y SET EXPRESSION (x * 4); > > -- Check the new data > SELECT * FROM t1; > x | y > ---+ > 1 | 4 > 2 | 8 > 3 | 12 > (3 rows) > > Thank you. > -- > Regards, > Amul Sul > EDB: http://www.enterprisedb.com - setup. BEGIN; set search_path = test; DROP TABLE if exists gtest_parent, gtest_child; CREATE TABLE gtest_parent (f1 date NOT NULL, f2 bigint, f3 bigint GENERATED ALWAYS AS (f2 * 2) STORED) PARTITION BY RANGE (f1); CREATE TABLE gtest_child PARTITION OF gtest_parent FOR VALUES FROM ('2016-07-01') TO ('2016-08-01'); -- inherits gen expr CREATE TABLE gtest_child2 PARTITION OF gtest_parent ( f3 WITH OPTIONS GENERATED ALWAYS AS (f2 * 22) STORED -- overrides gen expr ) FOR VALUES FROM ('2016-08-01') TO ('2016-09-01'); CREATE TABLE gtest_child3 (f1 date NOT NULL, f2 bigint, f3 bigint GENERATED ALWAYS AS (f2 * 33) STORED); ALTER TABLE gtest_parent ATTACH PARTITION gtest_child3 FOR VALUES FROM ('2016-09-01') TO ('2016-10-01'); INSERT INTO gtest_parent (f1, f2) VALUES ('2016-07-15', 1); INSERT INTO gtest_parent (f1, f2) VALUES ('2016-07-15', 2); INSERT INTO gtest_parent (f1, f2) VALUES ('2016-08-15', 3); UPDATE gtest_parent SET f1 = f1 + 60 WHERE f2 = 1; ALTER TABLE ONLY gtest_parent ALTER COLUMN f3 SET EXPRESSION (f2 * 4); ALTER TABLE gtest_child ALTER COLUMN f3 SET EXPRESSION (f2 * 10); COMMIT; set search_path = test; SELECT table_name, column_name, is_generated, generation_expression FROM information_schema.columns WHERE table_name in ('gtest_child','gtest_child1', 'gtest_child2','gtest_child3') order by 1,2; result: table_name | column_name | is_generated | generation_expression --+-+--+--- gtest_child | f1 | NEVER| gtest_child | f1 | NEVER| gtest_child | f2 | NEVER| gtest_child | f2 | NEVER| gtest_child | f3 | ALWAYS | (f2 * 2) gtest_child | f3 | ALWAYS | (f2 * 10) gtest_child2 | f1 | NEVER| gtest_child2 | f1 | NEVER| gtest_child2 | f2 | NEVER| gtest_child2 | f2 | NEVER| gtest_child2 | f3 | ALWAYS | (f2 * 22) gtest_child2 | f3 | ALWAYS | (f2 * 2) gtest_child3 | f1 | NEVER| gtest_child3 | f1 | NEVER| gtest_child3 | f2 | NEVER| gtest_child3 | f2 | NEVER| gtest_child3 | f3 | ALWAYS | (f2 * 2) gtest_child3 | f3 | ALWAYS | (f2 * 33) (18 rows) one partition, one column 2 generated expression. Is this the expected behavior? In the regress, you can replace \d table_name to sql query (similar to above) to get the generated expression meta data. since here you want the meta data to be correct. then one select query to valid generated expression behaviored sane or not.
Re: PATCH: Using BRIN indexes for sorted output
Hello, > Parallel version is not supported, but I think it should be possible. @Tomas are you working on this ? If not, I would like to give it a try. > static void > AssertCheckRanges(BrinSortState *node) > { > #ifdef USE_ASSERT_CHECKING > > #endif > } I guess it should not be empty at the ongoing development stage. Attached a small modification of the patch with a draft of the docs. Regards, Sergey Dudoladov From d4050be4bfd0a518eba0ff0a7b561f0420be9861 Mon Sep 17 00:00:00 2001 From: Sergey Dudoladov Date: Wed, 2 Aug 2023 16:47:35 +0200 Subject: [PATCH] BRIN sort: add docs / fix typos --- doc/src/sgml/brin.sgml | 48 + src/backend/executor/nodeBrinSort.c | 9 +++--- 2 files changed, 53 insertions(+), 4 deletions(-) diff --git a/doc/src/sgml/brin.sgml b/doc/src/sgml/brin.sgml index 9c5ffcddf8..a76f26e032 100644 --- a/doc/src/sgml/brin.sgml +++ b/doc/src/sgml/brin.sgml @@ -810,6 +810,54 @@ LOG: request for BRIN range summarization for index "brin_wi_idx" page 128 was + + BRIN Sort + + + This section provides an overview of BRIN sort that may be + useful to advanced users. See src/backend/executor/nodeBrinSort.c + in the source distribution for the implementation. + + + + Overview + + The information about ranges in a BRIN index can be used + to split a table into parts and sort each part separately. A BRIN + sort is an index scan that uses the idea to incrementally sort the data. On + large tables BRIN index scan bridges the performance gap + between having a large B-tree index that supports ordered retrieval, and having + no index at all. + + + + + Implementation + + + BRIN sort fetches a list of page ranges from the BRIN + index and sorts them by their summary info depending on the operator class and + sort direction. It then determines the watermark: the next summary value from + the list. Tuples with summary values less than the watermark are sorted and + outputted directly; tuples with values greater than the watermark are spilled + into the next iteration. The process continues until either all ranges are + processed or the desired number of tuples has been retrieved. + + + + + Limitations + + The performance of the BRIN sort depends on the degree of + correlation between the values of the indexed column and their physical location + on disk: in the case of high correlation, there will be few overlapping ranges, + and performance will be best. Another limitation is that BRIN + sort only works for indexes built on a single column. + + + + + Extensibility diff --git a/src/backend/executor/nodeBrinSort.c b/src/backend/executor/nodeBrinSort.c index d1b6b4e1ed..c640aea2d5 100644 --- a/src/backend/executor/nodeBrinSort.c +++ b/src/backend/executor/nodeBrinSort.c @@ -731,8 +731,9 @@ brinsort_load_spill_tuples(BrinSortState *node, bool check_watermark) static bool brinsort_next_range(BrinSortState *node, bool asc) { - /* FIXME free the current bs_range, if any */ - node->bs_range = NULL; + /* free the current bs_range, if any */ + if (node->bs_range != NULL) + pfree(node->bs_range); /* * Mark the position, so that we can restore it in case we reach the @@ -1227,7 +1228,7 @@ IndexNext(BrinSortState *node) case BRINSORT_PROCESS_RANGE: -elog(DEBUG1, "phase BRINSORT_PROCESS_RANGE"); +elog(DEBUG1, "phase = BRINSORT_PROCESS_RANGE"); slot = node->ss.ps.ps_ResultTupleSlot; @@ -1263,7 +1264,7 @@ IndexNext(BrinSortState *node) } else { - /* updte the watermark and try reading more ranges */ + /* update the watermark and try reading more ranges */ node->bs_phase = BRINSORT_LOAD_RANGE; brinsort_update_watermark(node, false, asc); } -- 2.34.1
Re: Improve const use in zlib-using code
Peter, I like the idea. Though the way you have it implemented at the moment seems like a trap in that any time zlib.h is included someone also has to remember to add this define. I would recommend adding the define to the build systems instead. Since you put in the work to find the version of zlib that added this, You might also want to add `required: '>= 1.2.5.2'` to the `dependency('zlib')` call in the main meson.build. I am wondering if we could remove the `z_streamp` check too. The version that it was added isn't obvious. -- Tristan Partin Neon (https://neon.tech)
Re: Use of additional index columns in rows filtering
On 8/2/23 02:50, Peter Geoghegan wrote: > On Mon, Jul 24, 2023 at 11:59 AM Peter Geoghegan wrote: >>> That might be true but I'm not sure what to do about that unless we >>> incorporate some "robustness" measure into the costing. If every >>> measure we have says one plan is better, don't we have to choose it? >> >> I'm mostly concerned about the possibility itself -- it's not a matter >> of tuning the cost. I agree that that approach would probably be >> hopeless. > > This seems related to the fact that EXPLAIN doesn't expose the > difference between what Markus Winand calls "Access Predicates" and > "Index Filter Predicates", as explained here: > > https://use-the-index-luke.com/sql/explain-plan/postgresql/filter-predicates > > That is, both "Access Predicates" and "Index Filter Predicates" are > shown after an "Index Cond: " entry in Postgres EXPLAIN output, in > general. Even though these are two very different things. I believe > that the underlying problem for the implementation (the reason why we > can't easily break this out further in EXPLAIN output) is that we > don't actually know what kind of predicate it is ourselves -- at least > not until execution time. We wait until then to do nbtree > preprocessing/scan setup. Though perhaps we should do more of this > during planning instead [1], for several reasons (fixing this is just > one of those reasons). > How come we don't know that until the execution time? Surely when building the paths/plans, we match the clauses to the index keys, no? Or are you saying that just having a scan key is not enough for it to be "access predicate"? Anyway, this patch is mostly about "Index Cond" mixing two types of predicates. But the patch is really about "Filter" predicates - moving some of them from table to index. So quite similar to the "index filter predicates" except that those are handled by the index AM. > The risk to "robustness" for cases like the one I drew attention to on > this thread would probably have been obvious all along if EXPLAIN > output were more like what Markus would have us do -- he certainly has > a good point here, in general. > > Breaking things out in EXPLAIN output along these lines might also > give us a better general sense of when a similar plan shift like this > was actually okay -- even according to something like my > non-traditional "query robustness" criteria. It's much harder for me > to argue that a shift in plans from what Markus calls an "Index Filter > Predicate" to what the patch will show under "Index Filter:" is a > novel new risk. That would be a much less consequential difference, > because those two things are fairly similar anyway. > But differentiating between access and filter predicates (at the index AM level) seems rather independent of what this patch aims to do. FWIW I agree we should make the differences visible in the explain. That seems fairly useful for non-trivial index access paths, and it does not change the execution at all. I think it'd be fine to do that only for VERBOSE mode, and only for EXPLAIN ANALYZE (if we only do this at execution time for now). > Besides, such a shift in plan would have to "legitimately win" for > things to work out like this. If we're essentially picking between two > different subtypes of "Index Filter Predicate", then there can't be > the same weird second order effects that we see when an "Access > Predicate" is out-competed by an "Index Filter Predicate". It's > possible that expression evaluation of a small-ish conjunctive > predicate like "Index Filter: ((tenthous = 1) OR (tenthous = 3) OR > (tenthous = 42))" will be faster than a natively executed SAOP. You > can't do that kind of expression evaluation in the index AM itself > (assuming that there is an opclass for nbtree to use in the first > place, which there might not be in the case of any non-key INCLUDE > columns). With the patch, you can do all this. And I think that you > can derisk it without resorting to the overly conservative approach of > limiting ourselves to non-key columns from INCLUDE indexes. > I'm not following. Why couldn't there be some second-order effects? Maybe it's obvious / implied by something said earlier, but I think every time we decide between multiple choices, there's a risk of picking wrong. Anyway, is this still about this patch or rather about your SAOP patch? > To summarize: As Markus says on the same page. "Index filter > predicates give a false sense of safety; even though an index is used, > the performance degrades rapidly on a growing data volume or system > load". That's essentially what I want to avoid here. I'm much less > concerned about competition between what are really "Index Filter > Predicate" subtypes. Allowing that competition to take place is not > entirely risk-free, of course, but it seems about as risky as anything > else in this area. > True. IMHO the danger or "index filter" predicates is that people assume index conditions eliminate
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
On Wed, Aug 2, 2023 at 4:09 PM Melih Mutlu wrote: > > PFA an updated version with some of the earlier reviews addressed. > Forgot to include them in the previous email. > It is always better to explicitly tell which reviews are addressed but anyway, I have done some minor cleanup in the 0001 patch including removing includes which didn't seem necessary, modified a few comments, and ran pgindent. I also thought of modifying some variable names based on suggestions by Peter Smith in an email [1] but didn't find many of them any better than the current ones so modified just a few of those. If you guys are okay with this then let's commit it and then we can focus more on the remaining patches. [1] - https://www.postgresql.org/message-id/CAHut%2BPs3Du9JFmhecWY8%2BVFD11VLOkSmB36t_xWHHQJNMpdA-A%40mail.gmail.com -- With Regards, Amit Kapila. v25-0001-Refactor-to-split-Apply-and-Tablesync-Workers.patch Description: Binary data
Re: Adding a LogicalRepWorker type field
Hi, Peter Smith , 2 Ağu 2023 Çar, 09:27 tarihinde şunu yazdı: > On Wed, Aug 2, 2023 at 3:35 PM Masahiko Sawada > wrote: > > > > I can see the problem you stated and it's true that the worker's type > > never changes during its lifetime. But I'm not sure we really need to > > add a new variable to LogicalRepWorker since we already have enough > > information there to distinguish the worker's type. > > > > Currently, we deduce the worker's type by checking some fields that > > never change such as relid and leader_piid. It's fine to me as long as > > we encapcel the deduction of worker's type by macros (or inline > > functions). Even with the proposed patch (0001 patch), we still need a > > similar logic to determine the worker's type. > > Thanks for the feedback. > > I agree that the calling code will not look very different > with/without this patch 0001, because everything is hidden by the > macros/functions. But those HEAD macros/functions are also hiding the > inefficiencies of the type deduction -- e.g. IMO it is quite strange > that we can only recognize the worker is an "apply worker" by > eliminating the other 2 possibilities. > Isn't the apply worker type still decided by eliminating the other choices even with the patch? /* Prepare the worker slot. */ + if (OidIsValid(relid)) + worker->type = TYPE_TABLESYNC_WORKER; + else if (is_parallel_apply_worker) + worker->type = TYPE_PARALLEL_APPLY_WORKER; + else + worker->type = TYPE_APPLY_WORKER; > 3. > +#define IsLeaderApplyWorker() is_worker_type(MyLogicalRepWorker, > TYPE_APPLY_WORKER) > +#define IsParallelApplyWorker() is_worker_type(MyLogicalRepWorker, > TYPE_PARALLEL_APPLY_WORKER) > +#define IsTablesyncWorker() is_worker_type(MyLogicalRepWorker, > TYPE_TABLESYNC_WORKER) > > My thoughts were around removing am_XXX_worker() and > IsXXXWorker() functions and just have IsXXXWorker() > alone with using IsXXXWorker(MyLogicalRepWorker) in places where > am_XXX_worker() is used. If implementing this leads to something like > the above with is_worker_type, -1 from me. > I agree that having both is_worker_type and IsXXXWorker makes the code more confusing. Especially both type check ways are used in the same file/function as follows: logicalrep_worker_detach(void) > { > /* Stop the parallel apply workers. */ > - if (am_leader_apply_worker()) > + if (IsLeaderApplyWorker()) > { > List *workers; > ListCell *lc; > @@ -749,7 +756,7 @@ logicalrep_worker_detach(void) > { > LogicalRepWorker *w = (LogicalRepWorker *) lfirst(lc); > > - if (isParallelApplyWorker(w)) > + if (is_worker_type(w, TYPE_PARALLEL_APPLY_WORKER)) > logicalrep_worker_stop_internal(w, SIGTERM); >} Regards, -- Melih Mutlu Microsoft
Re: Potential memory leak in contrib/intarray's g_intbig_compress
On Fri, 14 Jul 2023 at 07:57, Michael Paquier wrote: > > On Thu, Jul 13, 2023 at 06:28:39PM +0200, Matthias van de Meent wrote: > > There are similar pfree calls in the _int_gist.c file's g_int_compress > > function, which made me think we do need to clean up after use, but > > indeed these pfrees are useless (or even harmful if bug #17888 can be > > trusted) > > Indeed, all these are in a GiST temporary context. So you'd mean > something like the attached perhaps, for both the decompress and > compress paths? Yes, looks good to me. Thanks! Kind regards, Matthias van de Meent Neon (https://neon.tech)
Re: Oversight in reparameterize_path_by_child leading to executor crash
On Tue, Aug 1, 2023 at 9:20 PM Tom Lane wrote: > Richard Guo writes: > > So what I'm thinking is that maybe we can add a new type of path, named > > SampleScanPath, to have the TableSampleClause per path. Then we can > > safely reparameterize the TableSampleClause as needed for each > > SampleScanPath. That's what the attached patch does. > > Alternatively, could we postpone the reparameterization until > createplan.c? Having to build reparameterized expressions we might > not use seems expensive, and it would also contribute to the memory > bloat being complained of in nearby threads. I did think about this option but figured out that it seems beyond the scope of just fixing SampleScan. But if we want to optimize the reparameterization mechanism besides fixing this crash, I think this option is much better. I drafted a patch as attached. Thanks Richard v2-0001-Postpone-reparameterization-of-paths-until-when-creating-plans.patch Description: Binary data
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Hi, > PFA an updated version with some of the earlier reviews addressed. Forgot to include them in the previous email. Thanks, -- Melih Mutlu Microsoft v24-0003-Reuse-connection-when-tablesync-workers-change-t.patch Description: Binary data v24-0002-Reuse-Tablesync-Workers.patch Description: Binary data v24-0001-Refactor-to-split-Apply-and-Tablesync-Workers.patch Description: Binary data
ALTER COLUMN ... SET EXPRESSION to alter stored generated column's expression
Hi, Currently, we have an option to drop the expression of stored generated columns as: ALTER [ COLUMN ] column_name DROP EXPRESSION [ IF EXISTS ] But don't have support to update that expression. The attached patch provides that as: ALTER [ COLUMN ] column_name SET EXPRESSION expression Note that this form of ALTER is meant to work for the column which is already generated. It then changes the generation expression in the catalog and rewrite the table, using the existing table rewrite facilities for ALTER TABLE. Otherwise, an error will be reported. To keep the code flow simple, I have renamed the existing function that was in use for DROP EXPRESSION so that it can be used for SET EXPRESSION as well, which is a similar design as SET/DROP DEFAULT. I kept this renaming code changes in a separate patch to minimize the diff in the main patch. Demo: -- Create table CREATE TABLE t1 (x int, y int GENERATED ALWAYS AS (x * 2) STORED); INSERT INTO t1 VALUES(generate_series(1,3)); -- Check the generated data SELECT * FROM t1; x | y ---+--- 1 | 2 2 | 4 3 | 6 (3 rows) -- Alter the expression ALTER TABLE t1 ALTER COLUMN y SET EXPRESSION (x * 4); -- Check the new data SELECT * FROM t1; x | y ---+ 1 | 4 2 | 8 3 | 12 (3 rows) Thank you. -- Regards, Amul Sul EDB: http://www.enterprisedb.com From ef1448f7852000d5b701f9e3c7fe88737670740a Mon Sep 17 00:00:00 2001 From: Amul Sul Date: Mon, 31 Jul 2023 15:43:51 +0530 Subject: [PATCH v1 2/2] Allow to change generated column expression --- doc/src/sgml/ref/alter_table.sgml | 14 +- src/backend/commands/tablecmds.c| 88 + src/backend/parser/gram.y | 10 ++ src/bin/psql/tab-complete.c | 2 +- src/test/regress/expected/generated.out | 167 src/test/regress/sql/generated.sql | 36 - 6 files changed, 262 insertions(+), 55 deletions(-) diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml index d4d93eeb7c6..1b68dea8d9b 100644 --- a/doc/src/sgml/ref/alter_table.sgml +++ b/doc/src/sgml/ref/alter_table.sgml @@ -46,6 +46,7 @@ ALTER TABLE [ IF EXISTS ] name ALTER [ COLUMN ] column_name SET DEFAULT expression ALTER [ COLUMN ] column_name DROP DEFAULT ALTER [ COLUMN ] column_name { SET | DROP } NOT NULL +ALTER [ COLUMN ] column_name SET EXPRESSION expression ALTER [ COLUMN ] column_name DROP EXPRESSION [ IF EXISTS ] ALTER [ COLUMN ] column_name ADD GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY [ ( sequence_options ) ] ALTER [ COLUMN ] column_name { SET GENERATED { ALWAYS | BY DEFAULT } | SET sequence_option | RESTART [ [ WITH ] restart ] } [...] @@ -255,13 +256,18 @@ WITH ( MODULUS numeric_literal, REM - + DROP EXPRESSION [ IF EXISTS ] - This form turns a stored generated column into a normal base column. - Existing data in the columns is retained, but future changes will no - longer apply the generation expression. + The SET form replaces stored generated value for a + column. Existing data in the columns is rewritten and all the future + changes will apply the new generation expression. + + + The DROP form turns a stored generated column into a + normal base column. Existing data in the columns is retained, but future + changes will no longer apply the generation expression. diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 3b499fc0d8e..df26afe16cb 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -456,7 +456,8 @@ static ObjectAddress ATExecSetIdentity(Relation rel, const char *colName, static ObjectAddress ATExecDropIdentity(Relation rel, const char *colName, bool missing_ok, LOCKMODE lockmode); static void ATPrepColumnExpression(Relation rel, AlterTableCmd *cmd, bool recurse, bool recursing, LOCKMODE lockmode); -static ObjectAddress ATExecColumnExpression(Relation rel, const char *colName, +static ObjectAddress ATExecColumnExpression(AlteredTableInfo *tab, Relation rel, + const char *colName, Node *newDefault, bool missing_ok, LOCKMODE lockmode); static ObjectAddress ATExecSetStatistics(Relation rel, const char *colName, int16 colNum, Node *newValue, LOCKMODE lockmode); @@ -5056,7 +5057,8 @@ ATExecCmd(List **wqueue, AlteredTableInfo *tab, ATExecCheckNotNull(tab, rel, cmd->name, lockmode); break; case AT_ColumnExpression: - address = ATExecColumnExpression(rel, cmd->name, cmd->missing_ok, lockmode); + address = ATExecColumnExpression(tab, rel, cmd->name, cmd->def, + cmd->missing_ok, lockmode); break; case AT_SetStatistics: /* ALTER COLUMN SET STATISTICS */ address = ATExecSetStatistics(rel, cmd->name, cmd->num, cmd->def, lockmode); @@ -8015,16 +8017,21 @@ static void ATPrepColumnExpression(Relation rel, AlterTableCmd *cmd, bool recurse,
Re: pgbnech: allow to cancel queries during benchmark
On Wed, 2 Aug 2023 16:37:53 +0900 Yugo NAGATA wrote: > Hello Fabien, > > On Fri, 14 Jul 2023 20:32:01 +0900 > Yugo NAGATA wrote: > > I attached the updated patch. I'm sorry. I forgot to attach the patch. Regards, Yugo Nagata > > > Hello Fabien, > > > > Thank you for your review! > > > > On Mon, 3 Jul 2023 20:39:23 +0200 (CEST) > > Fabien COELHO wrote: > > > > > > > > Yugo-san, > > > > > > Some feedback about v1 of this patch. > > > > > > Patch applies cleanly, compiles. > > > > > > There are no tests, could there be one? ISTM that one could be done with > > > a > > > "SELECT pg_sleep(...)" script?? > > > > Agreed. I will add the test. > > I added a TAP test. > > > > > > The global name "all_state" is quite ambiguous, what about > > > "client_states" > > > instead? Or maybe it could be avoided, see below. > > > > > > Instead of renaming "state" to "all_state" (or client_states as suggested > > > above), I'd suggest to minimize the patch by letting "state" inside the > > > main and adding a "client_states = state;" just after the allocation, or > > > another approach, see below. > > > > Ok, I'll fix to add a global variable "client_states" and make this point to > > "state" instead of changing "state" to global. > > Done. > > > > > > Should PQfreeCancel be called on deconnections, in finishCon? I think > > > that > > > there may be a memory leak with the current implementation?? > > > > Agreed. I'll fix. > > Done. > > Regards, > Yugo Nagata > > > > > > Maybe it should check that cancel is not NULL before calling PQcancel? > > > > I think this is already checked as below, but am I missing something? > > > > + if (all_state[i].cancel != NULL) > > + (void) PQcancel(all_state[i].cancel, errbuf, sizeof(errbuf)); > > > > > After looking at the code, I'm very unclear whether they may be some > > > underlying race conditions, or not, depending on when the cancel is > > > triggered. I think that some race conditions are still likely given the > > > current thread 0 implementation, and dealing with them with a barrier or > > > whatever is not desirable at all. > > > > > > In order to work around this issue, ISTM that we should go away from the > > > simple and straightforward thread 0 approach, and the only clean way is > > > that the cancelation should be managed by each thread for its own client. > > > > > > I'd suggest to have the advanceState to call PQcancel when > > > CancelRequested > > > is set and switch to CSTATE_ABORTED to end properly. This means that > > > there > > > would be no need for the global client states pointer, so the patch > > > should > > > be smaller and simpler. Possibly there would be some shortcuts added here > > > and there to avoid lingering after the control-C, in threadRun. > > > > I am not sure this approach is simpler than mine. > > > > In multi-threads, only one thread can catches the signal and other threads > > continue to run. Therefore, if Ctrl+C is pressed while threads are waiting > > responses from the backend in wait_on_socket_set, only one thread can be > > interrupted and return, but other threads will continue to wait and cannot > > check CancelRequested. So, for implementing your suggestion, we need any > > hack > > to make all threads return from wait_on_socket_set when the event occurs, > > but > > I don't have idea to do this in simpler way. > > > > In my patch, all threads can return from wait_on_socket_set at Ctrl+C > > because when thread #0 cancels all connections, the following error is > > sent to all sessions: > > > > ERROR: canceling statement due to user request > > > > and all threads will receive the response from the backend. > > > > Regards, > > Yugo Nagata > > > > -- > > Yugo NAGATA > > > > > > > -- > Yugo NAGATA > > -- Yugo NAGATA diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index 539c2795e2..68278d2b18 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -596,6 +596,7 @@ typedef enum typedef struct { PGconn *con; /* connection handle to DB */ + PGcancel *cancel; /* query cancel */ int id;/* client No. */ ConnectionStateEnum state; /* state machine's current state. */ ConditionalStack cstack; /* enclosing conditionals state */ @@ -638,6 +639,8 @@ typedef struct * here */ } CState; +CState *client_states; /* status of all clients */ + /* * Thread state */ @@ -3631,6 +3634,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg) st->state = CSTATE_ABORTED; break; } + st->cancel = PQgetCancel(st->con); /* reset now after connection */ now = pg_time_now(); @@ -4662,6 +4666,18 @@ disconnect_all(CState *state, int length) finishCon([i]); } +/* send cancel requests to all connections */ +static void +cancel_all() +{ + for (int i = 0; i < nclients; i++) + { + char errbuf[1]; + if (client_states[i].cancel != NULL)
Improve const use in zlib-using code
Now that we have effectively de-supported CentOS 6, by removing support for its OpenSSL version, I think we could also modernize the use of some other libraries, such as zlib. If we define ZLIB_CONST before including zlib.h, zlib augments some interfaces with const decorations. By doing that we can keep our own interfaces cleaner and can remove some unconstify calls. ZLIB_CONST was introduced in zlib 1.2.5.2 (17 Dec 2011); CentOS 6 has zlib-1.2.3-29.el6.x86_64. Note that if you use this patch and compile on CentOS 6, it still works, you just get a few compiler warnings about discarding qualifiers. Old environments tend to produce more compiler warnings anyway, so this doesn't seem so bad.From 324e37adff4c51f4f10807386c0c098cb8d21608 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Wed, 2 Aug 2023 11:01:27 +0200 Subject: [PATCH] Improve const use in zlib-using code If we define ZLIB_CONST before including zlib.h, zlib augments some interfaces with const decorations. By doing that we can keep our own interfaces cleaner and can remove some unconstify calls. ZLIB_CONST was introduced in zlib 1.2.5.2 (17 Dec 2011). XXX CentOS 6 has zlib-1.2.3-29.el6.x86_64 --- contrib/pgcrypto/pgp-compress.c | 3 ++- src/backend/backup/basebackup_gzip.c| 1 + src/bin/pg_basebackup/bbstreamer_gzip.c | 3 ++- src/bin/pg_basebackup/pg_basebackup.c | 1 + src/bin/pg_basebackup/pg_receivewal.c | 1 + src/bin/pg_basebackup/walmethods.c | 6 +++--- src/bin/pg_dump/compress_gzip.c | 1 + src/common/compression.c| 1 + 8 files changed, 12 insertions(+), 5 deletions(-) diff --git a/contrib/pgcrypto/pgp-compress.c b/contrib/pgcrypto/pgp-compress.c index 086bec31ae..5615f1acce 100644 --- a/contrib/pgcrypto/pgp-compress.c +++ b/contrib/pgcrypto/pgp-compress.c @@ -40,6 +40,7 @@ #ifdef HAVE_LIBZ +#define ZLIB_CONST #include #define ZIP_OUT_BUF 8192 @@ -113,7 +114,7 @@ compress_process(PushFilter *next, void *priv, const uint8 *data, int len) /* * process data */ - st->stream.next_in = unconstify(uint8 *, data); + st->stream.next_in = data; st->stream.avail_in = len; while (st->stream.avail_in > 0) { diff --git a/src/backend/backup/basebackup_gzip.c b/src/backend/backup/basebackup_gzip.c index b2d5e19ad9..45210a2ff0 100644 --- a/src/backend/backup/basebackup_gzip.c +++ b/src/backend/backup/basebackup_gzip.c @@ -13,6 +13,7 @@ #include "postgres.h" #ifdef HAVE_LIBZ +#define ZLIB_CONST #include #endif diff --git a/src/bin/pg_basebackup/bbstreamer_gzip.c b/src/bin/pg_basebackup/bbstreamer_gzip.c index 3bdbfa0bc4..fa52392a48 100644 --- a/src/bin/pg_basebackup/bbstreamer_gzip.c +++ b/src/bin/pg_basebackup/bbstreamer_gzip.c @@ -14,6 +14,7 @@ #include #ifdef HAVE_LIBZ +#define ZLIB_CONST #include #endif @@ -269,7 +270,7 @@ bbstreamer_gzip_decompressor_content(bbstreamer *streamer, mystreamer = (bbstreamer_gzip_decompressor *) streamer; zs = >zstream; - zs->next_in = (uint8 *) data; + zs->next_in = (const uint8 *) data; zs->avail_in = len; /* Process the current chunk */ diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c index 1dc8efe0cb..90fd044ec3 100644 --- a/src/bin/pg_basebackup/pg_basebackup.c +++ b/src/bin/pg_basebackup/pg_basebackup.c @@ -22,6 +22,7 @@ #include #include #ifdef HAVE_LIBZ +#define ZLIB_CONST #include #endif diff --git a/src/bin/pg_basebackup/pg_receivewal.c b/src/bin/pg_basebackup/pg_receivewal.c index d0a4079d50..c0de77b5aa 100644 --- a/src/bin/pg_basebackup/pg_receivewal.c +++ b/src/bin/pg_basebackup/pg_receivewal.c @@ -24,6 +24,7 @@ #include #endif #ifdef HAVE_LIBZ +#define ZLIB_CONST #include #endif diff --git a/src/bin/pg_basebackup/walmethods.c b/src/bin/pg_basebackup/walmethods.c index 376ddf72b7..b663a3d978 100644 --- a/src/bin/pg_basebackup/walmethods.c +++ b/src/bin/pg_basebackup/walmethods.c @@ -19,6 +19,7 @@ #include #endif #ifdef HAVE_LIBZ +#define ZLIB_CONST #include #endif @@ -705,7 +706,7 @@ typedef struct TarMethodData #ifdef HAVE_LIBZ static bool -tar_write_compressed_data(TarMethodData *tar_data, void *buf, size_t count, +tar_write_compressed_data(TarMethodData *tar_data, const void *buf, size_t count, bool flush) { tar_data->zp->next_in = buf; @@ -782,8 +783,7 @@ tar_write(Walfile *f, const void *buf, size_t count) #ifdef HAVE_LIBZ else if (f->wwmethod->compression_algorithm == PG_COMPRESSION_GZIP) { - if (!tar_write_compressed_data(tar_data, unconstify(void *, buf), - count, false)) + if (!tar_write_compressed_data(tar_data, buf, count, false)) return -1; f->currpos += count; return count; diff
Re: [PoC] Reducing planning time when tables have many partitions
On 2/8/2023 13:40, Yuya Watari wrote: As seen from the above, verifying iteration results was the cause of the performance degradation. I agree that we should avoid such degradation because it negatively affects the development of PostgreSQL. Removing the verification when committing this patch is one possible option. You introduced list_ptr_cmp as an extern function of a List, but use it the only under USE_ASSERT_CHECKING ifdef. Maybe you hide it under USE_ASSERT_CHECKING or remove all the stuff? -- regards, Andrey Lepikhov Postgres Professional
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Hi, Amit Kapila , 2 Ağu 2023 Çar, 12:01 tarihinde şunu yazdı: > I think we are getting the error (ERROR: could not find logical > decoding starting point) because we wouldn't have waited for WAL to > become available before reading it. It could happen due to the > following code: > WalSndWaitForWal() > { > ... > if (streamingDoneReceiving && streamingDoneSending && > !pq_is_send_pending()) > break; > .. > } > > Now, it seems that in 0003 patch, instead of resetting flags > streamingDoneSending, and streamingDoneReceiving before start > replication, we should reset before create logical slots because we > need to read the WAL during that time as well to find the consistent > point. > Thanks for the suggestion Amit. I've been looking into this recently and couldn't figure out the cause until now. I quickly made the fix in 0003. Seems like it resolved the "could not find logical decoding starting point" errors. vignesh C , 1 Ağu 2023 Sal, 09:32 tarihinde şunu yazdı: > I agree that "no copy in progress issue" issue has nothing to do with > 0001 patch. This issue is present with the 0002 patch. > In the case when the tablesync worker has to apply the transactions > after the table is synced, the tablesync worker sends the feedback of > writepos, applypos and flushpos which results in "No copy in progress" > error as the stream has ended already. Fixed it by exiting the > streaming loop if the tablesync worker is done with the > synchronization. The attached 0004 patch has the changes for the same. > The rest of v22 patches are the same patch that were posted by Melih > in the earlier mail. Thanks for the fix. I placed it into 0002 with a slight change as follows: - send_feedback(last_received, false, false); > + if (!MyLogicalRepWorker->relsync_completed) > + send_feedback(last_received, false, false); IMHO relsync_completed means simply the same with streaming_done, that's why I wanted to check that flag instead of an additional goto statement. Does it make sense to you as well? Thanks, -- Melih Mutlu Microsoft v23-0002-Reuse-Tablesync-Workers.patch Description: Binary data v23-0001-Refactor-to-split-Apply-and-Tablesync-Workers.patch Description: Binary data v23-0003-Reuse-connection-when-tablesync-workers-change-t.patch Description: Binary data
Re: Support to define custom wait events for extensions
On 2023-08-01 12:23, Andres Freund wrote: Hi, On 2023-08-01 12:14:49 +0900, Michael Paquier wrote: On Tue, Aug 01, 2023 at 11:51:35AM +0900, Masahiro Ikeda wrote: > Thanks for committing the main patch. > > In my understanding, the rest works are > * to support WaitEventExtensionMultiple() > * to replace WAIT_EVENT_EXTENSION to custom wait events > > Do someone already works for them? If not, I'll consider > how to realize them. Note that postgres_fdw and dblink use WAIT_EVENT_EXTENSION, but have no dependency to shared_preload_libraries. Perhaps these could just use a dynamic handling but that deserves a separate discussion because of the fact that they'd need shared memory without being able to request it. autoprewarm.c is much simpler. This is why the scheme as implemented doesn't really make sense to me. It'd be much easier to use if we had a shared hashtable with the identifiers than what's been merged now. In plenty of cases it's not realistic for an extension library to run in each backend, while still needing to wait for things. OK, I'll try to make a PoC patch. Regards, -- Masahiro Ikeda NTT DATA CORPORATION
Re: multiple membership grants and information_schema.applicable_roles
On 24.07.23 08:42, Pavel Luzanov wrote: I do see what seems like a different issue: the standard appears to expect that indirect role grants should also be shown (via the recursive CTE), and we are not doing that. I noticed this, but the view stays unchanged so long time. I thought it was done intentionally. The implementation of the information_schema.applicable_roles view predates both indirect role grants and recursive query support. So some updates might be in order.
Re: Adding a LogicalRepWorker type field
On Wed, Aug 2, 2023 at 2:46 PM Bharath Rupireddy wrote: > > On Wed, Aug 2, 2023 at 12:14 PM Peter Smith wrote: > > > > We can't use the same names for both with/without-parameter functions > > because there is no function overloading in C. In patch v3-0001 I've > > replaced the "dual set of macros", with a single inline function of a > > different name, and one set of space-saving macros. > > > > PSA v3 > > Quick thoughts: > > 1. > +typedef enum LogicalRepWorkerType > +{ > +TYPE_UNKNOWN = 0, > +TYPE_TABLESYNC_WORKER, > +TYPE_APPLY_WORKER, > +TYPE_PARALLEL_APPLY_WORKER > +} LogicalRepWorkerType; > > -1 for these names starting with prefix TYPE_, in fact LR_ looked better. > > 2. > +is_worker_type(LogicalRepWorker *worker, LogicalRepWorkerType type) > > This function name looks too generic, an element of logical > replication is better in the name. > > 3. > +#define IsLeaderApplyWorker() is_worker_type(MyLogicalRepWorker, > TYPE_APPLY_WORKER) > +#define IsParallelApplyWorker() is_worker_type(MyLogicalRepWorker, > TYPE_PARALLEL_APPLY_WORKER) > +#define IsTablesyncWorker() is_worker_type(MyLogicalRepWorker, > TYPE_TABLESYNC_WORKER) > > My thoughts were around removing am_XXX_worker() and > IsXXXWorker() functions and just have IsXXXWorker() > alone with using IsXXXWorker(MyLogicalRepWorker) in places where > am_XXX_worker() is used. If implementing this leads to something like > the above with is_worker_type, -1 from me. > What I was suggesting to Peter to have something like: static inline bool am_tablesync_worker(void) { return (MyLogicalRepWorker->type == TYPE_APPLY_WORKER); } -- With Regards, Amit Kapila.
Re: add timing information to pg_upgrade
On 02.08.23 10:30, Bharath Rupireddy wrote: Moreover, the ts command gives me the timestamps for each of the messages printed, so an extra step is required to calculate the time taken for an operation. There is "ts -i" for that.
Re: Adding a LogicalRepWorker type field
On Wed, Aug 2, 2023 at 12:14 PM Peter Smith wrote: > > We can't use the same names for both with/without-parameter functions > because there is no function overloading in C. In patch v3-0001 I've > replaced the "dual set of macros", with a single inline function of a > different name, and one set of space-saving macros. > > PSA v3 Quick thoughts: 1. +typedef enum LogicalRepWorkerType +{ +TYPE_UNKNOWN = 0, +TYPE_TABLESYNC_WORKER, +TYPE_APPLY_WORKER, +TYPE_PARALLEL_APPLY_WORKER +} LogicalRepWorkerType; -1 for these names starting with prefix TYPE_, in fact LR_ looked better. 2. +is_worker_type(LogicalRepWorker *worker, LogicalRepWorkerType type) This function name looks too generic, an element of logical replication is better in the name. 3. +#define IsLeaderApplyWorker() is_worker_type(MyLogicalRepWorker, TYPE_APPLY_WORKER) +#define IsParallelApplyWorker() is_worker_type(MyLogicalRepWorker, TYPE_PARALLEL_APPLY_WORKER) +#define IsTablesyncWorker() is_worker_type(MyLogicalRepWorker, TYPE_TABLESYNC_WORKER) My thoughts were around removing am_XXX_worker() and IsXXXWorker() functions and just have IsXXXWorker() alone with using IsXXXWorker(MyLogicalRepWorker) in places where am_XXX_worker() is used. If implementing this leads to something like the above with is_worker_type, -1 from me. > Done. I converged everything to the macro-naming style as suggested, > but I chose not to pass MyLogicalRepWorker as a parameter for the > normal (am_xxx case) otherwise I felt it would make the calling code > unnecessarily verbose. With is_worker_type, the calling code now looks even more verbose. I don't think IsLeaderApplyWorker(MyLogicalRepWorker) is a bad idea. This unifies multiple worker type functions into one and makes code simple. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: PG_CATCH used without PG_RETHROW
Hi, I think this is a bug because this function's CATCH clause does not restore memroycontext. Thus when leaving the function, the CurrentMemroyContext will be ErrorMemroyContext. I see this part of code is refactored in master branch, but in REL_16_STABLE, the code is still there. The following SQL might lead to PANIC (reference to memory that just reset) zlyu=# select version(); version - PostgreSQL 15.3 on aarch64-unknown-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit (1 row) zlyu=# select xml_is_well_formed('') or array_length(array_append(array[random(), random()], case when xml_is_well_formed('') then random() else random() end), 1) > 1; ERROR: cache lookup failed for type 2139062143 zlyu=# Simple fix is to backport the change from master branch or restore the context. Tom Lane 于2023年8月2日周三 16:32写道: > Greg Stark writes: > > My understanding is that PG_TRY/PG_CATCH doesn't save enough state to > > avoid rethrowing errors and if you want to actually continue the > > transaction you must use a subtransaction. As a result I was under the > > impression it was mandatory to PG_RETHROW as a result. > > > If that's the case then I think I just came across a bug in > > utils/adt/xml.c where there's no PG_RETHROW: > > The reason we think that's OK is that we assume libxml2 does not call back > into the general backend code, so there is no PG state we'd have to undo. > > regards, tom lane > > > -- > Sent via pgsql-hackers mailing list (pgsql-hack...@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > > >
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
On Tue, Aug 1, 2023 at 9:44 AM Peter Smith wrote: > > > FYI, here is some more information about ERRORs seen. > > The patches were re-tested -- applied in stages (and also against the > different scripts) to identify where the problem was introduced. Below > are the observations: > > ~~~ > > Using original test scripts > > 1. Using only patch v21-0001 > - no errors > > 2. Using only patch v21-0001+0002 > - no errors > > 3. Using patch v21-0001+0002+0003 > - no errors > > ~~~ > > Using the "busy loop" test scripts for long transactions > > 1. Using only patch v21-0001 > - no errors > > 2. Using only patch v21-0001+0002 > - gives errors for "no copy in progress issue" > e.g. ERROR: could not send data to WAL stream: no COPY in progress > > 3. Using patch v21-0001+0002+0003 > - gives the same "no copy in progress issue" errors as above > e.g. ERROR: could not send data to WAL stream: no COPY in progress > - and also gives slot consistency point errors > e.g. ERROR: could not create replication slot > "pg_16700_sync_16514_7261998170966054867": ERROR: could not find > logical decoding starting point > e.g. LOG: could not drop replication slot > "pg_16700_sync_16454_7261998170966054867" on publisher: ERROR: > replication slot "pg_16700_sync_16454_7261998170966054867" does not > exist > I think we are getting the error (ERROR: could not find logical decoding starting point) because we wouldn't have waited for WAL to become available before reading it. It could happen due to the following code: WalSndWaitForWal() { ... if (streamingDoneReceiving && streamingDoneSending && !pq_is_send_pending()) break; .. } Now, it seems that in 0003 patch, instead of resetting flags streamingDoneSending, and streamingDoneReceiving before start replication, we should reset before create logical slots because we need to read the WAL during that time as well to find the consistent point. -- With Regards, Amit Kapila.
Re: add timing information to pg_upgrade
On Wed, Aug 2, 2023 at 12:44 PM Peter Eisentraut wrote: > > On 01.08.23 18:00, Nathan Bossart wrote: > > One of the main purposes of this thread is to gauge interest. I'm hoping > > there are other developers who are interested in reducing > > pg_upgrade-related downtime, and it seemed like it'd be nice to have > > built-in functionality for measuring the step times instead of requiring > > folks to apply this patch every time. And I think users might also be > > interested in this information, if for no other reason than to help us > > pinpoint which steps are taking longer for various workloads. > > If it's just for developers and expert users, perhaps existing > shell-level functionality like "pg_upgrade | ts" would suffice? Interesting. I had to install moreutils package to get ts. And, something like ts command may or may not be available on all platforms. Moreover, the ts command gives me the timestamps for each of the messages printed, so an extra step is required to calculate the time taken for an operation. I think it'd be better if pg_upgrade can calculate the time taken for each operation, and I'm okay if it is an opt-in feature with --verbose option. [1] Aug 02 07:44:17 Sync data directory to disk ok Aug 02 07:44:17 Creating script to delete old cluster ok Aug 02 07:44:17 Checking for extension updates ok -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: cataloguing NOT NULL constraints
On 24.07.23 12:32, Alvaro Herrera wrote: However, 11.16 ( as part of 11.12 ), says that DROP NOT NULL causes the indication of the column as NOT NULL to be removed. This, to me, says that if you do have multiple such constraints, you'd better remove them all with that command. All in all, I lean towards allowing just one as best as we can. Another clue is in 11.15 , which says 1) Let C be the column identified by the CN in the containing . If the column descriptor of C does not contain an indication that C is defined as NOT NULL, then: [do things] Otherwise it does nothing. So there can only be one such constraint per table.
RE: [PoC] pg_upgrade: allow to upgrade publisher node
Dear Amit, Thank you for giving comments! PSA new version patchset. > 1. Do we really need 0001 patch after the latest change proposed by > Vignesh in the 0004 patch? I removed 0001 patch and revived old patch which serializes slots at shutdown. This is because the problem which slots are not serialized to disk still remain [1] and then confirmed_flush becomes behind, even if we implement the approach. > 2. > + if (dopt.logical_slots_only) > + { > + if (!dopt.binary_upgrade) > + pg_fatal("options --logical-replication-slots-only requires option > --binary-upgrade"); > + > + if (dopt.dataOnly) > + pg_fatal("options --logical-replication-slots-only and > -a/--data-only cannot be used together"); > + > + if (dopt.schemaOnly) > + pg_fatal("options --logical-replication-slots-only and > -s/--schema-only cannot be used together"); > > Can you please explain why the patch imposes these restrictions? I > guess the binary_upgrade is because you want this option to be used > for the upgrade. Do we want to avoid giving any other option with > logical_slots, if so, are the above checks sufficient and why? Regarding the --binary-upgrade, the motivation is same as you expected. I covered up the --logical-replication-slots-only option from users, so it should not be used not for upgrade. Additionaly, this option is not shown in help and document. As for -{data|schema}-only options, I removed restrictions. Firstly I set as excluded because it may be confused - as discussed at [2], slots must be dumped after all the pg_resetwal is done and at that time all the definitions are already dumped. to avoid duplicated definitions, we must ensure only slots are written in the output file. I thought this requirement contradict descirptions of these options (Dump only the A, not B). But after considering more, I thought this might not be needed because it was not opened to users - no one would be confused by using both them. (Restriction for -c is also removed for the same motivation) > 3. > + /* > + * Get replication slots. > + * > + * XXX: Which information must be extracted from old node? Currently three > + * attributes are extracted because they are used by > + * pg_create_logical_replication_slot(). > + */ > + appendPQExpBufferStr(query, > + "SELECT slot_name, plugin, two_phase " > + "FROM pg_catalog.pg_replication_slots " > + "WHERE database = current_database() AND temporary = false " > + "AND wal_status IN ('reserved', 'extended');"); > > Why are we ignoring the slots that have wal status as WALAVAIL_REMOVED > or WALAVAIL_UNRESERVED? I think the slots where wal status is > WALAVAIL_REMOVED, the corresponding slots are invalidated at some > point. I think such slots can't be used for decoding but these will be > dropped along with the subscription or when a user does it manually. > So, if we don't copy such slots after the upgrade then there could be > a problem in dropping the corresponding subscription. If we don't want > to copy over such slots then we need to provide instructions on what > users should do in such cases. OTOH, if we want to copy over such > slots then we need to find a way to invalidate such slots after copy. > Either way, this needs more analysis. I considered again here. At least WALAVAIL_UNRESERVED should be supported because the slot is still usable. It can return reserved or extended. As for WALAVAIL_REMOVED, I don't think it should be so that I added a description to the document. This feature re-create slots which have same name/plugins as old ones, not replicate its state. So if we copy them as-is slots become usable again. If subscribers refer the slot and then connect again at that time, changes between 'WALAVAIL_REMOVED' may be lost. Based on above slots must be copied as WALAVAIL_REMOVED, but as you said, we do not have a way to control that. the status is calculated by using restart_lsn, but there are no function to modify directly. One approach is adding an SQL funciton which set restart_lsn to aritrary value (or 0/0, invalid), but it seems dangerous. > 4. > + /* > + * Check that all logical replication slots have reached the current WAL > + * position. > + */ > + res = executeQueryOrDie(conn, > + "SELECT slot_name FROM pg_catalog.pg_replication_slots " > + "WHERE (SELECT count(record_type) " > + " FROM pg_catalog.pg_get_wal_records_content(confirmed_flush_lsn, > pg_catalog.pg_current_wal_insert_lsn()) " > + " WHERE record_type != 'CHECKPOINT_SHUTDOWN') <> 0 " > + "AND temporary = false AND wal_status IN ('reserved', 'extended');"); > > I think this can unnecessarily lead to reading a lot of WAL data if > the confirmed_flush_lsn for a slot is too much behind. Can we think of > improving this by passing the number of records to read which in this > case should be 1? I checked and pg_wal_record_info() seemed to be used for the purpose. I tried to move the functionality to core. But this function raise an ERROR when there is no valid record after the specified
RE: [PoC] pg_upgrade: allow to upgrade publisher node
Dear Vignesh, Thank you for making the PoC! > Here is a patch which checks that there are no WAL records other than > CHECKPOINT_SHUTDOWN WAL record to be consumed based on the discussion > from [1]. Basically I agreed your approach. Thanks! > Patch 0001 and 0002 is same as the patch posted by Kuroda-san, Patch > 0003 exposes pg_get_wal_records_content to get the WAL records along > with the WAL record type between start and end lsn. pg_walinspect > contrib module already exposes a function for this requirement, I have > moved this functionality to be exposed from the backend. Patch 0004 > has slight change in check function to check that there are no other > records other than CHECKPOINT_SHUTDOWN to be consumed. The attached > patch has the changes for the same. > Thoughts? > > [1] - > https://www.postgresql.org/message-id/CAA4eK1Kem-J5NM7GJCgyKP84pEN6 > RsG6JWo%3D6pSn1E%2BiexL1Fw%40mail.gmail.com Few comments: * Per comment from Amit [1], I used pg_get_wal_record_info() instead of pg_get_wal_records_info(). This function extract a next available WAL record, which can avoid huge scan if the confirmed_flush is much behind. * According to cfbot and my analysis, the 0001 cannot pass the test on macOS. So I revived Julien's patch [2] as 0002 once. AFAIS the 0001 is not so welcomed. Next patch will be available soon. [1]: https://www.postgresql.org/message-id/CAA4eK1LWKkoyy-p-SAT0JTWa%3D6kXiMd%3Da6ZcArY9eU4a3g4TZg%40mail.gmail.com [2]: https://www.postgresql.org/message-id/20230414061248.vdsxz2febjo3re6h%40jrouhaud Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: [PATCH] [zh_CN.po] fix a typo in simplified Chinese translation file
On 01.08.23 10:09, Junwang Zhao wrote: add the missing leading `l` for log_statement_sample_rate I have committed this fix to the translations repository.
Re: pgbnech: allow to cancel queries during benchmark
Hello Fabien, On Fri, 14 Jul 2023 20:32:01 +0900 Yugo NAGATA wrote: I attached the updated patch. > Hello Fabien, > > Thank you for your review! > > On Mon, 3 Jul 2023 20:39:23 +0200 (CEST) > Fabien COELHO wrote: > > > > > Yugo-san, > > > > Some feedback about v1 of this patch. > > > > Patch applies cleanly, compiles. > > > > There are no tests, could there be one? ISTM that one could be done with a > > "SELECT pg_sleep(...)" script?? > > Agreed. I will add the test. I added a TAP test. > > > The global name "all_state" is quite ambiguous, what about "client_states" > > instead? Or maybe it could be avoided, see below. > > > > Instead of renaming "state" to "all_state" (or client_states as suggested > > above), I'd suggest to minimize the patch by letting "state" inside the > > main and adding a "client_states = state;" just after the allocation, or > > another approach, see below. > > Ok, I'll fix to add a global variable "client_states" and make this point to > "state" instead of changing "state" to global. Done. > > > Should PQfreeCancel be called on deconnections, in finishCon? I think that > > there may be a memory leak with the current implementation?? > > Agreed. I'll fix. Done. Regards, Yugo Nagata > > > Maybe it should check that cancel is not NULL before calling PQcancel? > > I think this is already checked as below, but am I missing something? > > + if (all_state[i].cancel != NULL) > + (void) PQcancel(all_state[i].cancel, errbuf, sizeof(errbuf)); > > > After looking at the code, I'm very unclear whether they may be some > > underlying race conditions, or not, depending on when the cancel is > > triggered. I think that some race conditions are still likely given the > > current thread 0 implementation, and dealing with them with a barrier or > > whatever is not desirable at all. > > > > In order to work around this issue, ISTM that we should go away from the > > simple and straightforward thread 0 approach, and the only clean way is > > that the cancelation should be managed by each thread for its own client. > > > > I'd suggest to have the advanceState to call PQcancel when CancelRequested > > is set and switch to CSTATE_ABORTED to end properly. This means that there > > would be no need for the global client states pointer, so the patch should > > be smaller and simpler. Possibly there would be some shortcuts added here > > and there to avoid lingering after the control-C, in threadRun. > > I am not sure this approach is simpler than mine. > > In multi-threads, only one thread can catches the signal and other threads > continue to run. Therefore, if Ctrl+C is pressed while threads are waiting > responses from the backend in wait_on_socket_set, only one thread can be > interrupted and return, but other threads will continue to wait and cannot > check CancelRequested. So, for implementing your suggestion, we need any hack > to make all threads return from wait_on_socket_set when the event occurs, but > I don't have idea to do this in simpler way. > > In my patch, all threads can return from wait_on_socket_set at Ctrl+C > because when thread #0 cancels all connections, the following error is > sent to all sessions: > > ERROR: canceling statement due to user request > > and all threads will receive the response from the backend. > > Regards, > Yugo Nagata > > -- > Yugo NAGATA > > -- Yugo NAGATA
Re: add timing information to pg_upgrade
On Wed, Aug 2, 2023 at 12:45 PM Peter Eisentraut wrote: > > On 01.08.23 17:45, Nathan Bossart wrote: > > The message is too long, so there's no space between it and the "ok" > > message: > > > > Checking for incompatible "aclitem" data type in user tablesok > > > > Instead of altering the messages, we could bump MESSAGE_WIDTH from 60 to > > 62 or 64. Do you prefer that approach? > > I think we should change the output format to be more like initdb, like > > Doing something ... ok > > without horizontally aligning all the "ok"s. While this looks simple, we might end up with a lot of diff and changes after removing MESSAGE_WIDTH. There's a significant part of pg_upgrade code that deals with MESSAGE_WIDTH. I don't think it's worth the effort. Therefore, I'd prefer the simplest possible fix - change the message to '"Checking for \"aclitem\" data type in user tables". It may be an overkill, but we can consider adding Assert(sizeof(message) < MESSAGE_WIDTH) in progress report functions to not encourage new messages to end up in the same formatting issue. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: add timing information to pg_upgrade
On 01.08.23 17:45, Nathan Bossart wrote: The message is too long, so there's no space between it and the "ok" message: Checking for incompatible "aclitem" data type in user tablesok Instead of altering the messages, we could bump MESSAGE_WIDTH from 60 to 62 or 64. Do you prefer that approach? I think we should change the output format to be more like initdb, like Doing something ... ok without horizontally aligning all the "ok"s.
Re: add timing information to pg_upgrade
On 01.08.23 18:00, Nathan Bossart wrote: One of the main purposes of this thread is to gauge interest. I'm hoping there are other developers who are interested in reducing pg_upgrade-related downtime, and it seemed like it'd be nice to have built-in functionality for measuring the step times instead of requiring folks to apply this patch every time. And I think users might also be interested in this information, if for no other reason than to help us pinpoint which steps are taking longer for various workloads. If it's just for developers and expert users, perhaps existing shell-level functionality like "pg_upgrade | ts" would suffice?
Re: Adding a LogicalRepWorker type field
On Wed, Aug 2, 2023 at 1:00 PM Amit Kapila wrote: > > On Wed, Aug 2, 2023 at 8:10 AM Peter Smith wrote: > > > > > > The am_xxx functions are removed now in the v2-0001 patch. See [1]. > > > > The replacement set of macros (the ones with no arg) are not strictly > > necessary, except I felt it would make the code unnecessarily verbose > > if we insist to pass MyLogicalRepWorker everywhere from the callers in > > worker.c / tablesync.c / applyparallelworker.c. > > > > Agreed but having a dual set of macros is also not clean. Can we > provide only a unique set of inline functions instead? > We can't use the same names for both with/without-parameter functions because there is no function overloading in C. In patch v3-0001 I've replaced the "dual set of macros", with a single inline function of a different name, and one set of space-saving macros. PSA v3 -- Kind Regards, Peter Smith. Fujitsu Australia v3-0001-Add-LogicalRepWorkerType-enum.patch Description: Binary data v3-0002-Switch-on-worker-type.patch Description: Binary data
Re: [PoC] Reducing planning time when tables have many partitions
Hello, I really appreciate sharing very useful scripts and benchmarking results. On Fri, Jul 28, 2023 at 6:51 PM Ashutosh Bapat wrote: > Given that most of the developers run assert enabled builds it would > be good to bring down the degradation there while keeping the > excellent speedup in non-assert builds. >From my observation, this degradation in assert enabled build is caused by verifying the iteration results of EquivalenceMembers. My patch uses Bitmapset-based indexes to speed up the iteration. When assertions are enabled, we verify that the result of the iteration is the same with and without the indexes. This verification results in executing a similar loop three times, which causes the degradation. I measured planning time by using your script without this verification. The results are as follows: Master: 144.55 ms Patched (v19): 529.85 ms Patched (v19) without verification: 78.84 ms (*) All runs are with assertions. As seen from the above, verifying iteration results was the cause of the performance degradation. I agree that we should avoid such degradation because it negatively affects the development of PostgreSQL. Removing the verification when committing this patch is one possible option. > Queries on partitioned tables eat a lot of memory anyways, increasing > that further should be avoided. > > I have not studied the patches. But I think the memory increase has to > do with our Bitmapset structure. It's space inefficient when there are > thousands of partitions involved. See my comment at [2] Thank you for pointing this out. I have never considered the memory usage impact of this patch. As you say, the Bitmapset structure caused this increase. I will try to look into this further. -- Best regards, Yuya Watari
Re: Adding a LogicalRepWorker type field
On Wed, Aug 2, 2023 at 3:35 PM Masahiko Sawada wrote: > > On Mon, Jul 31, 2023 at 10:47 AM Peter Smith wrote: > > > > Hi hackers, > > > > BACKGROUND: > > > > The logical replication has different worker "types" (e.g. apply > > leader, apply parallel, tablesync). > > > > They all use a common structure called LogicalRepWorker, but at times > > it is necessary to know what "type" of worker a given LogicalRepWorker > > represents. > > > > Once, there were just apply workers and tablesync workers - these were > > easily distinguished because only tablesync workers had a valid > > 'relid' field. Next, parallel-apply workers were introduced - these > > are distinguishable from the apply leaders by the value of > > 'leader_pid' field. > > > > > > PROBLEM: > > > > IMO, deducing the worker's type by examining multiple different field > > values seems a dubious way to do it. This maybe was reasonable enough > > when there were only 2 types, but as more get added it becomes > > increasingly complicated. > > > > SOLUTION: > > > > AFAIK none of the complications is necessary anyway; the worker type > > is already known at launch time, and it never changes during the life > > of the process. > > > > The attached Patch 0001 introduces a new enum 'type' field, which is > > assigned during the launch. > > > > ~ > > > > This change not only simplifies the code, but it also permits other > > code optimizations, because now we can switch on the worker enum type, > > instead of using cascading if/else. (see Patch 0002). > > > > Thoughts? > > I can see the problem you stated and it's true that the worker's type > never changes during its lifetime. But I'm not sure we really need to > add a new variable to LogicalRepWorker since we already have enough > information there to distinguish the worker's type. > > Currently, we deduce the worker's type by checking some fields that > never change such as relid and leader_piid. It's fine to me as long as > we encapcel the deduction of worker's type by macros (or inline > functions). Even with the proposed patch (0001 patch), we still need a > similar logic to determine the worker's type. Thanks for the feedback. I agree that the calling code will not look very different with/without this patch 0001, because everything is hidden by the macros/functions. But those HEAD macros/functions are also hiding the inefficiencies of the type deduction -- e.g. IMO it is quite strange that we can only recognize the worker is an "apply worker" by eliminating the other 2 possibilities. As further background, the idea for this patch came from considering another thread to "re-use" workers [1]. For example, when thinking about re-using tablesync workers the HEAD am_tablesync_worker() function begins to fall apart -- ie. it does not let you sensibly disassociate a tablesync worker from its relid (which you might want to do if tablesync workers were "pooled") because in the HEAD code any tablesync worker without a relid is (by definition) NOT recognized as a tablesync worker anymore! Those above issues just disappear when a type field is added. > > If we want to change both process_syncing_tables() and > should_apply_changes_for_rel() to use the switch statement instead of > using cascading if/else, I think we can have a function, say > LogicalRepWorkerType(), that returns LogicalRepWorkerType of the given > worker. > The point of that "switch" in patch 0002 was to demonstrate that with a worker-type enum we can write more *efficient* code in some places than the current cascading if/else. I think making a "switch" just for the sake of it, by adding more functions that hide yet more work, is going in the opposite direction. -- [1] reuse workers - https://www.postgresql.org/message-id/flat/CAGPVpCTq%3DrUDd4JUdaRc1XUWf4BrH2gdSNf3rtOMUGj9rPpfzQ%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Re: Faster "SET search_path"
On Wed, 2023-08-02 at 01:14 -0400, Isaac Morland wrote: > I don’t think the fact that an optimization might suddenly not work > in a certain situation is a reason not to optimize. What would our > query planner look like if we took that approach? ... > Instead, we should try to find ways of making the performance more > transparent. We already have some features for this, but maybe more > can be done. Exactly; for the planner we have EXPLAIN. We could try to expose GUC switching costs that way. Or maybe users can start thinking of search_path as a performance- related setting to be careful with. Or we could try harder to optimize the switching costs so that it's not so bad. I just started and I already saw some nice speedups; with a few of the easy fixes done, the remaining opportunities should be more visible in the profiler. Regards, Jeff Davis
Re: Inaccurate comments in ReorderBufferCheckMemoryLimit()
On Wed, Aug 2, 2023 at 11:21 AM Amit Kapila wrote: > > On Tue, Aug 1, 2023 at 2:06 PM Masahiko Sawada wrote: > > > > On Tue, Aug 1, 2023 at 11:33 AM Amit Kapila wrote: > > > > > > On Mon, Jul 31, 2023 at 8:46 PM Masahiko Sawada > > > wrote: > > > > > > > > While reading the code, I realized that the following code comments > > > > might not be accurate: > > > > > > > > /* > > > > * Pick the largest transaction (or subtransaction) and evict > > > > it from > > > > * memory by streaming, if possible. Otherwise, spill to disk. > > > > */ > > > > if (ReorderBufferCanStartStreaming(rb) && > > > > (txn = ReorderBufferLargestStreamableTopTXN(rb)) != NULL) > > > > { > > > > /* we know there has to be one, because the size is not > > > > zero */ > > > > Assert(txn && rbtxn_is_toptxn(txn)); > > > > Assert(txn->total_size > 0); > > > > Assert(rb->size >= txn->total_size); > > > > > > > > ReorderBufferStreamTXN(rb, txn); > > > > } > > > > > > > > AFAICS since ReorderBufferLargestStreamableTopTXN() returns only > > > > top-level transactions, the comment above the if statement is not > > > > right. It would not pick a subtransaction. > > > > > > > > > > I think the subtransaction case is for the spill-to-disk case as both > > > cases are explained in the same comment. > > > > > > > Also, I'm not sure that the second comment "we know there has to be > > > > one, because the size is not zero" is right since there might not be > > > > top-transactions that are streamable. > > > > > > > > > > I think this comment is probably referring to asserts related to the > > > size similar to spill to disk case. > > > > > > How about if we just remove (or subtransaction) from the following > > > comment: "Pick the largest transaction (or subtransaction) and evict > > > it from memory by streaming, if possible. Otherwise, spill to disk."? > > > Then by referring to streaming/spill-to-disk cases, one can understand > > > in which cases only top-level xacts are involved and in which cases > > > both are involved. > > > > Sounds good. I've updated the patch accordingly. > > > > LGTM. Thank you for reviewing the patch! Pushed. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Extract numeric filed in JSONB more effectively
On Tue, Aug 1, 2023 at 12:39 PM Andy Fan wrote: > > Hi: > > Currently if we want to extract a numeric field in jsonb, we need to use > the following expression: cast (a->>'a' as numeric). It will turn a numeric > to text first and then turn the text to numeric again. See > jsonb_object_field_text and JsonbValueAsText. However the binary format > of numeric in JSONB is compatible with the numeric in SQL, so I think we > can have an operator to extract the numeric directly. If the value of a given > field is not a numeric data type, an error will be raised, this can be > documented. > > In this patch, I added a new operator for this purpose, here is the > performance gain because of this. > > create table tb (a jsonb); > insert into tb select '{"a": 1}'::jsonb from generate_series(1, 10)i; > > current method: > select count(*) from tb where cast (a->>'a' as numeric) = 2; > 167ms. > > new method: > select count(*) from tb where a@->'a' = 2; > 65ms. > > Is this the right way to go? Testcase, document and catalog version are > updated. > > > -- > Best Regards > Andy Fan return PointerGetDatum(v->val.numeric); should be something like PG_RETURN_NUMERIC(v->val.numeric); ?