Re: Delay locking partitions during query execution
On Tue, Mar 5, 2019 at 8:04 PM David Rowley wrote: > Actually, I'm not sure it could work at all. It does not seem very > safe to lookup a partition's parent without actually holding a lock on > the partition and we can't lock the partition and then lock each > parent in turn as that's the exact opposite locking order that we do > when querying a partitioned table, so could result in deadlocking. > > Many moons ago in the 0002 patch in [1] I proposed to change the way > we store partition hierarchies which involved ditching bool > pg_class.relispartition in favour of Oid pg_class.relpartitionparent. > That would make parent lookups pretty fast, but... given the above it > seems like we couldn't do this at all. One thing that is both kinda nice and kinda strange about our heavyweight lock manager is that it has no idea what it is locking. If you say "please lock OID 12345" for me, it does, but it doesn't know anything about the relationship between that OID and any other thing you might want to lock. Compare that to what Gray and Reuter describe in "Transaction Processing: Concepts and Techniques", Section 7.8, Granular Locking. There, there is an idea that the set of possible locks forms a tree, and that locking a node of the tree is tantamount to locking all of its descendents. Such a concept would be useful here: you could take e.g. AccessShareLock on the root of the partitioning hierarchy and that would in effect give you that lock mode on every partition, but without needing to make a separate entry for each partition lock. Sadly, implementing such a thing for PostgreSQL seems extremely challenging. We'd somehow have to build up in shared memory an idea of what the locking hierarchy was, and then update it as DDL happens, and drop entries that aren't interesting any more so we don't run out of shared memory. Performance and concurrency would be really hard problems, and assumptions about the current locking model are baked into code in MANY parts of the system, so finding everything that needed to be (or could be) changed would probably be extremely challenging. But if you could make it all work we might well end up in a better state than we are today. However, I'm leaving this problem for a time when I have six months or a year to do nothing else, because I'm pretty sure that's what it would take. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Delay locking partitions during query execution
On Wed, 6 Mar 2019 at 04:46, Tomas Vondra wrote: > > On 3/5/19 6:55 AM, David Rowley wrote: > > The only way I can think to fix this is to just never lock partitions > > at all, and if a lock is to be obtained on a partition, it must be > > instead obtained on the top-level partitioned table. That's a rather > > large change that could have large consequences, and I'm not even sure > > it's possible since we'd need to find the top-level parent before > > obtaining the lock, by then the hierarchy might have changed and we'd > > need to recheck, which seems like quite a lot of effort just to obtain > > a lock... Apart from that, it's not this patch, so looks like I'll > > need to withdraw this one :-( > > > > So you're saying we could > > 1) lookup the parent and lock it > 2) repeat the lookup to verify it did not change > > I think that could still be a win, assuming that most hierarchies will > be rather shallow (I'd say 2-3 levels will cover like 95% of cases, and > 4 levels would be 100% in practice). And the second lookup should be > fairly cheap thanks to syscache and the fact that the hierarchies do not > change very often. Actually, I'm not sure it could work at all. It does not seem very safe to lookup a partition's parent without actually holding a lock on the partition and we can't lock the partition and then lock each parent in turn as that's the exact opposite locking order that we do when querying a partitioned table, so could result in deadlocking. Many moons ago in the 0002 patch in [1] I proposed to change the way we store partition hierarchies which involved ditching bool pg_class.relispartition in favour of Oid pg_class.relpartitionparent. That would make parent lookups pretty fast, but... given the above it seems like we couldn't do this at all. [1] https://www.postgresql.org/message-id/CAKJS1f9QjUwQrio20Pi%3DyCHmnouf4z3SfN8sqXaAcwREG6k0zQ%40mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Delay locking partitions during query execution
On 3/5/19 6:55 AM, David Rowley wrote: > On Sat, 2 Feb 2019 at 02:52, Robert Haas wrote: >> I think the key question here is whether or not you can cope with >> someone having done arbitrary AEL-requiring modifications to the >> delaylocked partitions. If you can, the fact that the plan might >> sometimes be out-of-date is an inevitable consequence of doing this at >> all, but I think we can argue that it's an acceptable consequence as >> long as the resulting behavior is not too bletcherous. If you can't, >> then this patch is dead. > > I spent some time looking at this patch again and thinking about the > locking. I ended up looking through each alter table subcommand by way > of AlterTableGetLockLevel() and going through scenarios with each of > them in my head to try to understand: > > a) If the subcommand can even be applied to a leaf partition; and > b) Can the subcommand cause a cached plan to become invalid? > > I did all the easy ones first then started on the harder ones. I came > to AT_DropConstraint and imagined the following scenario: > > -- ** session 1 > create table listp (a int) partition by list(a); > create table listp1 partition of listp for values in(1); > create table listp2 partition of listp for values in(2); > > create index listp1_a_idx on listp1 (a); > create index listp2_a_idx on listp2 (a); > > set enable_seqscan = off; > set plan_cache_mode = force_generic_plan; > prepare q1 (int) as select * from listp where a = $1; > execute q1 (1); > begin; > execute q1 (1); > > > -- ** session 2 > drop index listp2_a_idx; > > -- ** session 1 > execute q1 (2); > ERROR: could not open relation with OID 16401 > > The only way I can think to fix this is to just never lock partitions > at all, and if a lock is to be obtained on a partition, it must be > instead obtained on the top-level partitioned table. That's a rather > large change that could have large consequences, and I'm not even sure > it's possible since we'd need to find the top-level parent before > obtaining the lock, by then the hierarchy might have changed and we'd > need to recheck, which seems like quite a lot of effort just to obtain > a lock... Apart from that, it's not this patch, so looks like I'll > need to withdraw this one :-( > So you're saying we could 1) lookup the parent and lock it 2) repeat the lookup to verify it did not change I think that could still be a win, assuming that most hierarchies will be rather shallow (I'd say 2-3 levels will cover like 95% of cases, and 4 levels would be 100% in practice). And the second lookup should be fairly cheap thanks to syscache and the fact that the hierarchies do not change very often. I can't judge how invasive this patch would be, but I agree it's more complex than the originally proposed patch. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Delay locking partitions during query execution
On Sat, 2 Feb 2019 at 02:52, Robert Haas wrote: > I think the key question here is whether or not you can cope with > someone having done arbitrary AEL-requiring modifications to the > delaylocked partitions. If you can, the fact that the plan might > sometimes be out-of-date is an inevitable consequence of doing this at > all, but I think we can argue that it's an acceptable consequence as > long as the resulting behavior is not too bletcherous. If you can't, > then this patch is dead. I spent some time looking at this patch again and thinking about the locking. I ended up looking through each alter table subcommand by way of AlterTableGetLockLevel() and going through scenarios with each of them in my head to try to understand: a) If the subcommand can even be applied to a leaf partition; and b) Can the subcommand cause a cached plan to become invalid? I did all the easy ones first then started on the harder ones. I came to AT_DropConstraint and imagined the following scenario: -- ** session 1 create table listp (a int) partition by list(a); create table listp1 partition of listp for values in(1); create table listp2 partition of listp for values in(2); create index listp1_a_idx on listp1 (a); create index listp2_a_idx on listp2 (a); set enable_seqscan = off; set plan_cache_mode = force_generic_plan; prepare q1 (int) as select * from listp where a = $1; execute q1 (1); begin; execute q1 (1); -- ** session 2 drop index listp2_a_idx; -- ** session 1 execute q1 (2); ERROR: could not open relation with OID 16401 The only way I can think to fix this is to just never lock partitions at all, and if a lock is to be obtained on a partition, it must be instead obtained on the top-level partitioned table. That's a rather large change that could have large consequences, and I'm not even sure it's possible since we'd need to find the top-level parent before obtaining the lock, by then the hierarchy might have changed and we'd need to recheck, which seems like quite a lot of effort just to obtain a lock... Apart from that, it's not this patch, so looks like I'll need to withdraw this one :-( -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Delay locking partitions during query execution
On Tue, 19 Feb 2019 at 12:50, Tom Lane wrote: > > [ reposting some questions asked in the wrong thread ] > > What I'd like to understand about this patch is how it relates > to Amit L.'s work on making the planner faster for partitioned > UPDATE/DELETE cases (https://commitfest.postgresql.org/22/1778/). > I think that that might render this moot? Amit's already > getting rid of unnecessary partition locking. (I'm still not certain of this patch myself and it does need me to do some more analysis to check on what was mentioned recently upthread, but...) As for the overlap with Amit's patch, it is true he's reducing the locking down to just non-pruned partitions, so for this patch, that only leaves us able to skip locking partitions that are about to be run-time pruned. That happens to be quite useful when using generic plans as it's quite possible that the planner is unable to prune any partitions and the executor would otherwise have to lock all of them, but perhaps only end up scanning 1 of them.It's true that when plan_cache_mode = auto that a generic plan doing this is unlikely to ever be chosen since the planner does not cost in the possibility of run-time pruning, but there's also plan_cache_mode = force_generic_plan. > In any case, I'd counsel against putting planner outputs into > RangeTblEntry. Someday we ought to make parse trees read-only > to the planner, as plan trees are to the executor; defining them > to carry planner results kneecaps any such effort before it starts. > Not to mention the unpleasant consequences for the footprint > of this patch. I agree that one day it would be nice to move towards having the planner never touch the parse, but I didn't really see it as anything much beyond what rellockmode is today. If we one day have a solution for that, then likely the same solution will support fixing the delaylock flag too. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Delay locking partitions during query execution
[ reposting some questions asked in the wrong thread ] What I'd like to understand about this patch is how it relates to Amit L.'s work on making the planner faster for partitioned UPDATE/DELETE cases (https://commitfest.postgresql.org/22/1778/). I think that that might render this moot? Amit's already getting rid of unnecessary partition locking. In any case, I'd counsel against putting planner outputs into RangeTblEntry. Someday we ought to make parse trees read-only to the planner, as plan trees are to the executor; defining them to carry planner results kneecaps any such effort before it starts. Not to mention the unpleasant consequences for the footprint of this patch. regards, tom lane
Re: Delay locking partitions during query execution
On Sat, 2 Feb 2019 at 13:43, Tomas Vondra wrote: > > On 2/1/19 2:51 PM, Robert Haas wrote: > >> (I admit to not having the best grasp on how all this works, so feel > >> free to shoot this down) > >> > > I think the key question here is whether or not you can cope with > > someone having done arbitrary AEL-requiring modifications to the > > delaylocked partitions. If you can, the fact that the plan might > > sometimes be out-of-date is an inevitable consequence of doing this at > > all, but I think we can argue that it's an acceptable consequence as > > long as the resulting behavior is not too bletcherous. If you can't, > > then this patch is dead. > > I'm a bit confused - does the patch actually change anything here? As > demonstrated earlier in this thread, it actually behaves just like > master. No? I guess Robert meant if it say, failed to execute a cached plan with say "unable to open relation ..." after someone concurrently did something like ALTER TABLE ... SET TABLESPACE on one of the partitions. This particular case can't happen with the patch since we always accept invalidation message after we obtain a new lock, but I'm working through various scenarios just in case there is something that could invalidate the plan, rather than just the relcache entry. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Delay locking partitions during query execution
On 2/1/19 2:51 PM, Robert Haas wrote: >> (I admit to not having the best grasp on how all this works, so feel >> free to shoot this down) >> > I think the key question here is whether or not you can cope with > someone having done arbitrary AEL-requiring modifications to the > delaylocked partitions. If you can, the fact that the plan might > sometimes be out-of-date is an inevitable consequence of doing this at > all, but I think we can argue that it's an acceptable consequence as > long as the resulting behavior is not too bletcherous. If you can't, > then this patch is dead. I'm a bit confused - does the patch actually change anything here? As demonstrated earlier in this thread, it actually behaves just like master. No? -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Delay locking partitions during query execution
On Thu, Jan 31, 2019 at 8:29 PM David Rowley wrote: > I think perhaps accepting invalidations at the start of the statement > might appear to fix the problem in master, but I think there's still a > race condition around CheckCachedPlan() since we'll ignore > invalidation messages when we perform LockRelationOid() in > AcquireExecutorLocks() due to the lock already being held. So there's > likely still a window where the invalidation message could come in and > we miss it.That said, if lock is already taken in one session, and > some other session alters some relation we have a lock on, then that > alternation can't have been done with an AEL, as that would have > blocked on our lock, so it must be some ShareUpdateExclusiveLock > change, and if so that change must not be important enough to block > concurrently executing queries, so likely shouldn't actually break > anything. > > So in summary, I think that checking for invalidation messages at the > start of the statement allows us to replan within a transaction, but > does not guarantee we can the exact correct plan for the current > settings. Right, but that's also an inevitably consequence of postponing locking. If you start running the statement without locking every object involved in the query, then somebody could be holding an AccessExclusiveLock on that object and changing it in any way that they're allowed to do without holding a lock that does conflict with one you're holding. Nothing we do with invalidation messages can prevent that from happening, though it can make the race narrower. We *must* be able to cope with having the out-of-date plan or this whole idea is sunk. > I also don't have a full grasp on why we must ignore invalidation > messages when we already have a lock on the relation. I assume it's > not a huge issue since obtaining a new lock on any other relation will > process the invalidation queue anyway. I know that f868a8143a9 > recently made some changes to fix some recursive issues around these > invalidations. It's not necessary. It's a performance optimization. The idea is: we need our caches to be up to date. Anything that changes an object should take AccessExclusiveLock, send invalidation messages, and only then release the AccessExclusiveLock. Anything that looks at an object should acquire at least AccessShareLock and then process invalidation messages. This protocol guarantees that any changes that are made under AEL will be seen in a race-free way, since the invalidation messages are sent before releasing the lock and processed after taking one. But it doesn't require processing those messages when we haven't taken a lock, so we don't. > (I admit to not having the best grasp on how all this works, so feel > free to shoot this down) I think the key question here is whether or not you can cope with someone having done arbitrary AEL-requiring modifications to the delaylocked partitions. If you can, the fact that the plan might sometimes be out-of-date is an inevitable consequence of doing this at all, but I think we can argue that it's an acceptable consequence as long as the resulting behavior is not too bletcherous. If you can't, then this patch is dead. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Delay locking partitions during query execution
On Fri, 1 Feb 2019 at 10:32, Robert Haas wrote: > > On Sun, Jan 27, 2019 at 8:26 PM David Rowley > wrote: > > One way around this would be to always perform an invalidation on the > > partition's parent when performing a relcache invalidation on the > > partition. We could perhaps invalidate all the way up to the top > > level partitioned table, that way we could just obtain a lock on the > > target partitioned table during AcquireExecutorLocks(). I'm currently > > only setting the delaylock flag to false for leaf partitions only. > > Would this problem go away if we adopted the proposal discussed in > http://postgr.es/m/24823.1544220...@sss.pgh.pa.us and, if so, is that > a good fix? > > I don't quite understand why this is happening. It seems like as long > as you take at least one new lock, you'll process *every* pending > invalidation message, and that should trigger replanning as long as > the dependencies are correct. But maybe the issue is that you hold > all the other locks you need already, and the lock on the partition at > issue is only acquired during execution, at which point it's too late > to replan. If so, then I think something along the lines of the above > might make a lot of sense. I think perhaps accepting invalidations at the start of the statement might appear to fix the problem in master, but I think there's still a race condition around CheckCachedPlan() since we'll ignore invalidation messages when we perform LockRelationOid() in AcquireExecutorLocks() due to the lock already being held. So there's likely still a window where the invalidation message could come in and we miss it.That said, if lock is already taken in one session, and some other session alters some relation we have a lock on, then that alternation can't have been done with an AEL, as that would have blocked on our lock, so it must be some ShareUpdateExclusiveLock change, and if so that change must not be important enough to block concurrently executing queries, so likely shouldn't actually break anything. So in summary, I think that checking for invalidation messages at the start of the statement allows us to replan within a transaction, but does not guarantee we can the exact correct plan for the current settings. I also don't have a full grasp on why we must ignore invalidation messages when we already have a lock on the relation. I assume it's not a huge issue since obtaining a new lock on any other relation will process the invalidation queue anyway. I know that f868a8143a9 recently made some changes to fix some recursive issues around these invalidations. (I admit to not having the best grasp on how all this works, so feel free to shoot this down) -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Delay locking partitions during query execution
On 1/31/19 10:31 PM, Robert Haas wrote: > On Sun, Jan 27, 2019 at 8:26 PM David Rowley > wrote: >> One way around this would be to always perform an invalidation on the >> partition's parent when performing a relcache invalidation on the >> partition. We could perhaps invalidate all the way up to the top >> level partitioned table, that way we could just obtain a lock on the >> target partitioned table during AcquireExecutorLocks(). I'm currently >> only setting the delaylock flag to false for leaf partitions only. > > Would this problem go away if we adopted the proposal discussed in > http://postgr.es/m/24823.1544220...@sss.pgh.pa.us and, if so, is that > a good fix? > > I don't quite understand why this is happening. It seems like as long > as you take at least one new lock, you'll process *every* pending > invalidation message, and that should trigger replanning as long as > the dependencies are correct. But maybe the issue is that you hold > all the other locks you need already, and the lock on the partition at > issue is only acquired during execution, at which point it's too late > to replan. If so, then I think something along the lines of the above > might make a lot of sense. > It happens because ConditionalLockRelation does this: if (res != LOCKACQUIRE_ALREADY_CLEAR) { AcceptInvalidationMessages(); MarkLockClear(locallock); } and with the prepared statement already planned, we end up skipping that because res == LOCKACQUIRE_ALREADY_CLEAR. Which happens because GrantLockLocal (called from LockAcquireExtended) find the relation in as already locked. I don't know if this is correct or not, though. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Delay locking partitions during query execution
On Sun, Jan 27, 2019 at 8:26 PM David Rowley wrote: > One way around this would be to always perform an invalidation on the > partition's parent when performing a relcache invalidation on the > partition. We could perhaps invalidate all the way up to the top > level partitioned table, that way we could just obtain a lock on the > target partitioned table during AcquireExecutorLocks(). I'm currently > only setting the delaylock flag to false for leaf partitions only. Would this problem go away if we adopted the proposal discussed in http://postgr.es/m/24823.1544220...@sss.pgh.pa.us and, if so, is that a good fix? I don't quite understand why this is happening. It seems like as long as you take at least one new lock, you'll process *every* pending invalidation message, and that should trigger replanning as long as the dependencies are correct. But maybe the issue is that you hold all the other locks you need already, and the lock on the partition at issue is only acquired during execution, at which point it's too late to replan. If so, then I think something along the lines of the above might make a lot of sense. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Delay locking partitions during query execution
On Tue, 29 Jan 2019 at 19:42, Amit Langote wrote: > However, I tried the example as you described and the plan *doesn't* > change due to concurrent update of reloptions with master (without the > patch) either. Well, I didn't think to try that. I just assumed had broken it. Could well be related to lowering the lock levels on setting these reloptions. Once upon a time, they required an AEL but that was changed in e5550d5fec6. So previous to that commit you'd have been blocked from making the concurrent change while the other session was in a transaction. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Delay locking partitions during query execution
On 2019/01/28 20:27, David Rowley wrote: > On Mon, 28 Jan 2019 at 20:45, Amit Langote > wrote: >> It seems to me that plancache.c doesn't really need to perform >> AcquireExecutorLocks()/LockRelationOid() to learn that a partition's >> reloptions property has changed to discard a generic plan and build a new >> one. AFAICT, PlanCacheRelCallback() takes care of resetting a cached plan >> by observing that an invalidation message that it received either from >> the same session or from another session belongs to one of the relations >> in PlannedStmt.relationOids. That list must already contain all >> partitions' OIDs. > > Really? So when you tried my case you properly got a plan with a > non-parallel Seq Scan on listp1? > > I imagine you didn't with yours since we check for relcache > invalidations at the start of a transaction. I performed both my > EXECUTEs in the same transaction. Yeah, I performed each EXECUTE in a new transaction, so not the same case as yours. Sorry about the noise. However, I tried the example as you described and the plan *doesn't* change due to concurrent update of reloptions with master (without the patch) either. session 1: begin; prepare q1 as select count(*) from listp; explain (costs off, analyze, timing off, summary off) execute q1; QUERY PLAN ─ Finalize Aggregate (actual rows=1 loops=1) -> Gather (actual rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 -> Partial Aggregate (actual rows=1 loops=3) -> Parallel Append (actual rows=7 loops=3) -> Parallel Seq Scan on listp1 (actual rows=3 loops=3) -> Parallel Seq Scan on listp2 (actual rows=5 loops=2) (8 rows) session 2: alter table listp1 set (parallel_workers=0); session 1: explain (costs off, analyze, timing off, summary off) execute q1; QUERY PLAN ─ Finalize Aggregate (actual rows=1 loops=1) -> Gather (actual rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 -> Partial Aggregate (actual rows=1 loops=3) -> Parallel Append (actual rows=7 loops=3) -> Parallel Seq Scan on listp1 (actual rows=3 loops=3) -> Parallel Seq Scan on listp2 (actual rows=5 loops=2) (8 rows) I then built master with -DCATCACHE_FORCE_RELEASE and the plan does change, but because of syscache misses here and there resulting in opening the erring system catalog which then does AcceptInvalidationMessages(). In the normal build, invalidation messages don't seem to be processed even by calling AcquireExecutorLocks(), which is perhaps not normal. It seem that the following condition in LockRelationOid is never true when called from AcquireExecutorLocks: if (res != LOCKACQUIRE_ALREADY_CLEAR) { AcceptInvalidationMessages(); MarkLockClear(locallock); } Bug, maybe? Thanks, Amit
Re: Delay locking partitions during query execution
On Mon, 28 Jan 2019 at 20:45, Amit Langote wrote: > It seems to me that plancache.c doesn't really need to perform > AcquireExecutorLocks()/LockRelationOid() to learn that a partition's > reloptions property has changed to discard a generic plan and build a new > one. AFAICT, PlanCacheRelCallback() takes care of resetting a cached plan > by observing that an invalidation message that it received either from > the same session or from another session belongs to one of the relations > in PlannedStmt.relationOids. That list must already contain all > partitions' OIDs. Really? So when you tried my case you properly got a plan with a non-parallel Seq Scan on listp1? I imagine you didn't with yours since we check for relcache invalidations at the start of a transaction. I performed both my EXECUTEs in the same transaction. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Delay locking partitions during query execution
On 2019/01/28 10:26, David Rowley wrote: > On Tue, 4 Dec 2018 at 00:42, David Rowley > wrote: >> Over here and along similar lines to the above, but this time I'd like >> to take this even further and change things so we don't lock *any* >> partitions during AcquireExecutorLocks() and instead just lock them >> when we first access them with ExecGetRangeTableRelation(). This >> improves the situation when many partitions get run-time pruned, as >> we'll never bother locking those at all since we'll never call >> ExecGetRangeTableRelation() on them. We'll of course still lock the >> partitioned table so that plan invalidation works correctly. > > I started looking at this patch again and I do see a problem with it. > Let me explain: > > Currently during LockRelationOid() when we obtain a lock on a relation > we'll check for rel cache invalidation messages. This means that > during AcquireExecutorLocks(), as called from CheckCachedPlan(), we > could miss invalidation messages since we're no longer necessarily > locking the partitions. > > I think this only creates problems for partition's reloptions being > changed and cached plans possibly being not properly invalidated when > that happens, but I might be wrong about that, but if I'm not, there > still also might be more important reasons to ensure we invalidate the > plan properly in the future. It seems to me that plancache.c doesn't really need to perform AcquireExecutorLocks()/LockRelationOid() to learn that a partition's reloptions property has changed to discard a generic plan and build a new one. AFAICT, PlanCacheRelCallback() takes care of resetting a cached plan by observing that an invalidation message that it received either from the same session or from another session belongs to one of the relations in PlannedStmt.relationOids. That list must already contain all partitions' OIDs. Session 1: prepare q as select count(*) from p; explain (costs off) execute q(1); QUERY PLAN Finalize Aggregate -> Gather Workers Planned: 2 -> Partial Aggregate -> Parallel Append -> Parallel Seq Scan on p4 -> Parallel Seq Scan on p1 -> Parallel Seq Scan on p2 -> Parallel Seq Scan on p3 -> Parallel Seq Scan on p_def (10 rows) -- same session (p1 can no longer use parallel scan) alter table p1 set (parallel_workers=0); explain (costs off) execute q(1); QUERY PLAN Finalize Aggregate -> Gather Workers Planned: 2 -> Partial Aggregate -> Parallel Append -> Seq Scan on p1 -> Parallel Seq Scan on p4 -> Parallel Seq Scan on p2 -> Parallel Seq Scan on p3 -> Parallel Seq Scan on p_def (10 rows) -- another session alter table p1 reset (parallel_workers); -- back to session 1 (p1 can now use parallel scan) explain (costs off) execute q(1); QUERY PLAN Finalize Aggregate -> Gather Workers Planned: 2 -> Partial Aggregate -> Parallel Append -> Parallel Seq Scan on p4 -> Parallel Seq Scan on p1 -> Parallel Seq Scan on p2 -> Parallel Seq Scan on p3 -> Parallel Seq Scan on p_def (10 rows) Thanks, Amit
Re: Delay locking partitions during query execution
On Tue, 4 Dec 2018 at 00:42, David Rowley wrote: > Over here and along similar lines to the above, but this time I'd like > to take this even further and change things so we don't lock *any* > partitions during AcquireExecutorLocks() and instead just lock them > when we first access them with ExecGetRangeTableRelation(). This > improves the situation when many partitions get run-time pruned, as > we'll never bother locking those at all since we'll never call > ExecGetRangeTableRelation() on them. We'll of course still lock the > partitioned table so that plan invalidation works correctly. I started looking at this patch again and I do see a problem with it. Let me explain: Currently during LockRelationOid() when we obtain a lock on a relation we'll check for rel cache invalidation messages. This means that during AcquireExecutorLocks(), as called from CheckCachedPlan(), we could miss invalidation messages since we're no longer necessarily locking the partitions. I think this only creates problems for partition's reloptions being changed and cached plans possibly being not properly invalidated when that happens, but I might be wrong about that, but if I'm not, there still also might be more important reasons to ensure we invalidate the plan properly in the future. The following shows the problem: create table listp (a int) partition by list(a); create table listp1 partition of listp for values in(1); create table listp2 partition of listp for values in(2); insert into listp select x from generate_series(1,2)x, generate_series(1,10); analyze listp; -- session 1; begin; prepare q1 as select count(*) from listp; explain (costs off,analyze, timing off, summary off) execute q1; QUERY PLAN -- Finalize Aggregate (actual rows=1 loops=1) -> Gather (actual rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 -> Partial Aggregate (actual rows=1 loops=3) -> Parallel Append (actual rows=7 loops=3) -> Parallel Seq Scan on listp2 (actual rows=3 loops=3) -> Parallel Seq Scan on listp1 (actual rows=10 loops=1) (8 rows) -- session 2 alter table listp1 set (parallel_workers=0); -- session 1 explain (costs off, analyze, timing off, summary off) execute q1; -- same as previous plan, i.e. still does a parallel scan on listp1. One way around this would be to always perform an invalidation on the partition's parent when performing a relcache invalidation on the partition. We could perhaps invalidate all the way up to the top level partitioned table, that way we could just obtain a lock on the target partitioned table during AcquireExecutorLocks(). I'm currently only setting the delaylock flag to false for leaf partitions only. It's not particularly great to think of invalidating the partitioned table's relcache entry for this, but it's also not so great that we lock all partitions when runtime pruning might prune away the partition anyway. I don't see a way that we can have both, but I'm happy to hear people's thoughts about this. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Delay locking partitions during query execution
On Sat, 12 Jan 2019 at 23:42, David Rowley wrote: > I've attached a rebase version of this. The previous version > conflicted with some changes made in b60c397599. I've attached another rebased version. This one fixes up the conflict with e0c4ec07284. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services v3-0001-Allow-lock-acquisitions-for-partitions-to-be-dela.patch Description: Binary data
Re: Delay locking partitions during query execution
On Thu, 17 Jan 2019 at 17:18, Amit Langote wrote: > > On 2019/01/04 9:53, David Rowley wrote: > > Without PREPAREd statements, if the planner itself was unable to prune > > the partitions it would already have obtained the lock during > > planning, so AcquireExecutorLocks(), in this case, would bump into the > > local lock hash table entry and forego trying to obtain the lock > > itself. That's not free, but it's significantly faster than obtaining > > a lock. > > Hmm, AcquireExecutorLocks is only called if prepared statements are used > and that too if a generic cached plan is reused. > > GetCachedPlan -> CheckCachedPlan -> AcquireExecutorLocks ah okay. Thanks for the correction, but either way, it's really only useful when run-time pruning prunes partitions before initialising the Append/MergeAppend node. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Delay locking partitions during query execution
On 2019/01/04 9:53, David Rowley wrote: > Without PREPAREd statements, if the planner itself was unable to prune > the partitions it would already have obtained the lock during > planning, so AcquireExecutorLocks(), in this case, would bump into the > local lock hash table entry and forego trying to obtain the lock > itself. That's not free, but it's significantly faster than obtaining > a lock. Hmm, AcquireExecutorLocks is only called if prepared statements are used and that too if a generic cached plan is reused. GetCachedPlan -> CheckCachedPlan -> AcquireExecutorLocks In GetCachedPlan: if (!customplan) { if (CheckCachedPlan(plansource)) Thanks, Amit
Re: Delay locking partitions during query execution
On Tue, 4 Dec 2018 at 00:42, David Rowley wrote: > Over here and along similar lines to the above, but this time I'd like > to take this even further and change things so we don't lock *any* > partitions during AcquireExecutorLocks() and instead just lock them > when we first access them with ExecGetRangeTableRelation(). I've attached a rebase version of this. The previous version conflicted with some changes made in b60c397599. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services v2-0001-Allow-lock-acquisitions-for-partitions-to-be-dela.patch Description: Binary data
Re: Delay locking partitions during query execution
On Sat, 5 Jan 2019 at 03:12, Tomas Vondra wrote: > >> > >> partitions0 100 10001 > >> > >> master 19 1590 2090 128 > >> patched 18 1780 6820 1130 > >> > >> So, that's nice. I wonder why the throughput drops so fast between 1k > >> and 10k partitions, but I'll look into that later. > > > > Those look strange. Why is it so slow with the non-partitioned case? > > I'd have expected that to be the fastest result. > > > > Because there are 1M rows in the table, and it's doing a seqscan. Of course. My test did the same, but I didn't consider that because I had so few rows per partition. Likely just adding an index would have it make more sense. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Delay locking partitions during query execution
On 1/4/19 1:53 AM, David Rowley wrote: > On Fri, 4 Jan 2019 at 13:01, Tomas Vondra > wrote: >> On 1/3/19 11:57 PM, David Rowley wrote: >>> You'll know you're getting a generic plan when you see "Filter (a = >>> $1)" and see "Subplans Removed: " below the Append. >>> >> >> Indeed, with prepared statements I now see some improvements: >> >> partitions0 100 10001 >> >> master 19 1590 2090 128 >> patched 18 1780 6820 1130 >> >> So, that's nice. I wonder why the throughput drops so fast between 1k >> and 10k partitions, but I'll look into that later. > > Those look strange. Why is it so slow with the non-partitioned case? > I'd have expected that to be the fastest result. > Because there are 1M rows in the table, and it's doing a seqscan. That also makes the other cases difficult to compare, because with few partitions there will be multiple pages per partition, scanned sequentially. And with many partitions it's likely only a single page with a couple of rows on it. I'll think about constructing a better benchmark, to make it easier to compare - perhaps by using a single row per table and/or adding indexes. Or something ... >> Does this mean this optimization can only ever work with prepared >> statements, or can it be made to work with regular plans too? > > That's a good question. I confirm it's only any use of run-time > pruning occurs during init plan. For this patch to be any use, you'll > need to see a "Subplans Removed: " in the query's EXPLAIN ANALYZE > output. If you don't see this then all Append/MergeAppend subplans > were initialised and the relation lock would have been obtained > regardless of if delaylock is set for the relation. The effect of the > patch here would just have been to obtain the lock during the first > call to ExecGetRangeTableRelation() for that relation instead of > during AcquireExecutorLocks(). There may actually be a tiny overhead > in this case since AcquireExecutorLocks() must skip the delaylock > relations, but they'll get locked later anyway. I doubt you could > measure that though. > > When run-time pruning is able to prune partitions before execution > starts then the optimisation is useful since AcquireExecutorLocks() > won't obtain the lock and ExecGetRangeTableRelation() won't be called > for all pruned partition's rels as we don't bother to init the > Append/MergeAppend subplan for those. > > I'm a little unsure if there are any cases where this type of run-time > pruning can occur when PREPAREd statements are not in use. Initplan > parameters can't prune before executor run since we need to run the > executor to obtain the values of those. Likewise for evaluation of > volatile functions. So I think run-time pruning before initplan is > only ever going to happen for PARAM_EXTERN type parameters, i.e. with > PREPAREd statements (REF: analyze_partkey_exprs() partprune.c). > Without PREPAREd statements, if the planner itself was unable to prune > the partitions it would already have obtained the lock during > planning, so AcquireExecutorLocks(), in this case, would bump into the > local lock hash table entry and forego trying to obtain the lock > itself. That's not free, but it's significantly faster than obtaining > a lock. > > Or in short... it only good for prepared statements where the > statement's parameters allow for run-time pruning. However, that's a > pretty large case since the planner is still very slow at planning for > large numbers of partitions, meaning it's common (or at least it will > be) for people to use PREPAREd statement and plan_cache_mode = > force_generic_plan; > OK, thanks for the explanation. One more reason to use prepared statements in such cases ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Delay locking partitions during query execution
On Fri, 4 Jan 2019 at 13:01, Tomas Vondra wrote: > On 1/3/19 11:57 PM, David Rowley wrote: > > You'll know you're getting a generic plan when you see "Filter (a = > > $1)" and see "Subplans Removed: " below the Append. > > > > Indeed, with prepared statements I now see some improvements: > > partitions0 100 10001 > > master 19 1590 2090 128 > patched 18 1780 6820 1130 > > So, that's nice. I wonder why the throughput drops so fast between 1k > and 10k partitions, but I'll look into that later. Those look strange. Why is it so slow with the non-partitioned case? I'd have expected that to be the fastest result. > Does this mean this optimization can only ever work with prepared > statements, or can it be made to work with regular plans too? That's a good question. I confirm it's only any use of run-time pruning occurs during init plan. For this patch to be any use, you'll need to see a "Subplans Removed: " in the query's EXPLAIN ANALYZE output. If you don't see this then all Append/MergeAppend subplans were initialised and the relation lock would have been obtained regardless of if delaylock is set for the relation. The effect of the patch here would just have been to obtain the lock during the first call to ExecGetRangeTableRelation() for that relation instead of during AcquireExecutorLocks(). There may actually be a tiny overhead in this case since AcquireExecutorLocks() must skip the delaylock relations, but they'll get locked later anyway. I doubt you could measure that though. When run-time pruning is able to prune partitions before execution starts then the optimisation is useful since AcquireExecutorLocks() won't obtain the lock and ExecGetRangeTableRelation() won't be called for all pruned partition's rels as we don't bother to init the Append/MergeAppend subplan for those. I'm a little unsure if there are any cases where this type of run-time pruning can occur when PREPAREd statements are not in use. Initplan parameters can't prune before executor run since we need to run the executor to obtain the values of those. Likewise for evaluation of volatile functions. So I think run-time pruning before initplan is only ever going to happen for PARAM_EXTERN type parameters, i.e. with PREPAREd statements (REF: analyze_partkey_exprs() partprune.c). Without PREPAREd statements, if the planner itself was unable to prune the partitions it would already have obtained the lock during planning, so AcquireExecutorLocks(), in this case, would bump into the local lock hash table entry and forego trying to obtain the lock itself. That's not free, but it's significantly faster than obtaining a lock. Or in short... it only good for prepared statements where the statement's parameters allow for run-time pruning. However, that's a pretty large case since the planner is still very slow at planning for large numbers of partitions, meaning it's common (or at least it will be) for people to use PREPAREd statement and plan_cache_mode = force_generic_plan; > >> Furthermore, I've repeatedly ran into this issue: > >> > >> test=# \d hashp > >> ERROR: unrecognized token: "false" > >> LINE 2: ...catalog.array_to_string(array(select rolname from pg_catalog... > >> ^ > >> I have no idea why it breaks like this, and it's somewhat random (i.e. > >> not readily reproducible). But I've only ever seen it with this patch > >> applied. > > > > You'll probably need to initdb with the patch applied as there's a new > > field in RangeTblEntry. If there's a serialised one of these stored in > > the in the catalogue somewhere then the new read function will have > > issues reading the old serialised format. > > > > D'oh! That explains it, because switching from/to patched binaries might > have easily been triggering the error. I've checked that there are no > changes to catalogs, but it did not occur to me adding a new RTE field > could have such consequences ... schema-wise, no changes, but data-wise, there are changes. $ pg_dump --schema=pg_catalog --data-only postgres | grep ":rellockmode" | wc -l 121 All of which are inside the pg_rewrite table: $ pg_dump --schema=pg_catalog --data-only --table=pg_rewrite postgres | grep ":rellockmode" | wc -l 121 I just used ":rellockmode" here as it's a field that exists in RangeTblEntry. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Delay locking partitions during query execution
On 1/3/19 11:57 PM, David Rowley wrote: > On Fri, 4 Jan 2019 at 11:48, Tomas Vondra > wrote: >> Nope, that doesn't seem to make any difference :-( In all cases the >> resulting plan (with 10k partitions) looks like this: >> >> test=# explain analyze select * from hashp where a = 13442; >> >> QUERY PLAN >> --- >> Append (cost=0.00..41.94 rows=13 width=4) >> (actual time=0.018..0.018 rows=0 loops=1) >>-> Seq Scan on hashp6784 (cost=0.00..41.88 rows=13 width=4) >> (actual time=0.017..0.018 rows=0 loops=1) >> Filter: (a = 13442) >> Planning Time: 75.870 ms >> Execution Time: 0.471 ms >> (5 rows) >> >> and it doesn't change (the timings on shape) no matter how I set any of >> the GUCs. > > For this to work, run-time pruning needs to take place, so it must be > a PREPAREd statement. > > With my test I used: > > bench.sql: > \set p_a 13315 > select * from hashp where a = :p_a; > > $ pgbench -n -f bench.sql -M prepared -T 60 postgres > > You'll know you're getting a generic plan when you see "Filter (a = > $1)" and see "Subplans Removed: " below the Append. > Indeed, with prepared statements I now see some improvements: partitions0 100 10001 master 19 1590 2090 128 patched 18 1780 6820 1130 So, that's nice. I wonder why the throughput drops so fast between 1k and 10k partitions, but I'll look into that later. Does this mean this optimization can only ever work with prepared statements, or can it be made to work with regular plans too? >> Furthermore, I've repeatedly ran into this issue: >> >> test=# \d hashp >> ERROR: unrecognized token: "false" >> LINE 2: ...catalog.array_to_string(array(select rolname from pg_catalog... >> ^ >> I have no idea why it breaks like this, and it's somewhat random (i.e. >> not readily reproducible). But I've only ever seen it with this patch >> applied. > > You'll probably need to initdb with the patch applied as there's a new > field in RangeTblEntry. If there's a serialised one of these stored in > the in the catalogue somewhere then the new read function will have > issues reading the old serialised format. > D'oh! That explains it, because switching from/to patched binaries might have easily been triggering the error. I've checked that there are no changes to catalogs, but it did not occur to me adding a new RTE field could have such consequences ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Delay locking partitions during query execution
On Fri, 4 Jan 2019 at 11:48, Tomas Vondra wrote: > Nope, that doesn't seem to make any difference :-( In all cases the > resulting plan (with 10k partitions) looks like this: > > test=# explain analyze select * from hashp where a = 13442; > > QUERY PLAN > --- > Append (cost=0.00..41.94 rows=13 width=4) > (actual time=0.018..0.018 rows=0 loops=1) >-> Seq Scan on hashp6784 (cost=0.00..41.88 rows=13 width=4) > (actual time=0.017..0.018 rows=0 loops=1) > Filter: (a = 13442) > Planning Time: 75.870 ms > Execution Time: 0.471 ms > (5 rows) > > and it doesn't change (the timings on shape) no matter how I set any of > the GUCs. For this to work, run-time pruning needs to take place, so it must be a PREPAREd statement. With my test I used: bench.sql: \set p_a 13315 select * from hashp where a = :p_a; $ pgbench -n -f bench.sql -M prepared -T 60 postgres You'll know you're getting a generic plan when you see "Filter (a = $1)" and see "Subplans Removed: " below the Append. > Furthermore, I've repeatedly ran into this issue: > > test=# \d hashp > ERROR: unrecognized token: "false" > LINE 2: ...catalog.array_to_string(array(select rolname from pg_catalog... > ^ > I have no idea why it breaks like this, and it's somewhat random (i.e. > not readily reproducible). But I've only ever seen it with this patch > applied. You'll probably need to initdb with the patch applied as there's a new field in RangeTblEntry. If there's a serialised one of these stored in the in the catalogue somewhere then the new read function will have issues reading the old serialised format. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Delay locking partitions during query execution
On 1/3/19 10:50 PM, David Rowley wrote: > On Fri, 4 Jan 2019 at 02:40, Tomas Vondra > wrote: >> I'm a bit confused, because I can't reproduce any such speedup. I've >> used the attached script that varies the number of partitions (which >> worked quite nicely in the INSERT thread), but I'm getting results like >> this: >> >> partitions 0 100 1000 1 >> >> master 49 1214 186 11 >> patched53 1225 187 11 >> >> So I don't see any significant speedup, for some reason :-( >> >> Before I start digging into this, is there something that needs to be >> done to enable it? > > Thanks for looking at this. > > One thing I seem to quite often forget to mention is that I was running with: > > plan_cache_mode = force_generic_plan > max_parallel_workers_per_gather = 0; > > Without changing plan_cache_mode then the planner would likely never > favour a generic plan since it will not appear to be very efficient > due to the lack of consideration to the costing of run-time partition > pruning. > > Also, then with a generic plan, the planner will likely want to build > a parallel plan since it sees up to 10k partitions that need to be > scanned. max_parallel_workers_per_gather = 0 puts it right. > > (Ideally, the planner would cost run-time pruning, but it's not quite > so simple for RANGE partitions with non-equality operators. Likely > we'll want to fix that one day, but that's not for here) > Nope, that doesn't seem to make any difference :-( In all cases the resulting plan (with 10k partitions) looks like this: test=# explain analyze select * from hashp where a = 13442; QUERY PLAN --- Append (cost=0.00..41.94 rows=13 width=4) (actual time=0.018..0.018 rows=0 loops=1) -> Seq Scan on hashp6784 (cost=0.00..41.88 rows=13 width=4) (actual time=0.017..0.018 rows=0 loops=1) Filter: (a = 13442) Planning Time: 75.870 ms Execution Time: 0.471 ms (5 rows) and it doesn't change (the timings on shape) no matter how I set any of the GUCs. Furthermore, I've repeatedly ran into this issue: test=# \d hashp ERROR: unrecognized token: "false" LINE 2: ...catalog.array_to_string(array(select rolname from pg_catalog... ^ I have no idea why it breaks like this, and it's somewhat random (i.e. not readily reproducible). But I've only ever seen it with this patch applied. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Delay locking partitions during query execution
On Fri, 4 Jan 2019 at 02:40, Tomas Vondra wrote: > I'm a bit confused, because I can't reproduce any such speedup. I've > used the attached script that varies the number of partitions (which > worked quite nicely in the INSERT thread), but I'm getting results like > this: > > partitions 0 100 1000 1 > > master 49 1214 186 11 > patched53 1225 187 11 > > So I don't see any significant speedup, for some reason :-( > > Before I start digging into this, is there something that needs to be > done to enable it? Thanks for looking at this. One thing I seem to quite often forget to mention is that I was running with: plan_cache_mode = force_generic_plan max_parallel_workers_per_gather = 0; Without changing plan_cache_mode then the planner would likely never favour a generic plan since it will not appear to be very efficient due to the lack of consideration to the costing of run-time partition pruning. Also, then with a generic plan, the planner will likely want to build a parallel plan since it sees up to 10k partitions that need to be scanned. max_parallel_workers_per_gather = 0 puts it right. (Ideally, the planner would cost run-time pruning, but it's not quite so simple for RANGE partitions with non-equality operators. Likely we'll want to fix that one day, but that's not for here) -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Delay locking partitions during query execution
On 12/3/18 12:42 PM, David Rowley wrote: > ... > > Master: 1 parts > > $ pgbench -n -f bench.sql -M prepared -T 60 postgres > tps = 108.882749 (excluding connections establishing) > tps = 108.245437 (excluding connections establishing) > > delaylock: 1 parts > > $ pgbench -n -f bench.sql -M prepared -T 60 postgres > tps = 1068.289505 (excluding connections establishing) > tps = 1092.797015 (excluding connections establishing) > I'm a bit confused, because I can't reproduce any such speedup. I've used the attached script that varies the number of partitions (which worked quite nicely in the INSERT thread), but I'm getting results like this: partitions 0 100 1000 1 master 49 1214 186 11 patched53 1225 187 11 So I don't see any significant speedup, for some reason :-( Before I start digging into this, is there something that needs to be done to enable it? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services select.sql Description: application/sql run-select.sh Description: application/shellscript