Re: [HACKERS] parallel mode and parallel contexts
On Tue, May 05, 2015 at 01:05:38PM -0400, Robert Haas wrote: On Sat, May 2, 2015 at 12:35 PM, Noah Misch n...@leadboat.com wrote: Perhaps, rather than model it as master M waiting on worker list W1|W2|W3, model it with queue-nonempty and queue-nonfull events, one pair per queue. That comment of mine was junk; it suggested a data structure containing the same worker list it purported to remove. Oops. Each individual queue has only a single reader and a single writer. In your example, there would be three queues, not just one. Of course, one could design a new shared memory data structure representing a collection of related queues, but I still don't see exactly how we get around the problem of that requiring O(MaxBackends^2) storage space. If each backend waits on a uniformly-distributed 50% of other backends, tracking that wait graph indeed requires O(MaxBackends^2) space. Backends completing useful work in the field will have far-sparser wait graphs, and that should inform the choice of data structure: On Tue, Apr 21, 2015 at 02:08:00PM -0400, Robert Haas wrote: When I first started thinking about how to fix this, I said, well, that's no big deal, we can just advertise the whole list of processes that we're waiting for in shared memory, rather than just the one. That boiled down to representing the wait graph with an adjacency list, which sounds like an efficient choice. M subscribes to queue0-nonempty, and each of W1,W2,W3 publish to it. M publishes to queue0-nonfull, and the workers subscribe. An edge forms in the deadlock detector's graph when all of an event's subscribers wait on it. (One can approximate that in userspace with a pair of advisory locks.) An edge between which two processes? I hadn't de-fuzzed my thinking that far yet. If you still need the answer after this message, let me know, and I'll work on that. As a guess, I think it's three edges M-W1, M-W2 and M-W3. After thinking about it a bit more, I realized that even if we settle on some solution to that problem, there's another issues: the wait-edges created by this system don't really have the same semantics as regular lock waits. Suppose A is waiting on a lock held by B and B is waiting for a lock held by A; that's a deadlock. But if A is waiting for B to write to a tuple queue and B is waiting for A to read from a tuple queue, that's not a deadlock if the queues in question are the same. I do see a deadlock. B wants to block until A reads from the queue, so it advertises that and sleeps. Waking up deadlock_timeout later, it notices that A is waiting for B to write something. B will not spontaneously suspend waiting to write to the queue, nor will A suspend waiting to read from the queue. Thus, the deadlock is valid. This assumes the deadlock detector reasons from an authoritative picture of pending waits and that we reliably wake up a process when the condition it sought has arrived. We have that for heavyweight lock acquisitions. The assumption may be incompatible with Andres's hope, quoted above, of avoiding the full lock acquisition procedure. I don't understand. What's typically going to happen here is that the queue will initially be empty, and A will block. Suppose B has a large message to write to the queue, let's say 100x the queue size. It will write the first 1% of the message into the queue, set A's latch, and go to sleep. A will now wake up, drain the queue, set B's latch, and go to sleep. B will then wake up, write the next chunk of the message, set A's latch again, and go to sleep. They'll go back and forth like this until the entire message has been transmitted. There's no deadlock here: everything is working as designed. But there may be instants where both A and B are waiting, because (for example) A may set B's latch and finish going to sleep before B gets around to waking up. I see what you had in mind. The problem showed up in the last sentence; namely, the hand-off from process to process is not atomic like it is for heavyweight locks. That's exactly what I was (not too clearly) getting at when I wrote, This assumes ... above. For the sake of illustration, suppose you replace your queue in this algorithm with a heavyweight lock and a small buffer. Initially, B holds the lock and A waits for the lock. The following event sequence repeats until the transfer is done: B fills the buffer B releases the lock, granting it to A during LockRelease() B starts waiting for the lock A empties the buffer A releases the lock, granting it to B during LockRelease() A starts waiting for the lock The deadlock detector will rightly never report a deadlock. To have the same safety in your example, it's not enough for one process to set the latch of another process. The process must update something in static shared memory to indicate that the waited-for condition (queue-nonempty for A,
Re: [HACKERS] parallel mode and parallel contexts
On Thu, May 7, 2015 at 3:09 AM, Noah Misch n...@leadboat.com wrote: Each individual queue has only a single reader and a single writer. In your example, there would be three queues, not just one. Of course, one could design a new shared memory data structure representing a collection of related queues, but I still don't see exactly how we get around the problem of that requiring O(MaxBackends^2) storage space. If each backend waits on a uniformly-distributed 50% of other backends, tracking that wait graph indeed requires O(MaxBackends^2) space. Backends completing useful work in the field will have far-sparser wait graphs, and that should inform the choice of data structure: We can, of course, underallocate and hope for the best. M subscribes to queue0-nonempty, and each of W1,W2,W3 publish to it. M publishes to queue0-nonfull, and the workers subscribe. An edge forms in the deadlock detector's graph when all of an event's subscribers wait on it. (One can approximate that in userspace with a pair of advisory locks.) An edge between which two processes? I hadn't de-fuzzed my thinking that far yet. If you still need the answer after this message, let me know, and I'll work on that. As a guess, I think it's three edges M-W1, M-W2 and M-W3. I think it's an edge where one end is M and the other end is fuzzily diffused across W1, W2, and W3, because we only need one of them to wake up. There's no way to represent such a thing in the deadlock detector today. We could invent one, of course, but it sounds complicated. Also, I suspect deadlock checking in this world reduces to http://en.wikipedia.org/wiki/Boolean_satisfiability_problem and is therefore NP-complete. I see what you had in mind. The problem showed up in the last sentence; namely, the hand-off from process to process is not atomic like it is for heavyweight locks. That's exactly what I was (not too clearly) getting at when I wrote, This assumes ... above. For the sake of illustration, suppose you replace your queue in this algorithm with a heavyweight lock and a small buffer. Initially, B holds the lock and A waits for the lock. The following event sequence repeats until the transfer is done: B fills the buffer B releases the lock, granting it to A during LockRelease() B starts waiting for the lock A empties the buffer A releases the lock, granting it to B during LockRelease() A starts waiting for the lock The deadlock detector will rightly never report a deadlock. To have the same safety in your example, it's not enough for one process to set the latch of another process. The process must update something in static shared memory to indicate that the waited-for condition (queue-nonempty for A, queue-nonfull for B) is now satisfied. You need such a unit of state anyway, don't you? Yes. It's the number of bytes read and written, and it's stored in dynamic shared memory. You can move some of the state to the main shared memory segment, but I don't look forward to the performance consequences of a lock acquire release cycle every time the queue fills or drains. Of course, there is also a loss of flexibility there: a big reason for developing this stuff in the first place was to avoid being constrained by the main shared memory segment. The number of held LWLocks is always zero when entering user-defined code and again when exiting user-defined code. Therefore, the possible LWLock wait graphs are known at compile time, and we could prove that none contain a deadlock. Therefore, we scarcely miss having an LWLock deadlock detector. That does not map to the tuple queue behavior at hand, because we hope to allow the queue's writer to run user-defined code while the queue's reader is waiting. My term user-defined code does come with some hand-waving. A superuser can cause an undetected deadlock via a C-language hook. We would not call that a PostgreSQL bug, though the hook is literally user-defined. Let's keep the deadlock detector able to identify every deadlock a non-superuser can cause. That includes deadlocks arising from heavyweight lock acquisition in parallel-safe functions. I'm in agreement with that goal. I think where this project went off the rails is when we made the decision as to how to fix the problems in the original group locking approach. In the original concept, locks between processes in the same parallel group just didn't ever conflict; two otherwise-conflicting locks held by different backends within the same group were regarded as those processes sharing that lock. Andres and possibly others pointed out that for stuff like relation locks, that's I think you mean relation extension locks. Yes, thanks. probably not the right behavior. I agree with that. It was also suggested that this would be less scary if we limited ourselves to sharing the locks held at the beginning of parallelism. That had the further advantage of
Re: [HACKERS] parallel mode and parallel contexts
On Sat, May 2, 2015 at 12:35 PM, Noah Misch n...@leadboat.com wrote: Perhaps, rather than model it as master M waiting on worker list W1|W2|W3, model it with queue-nonempty and queue-nonfull events, one pair per queue. Each individual queue has only a single reader and a single writer. In your example, there would be three queues, not just one. Of course, one could design a new shared memory data structure representing a collection of related queues, but I still don't see exactly how we get around the problem of that requiring O(MaxBackends^2) storage space. M subscribes to queue0-nonempty, and each of W1,W2,W3 publish to it. M publishes to queue0-nonfull, and the workers subscribe. An edge forms in the deadlock detector's graph when all of an event's subscribers wait on it. (One can approximate that in userspace with a pair of advisory locks.) An edge between which two processes? After thinking about it a bit more, I realized that even if we settle on some solution to that problem, there's another issues: the wait-edges created by this system don't really have the same semantics as regular lock waits. Suppose A is waiting on a lock held by B and B is waiting for a lock held by A; that's a deadlock. But if A is waiting for B to write to a tuple queue and B is waiting for A to read from a tuple queue, that's not a deadlock if the queues in question are the same. I do see a deadlock. B wants to block until A reads from the queue, so it advertises that and sleeps. Waking up deadlock_timeout later, it notices that A is waiting for B to write something. B will not spontaneously suspend waiting to write to the queue, nor will A suspend waiting to read from the queue. Thus, the deadlock is valid. This assumes the deadlock detector reasons from an authoritative picture of pending waits and that we reliably wake up a process when the condition it sought has arrived. We have that for heavyweight lock acquisitions. The assumption may be incompatible with Andres's hope, quoted above, of avoiding the full lock acquisition procedure. I don't understand. What's typically going to happen here is that the queue will initially be empty, and A will block. Suppose B has a large message to write to the queue, let's say 100x the queue size. It will write the first 1% of the message into the queue, set A's latch, and go to sleep. A will now wake up, drain the queue, set B's latch, and go to sleep. B will then wake up, write the next chunk of the message, set A's latch again, and go to sleep. They'll go back and forth like this until the entire message has been transmitted. There's no deadlock here: everything is working as designed. But there may be instants where both A and B are waiting, because (for example) A may set B's latch and finish going to sleep before B gets around to waking up. That's probably something that we could patch around, but I think it's missing the larger point. When you're dealing with locks, there is one operation that causes blocking (acquiring a lock) and another operation that unblocks other processes (releasing a lock). With message queues, there are still two operations (reading and writing), but either of them can either cause you to block yourself, or to unblock another process. To put that another way, locks enforce mutual exclusion; message queues force mutual *inclusion*. Locks cause blocking when two processes try to operate on the same object at the same time; message queues cause blocking *unless* two processes operate on the same object at the same time. That difference seems to me to be quite fundamental. So then you have to wonder why we're not solving problem #1, because the deadlock was just as certain before we generated the maximum possible number of tuples locally as it was afterwards. The 1. above reads like a benefit, not a problem. What might you solve about it? Sorry, that wasn't very clear. Normally, it is a benefit, but in some cases, it could result in a long delay in reporting an inevitable deadlock. What I mean is: suppose we construct a working mechanism where the deadlock detector knows about tuple queue waits. Suppose further that we have a parallel leader and two workers cooperating to do a task that takes 300 s and can be parallelized with perfect efficiency so that, working together, those three processes can get the job done in just 100 s. If the leader tries to take a lock held in a conflicting mode by one of the workers, the worker will fill up the tuple queue - probably quite quickly - and wait. We now detect a deadlock. Our detection is timely, and life is good. On the other hand, suppose that one of the *workers* tries to take a lock held in a conflicting mode by the other worker or by the leader. There is no immediate deadlock, because the leader is not waiting. Instead, we'll finish the computation with the remaining worker and the leader, which will take 150 seconds since we only have
Re: [HACKERS] parallel mode and parallel contexts
On Tue, Apr 21, 2015 at 02:08:00PM -0400, Robert Haas wrote: On Tue, Mar 24, 2015 at 3:26 PM, Andres Freund and...@2ndquadrant.com wrote: [proposal: teach deadlock detector about parallel master waiting on workers] deadlock.c is far from simple, and at least I don't find the control flow to be particularly clear. So it's not easy. It'd be advantageous to tackle things at that level because it'd avoid the need to acquire locks on some lock's waitqueue when blocking; we're going to do that a lot. But It seems to me that it should be possible to suceed: In FindLockCycleRecurse(), in the case that we're not waiting for an actual lock (checkProc-links.next == NULL) we can add a case that considers the 'blocking parallelism' case. ISTM that that's just a FindLockCycleRecurse() call on the process that we're waiting for. We'd either have to find the other side's locktag for DEADLOCK_INFO or invent another field in there; but that seems like a solveable problem. 1. The master doesn't actually *wait* until the very end of the parallel phase. 2. When the master does wait, it waits for all of the parallel workers at once, not each one individually. So, I don't think anything as simplistic as teaching a blocking shm_mq_receive() to tip off the deadlock detector that we're waiting for the process on the other end of that particular queue can ever work. Because of point #2, that never happens. When I first started thinking about how to fix this, I said, well, that's no big deal, we can just advertise the whole list of processes that we're waiting for in shared memory, rather than just the one. This is a bit tricky, though. Any general API for any backend to advertise that it's waiting for an arbitrary subset of the other backends would require O(n^2) shared memory state. That wouldn't be completely insane, but it's not great, either. For this particular case, we could optimize that down to O(n) by just chaining all of the children of a given parallel group leader in a linked whose nodes are inlined in their PGPROCs, but that doesn't feel very general, because it forecloses the possibility of the children ever using that API, and I think they might need to. If nothing else, they might need to advertise that they're blocking on the master if they are trying to send data, the queue is full, and they have to wait for the master to drain some of it before they can proceed. Perhaps, rather than model it as master M waiting on worker list W1|W2|W3, model it with queue-nonempty and queue-nonfull events, one pair per queue. M subscribes to queue0-nonempty, and each of W1,W2,W3 publish to it. M publishes to queue0-nonfull, and the workers subscribe. An edge forms in the deadlock detector's graph when all of an event's subscribers wait on it. (One can approximate that in userspace with a pair of advisory locks.) After thinking about it a bit more, I realized that even if we settle on some solution to that problem, there's another issues: the wait-edges created by this system don't really have the same semantics as regular lock waits. Suppose A is waiting on a lock held by B and B is waiting for a lock held by A; that's a deadlock. But if A is waiting for B to write to a tuple queue and B is waiting for A to read from a tuple queue, that's not a deadlock if the queues in question are the same. I do see a deadlock. B wants to block until A reads from the queue, so it advertises that and sleeps. Waking up deadlock_timeout later, it notices that A is waiting for B to write something. B will not spontaneously suspend waiting to write to the queue, nor will A suspend waiting to read from the queue. Thus, the deadlock is valid. This assumes the deadlock detector reasons from an authoritative picture of pending waits and that we reliably wake up a process when the condition it sought has arrived. We have that for heavyweight lock acquisitions. The assumption may be incompatible with Andres's hope, quoted above, of avoiding the full lock acquisition procedure. A further difference in the semantics of these wait-edges is that if process A is awaiting AccessExclusiveLock on a resource held by B, C, and D at AccessShareLock, it needs to wait for all of those processes to release their locks before it can do anything else. On the other hand, if process A is awaiting tuples from B, C, and D, it just needs ONE of those processes to emit tuples in order to make progress. Now Right. maybe that doesn't make any difference in practice, because even if two of those processes are making lively progress and A is receiving tuples from them and processing them like gangbusters, that's probably not going to help the third one get unstuck. If we adopt the approach of discounting that possibility, then as long as the parallel leader can generate tuples locally (that is, local_scan_done = false) we don't report the deadlock, but as soon as it can no
Re: [HACKERS] parallel mode and parallel contexts
On Wed, Apr 29, 2015 at 12:23 PM, Robert Haas robertmh...@gmail.com wrote: So, I think it makes sense to split up this patch in two. There's no real debate, AFAICS, about anything in the patch other than the heavyweight locking stuff. So I'd like to go ahead and commit the rest. That's attached here as parallel-mode-v10.patch. Hearing no objections, done. Still hoping for some input on the rest. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On Sun, Apr 26, 2015 at 2:03 PM, Euler Taveira eu...@timbira.com.br wrote: On 19-03-2015 15:13, Robert Haas wrote: On Wed, Mar 18, 2015 at 5:36 PM, Andres Freund and...@2ndquadrant.com wrote: Reading the README first, the rest later. So you can comment on my comments, while I actually look at the code. Parallelism, yay! I'm also looking at this piece of code and found some low-hanging fruit. Thanks. 0001 and 0003 look good, but 0002 is actually unrelated to this patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On Sun, Apr 26, 2015 at 9:57 PM, Amit Langote langote_amit...@lab.ntt.co.jp wrote: On 2015-04-22 AM 04:14, Robert Haas wrote: We should check IsParallelWorker() for operations that are allowed in the master during parallel mode, but not allowed in the workers - e.g. the master can scan its own temporary relations, but its workers can't. We should check IsInParallelMode() for operations that are completely off-limits in parallel mode - i.e. writes. We use ereport() where we expect that SQL could hit that check, and elog() where we expect that only (buggy?) C code could hit that check. By the way, perhaps orthogonal to the basic issue Alvaro and you are discussing here - when playing around with (parallel-mode + parallel-safety + parallel heapscan + parallel seqscan), I'd observed (been a while since) that if you run a CREATE TABLE AS ... SELECT and the SELECT happens to use parallel scan, the statement as a whole fails because the top level statement (CTAS) is not read-only. So, only way to make CTAS succeed is to disable seqscan; which may require some considerations in reporting to user to figure out. I guess it happens because the would-be parallel leader of the scan would also be the one doing DefineRelation() DDL. Apologies if this has been addressed since. That sounds like a bug in the assess-parallel-safety patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On 19-03-2015 15:13, Robert Haas wrote: On Wed, Mar 18, 2015 at 5:36 PM, Andres Freund and...@2ndquadrant.com wrote: Reading the README first, the rest later. So you can comment on my comments, while I actually look at the code. Parallelism, yay! I'm also looking at this piece of code and found some low-hanging fruit. -- Euler Taveira Timbira - http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento From 3ce5376868f61a67540915b83a15c59a31fc895a Mon Sep 17 00:00:00 2001 From: Euler Taveira eu...@timbira.com Date: Sun, 26 Apr 2015 13:49:37 -0300 Subject: [PATCH 1/3] fix some typos. --- src/backend/access/heap/heapam.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index da0b70e..16d8c59 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -2250,7 +2250,7 @@ heap_prepare_insert(Relation relation, HeapTuple tup, TransactionId xid, /* * For now, parallel operations are required to be strictly read-only. * Unlike heap_update() and heap_delete(), an insert should never create - * a combo CID, so it might be possible to relax this restrction, but + * a combo CID, so it might be possible to relax this restriction, but * not without more thought and testing. */ if (IsInParallelMode()) @@ -2666,7 +2666,7 @@ heap_delete(Relation relation, ItemPointer tid, Assert(ItemPointerIsValid(tid)); /* - * Forbid this during a parallel operation, lest it allocate a combocid. + * Forbid this during a parallel operation, lets it allocate a combocid. * Other workers might need that combocid for visibility checks, and we * have no provision for broadcasting it to them. */ @@ -3124,7 +3124,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup, Assert(ItemPointerIsValid(otid)); /* - * Forbid this during a parallel operation, lest it allocate a combocid. + * Forbid this during a parallel operation, lets it allocate a combocid. * Other workers might need that combocid for visibility checks, and we * have no provision for broadcasting it to them. */ @@ -5435,7 +5435,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple) /* * For now, parallel operations are required to be strictly read-only. * Unlike a regular update, this should never create a combo CID, so it - * might be possible to relax this restrction, but not without more + * might be possible to relax this restriction, but not without more * thought and testing. It's not clear that it would be useful, anyway. */ if (IsInParallelMode()) -- 2.1.4 From cf25445d9d21496b6927e9ef45e6c3815fef8ad5 Mon Sep 17 00:00:00 2001 From: Euler Taveira eu...@timbira.com Date: Sun, 26 Apr 2015 14:24:26 -0300 Subject: [PATCH 2/3] Avoid compiler warnings about unused variables in assert-disabled builds. --- src/backend/utils/fmgr/funcapi.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/backend/utils/fmgr/funcapi.c b/src/backend/utils/fmgr/funcapi.c index ebd7ddd..b9f2b06 100644 --- a/src/backend/utils/fmgr/funcapi.c +++ b/src/backend/utils/fmgr/funcapi.c @@ -887,7 +887,7 @@ get_func_trftypes(HeapTuple procTup, Oid **p_trftypes) { - Form_pg_proc procStruct = (Form_pg_proc) GETSTRUCT(procTup); + Form_pg_proc procStruct PG_USED_FOR_ASSERTS_ONLY = (Form_pg_proc) GETSTRUCT(procTup); Datum protrftypes; ArrayType *arr; int nelems; -- 2.1.4 From 7d1716fdf84f24a1dddfa02db27d532e06c92c3d Mon Sep 17 00:00:00 2001 From: Euler Taveira eu...@timbira.com Date: Sun, 26 Apr 2015 14:52:39 -0300 Subject: [PATCH 3/3] Fix some more typos. --- src/backend/access/transam/README.parallel | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/src/backend/access/transam/README.parallel b/src/backend/access/transam/README.parallel index c066fff..2257e4c 100644 --- a/src/backend/access/transam/README.parallel +++ b/src/backend/access/transam/README.parallel @@ -76,7 +76,7 @@ Instead, we take a more pragmatic approach. First, we try to make as many of the operations that are safe outside of parallel mode work correctly in parallel mode as well. Second, we try to prohibit common unsafe operations via suitable error checks. These checks are intended to catch 100% of -unsafe things that a user might do from the SQL interface, but code writen +unsafe things that a user might do from the SQL interface, but code written in C can do unsafe things that won't trigger these checks. The error checks are engaged via EnterParallelMode(), which should be called before creating a parallel context, and disarmed via ExitParallelMode(), which should be @@ -108,7 +108,7 @@ worker. This includes: - The combo CID mappings. This is needed to ensure consistent answers to tuple visibility checks. The need to synchronize this data structure is a major reason why
Re: [HACKERS] parallel mode and parallel contexts
On 2015-04-22 AM 04:14, Robert Haas wrote: We should check IsParallelWorker() for operations that are allowed in the master during parallel mode, but not allowed in the workers - e.g. the master can scan its own temporary relations, but its workers can't. We should check IsInParallelMode() for operations that are completely off-limits in parallel mode - i.e. writes. We use ereport() where we expect that SQL could hit that check, and elog() where we expect that only (buggy?) C code could hit that check. By the way, perhaps orthogonal to the basic issue Alvaro and you are discussing here - when playing around with (parallel-mode + parallel-safety + parallel heapscan + parallel seqscan), I'd observed (been a while since) that if you run a CREATE TABLE AS ... SELECT and the SELECT happens to use parallel scan, the statement as a whole fails because the top level statement (CTAS) is not read-only. So, only way to make CTAS succeed is to disable seqscan; which may require some considerations in reporting to user to figure out. I guess it happens because the would-be parallel leader of the scan would also be the one doing DefineRelation() DDL. Apologies if this has been addressed since. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On Tue, Apr 21, 2015 at 11:38 PM, Robert Haas robertmh...@gmail.com wrote: After thinking about it a bit more, I realized that even if we settle on some solution to that problem, there's another issues: the wait-edges created by this system don't really have the same semantics as regular lock waits. Suppose A is waiting on a lock held by B and B is waiting for a lock held by A; that's a deadlock. But if A is waiting for B to write to a tuple queue and B is waiting for A to read from a tuple queue, that's not a deadlock if the queues in question are the same. If they are different queues, it might be a deadlock, but then again maybe not. It may be that A is prepared to accept B's message from one queue, and that upon fully receiving it, it will do some further work that will lead it to write a tuple into the other queue. If so, we're OK; if not, it's a deadlock. I agree that the way deadlock detector works for locks, it might not be same as it needs to work for communication buffers (tuple queues). Here I think we also need to devise some different way to remove resources from wait/resource queue, it might not be a good idea to do it similar to locks as locks are released at transaction end whereas this new resources (communication buffers) don't operate at transaction boundary. I'm not sure whether you'll want to argue that that is an implausible scenario, but I'm not too sure it is. The worker could be saying hey, I need some additional piece of your backend-local state in order to finish this computation, and the master could then provide it. I don't have any plans like that, but it's been suggested previously by others, so it's not an obviously nonsensical thing to want to do. If such a thing is not possible today and also it seems that is a not a good design to solve any problem, then why to spend too much effort in trying to find ways to detect deadlocks for the same. A further difference in the semantics of these wait-edges is that if process A is awaiting AccessExclusiveLock on a resource held by B, C, and D at AccessShareLock, it needs to wait for all of those processes to release their locks before it can do anything else. On the other hand, if process A is awaiting tuples from B, C, and D, it just needs ONE of those processes to emit tuples in order to make progress. Now maybe that doesn't make any difference in practice, because even if two of those processes are making lively progress and A is receiving tuples from them and processing them like gangbusters, that's probably not going to help the third one get unstuck. If we adopt the approach of discounting that possibility, then as long as the parallel leader can generate tuples locally (that is, local_scan_done = false) we don't report the deadlock, but as soon as it can no longer do that (local_scan_done = true) then we do, even though we could still theoretically read more tuples from the non-stuck workers. So then you have to wonder why we're not solving problem #1, because the deadlock was just as certain before we generated the maximum possible number of tuples locally as it was afterwards. I think the deadlock detection code should run if anytime we have to wait for more than deadlock timeout. Now I think the important point here is when to actually start waiting, as per current parallel_seqscan patch we wait only after checking the tuples from all queues, we could have done it other way (wait if we can't fetch from one queue) as well. It seems both has pros and cons, so we can proceed assuming current way is okay and we can consider to change it in future once we find some reason for the same. Having said above, I feel this is not the problem that we should try to solve at this point unless there is any scenario where we could hit deadlock due to communication buffers. I think such a solution would be required for advanced form of parallelism (like intra-query parallelism). By the way, why are we trying to solve this problem, is there any way with which we can hit it for parallel sequential scan? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] parallel mode and parallel contexts
On Tue, Mar 24, 2015 at 3:26 PM, Andres Freund and...@2ndquadrant.com wrote: You'd need some kind of API that says pretend I'm waiting for this lock, but don't really wait for it, and you'd need to be darn sure that you removed yourself from the wait queue again before doing any other heavyweight lock manipulation. Do you have specific thoughts on how to implement this? I've thought some about this, and I think it's a bit easier to not do it on the actual lock waitqueues, but teach deadlock.c about that kind of blocking. deadlock.c is far from simple, and at least I don't find the control flow to be particularly clear. So it's not easy. It'd be advantageous to tackle things at that level because it'd avoid the need to acquire locks on some lock's waitqueue when blocking; we're going to do that a lot. But It seems to me that it should be possible to suceed: In FindLockCycleRecurse(), in the case that we're not waiting for an actual lock (checkProc-links.next == NULL) we can add a case that considers the 'blocking parallelism' case. ISTM that that's just a FindLockCycleRecurse() call on the process that we're waiting for. We'd either have to find the other side's locktag for DEADLOCK_INFO or invent another field in there; but that seems like a solveable problem. I (finally) had some time to look at this today. Initially, it looked good. Unfortunately, the longer I looked at it, the less convinced I got that we could solve the problem this way. The core of the issue is that the Funnel node in the parallel group leader basically does this: while (!local_scan_done || !remote_scan_done) { attempt a read from each remaining worker's tuple queue, blocking only if local_scan_done; if (we got a tuple) return it; else if (there are no remaining workers) remote_scan_done = true; attempt to produce a tuple just as if we were a worker ourselves; if (we got a tuple) return it; else local_scan_done = true; } Imagine that we're doing a parallel sequential scan; each worker claims one page but goes into the tank before it has returned all of the tuples on that page. The master reads all the remaining pages but must now wait for the workers to finish returning the tuples on the pages they claimed. So what this means is: 1. The master doesn't actually *wait* until the very end of the parallel phase. 2. When the master does wait, it waits for all of the parallel workers at once, not each one individually. So, I don't think anything as simplistic as teaching a blocking shm_mq_receive() to tip off the deadlock detector that we're waiting for the process on the other end of that particular queue can ever work. Because of point #2, that never happens. When I first started thinking about how to fix this, I said, well, that's no big deal, we can just advertise the whole list of processes that we're waiting for in shared memory, rather than just the one. This is a bit tricky, though. Any general API for any backend to advertise that it's waiting for an arbitrary subset of the other backends would require O(n^2) shared memory state. That wouldn't be completely insane, but it's not great, either. For this particular case, we could optimize that down to O(n) by just chaining all of the children of a given parallel group leader in a linked whose nodes are inlined in their PGPROCs, but that doesn't feel very general, because it forecloses the possibility of the children ever using that API, and I think they might need to. If nothing else, they might need to advertise that they're blocking on the master if they are trying to send data, the queue is full, and they have to wait for the master to drain some of it before they can proceed. After thinking about it a bit more, I realized that even if we settle on some solution to that problem, there's another issues: the wait-edges created by this system don't really have the same semantics as regular lock waits. Suppose A is waiting on a lock held by B and B is waiting for a lock held by A; that's a deadlock. But if A is waiting for B to write to a tuple queue and B is waiting for A to read from a tuple queue, that's not a deadlock if the queues in question are the same. If they are different queues, it might be a deadlock, but then again maybe not. It may be that A is prepared to accept B's message from one queue, and that upon fully receiving it, it will do some further work that will lead it to write a tuple into the other queue. If so, we're OK; if not, it's a deadlock. I'm not sure whether you'll want to argue that that is an implausible scenario, but I'm not too sure it is. The worker could be saying hey, I need some additional piece of your backend-local state in order to finish this computation, and the master could then provide it. I don't have any plans like that, but it's been suggested previously by others, so it's not an obviously nonsensical thing to want to do. A further
Re: [HACKERS] parallel mode and parallel contexts
On Mon, Apr 20, 2015 at 6:49 PM, Peter Geoghegan p...@heroku.com wrote: I see that you're using git format-patch to generate this. But the patch is only patch 1/4. Is that intentional? Where are the other pieces? I think that the parallel seqscan patch, and the assessing parallel safety patch are intended to fit together with this patch, but I can't find a place where there is a high level overview explaining just how they fit together (I notice Amit's patch has an #include access/parallel.h, which is here, but that wasn't trivial to figure out). I haven't been paying too much attention to this patch series. The intended order of application is: - parallel mode/contexts - assess parallel safety - parallel heap scan - parallel seq scan The parallel heap scan patch should probably be merged into the parallel seq scan patch, which should probably then be split into several patches, one or more with general parallel query infrastructure and then another with stuff specific to parallel sequential scan. But we're not quite there yet. Until we can get agreement on how to finalize the parallel mode/contexts patch, I haven't been too worried about getting the other stuff sorted out exactly right. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On Tue, Mar 24, 2015 at 11:56 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Well, it's not actually the same message. They're all a bit different. Or mostly all of them. And the variable part is not a command name, as in the PreventTransactionChain() case, so it would affect translatability if we did that. Three things that vary are 1) the fact that some check IsParallelWorker() and others check IsInParallelMode(), and 2) that some of them are ereports() while others are elog(), and 3) that some are ERROR and others are FATAL. Is there a reason for these differences? The message text also varies, because it states the particular operation that is prohibited. We should check IsParallelWorker() for operations that are allowed in the master during parallel mode, but not allowed in the workers - e.g. the master can scan its own temporary relations, but its workers can't. We should check IsInParallelMode() for operations that are completely off-limits in parallel mode - i.e. writes. We use ereport() where we expect that SQL could hit that check, and elog() where we expect that only (buggy?) C code could hit that check. We use FATAL for some of the checks in xact.c, for parity with other, similar checks in xact.c that relate to checking the transaction state. I think this is because we assume that if the transaction state is hosed, it's necessary to terminate the backend to recover. In other cases, we use ERROR. (Note typo restrction in quoted paragraph above.) Thanks, will fix. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On Thu, Mar 19, 2015 at 11:13 AM, Robert Haas robertmh...@gmail.com wrote: Here is yet another version of this patch. In addition to the fixes mentioned above, this version includes some minor rebasing around recent commits, and also better handling of the case where we discover that we cannot launch workers after all. This can happen because (1) dynamic_shared_memory_type=none, (2) the maximum number of DSM segments supported by the system configuration are already in use, or (3) the user creates a parallel context with nworkers=0. In any of those cases, the system will now create a backend-private memory segment instead of a dynamic shared memory segment, and will skip steps that don't need to be done in that case. This is obviously an undesirable scenario. If we choose a parallel sequential scan, we want it to launch workers and really run in parallel. Hopefully, in case (1) or case (3), we will avoid choosing a parallel plan in the first place, but case (2) is pretty hard to avoid completely, as we have no idea what other processes may or may not be doing with dynamic shared memory segments ... and, in any case, degrading to non-parallel execution beats failing outright. I see that you're using git format-patch to generate this. But the patch is only patch 1/4. Is that intentional? Where are the other pieces? I think that the parallel seqscan patch, and the assessing parallel safety patch are intended to fit together with this patch, but I can't find a place where there is a high level overview explaining just how they fit together (I notice Amit's patch has an #include access/parallel.h, which is here, but that wasn't trivial to figure out). I haven't been paying too much attention to this patch series. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On Tue, Mar 24, 2015 at 3:26 PM, Andres Freund and...@2ndquadrant.com wrote: + - The currently active user ID and security context. Note that this is +the fourth user ID we restore: the initial step of binding to the correct +database also involves restoring the authenticated user ID. When GUC +values are restored, this incidentally sets SessionUserId and OuterUserId +to the correct values. This final step restores CurrentUserId. Ah. That's the answer for above. Could you just move it next to the other user bit? Well, I think it's good to keep this in the same order it happens. That's almost true, with the exception of the libraries, which were out of order. (I've fixed that now.) I don't find this convincing in the least. This should explain a developer which state he can rely on being shared. For that a sane order is helpful. In which precise order this happens doesn't seem to matter for that at all; it's also likely ot get out of date. I'm hoping we can agree to disagree on this point. I like my order order better, you like your order better; we're arguing about the ordering of paragraphs in a README. You'd need some kind of API that says pretend I'm waiting for this lock, but don't really wait for it, and you'd need to be darn sure that you removed yourself from the wait queue again before doing any other heavyweight lock manipulation. Do you have specific thoughts on how to implement this? I've thought some about this, and I think it's a bit easier to not do it on the actual lock waitqueues, but teach deadlock.c about that kind of blocking. deadlock.c is far from simple, and at least I don't find the control flow to be particularly clear. So it's not easy. It'd be advantageous to tackle things at that level because it'd avoid the need to acquire locks on some lock's waitqueue when blocking; we're going to do that a lot. But It seems to me that it should be possible to suceed: In FindLockCycleRecurse(), in the case that we're not waiting for an actual lock (checkProc-links.next == NULL) we can add a case that considers the 'blocking parallelism' case. ISTM that that's just a FindLockCycleRecurse() call on the process that we're waiting for. We'd either have to find the other side's locktag for DEADLOCK_INFO or invent another field in there; but that seems like a solveable problem. I'll take a look at this. Thanks for the pointers. That's a fair point. I was worried that this is going to introduce additional hard edges between processes. I think it actually might, but only when a lock is upgraded; which really shouldn't happen for the copied locks. Agreed. And if it does, and we get a deadlock, I think I'm sorta OK with that. It's a known fact that lock upgrades are a deadlock hazard, and the possible existence of obscure scenarios where that's more likely when parallelism is involved than without it doesn't bother me terribly. I mean, we have to look at the specifics, but I think it's far from obvious that I'm not sure if I said that somewhere before: If we aborted parallelism if any of the to-be-copied locks would conflict with its copy, instead of duplicating them, I could live with this. Then this would amount to jumping the lock queue, which seems reasonable to me. Since not being able to do parallelism already needs to be handled gracefull, this doesn't seem to be too bad. It's a tempting offer since it would shorten the path to getting this committed, but I think that's leaving a lot on the table. In particular, if we ever want to have parallel CLUSTER, VACUUM FULL, or ALTER TABLE, that's not going to get us there. If we don't have the right infrastructure to support that in whatever the initial commit is, fine. But we've got to have a plan that gets us there, or we're just boxing ourselves into a corner. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
Robert Haas wrote: On Wed, Mar 18, 2015 at 7:10 PM, Andres Freund and...@2ndquadrant.com wrote: + /* + * For now, parallel operations are required to be strictly read-only. + * Unlike heap_update() and heap_delete(), an insert should never create + * a combo CID, so it might be possible to relax this restrction, but + * not without more thought and testing. + */ + if (IsInParallelMode()) + ereport(ERROR, + (errcode(ERRCODE_INVALID_TRANSACTION_STATE), + errmsg(cannot insert tuples during a parallel operation))); + Minor nitpick: Should we perhaps move the ereport to a separate function? Akin to PreventTransactionChain()? Seems a bit nicer to not have the same message everywhere. Well, it's not actually the same message. They're all a bit different. Or mostly all of them. And the variable part is not a command name, as in the PreventTransactionChain() case, so it would affect translatability if we did that. Three things that vary are 1) the fact that some check IsParallelWorker() and others check IsInParallelMode(), and 2) that some of them are ereports() while others are elog(), and 3) that some are ERROR and others are FATAL. Is there a reason for these differences? (Note typo restrction in quoted paragraph above.) Maybe something similar to what commit f88d4cfc did: have a function where all possible messages are together, and one is selected with some enum. That way, it's easier to maintain consistency if more cases are added in the future. I don't actually think this is a big deal. Yeah. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On 2015-03-19 14:13:59 -0400, Robert Haas wrote: On Wed, Mar 18, 2015 at 5:36 PM, Andres Freund and...@2ndquadrant.com wrote: Reading the README first, the rest later. So you can comment on my comments, while I actually look at the code. Parallelism, yay! Sorry, we don't support parallelism yet. :-) And not even proper sequential work apparently... + - The currently active user ID and security context. Note that this is +the fourth user ID we restore: the initial step of binding to the correct +database also involves restoring the authenticated user ID. When GUC +values are restored, this incidentally sets SessionUserId and OuterUserId +to the correct values. This final step restores CurrentUserId. Ah. That's the answer for above. Could you just move it next to the other user bit? Well, I think it's good to keep this in the same order it happens. That's almost true, with the exception of the libraries, which were out of order. (I've fixed that now.) I don't find this convincing in the least. This should explain a developer which state he can rely on being shared. For that a sane order is helpful. In which precise order this happens doesn't seem to matter for that at all; it's also likely ot get out of date. +At the end of a parallel operation, which can happen either because it +completed successfully or because it was interrupted by an error, parallel +workers associated with that operation exit. In the error case, transaction +abort processing in the parallel leader kills of any remaining workers, and +the parallel leader then waits for them to die. Very slightly awkward because first you talk about successful *or* error and then about abort processing. I don't understand what's awkward about that. I make a general statement about what happens at the end of a parallel operation, and then the next few sentences follow up by explaining what happens in the error case, and what happens in the success case. I seem to have completely overread the In the error case, part of the sentence. Forget what I wrote. +Copying locks to workers is important for the avoidance of undetected +deadlock between the initiating process and its parallel workers. If the +initiating process holds a lock on an object at the start of parallelism, +and the worker subsequently attempts to acquire a lock on that same object +and blocks, this will result in an undetected deadlock, because the +initiating process cannot finish the transaction (thus releasing the lock) +until the worker terminates, and the worker cannot acquire the lock while +the initiating process holds it. Locks which the processes involved acquire +and then release during parallelism do not present this hazard; they simply +force the processes involved to take turns accessing the protected resource. I don't think this is a strong guarantee. There very well can be lack of forward progress if they're waiting on each other in some way. Say the master backend holds the lock and waits on output from a worker. The worker then will endlessly wait for the lock to become free. A deadlock. Or, as another scenario, consider cooperating backends that both try to send tuples to each other but the queue is full. A deadlock. The idea is that both the master and the workers are restricted to locks which they take and release again. The idea is, specifically, to allow syscache lookups. Taking a lock and then waiting for the worker is no good, which is why WaitForParallelWorkersToFinish() calls CheckForRetainedParallelLocks(). That would catch your first scenario. Your second scenario seems to me to be a different kind of problem. If that comes up, what you're going to want to do is rewrite the workers to avoid the deadlock by using non-blocking messaging. I don't think that actually really solves the problem. You'll still end up blocking with full pipes at some point. Whether you do it inside the messaging or outside the messaging code doesn't particularly matter. In any case, I don't think it's really this patch's job to prevent deadlocks in code that doesn't exist today and in no way involves the lock manager. Well, you're arguing that we need a solution in the parallelism infrastructure for the lock deadlocks problem. I don't think it's absurd to extend care to other cases. To me it seems the deadlock detector has to be enhanced to be able to see 'waiting for' edges. Independent on how we resolve our difference of opinion on the copying of locks. It seems to me that this isn't all that hard: Whenever we block waiting for another backend we enter ourselves on the wait queue to that backend's virtual transaction. When finished we take the blocking backend off. That, afaics, should do it. Alternatively we can just publish what backend we're waiting for in PGPROC and make deadlock also look at that; but, while
Re: [HACKERS] parallel mode and parallel contexts
Andres Freund and...@2ndquadrant.com writes: Also: Man, trying to understand the guts of deadlock.c only made me understand how *friggin* expensive deadlock checking is. I'm really rather surprised that it only infrequently causes problems. The reason for that is that we only run deadlock checking if something's been waiting for at least one second, which pretty much takes it out of any performance-relevant code pathway. I think it would be a serious error to put any deadlock-checking-like behavior into mainline code. Which probably means that Andres is right that teaching deadlock.c about any new sources of deadlock is the way to approach this. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On 2015-03-18 12:02:07 -0400, Robert Haas wrote: diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index cb6f8a3..173f0ba 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -2234,6 +2234,17 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup, TransactionId xid, CommandId cid, int options) { + /* + * For now, parallel operations are required to be strictly read-only. + * Unlike heap_update() and heap_delete(), an insert should never create + * a combo CID, so it might be possible to relax this restrction, but + * not without more thought and testing. + */ + if (IsInParallelMode()) + ereport(ERROR, + (errcode(ERRCODE_INVALID_TRANSACTION_STATE), + errmsg(cannot insert tuples during a parallel operation))); + Minor nitpick: Should we perhaps move the ereport to a separate function? Akin to PreventTransactionChain()? Seems a bit nicer to not have the same message everywhere. +void +DestroyParallelContext(ParallelContext *pcxt) +{ + int i; + + /* + * Be careful about order of operations here! We remove the parallel + * context from the list before we do anything else; otherwise, if an + * error occurs during a subsequent step, we might try to nuke it again + * from AtEOXact_Parallel or AtEOSubXact_Parallel. + */ + dlist_delete(pcxt-node); Hm. I'm wondering about this. What if we actually fail below? Like in dsm_detach() or it's callbacks? Then we'll enter abort again at some point (during proc_exit at the latest) but will not wait for the child workers. Right? +/* + * End-of-subtransaction cleanup for parallel contexts. + * + * Currently, it's forbidden to enter or leave a subtransaction while + * parallel mode is in effect, so we could just blow away everything. But + * we may want to relax that restriction in the future, so this code + * contemplates that there may be multiple subtransaction IDs in pcxt_list. + */ +void +AtEOSubXact_Parallel(bool isCommit, SubTransactionId mySubId) +{ + while (!dlist_is_empty(pcxt_list)) + { + ParallelContext *pcxt; + + pcxt = dlist_head_element(ParallelContext, node, pcxt_list); + if (pcxt-subid != mySubId) + break; + if (isCommit) + elog(WARNING, leaked parallel context); + DestroyParallelContext(pcxt); + } +} +/* + * End-of-transaction cleanup for parallel contexts. + */ +void +AtEOXact_Parallel(bool isCommit) +{ + while (!dlist_is_empty(pcxt_list)) + { + ParallelContext *pcxt; + + pcxt = dlist_head_element(ParallelContext, node, pcxt_list); + if (isCommit) + elog(WARNING, leaked parallel context); + DestroyParallelContext(pcxt); + } +} Is there any reason to treat the isCommit case as a warning? To me that sounds like a pretty much guaranteed programming error. If your're going to argue that a couple resource leakage warnings do similarly: I don't think that counts for that much. For one e.g. a leaked tupdesc has much less consequences, for another IIRC those warnings were introduced after the fact. + * When running as a parallel worker, we place only a single + * TransactionStateData is placed on the parallel worker's + * state stack, 'we place .. is placed' /* + * EnterParallelMode + */ +void +EnterParallelMode(void) +{ + TransactionState s = CurrentTransactionState; + + Assert(s-parallelModeLevel = 0); + + ++s-parallelModeLevel; +} + +/* + * ExitParallelMode + */ +void +ExitParallelMode(void) +{ + TransactionState s = CurrentTransactionState; + + Assert(s-parallelModeLevel 0); + Assert(s-parallelModeLevel 1 || !ParallelContextActive()); + + --s-parallelModeLevel; + + if (s-parallelModeLevel == 0) + CheckForRetainedParallelLocks(); +} Hm. Is it actually ok for nested parallel mode to retain locks on their own? Won't that pose a problem? Or did you do it that way just because we don't have more state? If so that deserves a comment explaining that htat's the case and why that's acceptable. @@ -1769,6 +1904,9 @@ CommitTransaction(void) { TransactionState s = CurrentTransactionState; TransactionId latestXid; + boolparallel; + + parallel = (s-blockState == TBLOCK_PARALLEL_INPROGRESS); + /* If we might have parallel workers, clean them up now. */ + if (IsInParallelMode()) + { + CheckForRetainedParallelLocks(); + AtEOXact_Parallel(true); + s-parallelModeLevel = 0; + } 'parallel' looks strange when we're also, rightly so, do stuff like checking
Re: [HACKERS] parallel mode and parallel contexts
Hi, Reading the README first, the rest later. So you can comment on my comments, while I actually look at the code. Parallelism, yay! On 2015-03-18 12:02:07 -0400, Robert Haas wrote: +Instead, we take a more pragmatic approach: we try to make as many of the +operations that are safe outside of parallel mode work correctly in parallel +mode as well, and we try to prohibit the rest via suitable error checks. I'd say that we'd try to prohibit a bunch of common cases. Among them the ones that are triggerable from SQL. We don't really try to prohibit many kinds of traps, as you describe. + - The authenticated user ID and current database. Each parallel worker +will connect to the same database as the initiating backend, using the +same user ID. This sentence immediately makes me wonder why only the authenticated user, and not the currently active role, is mentioned. + - The values of all GUCs. Accordingly, permanent changes to the value of +any GUC are forbidden while in parallel mode; but temporary changes, +such as entering a function with non-NULL proconfig, are potentially OK. potentially OK sounds odd to me. Which danger are you seing that isn't relevan tfor norm + - The combo CID mappings. This is needed to ensure consistent answers to +tuple visibility checks. The need to synchronize this data structure is +a major reason why we can't support writes in parallel mode: such writes +might create new combo CIDs, and we have now way to let other workers +(or the initiating backend) know about them. Absolutely *not* in the initial version, but we really need to find a better solution for this. + - The currently active user ID and security context. Note that this is +the fourth user ID we restore: the initial step of binding to the correct +database also involves restoring the authenticated user ID. When GUC +values are restored, this incidentally sets SessionUserId and OuterUserId +to the correct values. This final step restores CurrentUserId. Ah. That's the answer for above. Could you just move it next to the other user bit? +We could copy the entire transaction state stack, +but most of it would be useless: for example, you can't roll back to a +savepoint from within a parallel worker, and there are no resources to +associated with the memory contexts or resource owners of intermediate +subtransactions. I do wonder if we're not going to have to change this in the not too far away future. But then, this isn't externally visible at all, so whatever. +At the end of a parallel operation, which can happen either because it +completed successfully or because it was interrupted by an error, parallel +workers associated with that operation exit. In the error case, transaction +abort processing in the parallel leader kills of any remaining workers, and +the parallel leader then waits for them to die. Very slightly awkward because first you talk about successful *or* error and then about abort processing. + - Cleanup of pg_temp namespaces is not done. The initiating backend is +responsible for this, too. How could a worker have its own pg_temp namespace? +Lock Management +=== + +Certain heavyweight locks that the initiating backend holds at the beginning +of parallelism are copied to each worker, which unconditionally acquires them. +The parallel backends acquire - without waiting - each such lock that the +leader holds, even if that lock is self-exclusive. This creates the unusual +situation that a lock which could normally only be held by a single backend +can be shared among several backends in a parallel group. + +Obviously, this presents significant hazards that would not be present in +normal execution. If, for example, a backend were to initiate parallelism +while ReindexIsProcessingIndex() were true for some index, the parallel +backends launched at that time would neither share this state nor be excluded +from accessing the index via the heavyweight lock mechanism. It is therefore +imperative that backends only initiate parallelism from places where it will +be safe for parallel workers to access the relations on which they hold locks. +It is also important that they not afterwards do anything which causes access +to those relations to become unsafe, or at least not until after parallelism +has concluded. The fact that parallelism is strictly read-only means that the +opportunities for such mishaps are few and far between; furthermore, most +operations which present these hazards are DDL operations, which will be +rejected by CheckTableNotInUse() if parallel mode is active. + +Only relation locks, locks on database or shared objects, and perhaps the lock +on our transaction ID are copied to workers. perhaps? Advisory locks are not copied, +but the leader may hold them at the start of parallelism; they cannot +subsequently be manipulated
Re: [HACKERS] parallel mode and parallel contexts
On Sun, Feb 15, 2015 at 11:48 AM, Robert Haas robertmh...@gmail.com wrote: On Fri, Feb 13, 2015 at 2:22 AM, Michael Paquier michael.paqu...@gmail.com wrote: On Thu, Feb 12, 2015 at 3:59 AM, Robert Haas robertmh...@gmail.com wrote: We're not seeing eye to eye here yet, since I don't accept your example scenario and you don't accept mine. Let's keep discussing. Meanwhile, here's an updated patch. A lot of cool activity is showing up here, so moved the patch to CF 2015-02. Perhaps Andres you could add yourself as a reviewer? Here's a new version of the patch with (a) some additional restrictions on heavyweight locking both at the start of, and during, parallel mode and (b) a write-up in the README explaining the restrictions and my theory of why the handling of heavyweight locking is safe. Hopefully this goes some way towards addressing Andres's concerns. I've also replaced the specific (and wrong) messages about advisory locks with a more generic message, as previously proposed; and I've fixed at least one bug. Today, while testing parallel_seqscan patch, I encountered one intermittent issue (it hangs in below function) and I have question related to below function. +void +WaitForParallelWorkersToFinish(ParallelContext *pcxt) +{ .. + for (;;) + { .. + CHECK_FOR_INTERRUPTS(); + for (i = 0; i pcxt-nworkers; ++i) + { + if (pcxt-worker[i].error_mqh != NULL) + { + anyone_alive = true; + break; + } + } + + if (!anyone_alive) + break; + + WaitLatch(MyProc-procLatch, WL_LATCH_SET, -1); + ResetLatch(MyProc-procLatch); + } Isn't there a race condition in this function such that after it finds that there is some alive worker and before it does WaitLatch(), the worker completes its work and exits, now in such a case who is going to wake the backend waiting on procLatch? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] parallel mode and parallel contexts
On Fri, Mar 6, 2015 at 7:01 AM, Amit Kapila amit.kapil...@gmail.com wrote: Today, while testing parallel_seqscan patch, I encountered one intermittent issue (it hangs in below function) and I have question related to below function. +void +WaitForParallelWorkersToFinish(ParallelContext *pcxt) +{ .. + for (;;) + { .. + CHECK_FOR_INTERRUPTS(); + for (i = 0; i pcxt-nworkers; ++i) + { + if (pcxt-worker[i].error_mqh != NULL) + { + anyone_alive = true; + break; + } + } + + if (!anyone_alive) + break; + + WaitLatch(MyProc-procLatch, WL_LATCH_SET, -1); + ResetLatch(MyProc-procLatch); + } Isn't there a race condition in this function such that after it finds that there is some alive worker and before it does WaitLatch(), the worker completes its work and exits, now in such a case who is going to wake the backend waiting on procLatch? It doesn't matter whether some other backend sets the process latch before we reach WaitLatch() or after we begin waiting. Either way, it's fine. It would be bad if the process could exit without setting the latch at all, though. I hope that's not the case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On 2015-02-11 13:59:04 -0500, Robert Haas wrote: On Tue, Feb 10, 2015 at 8:45 PM, Andres Freund and...@2ndquadrant.com wrote: The only reason I'd like it to be active is because that'd *prohibit* doing the crazier stuff. There seems little reason to not da it under the additional protection against crazy things that'd give us? Trying to load additional libraries once parallel mode is in progress can provide failures, because the load of the libraries causes the system to do GUC_ACTION_SET on the GUCs whose initialization was deferred, and that trips up the no-changing-things-while-in-parallel-mode checks. Oh, that's a good point. I'm not sure if there's anything we can, or should, do about that. Fine with me. Well, ExportSnapshot()/Import has quite a bit more restrictions than what you're doing... Most importantly it cannot import in transactions unless using read committed isolation, forces xid assignment during export, forces the old snapshot to stick around for the whole transaction and only works on a primary. I'm not actually sure it makes a relevant difference, but it's certainly worth calling attention to. The fuding I was wondering about certainly is unnecessary though - GetSnapshotData() has already performed it... I'm not particularly awake right now and I think this needs a closer look either by someone awake... I'm not fully convinced this is safe. I'm not 100% comfortable with it either, but I just spent some time looking at it and can't see what's wrong with it. Basically, we're trying to get the parallel worker into a state that matches the master's state after doing GetTransactionSnapshot() - namely, CurrentSnapshot should point to the same snapshot on the master, and FirstSnapshotSet should be true, plus the same additional processing that GetTransactionSnapshot() would have done if we're in a higher transaction isolation level. It's possible we don't need to mimic that state, but it seems like a good idea. I plan to look at this soonish. Still, I wonder if we ought to be banning GetTransactionSnapshot() altogether. I'm not sure if there's ever a time when it's safe for a worker to call that. Why? Also, you don't seem to prohibit popping the active snapshot (should that be prohibitted maybe?) which might bump the initiator's xmin horizon. I think as long as our transaction snapshot is installed correctly our xmin horizon can't advance; am I missing something? Maybe I'm missing something, but why would that be the case in a read committed session? ImportSnapshot() only ever calls SetTransactionSnapshot it such a case (why does it contain code to cope without it?), but your patch doesn't seem to guarantee that. But as don't actually transport XactIsoLevel anywhere (omission probably?) that seems to mean that the SetTransactionSnapshot() won't do much, even if the source transaction is repeatable read. Doesn't XactIsoLevel get set by assign_XactIsoLevel() when we restore the GUC state? Yes, but afaics the transaction start will just overwrite it again: static void StartTransaction(void) { ... XactDeferrable = DefaultXactDeferrable; XactIsoLevel = DefaultXactIsoLevel; ... For a client issued BEGIN it works because utility.c does: case TRANS_STMT_BEGIN: case TRANS_STMT_START: { ListCell *lc; BeginTransactionBlock(); foreach(lc, stmt-options) { DefElem*item = (DefElem *) lfirst(lc); if (strcmp(item-defname, transaction_isolation) == 0) SetPGVariable(transaction_isolation, list_make1(item-arg), true); else if (strcmp(item-defname, transaction_read_only) == 0) SetPGVariable(transaction_read_only, list_make1(item-arg), true); else if (strcmp(item-defname, transaction_deferrable) == 0) SetPGVariable(transaction_deferrable, list_make1(item-arg), true); } Pretty, isn't it? Your manual ones don't either, that's what made me complain. E.g. pg_advisory_lock_int8 isn't called that on the SQL level. Instead they use the C name + parens. On further reflection, I think I should probably just change all of those to use a general message about advisory locks, i.e.: ERROR:
Re: [HACKERS] parallel mode and parallel contexts
On Thu, Feb 12, 2015 at 3:59 AM, Robert Haas robertmh...@gmail.com wrote: We're not seeing eye to eye here yet, since I don't accept your example scenario and you don't accept mine. Let's keep discussing. Meanwhile, here's an updated patch. A lot of cool activity is showing up here, so moved the patch to CF 2015-02. Perhaps Andres you could add yourself as a reviewer? -- Michael
Re: [HACKERS] parallel mode and parallel contexts
On Wed, Feb 11, 2015 at 1:59 PM, Robert Haas robertmh...@gmail.com wrote: I'm not sure what you mean by the severity of the lock. How about marking the locks that the worker inherited from the parallel master and throwing an error if it tries to lock one of those objects in a mode that it does not already hold? That'd probably catch a sizeable number of programming errors without tripping up many legitimate use cases (and we can always relax or modify the prohibition if it later turns out to be problematic). Or maybe do that only when the new lock mode is stronger than one it already holds. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On 2015-02-10 11:49:58 -0500, Robert Haas wrote: On Sat, Feb 7, 2015 at 7:20 PM, Andres Freund and...@2ndquadrant.com wrote: * Don't like CreateParallelContextForExtension much as a function name - I don't think we normally equate the fact that code is located in a loadable library with the extension mechanism. Suggestions for a better name? CreateParallelContextForLoadableFunction? *ExternalFunction maybe? That's dfmgr.c calls it. * the plain shm calculations intentionally use mul_size/add_size to deal with overflows. On 32bit it doesn't seem impossible, but unlikely to overflow size_t. Yes, I do that here too, though as with the plain shm calculations, only in the estimate functions. The functions that actually serialize stuff don't have to worry about overflow because it's already been checked. I thought I'd seen some estimation functions that didn't use it. But apparently I was wrong. * In +LaunchParallelWorkers, does it make sense trying to start workers if one failed? ISTM that's likely to not be helpful. I.e. it should just break; after the first failure. It can't just break, because clearing pcxt-worker[i].error_mqh is an essential step. I could add a flag variable that tracks whether any registrations have failed and change the if statement to if (!any_registrations_failed RegisterDynamicBackgroundWorker(worker, pcxt-worker[i].bgwhandle)), if you want. I thought about doing that but decided it was very unlikely to affect the real-world performance of anything, so easier just to keep the code simple. But I'll change it if you want. I think it'd be better. * ParallelMain restores libraries before GUC state. Given that librararies can, and actually somewhat frequently do, inspect GUCs on load, it seems better to do it the other way round? You argue We want to do this before restoring GUCs, because the libraries might define custom variables., but I don't buy that. It's completely normal for namespaced GUCs to be present before a library is loaded? Especially as you then go and process_session_preload_libraries() after setting the GUCs. This is a good question, but the answer is not entirely clear to me. I'm thinking I should probably just remove process_session_preload_libraries() altogether at this point. That was added at a time when RestoreLibraryState() didn't exist yet, and I think RestoreLibraryState() is a far superior way of handling this, because surely the list of libraries that the parallel leader *actually had loaded* is more definitive than any GUC. That sounds like a good idea to me. Now, that doesn't answer the question about whether we should load libraries first or GUCs. I agree that the comment's reasoning is bogus, but I'm not sure I understand why you think the other order is better. It *is* normal for namespaced GUCs to be present before a library is loaded, but it's equally normal (and, I think, often more desirable in practice) to load the library first and then set the GUCs. Well, it's pretty much never the case that the library is loaded before postgresql.conf gucs, right? A changed postgresql.conf is the only exception I can see. Neither is it the normal case for session|local_preload_libraries. Not even when GUCs are loaded via pg_db_role_setting or the startup packet... Generally, I think that libraries ought to be loaded as early as possible, because they may install hooks that change the way other stuff works, and should therefore be loaded before that other stuff happens. While that may be desirable I don't really see a reason for this to be treated differently for worker processes than the majority of cases otherwise. Anyway, I think this is a relatively minor issue. * Should ParallelMain maybe enter the parallel context before loading user defined libraries? It's far from absurd to execute code touching the database on library initialization... It's hard to judge without specific examples. What kinds of things do they do? I've seen code filling lookup caches and creating system objects (including tables and extensions). Are they counting on a transaction being active? I would have thought that was a no-no, since there are many instances in which it won't be true. Also, you might have just gotten loaded because a function stored in your library was called, so you could be in a transaction that's busy doing something else, or deep in a subtransaction stack, etc. It seems unsafe to do very much more than a few syscache lookups here, even if there does happen to be a transaction active. The only reason I'd like it to be active is because that'd *prohibit* doing the crazier stuff. There seems little reason to not da it under the additional protection against crazy things that'd give us? * I think restoring snapshots needs to fudge the worker's PGXACT-xmin to be the minimum of the top transaction id and the snapshots.
Re: [HACKERS] parallel mode and parallel contexts
On Sat, Feb 7, 2015 at 7:20 PM, Andres Freund and...@2ndquadrant.com wrote: Observations: * Some tailing whitespace in the readme. Very nice otherwise. Fixed. Thanks. * Don't like CreateParallelContextForExtension much as a function name - I don't think we normally equate the fact that code is located in a loadable library with the extension mechanism. Suggestions for a better name? CreateParallelContextForLoadableFunction? * StaticAssertStmt(BUFFERALIGN(PARALLEL_ERROR_QUEUE_SIZE) ...) what's that about? Gee, maybe I should have added a comment so I'd remember. If the buffer size isn't MAXALIGN'd, that would be really bad, because shm_mq_create() assumes that it's being given an aligned address. Maybe I should add an Assert() there. If it is MAXALIGN'd but not BUFFERALIGN'd, we might waste a few bytes of space, since shm_toc_allocate() always allocates in BUFFERALIGN'd chunks, but I don't think anything will actually break. Not sure if that's worth an assert or not. * the plain shm calculations intentionally use mul_size/add_size to deal with overflows. On 32bit it doesn't seem impossible, but unlikely to overflow size_t. Yes, I do that here too, though as with the plain shm calculations, only in the estimate functions. The functions that actually serialize stuff don't have to worry about overflow because it's already been checked. * I'd s/very // i We might be running in a very short-lived memory context.. Or replace it with too. Removed very. * In +LaunchParallelWorkers, does it make sense trying to start workers if one failed? ISTM that's likely to not be helpful. I.e. it should just break; after the first failure. It can't just break, because clearing pcxt-worker[i].error_mqh is an essential step. I could add a flag variable that tracks whether any registrations have failed and change the if statement to if (!any_registrations_failed RegisterDynamicBackgroundWorker(worker, pcxt-worker[i].bgwhandle)), if you want. I thought about doing that but decided it was very unlikely to affect the real-world performance of anything, so easier just to keep the code simple. But I'll change it if you want. * +WaitForParallelWorkersToFinish says that it waits for workers to exit cleanly. To me that's ambiguous. How about fully? I've removed the word cleanly and added a comment to more fully explain the danger: + * Even if the parallel operation seems to have completed successfully, it's + * important to call this function afterwards. We must not miss any errors + * the workers may have thrown during the parallel operation, or any that they + * may yet throw while shutting down. * ParallelMain restores libraries before GUC state. Given that librararies can, and actually somewhat frequently do, inspect GUCs on load, it seems better to do it the other way round? You argue We want to do this before restoring GUCs, because the libraries might define custom variables., but I don't buy that. It's completely normal for namespaced GUCs to be present before a library is loaded? Especially as you then go and process_session_preload_libraries() after setting the GUCs. This is a good question, but the answer is not entirely clear to me. I'm thinking I should probably just remove process_session_preload_libraries() altogether at this point. That was added at a time when RestoreLibraryState() didn't exist yet, and I think RestoreLibraryState() is a far superior way of handling this, because surely the list of libraries that the parallel leader *actually had loaded* is more definitive than any GUC. Now, that doesn't answer the question about whether we should load libraries first or GUCs. I agree that the comment's reasoning is bogus, but I'm not sure I understand why you think the other order is better. It *is* normal for namespaced GUCs to be present before a library is loaded, but it's equally normal (and, I think, often more desirable in practice) to load the library first and then set the GUCs. Generally, I think that libraries ought to be loaded as early as possible, because they may install hooks that change the way other stuff works, and should therefore be loaded before that other stuff happens. * Should ParallelMain maybe enter the parallel context before loading user defined libraries? It's far from absurd to execute code touching the database on library initialization... It's hard to judge without specific examples. What kinds of things do they do? Are they counting on a transaction being active? I would have thought that was a no-no, since there are many instances in which it won't be true. Also, you might have just gotten loaded because a function stored in your library was called, so you could be in a transaction that's busy doing something else, or deep in a subtransaction stack, etc. It seems unsafe to do very much more than a few syscache lookups here, even if there does happen to be a transaction active. *
Re: [HACKERS] parallel mode and parallel contexts
Hi, On 2015-02-06 22:43:21 -0500, Robert Haas wrote: Here's v4, with that fixed and a few more tweaks. If you attached files generated with 'git format-patch' I could directly apply then with the commit message and such. All at once if it's mutliple patches, as individual commits. On nontrivial patches it's nice to see the commit message(s) along the diff(s). Observations: * Some tailing whitespace in the readme. Very nice otherwise. * Don't like CreateParallelContextForExtension much as a function name - I don't think we normally equate the fact that code is located in a loadable library with the extension mechanism. * StaticAssertStmt(BUFFERALIGN(PARALLEL_ERROR_QUEUE_SIZE) ...) what's that about? * the plain shm calculations intentionally use mul_size/add_size to deal with overflows. On 32bit it doesn't seem impossible, but unlikely to overflow size_t. * I'd s/very // i We might be running in a very short-lived memory context.. Or replace it with too. * In +LaunchParallelWorkers, does it make sense trying to start workers if one failed? ISTM that's likely to not be helpful. I.e. it should just break; after the first failure. * +WaitForParallelWorkersToFinish says that it waits for workers to exit cleanly. To me that's ambiguous. How about fully? * ParallelMain restores libraries before GUC state. Given that librararies can, and actually somewhat frequently do, inspect GUCs on load, it seems better to do it the other way round? You argue We want to do this before restoring GUCs, because the libraries might define custom variables., but I don't buy that. It's completely normal for namespaced GUCs to be present before a library is loaded? Especially as you then go and process_session_preload_libraries() after setting the GUCs. * Should ParallelMain maybe enter the parallel context before loading user defined libraries? It's far from absurd to execute code touching the database on library initialization... * rename ParallelMain to ParallelWorkerMain? * I think restoring snapshots needs to fudge the worker's PGXACT-xmin to be the minimum of the top transaction id and the snapshots. Otherwise if the initiating process dies badly (e.g. because postmaster died) the workers will continue to work, while other processes may remove things. Also, you don't seem to prohibit popping the active snapshot (should that be prohibitted maybe?) which might bump the initiator's xmin horizon. * Comment about 'signal_handler' in +HandleParallelMessageInterrupt is outdated. * Is it really a good idea to overlay Z/ReadyForQuery with 'this worker exited cleanly'? Shouldn't it at least be T/Terminate? I'm generally not sure it's wise to use this faux FE/BE protocol here... * HandlingParallelMessages looks dead. * ParallelErrorContext has the wrong comment. * ParallelErrorContext() provides the worker's pid in the context message. I guess that's so it's easier to interpret when sent to the initiator? It'll look odd when logged by the failing process. * We now have pretty much the same handle_sigterm in a bunch of places. Can't we get rid of those? You actually probably can just use die(). * The comments in xact.c above XactTopTransactionId pretty much assume that the reader knows that that is about parallel stuff. * I'm a bit confused by the fact that Start/CommitTransactionCommand() emit a generic elog when encountering a PARALLEL_INPROGRESS, whereas ReleaseSavepoint()/RollbackTo has a special message. Shouldn't it be pretty much impossible to hit a ReleaseSavepoint() with a parallel transaction in progress? We'd have had to end the previous transaction command while parallel stuff was in progress - right? (Internal/ReleaseCurrentSubTransaction is different, that's used in code) * Why are you deviating from the sorting order used in other places for ParallelCurrentXids? That seems really wierd, especially as we use something else a couple lines down. Especially as you actually seem to send stuff in xidComparator order? * I don't think skipping AtEOXact_Namespace() entirely if parallel is a good idea. That function already does some other stuff than cleaning up temp tables. I think you should pass in parallel and do the skip in there. * Start/DndParallelWorkerTransaction assert the current state, whereas the rest of the file FATALs in that case. I think it'd actually be good to be conservative and do the same in this case. * You're manually passing function names to PreventCommandIfParallelMode() in a fair number of cases. I'd either try and keep the names consistent with what the functions are actually called at the sql level (adding the types in the parens) or just use PG_FUNCNAME_MACRO to keep them correct. * Wait. You now copy all held relation held as is to the standby? I quite doubt that's a good idea, and it's not my reading of the conclusions made in the group locking thread. At the very least
Re: [HACKERS] parallel mode and parallel contexts
On Wed, Feb 4, 2015 at 9:40 PM, Robert Haas robertmh...@gmail.com wrote: On Fri, Jan 30, 2015 at 12:08 PM, Robert Haas robertmh...@gmail.com wrote: Here's a new version. Andres mentioned previously that he thought it would be a good idea to commit the addition of BackgroundWorkerInitializeConnectionByOid() separately, as he's had cause to want it independently of the rest of this stuff. It would be useful for pg_background, too. So I've broken that out into a separate patch here (bgworker-by-oid.patch) and will commit that RSN unless somebody thinks it's a bad idea for some reason. AFAICS it should be uncontroversial. This is now done. The main patch needed some updating in light of Andres's recent assault on ImmediateInterruptOK (final result: Andres 1-0 IIOK) so here's a new version. I think we should expose variable ParallelWorkerNumber (or if you don't want to expose it then atleast GetParallel* function call is required to get the value of same), as that is needed for external applications wherever they want to allocate something for each worker, some examples w.r.t parallel seq scan patch are each worker should have separate tuple queue and probably for implementation of Explain statement also we might need it. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] parallel mode and parallel contexts
On 1/30/15 11:08 AM, Robert Haas wrote: The final patch attached her (parallel-dummy-v2.patch) has been updated slightly to incorporate some prefetching logic. It's still just demo code and is not intended for commit. I'm not sure whether the prefetching logic can actually be made to improve performance, either; if anyone feels like playing with that and reporting results back, that would be swell. Wouldn't we want the prefetching to happen after ReadBuffer() instead of before? This way we risk pushing the buffer we actually want out, plus if it's not already available the OS probably won't try and read it until it's read all the prefetch blocks. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On Wed, Jan 21, 2015 at 2:11 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Tue, Jan 20, 2015 at 9:52 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Jan 20, 2015 at 9:41 AM, Amit Kapila amit.kapil...@gmail.com wrote: It seems [WaitForBackgroundWorkerShutdown] has possibility to wait forever. Assume one of the worker is not able to start (not able to attach to shared memory or some other reason), then status returned by GetBackgroundWorkerPid() will be BGWH_NOT_YET_STARTED and after that it can wait forever in WaitLatch. I don't think that's right. The status only remains BGWH_NOT_YET_STARTED until the postmaster forks the child process. I think the control flow can reach the above location before postmaster could fork all the workers. Consider a case that we have launched all workers during ExecutorStart phase and then before postmaster could start all workers an error occurs in master backend and then it try to Abort the transaction and destroy the parallel context, at that moment it will get the above status and wait forever in above code. I am able to reproduce above scenario with debugger by using parallel_seqscan patch. Hmm. Well, if you can reproduce it, there clearly must be a bug. But I'm not quite sure where. What should happen in that case is that the process that started the worker has to wait for the postmaster to actually start it, and then after that for the new process to die, and then it should return. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On Wed, Jan 21, 2015 at 6:35 PM, Robert Haas robertmh...@gmail.com wrote: On Wed, Jan 21, 2015 at 2:11 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Tue, Jan 20, 2015 at 9:52 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Jan 20, 2015 at 9:41 AM, Amit Kapila amit.kapil...@gmail.com wrote: It seems [WaitForBackgroundWorkerShutdown] has possibility to wait forever. Assume one of the worker is not able to start (not able to attach to shared memory or some other reason), then status returned by GetBackgroundWorkerPid() will be BGWH_NOT_YET_STARTED and after that it can wait forever in WaitLatch. I don't think that's right. The status only remains BGWH_NOT_YET_STARTED until the postmaster forks the child process. I think the control flow can reach the above location before postmaster could fork all the workers. Consider a case that we have launched all workers during ExecutorStart phase and then before postmaster could start all workers an error occurs in master backend and then it try to Abort the transaction and destroy the parallel context, at that moment it will get the above status and wait forever in above code. I am able to reproduce above scenario with debugger by using parallel_seqscan patch. Hmm. Well, if you can reproduce it, there clearly must be a bug. But I'm not quite sure where. What should happen in that case is that the process that started the worker has to wait for the postmaster to actually start it, Okay, I think this should solve the issue, also it should be done before call of TerminateBackgroundWorker(). With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] parallel mode and parallel contexts
On Sat, Jan 17, 2015 at 3:40 AM, Robert Haas robertmh...@gmail.com wrote: New patch attached. I'm going to take the risk of calling this v1 (previous versions have been 0.x), since I've now done something about the heavyweight locking issue, as well as fixed the message-looping bug Amit pointed out. It doubtless needs more work, but it's starting to smell a bit more like a real patch. I need some clarification regarding below code: +BgwHandleStatus +WaitForBackgroundWorkerShutdown(BackgroundWorkerHandle *handle) +{ + BgwHandleStatus status; + int rc; + bool save_set_latch_on_sigusr1; + + save_set_latch_on_sigusr1 = set_latch_on_sigusr1; + set_latch_on_sigusr1 = true; + + PG_TRY(); + { + for (;;) + { + pid_t pid; + + CHECK_FOR_INTERRUPTS(); + + status = GetBackgroundWorkerPid(handle, pid); + if (status == BGWH_STOPPED) + return status; + + rc = WaitLatch(MyProc-procLatch, + WL_LATCH_SET | WL_POSTMASTER_DEATH, 0); + + if (rc WL_POSTMASTER_DEATH) + return BGWH_POSTMASTER_DIED; It seems this code has possibility to wait forever. Assume one of the worker is not able to start (not able to attach to shared memory or some other reason), then status returned by GetBackgroundWorkerPid() will be BGWH_NOT_YET_STARTED and after that it can wait forever in WaitLatch. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] parallel mode and parallel contexts
On Tue, Jan 20, 2015 at 9:41 AM, Amit Kapila amit.kapil...@gmail.com wrote: It seems [WaitForBackgroundWorkerShutdown] has possibility to wait forever. Assume one of the worker is not able to start (not able to attach to shared memory or some other reason), then status returned by GetBackgroundWorkerPid() will be BGWH_NOT_YET_STARTED and after that it can wait forever in WaitLatch. I don't think that's right. The status only remains BGWH_NOT_YET_STARTED until the postmaster forks the child process. At that point it immediately changes to BGWH_STARTED. If it starts up and then dies early on, for example because it can't attach to shared memory or somesuch, the status will change to BGWH_STOPPED. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On Tue, Jan 20, 2015 at 9:52 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Jan 20, 2015 at 9:41 AM, Amit Kapila amit.kapil...@gmail.com wrote: It seems [WaitForBackgroundWorkerShutdown] has possibility to wait forever. Assume one of the worker is not able to start (not able to attach to shared memory or some other reason), then status returned by GetBackgroundWorkerPid() will be BGWH_NOT_YET_STARTED and after that it can wait forever in WaitLatch. I don't think that's right. The status only remains BGWH_NOT_YET_STARTED until the postmaster forks the child process. I think the control flow can reach the above location before postmaster could fork all the workers. Consider a case that we have launched all workers during ExecutorStart phase and then before postmaster could start all workers an error occurs in master backend and then it try to Abort the transaction and destroy the parallel context, at that moment it will get the above status and wait forever in above code. I am able to reproduce above scenario with debugger by using parallel_seqscan patch. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] parallel mode and parallel contexts
On 2015-01-05 11:27:57 -0500, Robert Haas wrote: On Sat, Jan 3, 2015 at 7:31 PM, Andres Freund and...@2ndquadrant.com wrote: * I wonder if parallel contexts shouldn't be tracked via resowners That is a good question. I confess that I'm somewhat fuzzy about which things should be tracked via the resowner mechanism vs. which things should have their own bespoke bookkeeping. However, the AtEOXact_Parallel() stuff happens well before ResourceOwnerRelease(), which makes merging them seem not particularly clean. I'm not sure either. But I think the current location is wrong anyway - during AtEOXact_Parallel() before running user defined queries via AfterTriggerFireDeferred() seems wrong. * combocid.c should error out in parallel mode, as insurance Eh, which function? HeapTupleHeaderGetCmin(), HeapTupleHeaderGetCmax(), and AtEOXact_ComboCid() are intended to work in parallel mode. HeapTupleHeaderAdjustCmax() could Assert(!IsInParallelMode()) but the only calls are in heap_update() and heap_delete() which already have error checks, so putting yet another elog() there seems like overkill. To me it seems like a good idea, but whatever. * I doubt it's a good idea to allow heap_insert, heap_inplace_update, index_insert. I'm not convinced that those will be handled correct and relaxing restrictions is easier than adding them. I'm fine with adding checks to heap_insert() and heap_inplace_update(). I'm not sure we really need to add anything to index_insert(); how are we going to get there without hitting some other prohibited operation first? I'm not sure. But it's not that hard to imagine that somebody will start adding codepaths that insert into indexes first. Think upsert. * I'd much rather separate HandleParallelMessageInterrupt() in one variant that just tells the machinery there's a interrupt (called from signal handlers) and one that actually handles them. We shouldn't even consider adding any new code that does allocations, errors and such in signal handlers. That's just a *bad* idea. That's a nice idea, but I didn't invent the way this crap works today. ImmediateInterruptOK gets set to true while performing a lock wait, and we need to be able to respond to errors while in that state. I think there's got to be a better way to handle that than force every asynchronous operation to have to cope with the fact that ImmediateInterruptOK may be set or not set, but as of today that's what you have to do. I personally think it's absolutely unacceptable to add any more of that. That the current mess works is more luck than anything else - and I'm pretty sure there's several bugs in it. But since I realize I can't force you to develop a alternative solution, I tried to implement enough infrastructure to deal with it without too much work. As far as I can see this could relatively easily be implemented ontop of the removal of ImmediateInterruptOK in the patch series I posted yesterday? Afaics you just need to remove most of +HandleParallelMessageInterrupt() and replace it by a single SetLatch(). * I'm not a fan of the shm_toc stuff... Verbose and hard to read. I'd much rather place a struct there and be careful about not using pointers. That also would obliviate the need to reserve some ids. I don't see how that's going to work with variable-size data structures, and a bunch of the things that we need to serialize are variable-size. Meh. Just appending the variable data to the end of the structure and calculating offsets works just fine. Moreover, I'm not really interested in rehashing this design again. I know it's not your favorite; you've said that before. Well, if I keep having to read code using it, I'll keep being annoyed by it. I guess we both have to live with that. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On Thu, Jan 15, 2015 at 9:09 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Thu, Jan 15, 2015 at 6:52 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Jan 15, 2015 at 7:00 AM, Amit Kapila amit.kapil...@gmail.com wrote: +HandleParallelMessages(void) +{ .. .. + for (i = 0; i pcxt-nworkers; ++i) + { + /* + * Read messages for as long as we have an error queue; if we + * have hit (or hit while reading) ReadyForQuery, this will go to + * NULL. + */ + while (pcxt-worker[i].error_mqh != NULL) + { + shm_mq_result res; + + CHECK_FOR_INTERRUPTS(); + + res = shm_mq_receive(pcxt-worker[i].error_mqh, nbytes, + data, true); + if (res == SHM_MQ_SUCCESS) Here we are checking the error queue for all the workers and this loop will continue untill all have sent ReadyForQuery() message ('Z') which will make this loop continue till all workers have finished their work. Assume situation where first worker has completed the work and sent 'Z' message and second worker is still sending some tuples, now above code will keep on waiting for 'Z' message from second worker and won't allow to receive tuples sent by second worker till it send 'Z' message. As each worker send its own 'Z' message after completion, so ideally the above code should receive the message only for worker which has sent the message. I think for that it needs worker information who has sent the message. Are you talking about HandleParallelMessages() or WaitForParallelWorkersToFinish()? The former doesn't wait for anything; it just handles any messages that are available now. I am talking about HandleParallelMessages(). It doesn't wait but it is looping which will make it run for longer time as explained above. Just imagine a case where there are two workers and first worker has sent 'Z' message and second worker is doing some work, now in such a scenario loop will not finish until second worker also send 'Z' message or error. Am I missing something? Blah. You're right. I intended to write this loop so that it only runs until shm_mq_receive() returns SHM_MQ_WOULD_BLOCK. But that's not what I did. Will fix. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On Thu, Jan 15, 2015 at 7:00 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Tue, Jan 13, 2015 at 1:33 AM, Robert Haas robertmh...@gmail.com wrote: On Thu, Jan 8, 2015 at 6:52 AM, Amit Kapila amit.kapil...@gmail.com wrote: + seg = dsm_attach(DatumGetInt32(main_arg)); Here, I think DatumGetUInt32() needs to be used instead of DatumGetInt32() as the segment handle is uint32. OK, I'll change that in the next version. No issues, I have another question related to below code: +HandleParallelMessages(void) +{ .. .. + for (i = 0; i pcxt-nworkers; ++i) + { + /* + * Read messages for as long as we have an error queue; if we + * have hit (or hit while reading) ReadyForQuery, this will go to + * NULL. + */ + while (pcxt-worker[i].error_mqh != NULL) + { + shm_mq_result res; + + CHECK_FOR_INTERRUPTS(); + + res = shm_mq_receive(pcxt-worker[i].error_mqh, nbytes, + data, true); + if (res == SHM_MQ_SUCCESS) Here we are checking the error queue for all the workers and this loop will continue untill all have sent ReadyForQuery() message ('Z') which will make this loop continue till all workers have finished their work. Assume situation where first worker has completed the work and sent 'Z' message and second worker is still sending some tuples, now above code will keep on waiting for 'Z' message from second worker and won't allow to receive tuples sent by second worker till it send 'Z' message. As each worker send its own 'Z' message after completion, so ideally the above code should receive the message only for worker which has sent the message. I think for that it needs worker information who has sent the message. Are you talking about HandleParallelMessages() or WaitForParallelWorkersToFinish()? The former doesn't wait for anything; it just handles any messages that are available now. The latter does wait for all workers to finish, but the intention is that you only call it when you're ready to wind up the entire parallel operation, so that's OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On Tue, Jan 13, 2015 at 1:33 AM, Robert Haas robertmh...@gmail.com wrote: On Thu, Jan 8, 2015 at 6:52 AM, Amit Kapila amit.kapil...@gmail.com wrote: + seg = dsm_attach(DatumGetInt32(main_arg)); Here, I think DatumGetUInt32() needs to be used instead of DatumGetInt32() as the segment handle is uint32. OK, I'll change that in the next version. No issues, I have another question related to below code: +HandleParallelMessages(void) +{ .. .. + for (i = 0; i pcxt-nworkers; ++i) + { + /* + * Read messages for as long as we have an error queue; if we + * have hit (or hit while reading) ReadyForQuery, this will go to + * NULL. + */ + while (pcxt-worker[i].error_mqh != NULL) + { + shm_mq_result res; + + CHECK_FOR_INTERRUPTS(); + + res = shm_mq_receive(pcxt-worker[i].error_mqh, nbytes, + data, true); + if (res == SHM_MQ_SUCCESS) Here we are checking the error queue for all the workers and this loop will continue untill all have sent ReadyForQuery() message ('Z') which will make this loop continue till all workers have finished their work. Assume situation where first worker has completed the work and sent 'Z' message and second worker is still sending some tuples, now above code will keep on waiting for 'Z' message from second worker and won't allow to receive tuples sent by second worker till it send 'Z' message. As each worker send its own 'Z' message after completion, so ideally the above code should receive the message only for worker which has sent the message. I think for that it needs worker information who has sent the message. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] parallel mode and parallel contexts
On Thu, Jan 15, 2015 at 6:52 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Jan 15, 2015 at 7:00 AM, Amit Kapila amit.kapil...@gmail.com wrote: +HandleParallelMessages(void) +{ .. .. + for (i = 0; i pcxt-nworkers; ++i) + { + /* + * Read messages for as long as we have an error queue; if we + * have hit (or hit while reading) ReadyForQuery, this will go to + * NULL. + */ + while (pcxt-worker[i].error_mqh != NULL) + { + shm_mq_result res; + + CHECK_FOR_INTERRUPTS(); + + res = shm_mq_receive(pcxt-worker[i].error_mqh, nbytes, + data, true); + if (res == SHM_MQ_SUCCESS) Here we are checking the error queue for all the workers and this loop will continue untill all have sent ReadyForQuery() message ('Z') which will make this loop continue till all workers have finished their work. Assume situation where first worker has completed the work and sent 'Z' message and second worker is still sending some tuples, now above code will keep on waiting for 'Z' message from second worker and won't allow to receive tuples sent by second worker till it send 'Z' message. As each worker send its own 'Z' message after completion, so ideally the above code should receive the message only for worker which has sent the message. I think for that it needs worker information who has sent the message. Are you talking about HandleParallelMessages() or WaitForParallelWorkersToFinish()? The former doesn't wait for anything; it just handles any messages that are available now. I am talking about HandleParallelMessages(). It doesn't wait but it is looping which will make it run for longer time as explained above. Just imagine a case where there are two workers and first worker has sent 'Z' message and second worker is doing some work, now in such a scenario loop will not finish until second worker also send 'Z' message or error. Am I missing something? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] parallel mode and parallel contexts
On Wed, Jan 7, 2015 at 3:54 PM, Jim Nasby jim.na...@bluetreble.com wrote: Agreed, I was more concerned with calls to nextval(), which don't seem to be prevented in parallel mode? It looks prevented: /* * Forbid this during parallel operation because, to make it work, * the cooperating backends would need to share the backend-local cached * sequence information. Currently, we don't support that. */ PreventCommandIfParallelMode(nextval()); I was more thinking about all the add-on pl's like pl/ruby. But yeah, I don't see that there's much we can do if they're not using SPI. Actually, there is: it's forbidden at the layer of heap_insert(), heap_update(), heap_delete(). It's hard to imagine anyone trying to modify the database as a lower level than that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On Thu, Jan 8, 2015 at 6:52 AM, Amit Kapila amit.kapil...@gmail.com wrote: + seg = dsm_attach(DatumGetInt32(main_arg)); Here, I think DatumGetUInt32() needs to be used instead of DatumGetInt32() as the segment handle is uint32. OK, I'll change that in the next version. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On Wed, Jan 7, 2015 at 11:03 PM, Robert Haas robertmh...@gmail.com wrote: I have little doubt that this version is still afflicted with various bugs, and the heavyweight locking issue remains to be dealt with, but on the whole I think this is headed in the right direction. +ParallelMain(Datum main_arg) { .. + /* + * Now that we have a resource owner, we can attach to the dynamic + * shared memory segment and read the table of contents. + */ + seg = dsm_attach(DatumGetInt32(main_arg)); Here, I think DatumGetUInt32() needs to be used instead of DatumGetInt32() as the segment handle is uint32. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] parallel mode and parallel contexts
On 6 January 2015 at 21:37, Simon Riggs si...@2ndquadrant.com wrote: I get it now and agree Yes, very much. Should we copy both the top-level frame and the current subxid? Hot Standby links subxids directly to the top-level, so this would be similar. If we copied both, we wouldn't need to special case the Get functions. It would still be O(1). but please work some more on clarity of README, comments and variable naming! Something other than top sounds good. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On Tue, Jan 6, 2015 at 4:37 PM, Simon Riggs si...@2ndquadrant.com wrote: So when you say Only the top frame of the transaction state stack is copied you don't mean the top, you mean the bottom (the latest subxact)? Which then becomes the top in the parallel worker? OK... The item most recently added to the stack is properly called the top, but I guess it's confusing in this case because the item on the bottom of the stack is referred to as the TopTransaction. I'll see if I can rephrase that. Those comments really belong in a README, not the first visible comment in xact.c OK. You need to start with the explanation that parallel workers have a faked-up xact stack to make it easier to copy and manage. That is valid because we never change xact state during a worker operation. OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On 1/7/15, 9:39 AM, Robert Haas wrote: sequence.c: Is it safe to read a sequence value in a parallel worker if the cache_value is 1? No, because the sequence cache isn't synchronized between the workers. Maybe it would be safe if cache_value == 1, but there's not much use-case: how often are you going to have a read-only query that uses a sequence value. At some point we can look at making sequences parallel-safe, but worrying about it right now doesn't seem like a good use of time. Agreed, I was more concerned with calls to nextval(), which don't seem to be prevented in parallel mode? This may be a dumb question, but for functions do we know that all pl's besides C and SQL use SPI? If not I think they could end up writing in a worker. Well, the lower-level checks would catch that. But it is generally true that there's no way to prevent arbitrary C code from doing things that are unsafe in parallel mode and that we can't tell are unsafe. As I've said before, I think that we'll need to have a method of labeling functions as parallel-safe or not, but this patch isn't trying to solve that part of the problem. I was more thinking about all the add-on pl's like pl/ruby. But yeah, I don't see that there's much we can do if they're not using SPI. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On Tue, Jan 6, 2015 at 9:37 PM, Jim Nasby jim.na...@bluetreble.com wrote: CreateParallelContext(): Does it actually make sense to have nworkers=0? ISTM that's a bogus case. I'm not sure whether we'll ever use the zero-worker case in production, but I've certainly found it useful for performance-testing. Also, since the number of workers will normally be determined dynamically by the planner, should that check be a regular conditional instead of just an Assert? I don't think that's really necessary. It shouldn't be too much of a stretch for the planner to come up with a non-negative value. In LaunchParallelWorkers() the Start Workers comment states that we give up registering more workers if one fails to register, but there's nothing in the if condition to do that, and I don't see RegisterDynamicBackgroundWorker() doing it either. Is the comment just incorrect? Woops, that got changed at some point and I forgot to update the comment. Will fix. SerializeTransactionState(): instead of looping through the transaction stack to calculate nxids, couldn't we just set it to maxsize - sizeof(TransactionId) * 3? If we're looping a second time for safety a comment explaining that would be useful... Yeah, I guess we could do that. I don't think it really matters very much one way or the other. sequence.c: Is it safe to read a sequence value in a parallel worker if the cache_value is 1? No, because the sequence cache isn't synchronized between the workers. Maybe it would be safe if cache_value == 1, but there's not much use-case: how often are you going to have a read-only query that uses a sequence value. At some point we can look at making sequences parallel-safe, but worrying about it right now doesn't seem like a good use of time. This may be a dumb question, but for functions do we know that all pl's besides C and SQL use SPI? If not I think they could end up writing in a worker. Well, the lower-level checks would catch that. But it is generally true that there's no way to prevent arbitrary C code from doing things that are unsafe in parallel mode and that we can't tell are unsafe. As I've said before, I think that we'll need to have a method of labeling functions as parallel-safe or not, but this patch isn't trying to solve that part of the problem. @@ -2968,7 +2969,8 @@ ProcessInterrupts(void) errmsg(canceling statement due to user request))); } } - /* If we get here, do nothing (probably, QueryCancelPending was reset) */ + if (ParallelMessagePending) + HandleParallelMessageInterrupt(false); ISTM it'd be good to leave that comment in place (after the if). RestoreComboCIDState(): Assert(!comboCids || !comboHash): Shouldn't that be ? AIUI both should always be either set or 0. Fixed. Typo: + elog(ERROR, cannot update SecondarySnapshpt during a parallel operation); Fixed. The comment for RestoreSnapshot refers to set_transaction_snapshot, but that doesn't actually exist (or isn't referenced). Fixed. Will post a new version in a bit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On 7 January 2015 at 13:11, Robert Haas robertmh...@gmail.com wrote: On Tue, Jan 6, 2015 at 4:37 PM, Simon Riggs si...@2ndquadrant.com wrote: So when you say Only the top frame of the transaction state stack is copied you don't mean the top, you mean the bottom (the latest subxact)? Which then becomes the top in the parallel worker? OK... The item most recently added to the stack is properly called the top, but I guess it's confusing in this case because the item on the bottom of the stack is referred to as the TopTransaction. I'll see if I can rephrase that. Yes, I didn't mean to suggest the confusion was introduced by you - it's just you stepped into the mess by correctly using the word top in a place where its meaning would be opposite to the close-by usage of TopTransaction. Anyway, feeling good about this now. Thanks for your patience. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On 6 January 2015 at 16:33, Robert Haas robertmh...@gmail.com wrote: These comments don’t have any explanation or justification OK, I rewrote them. Hopefully it's better now. Thanks for new version and answers. * Transaction stuff strikes me as overcomplicated and error prone. Given there is no explanation or justification for most of it, I’d be inclined to question why its there Gosh, I was pretty pleased with how simple the transaction integration turned out to be. Most of what's there right now is either (a) code to copy state from the master to the parallel workers or (b) code to throw errors if the workers try to things that aren't safe. I suspect there are a few things missing, but I don't see anything there that looks unnecessary. If you can explain it in more detail in comments and README, I may agree. At present, I don't get it and it makes me nervous. The comment says Only the top frame of the transaction state stack is copied to a parallel worker but I'm not sure why. Top meaning the current subxact or the main xact? If main, why are do we need XactTopTransactionId This is very impressive, but it looks like we’re trying to lift too much weight on the first lift. If we want to go anywhere near this, we need to have very clear documentation about how and why its like that. I’m actually very sorry to say that because the review started very well, much much better than most. When I posted the group locking patch, it got criticized because it didn't actually do anything useful by itself; similarly, the pg_background patch was criticized for not being a large enough step toward parallelism. So, this time, I posted something more comprehensive. I don't think it's quite complete yet. I expect a committable version of this patch to be maybe another 500-1000 lines over what I have here right now -- I think it needs to do something about heavyweight locking, and I expect that there are some unsafe things that aren't quite prohibited yet. But the current patch is only 2300 lines, which is not astonishingly large for a feature of this magnitude; if anything, I'd say it's surprisingly small, due to a year and a half of effort laying the necessary groundwork via long series of preliminary commits. I'm not unwilling to divide this up some more if we can agree on a way to do that that makes sense, but I think we're nearing the point where we need to take the plunge and say, look, this is version one of parallelism. Thunk. I want this also; the only debate is where to draw the line and please don't see that as criticism. I'm very happy it's so short, I agree it could be longer. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On Tue, Jan 6, 2015 at 3:04 PM, Simon Riggs si...@2ndquadrant.com wrote: If you can explain it in more detail in comments and README, I may agree. At present, I don't get it and it makes me nervous. The comment says Only the top frame of the transaction state stack is copied to a parallel worker but I'm not sure why. Top meaning the current subxact or the main xact? If main, why are do we need XactTopTransactionId Current subxact. I initially thought of copying the entire TransactionStateData stack, but Heikki suggested (via IM) that I do it this way instead. I believe his concern was that it's never valid to commit or roll back to a subtransaction that is not at the top of the stack, and if you don't copy the stack, you avoid the risk of somehow ending up in that state. Also, you avoid having to invent resource owners for (sub)transactions that don't really exist in the current process. On the other hand, you do end up with a few special cases that wouldn't exist with the other approach. Still, I'm pretty happy to have taken Heikki's advice: it was certainly simple to implement this way, plus hopefully that way at least one person likes what I ended up with. :-) What else needs clarification? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On 6 January 2015 at 21:01, Robert Haas robertmh...@gmail.com wrote: On Tue, Jan 6, 2015 at 3:04 PM, Simon Riggs si...@2ndquadrant.com wrote: If you can explain it in more detail in comments and README, I may agree. At present, I don't get it and it makes me nervous. The comment says Only the top frame of the transaction state stack is copied to a parallel worker but I'm not sure why. Top meaning the current subxact or the main xact? If main, why are do we need XactTopTransactionId Current subxact. TopTransactionStateData points to the top-level transaction data, but XactTopTransactionId points to the current subxact. So when you say Only the top frame of the transaction state stack is copied you don't mean the top, you mean the bottom (the latest subxact)? Which then becomes the top in the parallel worker? OK... I initially thought of copying the entire TransactionStateData stack, but Heikki suggested (via IM) that I do it this way instead. I believe his concern was that it's never valid to commit or roll back to a subtransaction that is not at the top of the stack, and if you don't copy the stack, you avoid the risk of somehow ending up in that state. Also, you avoid having to invent resource owners for (sub)transactions that don't really exist in the current process. On the other hand, you do end up with a few special cases that wouldn't exist with the other approach. Still, I'm pretty happy to have taken Heikki's advice: it was certainly simple to implement this way, plus hopefully that way at least one person likes what I ended up with. :-) What else needs clarification? Those comments really belong in a README, not the first visible comment in xact.c You need to start with the explanation that parallel workers have a faked-up xact stack to make it easier to copy and manage. That is valid because we never change xact state during a worker operation. I get it now and agree, but please work some more on clarity of README, comments and variable naming! -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On 1/6/15, 10:33 AM, Robert Haas wrote: Entrypints? Already noted by Andres; fixed in the attached version. Perhaps we only parallelize while drinking beer... ;) CreateParallelContext(): Does it actually make sense to have nworkers=0? ISTM that's a bogus case. Also, since the number of workers will normally be determined dynamically by the planner, should that check be a regular conditional instead of just an Assert? In LaunchParallelWorkers() the Start Workers comment states that we give up registering more workers if one fails to register, but there's nothing in the if condition to do that, and I don't see RegisterDynamicBackgroundWorker() doing it either. Is the comment just incorrect? SerializeTransactionState(): instead of looping through the transaction stack to calculate nxids, couldn't we just set it to maxsize - sizeof(TransactionId) * 3? If we're looping a second time for safety a comment explaining that would be useful... sequence.c: Is it safe to read a sequence value in a parallel worker if the cache_value is 1? This may be a dumb question, but for functions do we know that all pl's besides C and SQL use SPI? If not I think they could end up writing in a worker. @@ -2968,7 +2969,8 @@ ProcessInterrupts(void) errmsg(canceling statement due to user request))); } } - /* If we get here, do nothing (probably, QueryCancelPending was reset) */ + if (ParallelMessagePending) + HandleParallelMessageInterrupt(false); ISTM it'd be good to leave that comment in place (after the if). RestoreComboCIDState(): Assert(!comboCids || !comboHash): Shouldn't that be ? AIUI both should always be either set or 0. Typo: + elog(ERROR, cannot update SecondarySnapshpt during a parallel operation); The comment for RestoreSnapshot refers to set_transaction_snapshot, but that doesn't actually exist (or isn't referenced). -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On Sat, Jan 3, 2015 at 7:31 PM, Andres Freund and...@2ndquadrant.com wrote: A couple remarks: * Shouldn't this provide more infrastructure to deal with the fact that we might get less parallel workers than we had planned for? Maybe, but I'm not really sure what that should look like. My working theory is that individual parallel applications (executor nodes, or functions that use parallelism internally, or whatever) should be coded in such a way that they work correctly even if the number of workers that starts is smaller than what they requested, or even zero. It may turn out that this is impractical in some cases, but I don't know what those cases are yet so I don't know what the common threads will be. I think parallel seq scan should work by establishing N tuple queues (where N is the number of workers we have). When asked to return a tuple, the master should poll all of those queues one after another until it finds one that contains a tuple. If none of them contain a tuple then it should claim a block to scan and return a tuple from that block. That way, if there are enough running workers to keep up with the master's demand for tuples, the master won't do any of the actual scan itself. But if the workers can't keep up -- e.g. suppose 90% of the CPU consumed by the query is in the filter qual for the scan -- then the master can pitch in along with everyone else. As a non-trivial fringe benefit, if the workers don't all start, or take a while to initialize, the user backend isn't stalled meanwhile. * I wonder if parallel contexts shouldn't be tracked via resowners That is a good question. I confess that I'm somewhat fuzzy about which things should be tracked via the resowner mechanism vs. which things should have their own bespoke bookkeeping. However, the AtEOXact_Parallel() stuff happens well before ResourceOwnerRelease(), which makes merging them seem not particularly clean. * combocid.c should error out in parallel mode, as insurance Eh, which function? HeapTupleHeaderGetCmin(), HeapTupleHeaderGetCmax(), and AtEOXact_ComboCid() are intended to work in parallel mode. HeapTupleHeaderAdjustCmax() could Assert(!IsInParallelMode()) but the only calls are in heap_update() and heap_delete() which already have error checks, so putting yet another elog() there seems like overkill. * I doubt it's a good idea to allow heap_insert, heap_inplace_update, index_insert. I'm not convinced that those will be handled correct and relaxing restrictions is easier than adding them. I'm fine with adding checks to heap_insert() and heap_inplace_update(). I'm not sure we really need to add anything to index_insert(); how are we going to get there without hitting some other prohibited operation first? * I'd much rather separate HandleParallelMessageInterrupt() in one variant that just tells the machinery there's a interrupt (called from signal handlers) and one that actually handles them. We shouldn't even consider adding any new code that does allocations, errors and such in signal handlers. That's just a *bad* idea. That's a nice idea, but I didn't invent the way this crap works today. ImmediateInterruptOK gets set to true while performing a lock wait, and we need to be able to respond to errors while in that state. I think there's got to be a better way to handle that than force every asynchronous operation to have to cope with the fact that ImmediateInterruptOK may be set or not set, but as of today that's what you have to do. * I'm not a fan of the shm_toc stuff... Verbose and hard to read. I'd much rather place a struct there and be careful about not using pointers. That also would obliviate the need to reserve some ids. I don't see how that's going to work with variable-size data structures, and a bunch of the things that we need to serialize are variable-size. Moreover, I'm not really interested in rehashing this design again. I know it's not your favorite; you've said that before. But it makes it possible to write code to do useful things in parallel, a capability that we do not otherwise have. And I like it fine. * Doesn't the restriction in GetSerializableTransactionSnapshotInt() apply for repeatable read just the same? Yeah. I'm not sure whether we really need that check at all, because there is a check in GetTransactionSnapshot() that probably checks the same thing. * I'm not a fan of relying on GetComboCommandId() to restore things in the same order as before. I thought that was a little wonky, but it's not likely to break, and there is an elog(ERROR) there to catch it if it does, so I'd rather not make it more complicated. * I'd say go ahead and commit the BgworkerByOid thing earlier/independently. I've seen need for that a couple times. Woohoo! I was hoping someone would say that. * s/entrypints/entrypoints/ Thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via
Re: [HACKERS] parallel mode and parallel contexts
Robert Haas robertmh...@gmail.com writes: On Sat, Jan 3, 2015 at 7:31 PM, Andres Freund and...@2ndquadrant.com wrote: * I wonder if parallel contexts shouldn't be tracked via resowners That is a good question. I confess that I'm somewhat fuzzy about which things should be tracked via the resowner mechanism vs. which things should have their own bespoke bookkeeping. However, the AtEOXact_Parallel() stuff happens well before ResourceOwnerRelease(), which makes merging them seem not particularly clean. FWIW, the resowner mechanism was never meant as a substitute for bespoke bookkeeping. What it is is a helper mechanism to reduce the need for PG_TRY blocks that guarantee that a resource-releasing function will be called even in error paths. I'm not sure whether that analogy applies well in parallel-execution cases. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On 22 December 2014 at 19:14, Robert Haas robertmh...@gmail.com wrote: Here is another new version, with lots of bugs fixed. An initial blind review, independent of other comments already made on thread. OK src/backend/access/heap/heapam.c heapam.c prohibitions on update and delete look fine OK src/backend/access/transam/Makefile OK src/backend/access/transam/README.parallel README.parallel and all concepts look good PARTIAL src/backend/access/transam/parallel.c wait_patiently mentioned in comment but doesn’t exist Why not make nworkers into a uint? Trampoline? Really? I think we should define what we mean by that, somewhere, rather than just use the term as if it was self-evident. Entrypints? .. OK src/backend/access/transam/varsup.c QUESTIONS src/backend/access/transam/xact.c These comments don’t have any explanation or justification + * This stores XID of the toplevel transaction to which we belong. + * Normally, it's the same as TopTransactionState.transactionId, but in + * a parallel worker it may not be. + * If we're a parallel worker, there may be additional XIDs that we need + * to regard as part of our transaction even though they don't appear + * in the transaction state stack. This comment is copied-and-pasted too many times for safety and elegance + /* +* Workers synchronize transaction state at the beginning of each parallel +* operation, so we can't account for transaction state change after that +* point. (Note that this check will certainly error out if s-blockState +* is TBLOCK_PARALLEL_INPROGRESS, so we can treat that as an invalid case +* below.) +*/ We need a single Check() that contains more detail and comments within Major comments * Parallel stuff at start sounded OK * Transaction stuff strikes me as overcomplicated and error prone. Given there is no explanation or justification for most of it, I’d be inclined to question why its there * If we have a trampoline for DSM, why don’t we use a trampoline for snapshots, then you wouldn’t need to do all this serialize stuff This is very impressive, but it looks like we’re trying to lift too much weight on the first lift. If we want to go anywhere near this, we need to have very clear documentation about how and why its like that. I’m actually very sorry to say that because the review started very well, much much better than most. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On Mon, Jan 5, 2015 at 9:57 PM, Robert Haas robertmh...@gmail.com wrote: On Sat, Jan 3, 2015 at 7:31 PM, Andres Freund and...@2ndquadrant.com wrote: * Doesn't the restriction in GetSerializableTransactionSnapshotInt() apply for repeatable read just the same? Yeah. I'm not sure whether we really need that check at all, because there is a check in GetTransactionSnapshot() that probably checks the same thing. The check in GetSerializableTransactionSnapshotInt() will also prohibit any user/caller of SetSerializableTransactionSnapshot() in parallel mode as that can't be prohibited by GetTransactionSnapshot(). With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] parallel mode and parallel contexts
On Fri, Jan 2, 2015 at 9:04 AM, Amit Kapila amit.kapil...@gmail.com wrote: While working on parallel seq-scan patch to adapt this framework, I noticed few things and have questions regrading the same. 1. Currently parallel worker just attaches to error queue, for tuple queue do you expect it to be done in the same place or in the caller supplied function, if later then we need segment address as input to that function to attach queue to the segment(shm_mq_attach()). Another question, I have in this regard is that if we have redirected messages to error queue by using pq_redirect_to_shm_mq, then how can we set tuple queue for the same purpose. Similarly I think more handling is needed for tuple queue in master backend and the answer to above will dictate what is the best way to do it. I've come to the conclusion that it's a bad idea for tuples to be sent through the same queue as errors. We want errors (or notices, but especially errors) to be processed promptly, but there may be a considerable delay in processing tuples. For example, imagine a plan that looks like this: Nested Loop - Parallel Seq Scan on p - Index Scan on q Index Scan: q.x = p.x The parallel workers should fill up the tuple queues used by the parallel seq scan so that the master doesn't have to do any of that work itself. Therefore, the normal situation will be that those tuple queues are all full. If an error occurs in a worker at that point, it can't add it to the tuple queue, because the tuple queue is full. But even if it could do that, the master then won't notice the error until it reads all of the queued-up tuple messges that are in the queue in front of the error, and maybe some messages from the other queues as well, since it probably round-robins between the queues or something like that. Basically, it could do a lot of extra work before noticing that error in there. Now we could avoid that by having the master read messages from the queue immediately and just save them off to local storage if they aren't error messages. But that's not very desirable either, because now we have no flow-control. The workers will just keep spamming tuples that the master isn't ready for into the queues, and the master will keep reading them and saving them to local storage, and eventually it will run out of memory and die. We could engineer some solution to this problem, of course, but it seems quite a bit simpler to just have two queues. The error queues don't need to be very big (I made them 16kB, which is trivial on any system on which you care about having working parallelism) and the tuple queues can be sized as needed to avoid pipeline stalls. 2. Currently there is no interface for wait_for_workers_to_become_ready() in your patch, don't you think it is important that before start of fetching tuples, we should make sure all workers are started, what if some worker fails to start? I think that, in general, getting the most benefit out of parallelism means *avoiding* situations where backends have to wait for each other. If the relation being scanned is not too large, the user backend might be able to finish the whole scan - or a significant fraction of it - before the workers initialize. Of course in that case it might have been a bad idea to parallelize in the first place, but we should still try to make the best of the situation. If some worker fails to start, then instead of having the full degree N parallelism we were hoping for, we have some degree K N, so things will take a little longer, but everything should still work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On 2014-12-22 14:14:31 -0500, Robert Haas wrote: On Fri, Dec 19, 2014 at 8:59 AM, Robert Haas robertmh...@gmail.com wrote: And here is a new version. Here is another new version, with lots of bugs fixed. A couple remarks: * Shouldn't this provide more infrastructure to deal with the fact that we might get less parallel workers than we had planned for? * I wonder if parallel contexts shouldn't be tracked via resowners * combocid.c should error out in parallel mode, as insurance * I doubt it's a good idea to allow heap_insert, heap_inplace_update, index_insert. I'm not convinced that those will be handled correct and relaxing restrictions is easier than adding them. * I'd much rather separate HandleParallelMessageInterrupt() in one variant that just tells the machinery there's a interrupt (called from signal handlers) and one that actually handles them. We shouldn't even consider adding any new code that does allocations, errors and such in signal handlers. That's just a *bad* idea. * I'm not a fan of the shm_toc stuff... Verbose and hard to read. I'd much rather place a struct there and be careful about not using pointers. That also would obliviate the need to reserve some ids. * Doesn't the restriction in GetSerializableTransactionSnapshotInt() apply for repeatable read just the same? * I'm not a fan of relying on GetComboCommandId() to restore things in the same order as before. * I'd say go ahead and commit the BgworkerByOid thing earlier/independently. I've seen need for that a couple times. * s/entrypints/entrypoints/ Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On Thu, Dec 18, 2014 at 1:23 AM, Robert Haas robertmh...@gmail.com wrote: In the meantime, I had a good chat with Heikki on IM yesterday which gave me some new ideas on how to fix up the transaction handling in here, which I am working on implementing. So hopefully I will have that by then. I am also hoping Amit will be adapting his parallel seq-scan patch to this framework. While working on parallel seq-scan patch to adapt this framework, I noticed few things and have questions regrading the same. 1. Currently parallel worker just attaches to error queue, for tuple queue do you expect it to be done in the same place or in the caller supplied function, if later then we need segment address as input to that function to attach queue to the segment(shm_mq_attach()). Another question, I have in this regard is that if we have redirected messages to error queue by using pq_redirect_to_shm_mq, then how can we set tuple queue for the same purpose. Similarly I think more handling is needed for tuple queue in master backend and the answer to above will dictate what is the best way to do it. 2. Currently there is no interface for wait_for_workers_to_become_ready() in your patch, don't you think it is important that before start of fetching tuples, we should make sure all workers are started, what if some worker fails to start? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] parallel mode and parallel contexts
On Thu, Dec 18, 2014 at 4:53 AM, Robert Haas robertmh...@gmail.com wrote: On Wed, Dec 17, 2014 at 9:16 AM, Simon Riggs si...@2ndquadrant.com wrote: On 12 December 2014 at 22:52, Robert Haas robertmh...@gmail.com wrote: I would be remiss if I failed to mention that this patch includes work by my colleagues Amit Kapila, Rushabh Lathia, and Jeevan Chalke, as well as my former colleague Noah Misch; and that it would not have been possible without the patient support of EnterpriseDB management. Noted and thanks to all. I will make this my priority for review, but regrettably not until New Year, about 2.5-3 weeks away. OK! In the meantime, I had a good chat with Heikki on IM yesterday which gave me some new ideas on how to fix up the transaction handling in here, which I am working on implementing. So hopefully I will have that by then. I am also hoping Amit will be adapting his parallel seq-scan patch to this framework. Simon, if you are planning to review this patch soon, could you add your name as a reviewer for this patch in the commit fest app? Thanks, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On Wed, Dec 17, 2014 at 2:53 PM, Robert Haas robertmh...@gmail.com wrote: In the meantime, I had a good chat with Heikki on IM yesterday which gave me some new ideas on how to fix up the transaction handling in here, which I am working on implementing. So hopefully I will have that by then. And here is a new version. This version has some real integration with the transaction system, along the lines of what Heikki suggested to me, so hopefully tuple visibility calculations in a parallel worker will now produce the right answers, though I need to integrate this with code to actually do something-or-other in parallel in order to really test that. There are still some problems with parallel worker shutdown. As hard as I tried to fight it, I'm gradually resigning myself to the fact that we're probably going to have to set things up so that the worker waits for all of its children to die during abort processing (and during commit processing, but that's not bothersome). Otherwise, to take just one example, chaos potentially ensues if you run a parallel query in a subtransaction and then roll back to a savepoint. But this version just kills the workers and doesn't actually wait for them to die; I'll see about fixing that, but wanted to send this around for comments first. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company diff --git a/contrib/parallel_dummy/Makefile b/contrib/parallel_dummy/Makefile new file mode 100644 index 000..de00f50 --- /dev/null +++ b/contrib/parallel_dummy/Makefile @@ -0,0 +1,19 @@ +MODULE_big = parallel_dummy +OBJS = parallel_dummy.o $(WIN32RES) +PGFILEDESC = parallel_dummy - dummy use of parallel infrastructure + +EXTENSION = parallel_dummy +DATA = parallel_dummy--1.0.sql + +REGRESS = parallel_dummy + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/parallel_dummy +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/parallel_dummy/parallel_dummy--1.0.sql b/contrib/parallel_dummy/parallel_dummy--1.0.sql new file mode 100644 index 000..2a7251c --- /dev/null +++ b/contrib/parallel_dummy/parallel_dummy--1.0.sql @@ -0,0 +1,7 @@ +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use CREATE EXTENSION parallel_dummy to load this file. \quit + +CREATE FUNCTION parallel_dummy(sleep_time pg_catalog.int4, + nworkers pg_catalog.int4) +RETURNS pg_catalog.void STRICT + AS 'MODULE_PATHNAME' LANGUAGE C; diff --git a/contrib/parallel_dummy/parallel_dummy.c b/contrib/parallel_dummy/parallel_dummy.c new file mode 100644 index 000..99b830f --- /dev/null +++ b/contrib/parallel_dummy/parallel_dummy.c @@ -0,0 +1,82 @@ +/*-- + * + * parallel_dummy.c + * Test harness code for parallel mode code. + * + * Copyright (C) 2013-2014, PostgreSQL Global Development Group + * + * IDENTIFICATION + * contrib/parallel_dummy/parallel_dummy.c + * + * - + */ + +#include postgres.h + +#include access/parallel.h +#include access/xact.h +#include fmgr.h +#include miscadmin.h +#include utils/builtins.h + +PG_MODULE_MAGIC; + +PG_FUNCTION_INFO_V1(parallel_dummy); + +#define PARALLEL_DUMMY_KEY 1 + +typedef struct { + int32 sleep_time; +} ParallelDummyInfo; + +void _PG_init(void); +void worker_main(shm_toc *toc); + +Datum +parallel_dummy(PG_FUNCTION_ARGS) +{ + int32 sleep_time = PG_GETARG_INT32(0); + int32 nworkers = PG_GETARG_INT32(1); + bool already_in_parallel_mode = IsInParallelMode(); + ParallelContext *pcxt; + ParallelDummyInfo *info; + + if (nworkers 1) + ereport(ERROR, +(errmsg(number of parallel workers must be positive))); + + if (!already_in_parallel_mode) + EnterParallelMode(); + + pcxt = CreateParallelContextForExtension(parallel_dummy, worker_main, + nworkers); + shm_toc_estimate_chunk(pcxt-estimator, sizeof(ParallelDummyInfo)); + shm_toc_estimate_keys(pcxt-estimator, 1); + InitializeParallelDSM(pcxt); + info = shm_toc_allocate(pcxt-toc, sizeof(ParallelDummyInfo)); + info-sleep_time = sleep_time; + shm_toc_insert(pcxt-toc, PARALLEL_DUMMY_KEY, info); + LaunchParallelWorkers(pcxt); + + /* here's where we do the real work ... */ + DirectFunctionCall1(pg_sleep, Float8GetDatum((float8) sleep_time)); + + DestroyParallelContext(pcxt); + + if (!already_in_parallel_mode) + ExitParallelMode(); + + PG_RETURN_VOID(); +} + +void +worker_main(shm_toc *toc) +{ + ParallelDummyInfo *info; + + info = shm_toc_lookup(toc, PARALLEL_DUMMY_KEY); + Assert(info != NULL); + + /* here's where we do the real work ... */ + DirectFunctionCall1(pg_sleep, Float8GetDatum((float8) info-sleep_time)); +} diff --git a/contrib/parallel_dummy/parallel_dummy.control b/contrib/parallel_dummy/parallel_dummy.control new file
Re: [HACKERS] parallel mode and parallel contexts
On 12 December 2014 at 22:52, Robert Haas robertmh...@gmail.com wrote: I would be remiss if I failed to mention that this patch includes work by my colleagues Amit Kapila, Rushabh Lathia, and Jeevan Chalke, as well as my former colleague Noah Misch; and that it would not have been possible without the patient support of EnterpriseDB management. Noted and thanks to all. I will make this my priority for review, but regrettably not until New Year, about 2.5-3 weeks away. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On Wed, Dec 17, 2014 at 9:16 AM, Simon Riggs si...@2ndquadrant.com wrote: On 12 December 2014 at 22:52, Robert Haas robertmh...@gmail.com wrote: I would be remiss if I failed to mention that this patch includes work by my colleagues Amit Kapila, Rushabh Lathia, and Jeevan Chalke, as well as my former colleague Noah Misch; and that it would not have been possible without the patient support of EnterpriseDB management. Noted and thanks to all. I will make this my priority for review, but regrettably not until New Year, about 2.5-3 weeks away. OK! In the meantime, I had a good chat with Heikki on IM yesterday which gave me some new ideas on how to fix up the transaction handling in here, which I am working on implementing. So hopefully I will have that by then. I am also hoping Amit will be adapting his parallel seq-scan patch to this framework. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers