Re: [HACKERS] parallel mode and parallel contexts

2015-05-07 Thread Noah Misch
On Tue, May 05, 2015 at 01:05:38PM -0400, Robert Haas wrote:
 On Sat, May 2, 2015 at 12:35 PM, Noah Misch n...@leadboat.com wrote:
  Perhaps, rather than model it as master M waiting on worker list W1|W2|W3,
  model it with queue-nonempty and queue-nonfull events, one pair per queue.

That comment of mine was junk; it suggested a data structure containing the
same worker list it purported to remove.  Oops.

 Each individual queue has only a single reader and a single writer.
 In your example, there would be three queues, not just one.  Of
 course, one could design a new shared memory data structure
 representing a collection of related queues, but I still don't see
 exactly how we get around the problem of that requiring
 O(MaxBackends^2) storage space.

If each backend waits on a uniformly-distributed 50% of other backends,
tracking that wait graph indeed requires O(MaxBackends^2) space.  Backends
completing useful work in the field will have far-sparser wait graphs, and
that should inform the choice of data structure:

On Tue, Apr 21, 2015 at 02:08:00PM -0400, Robert Haas wrote:
 When I first started thinking about how to fix this, I said, well,
 that's no big deal, we can just advertise the whole list of processes
 that we're waiting for in shared memory, rather than just the one.

That boiled down to representing the wait graph with an adjacency list, which
sounds like an efficient choice.

  M
  subscribes to queue0-nonempty, and each of W1,W2,W3 publish to it.  M
  publishes to queue0-nonfull, and the workers subscribe.  An edge forms in 
  the
  deadlock detector's graph when all of an event's subscribers wait on it.  
  (One
  can approximate that in userspace with a pair of advisory locks.)
 
 An edge between which two processes?

I hadn't de-fuzzed my thinking that far yet.  If you still need the answer
after this message, let me know, and I'll work on that.  As a guess, I think
it's three edges M-W1, M-W2 and M-W3.

  After thinking about it a bit more, I realized that even if we settle
  on some solution to that problem, there's another issues: the
  wait-edges created by this system don't really have the same semantics
  as regular lock waits.  Suppose A is waiting on a lock held by B and B
  is waiting for a lock held by A; that's a deadlock.  But if A is
  waiting for B to write to a tuple queue and B is waiting for A to read
  from a tuple queue, that's not a deadlock if the queues in question
  are the same.
 
  I do see a deadlock.  B wants to block until A reads from the queue, so it
  advertises that and sleeps.  Waking up deadlock_timeout later, it notices 
  that
  A is waiting for B to write something.  B will not spontaneously suspend
  waiting to write to the queue, nor will A suspend waiting to read from the
  queue.  Thus, the deadlock is valid.  This assumes the deadlock detector
  reasons from an authoritative picture of pending waits and that we reliably
  wake up a process when the condition it sought has arrived.  We have that 
  for
  heavyweight lock acquisitions.  The assumption may be incompatible with
  Andres's hope, quoted above, of avoiding the full lock acquisition 
  procedure.
 
 I don't understand.  What's typically going to happen here is that the
 queue will initially be empty, and A will block.  Suppose B has a
 large message to write to the queue, let's say 100x the queue size.
 It will write the first 1% of the message into the queue, set A's
 latch, and go to sleep.  A will now wake up, drain the queue, set B's
 latch, and go to sleep.  B will then wake up, write the next chunk of
 the message, set A's latch again, and go to sleep.  They'll go back
 and forth like this until the entire message has been transmitted.
 There's no deadlock here: everything is working as designed.  But
 there may be instants where both A and B are waiting, because (for
 example) A may set B's latch and finish going to sleep before B gets
 around to waking up.

I see what you had in mind.  The problem showed up in the last sentence;
namely, the hand-off from process to process is not atomic like it is for
heavyweight locks.  That's exactly what I was (not too clearly) getting at
when I wrote, This assumes ... above.  For the sake of illustration, suppose
you replace your queue in this algorithm with a heavyweight lock and a small
buffer.  Initially, B holds the lock and A waits for the lock.  The following
event sequence repeats until the transfer is done:

B fills the buffer
B releases the lock, granting it to A during LockRelease()
B starts waiting for the lock
A empties the buffer
A releases the lock, granting it to B during LockRelease()
A starts waiting for the lock

The deadlock detector will rightly never report a deadlock.  To have the same
safety in your example, it's not enough for one process to set the latch of
another process.  The process must update something in static shared memory to
indicate that the waited-for condition (queue-nonempty for A, 

Re: [HACKERS] parallel mode and parallel contexts

2015-05-07 Thread Robert Haas
On Thu, May 7, 2015 at 3:09 AM, Noah Misch n...@leadboat.com wrote:
 Each individual queue has only a single reader and a single writer.
 In your example, there would be three queues, not just one.  Of
 course, one could design a new shared memory data structure
 representing a collection of related queues, but I still don't see
 exactly how we get around the problem of that requiring
 O(MaxBackends^2) storage space.

 If each backend waits on a uniformly-distributed 50% of other backends,
 tracking that wait graph indeed requires O(MaxBackends^2) space.  Backends
 completing useful work in the field will have far-sparser wait graphs, and
 that should inform the choice of data structure:

We can, of course, underallocate and hope for the best.

  M
  subscribes to queue0-nonempty, and each of W1,W2,W3 publish to it.  M
  publishes to queue0-nonfull, and the workers subscribe.  An edge forms in 
  the
  deadlock detector's graph when all of an event's subscribers wait on it.  
  (One
  can approximate that in userspace with a pair of advisory locks.)

 An edge between which two processes?

 I hadn't de-fuzzed my thinking that far yet.  If you still need the answer
 after this message, let me know, and I'll work on that.  As a guess, I think
 it's three edges M-W1, M-W2 and M-W3.

I think it's an edge where one end is M and the other end is fuzzily
diffused across W1, W2, and W3, because we only need one of them to
wake up. There's no way to represent such a thing in the deadlock
detector today.  We could invent one, of course, but it sounds
complicated.  Also, I suspect deadlock checking in this world reduces
to http://en.wikipedia.org/wiki/Boolean_satisfiability_problem and is
therefore NP-complete.

 I see what you had in mind.  The problem showed up in the last sentence;
 namely, the hand-off from process to process is not atomic like it is for
 heavyweight locks.  That's exactly what I was (not too clearly) getting at
 when I wrote, This assumes ... above.  For the sake of illustration, suppose
 you replace your queue in this algorithm with a heavyweight lock and a small
 buffer.  Initially, B holds the lock and A waits for the lock.  The following
 event sequence repeats until the transfer is done:

 B fills the buffer
 B releases the lock, granting it to A during LockRelease()
 B starts waiting for the lock
 A empties the buffer
 A releases the lock, granting it to B during LockRelease()
 A starts waiting for the lock

 The deadlock detector will rightly never report a deadlock.  To have the same
 safety in your example, it's not enough for one process to set the latch of
 another process.  The process must update something in static shared memory to
 indicate that the waited-for condition (queue-nonempty for A, queue-nonfull
 for B) is now satisfied.  You need such a unit of state anyway, don't you?

Yes.  It's the number of bytes read and written, and it's stored in
dynamic shared memory.  You can move some of the state to the main
shared memory segment, but I don't look forward to the performance
consequences of a lock acquire  release cycle every time the queue
fills or drains.  Of course, there is also a loss of flexibility
there: a big reason for developing this stuff in the first place was
to avoid being constrained by the main shared memory segment.

 The number of held LWLocks is always zero when entering user-defined code and
 again when exiting user-defined code.  Therefore, the possible LWLock wait
 graphs are known at compile time, and we could prove that none contain a
 deadlock.  Therefore, we scarcely miss having an LWLock deadlock detector.
 That does not map to the tuple queue behavior at hand, because we hope to
 allow the queue's writer to run user-defined code while the queue's reader is
 waiting.  My term user-defined code does come with some hand-waving.  A
 superuser can cause an undetected deadlock via a C-language hook.  We would
 not call that a PostgreSQL bug, though the hook is literally user-defined.
 Let's keep the deadlock detector able to identify every deadlock a
 non-superuser can cause.  That includes deadlocks arising from heavyweight
 lock acquisition in parallel-safe functions.

I'm in agreement with that goal.

 I think where this project went off the rails is when we made the
 decision as to how to fix the problems in the original group locking
 approach.  In the original concept, locks between processes in the
 same parallel group just didn't ever conflict; two
 otherwise-conflicting locks held by different backends within the same
 group were regarded as those processes sharing that lock.  Andres and
 possibly others pointed out that for stuff like relation locks, that's

 I think you mean relation extension locks.

Yes, thanks.

 probably not the right behavior.  I agree with that.  It was also
 suggested that this would be less scary if we limited ourselves to
 sharing the locks held at the beginning of parallelism.  That had the
 further advantage of 

Re: [HACKERS] parallel mode and parallel contexts

2015-05-05 Thread Robert Haas
On Sat, May 2, 2015 at 12:35 PM, Noah Misch n...@leadboat.com wrote:
 Perhaps, rather than model it as master M waiting on worker list W1|W2|W3,
 model it with queue-nonempty and queue-nonfull events, one pair per queue.

Each individual queue has only a single reader and a single writer.
In your example, there would be three queues, not just one.  Of
course, one could design a new shared memory data structure
representing a collection of related queues, but I still don't see
exactly how we get around the problem of that requiring
O(MaxBackends^2) storage space.

 M
 subscribes to queue0-nonempty, and each of W1,W2,W3 publish to it.  M
 publishes to queue0-nonfull, and the workers subscribe.  An edge forms in the
 deadlock detector's graph when all of an event's subscribers wait on it.  (One
 can approximate that in userspace with a pair of advisory locks.)

An edge between which two processes?

 After thinking about it a bit more, I realized that even if we settle
 on some solution to that problem, there's another issues: the
 wait-edges created by this system don't really have the same semantics
 as regular lock waits.  Suppose A is waiting on a lock held by B and B
 is waiting for a lock held by A; that's a deadlock.  But if A is
 waiting for B to write to a tuple queue and B is waiting for A to read
 from a tuple queue, that's not a deadlock if the queues in question
 are the same.

 I do see a deadlock.  B wants to block until A reads from the queue, so it
 advertises that and sleeps.  Waking up deadlock_timeout later, it notices that
 A is waiting for B to write something.  B will not spontaneously suspend
 waiting to write to the queue, nor will A suspend waiting to read from the
 queue.  Thus, the deadlock is valid.  This assumes the deadlock detector
 reasons from an authoritative picture of pending waits and that we reliably
 wake up a process when the condition it sought has arrived.  We have that for
 heavyweight lock acquisitions.  The assumption may be incompatible with
 Andres's hope, quoted above, of avoiding the full lock acquisition procedure.

I don't understand.  What's typically going to happen here is that the
queue will initially be empty, and A will block.  Suppose B has a
large message to write to the queue, let's say 100x the queue size.
It will write the first 1% of the message into the queue, set A's
latch, and go to sleep.  A will now wake up, drain the queue, set B's
latch, and go to sleep.  B will then wake up, write the next chunk of
the message, set A's latch again, and go to sleep.  They'll go back
and forth like this until the entire message has been transmitted.
There's no deadlock here: everything is working as designed.  But
there may be instants where both A and B are waiting, because (for
example) A may set B's latch and finish going to sleep before B gets
around to waking up.

That's probably something that we could patch around, but I think it's
missing the larger point.  When you're dealing with locks, there is
one operation that causes blocking (acquiring a lock) and another
operation that unblocks other processes (releasing a lock).  With
message queues, there are still two operations (reading and writing),
but either of them can either cause you to block yourself, or to
unblock another process.  To put that another way, locks enforce
mutual exclusion; message queues force mutual *inclusion*.  Locks
cause blocking when two processes try to operate on the same object at
the same time; message queues cause blocking *unless* two processes
operate on the same object at the same time.  That difference seems to
me to be quite fundamental.

 So then
 you have to wonder why we're not solving problem #1, because the
 deadlock was just as certain before we generated the maximum possible
 number of tuples locally as it was afterwards.

 The 1. above reads like a benefit, not a problem.  What might you solve
 about it?

Sorry, that wasn't very clear.  Normally, it is a benefit, but in some
cases, it could result in a long delay in reporting an inevitable
deadlock.

What I mean is: suppose we construct a working mechanism where the
deadlock detector knows about tuple queue waits.  Suppose further that
we have a parallel leader and two workers cooperating to do a task
that takes 300 s and can be parallelized with perfect efficiency so
that, working together, those three processes can get the job done in
just 100 s.  If the leader tries to take a lock held in a conflicting
mode by one of the workers, the worker will fill up the tuple queue -
probably quite quickly - and wait.  We now detect a deadlock.  Our
detection is timely, and life is good.  On the other hand, suppose
that one of the *workers* tries to take a lock held in a conflicting
mode by the other worker or by the leader.  There is no immediate
deadlock, because the leader is not waiting.  Instead, we'll finish
the computation with the remaining worker and the leader, which will
take 150 seconds since we only have 

Re: [HACKERS] parallel mode and parallel contexts

2015-05-02 Thread Noah Misch
On Tue, Apr 21, 2015 at 02:08:00PM -0400, Robert Haas wrote:
 On Tue, Mar 24, 2015 at 3:26 PM, Andres Freund and...@2ndquadrant.com wrote:

[proposal: teach deadlock detector about parallel master waiting on workers]

  deadlock.c is far from simple, and at least I don't find the control
  flow to be particularly clear. So it's not easy.  It'd be advantageous
  to tackle things at that level because it'd avoid the need to acquire
  locks on some lock's waitqueue when blocking; we're going to do that a
  lot.
 
  But It seems to me that it should be possible to suceed: In
  FindLockCycleRecurse(), in the case that we're not waiting for an actual
  lock (checkProc-links.next == NULL) we can add a case that considers
  the 'blocking parallelism' case. ISTM that that's just a
  FindLockCycleRecurse() call on the process that we're waiting for. We'd
  either have to find the other side's locktag for DEADLOCK_INFO or invent
  another field in there; but that seems like a solveable problem.

 1. The master doesn't actually *wait* until the very end of the parallel 
 phase.
 2. When the master does wait, it waits for all of the parallel workers
 at once, not each one individually.
 
 So, I don't think anything as simplistic as teaching a blocking
 shm_mq_receive() to tip off the deadlock detector that we're waiting
 for the process on the other end of that particular queue can ever
 work.  Because of point #2, that never happens.  When I first started
 thinking about how to fix this, I said, well, that's no big deal, we
 can just advertise the whole list of processes that we're waiting for
 in shared memory, rather than just the one.  This is a bit tricky,
 though. Any general API for any backend to advertise that it's waiting
 for an arbitrary subset of the other backends would require O(n^2)
 shared memory state.  That wouldn't be completely insane, but it's not
 great, either.  For this particular case, we could optimize that down
 to O(n) by just chaining all of the children of a given parallel group
 leader in a linked whose nodes are inlined in their PGPROCs, but that
 doesn't feel very general, because it forecloses the possibility of
 the children ever using that API, and I think they might need to.  If
 nothing else, they might need to advertise that they're blocking on
 the master if they are trying to send data, the queue is full, and
 they have to wait for the master to drain some of it before they can
 proceed.

Perhaps, rather than model it as master M waiting on worker list W1|W2|W3,
model it with queue-nonempty and queue-nonfull events, one pair per queue.  M
subscribes to queue0-nonempty, and each of W1,W2,W3 publish to it.  M
publishes to queue0-nonfull, and the workers subscribe.  An edge forms in the
deadlock detector's graph when all of an event's subscribers wait on it.  (One
can approximate that in userspace with a pair of advisory locks.)

 After thinking about it a bit more, I realized that even if we settle
 on some solution to that problem, there's another issues: the
 wait-edges created by this system don't really have the same semantics
 as regular lock waits.  Suppose A is waiting on a lock held by B and B
 is waiting for a lock held by A; that's a deadlock.  But if A is
 waiting for B to write to a tuple queue and B is waiting for A to read
 from a tuple queue, that's not a deadlock if the queues in question
 are the same.

I do see a deadlock.  B wants to block until A reads from the queue, so it
advertises that and sleeps.  Waking up deadlock_timeout later, it notices that
A is waiting for B to write something.  B will not spontaneously suspend
waiting to write to the queue, nor will A suspend waiting to read from the
queue.  Thus, the deadlock is valid.  This assumes the deadlock detector
reasons from an authoritative picture of pending waits and that we reliably
wake up a process when the condition it sought has arrived.  We have that for
heavyweight lock acquisitions.  The assumption may be incompatible with
Andres's hope, quoted above, of avoiding the full lock acquisition procedure.

 A further difference in the semantics of these wait-edges is that if
 process A is awaiting AccessExclusiveLock on a resource held by B, C,
 and D at AccessShareLock, it needs to wait for all of those processes
 to release their locks before it can do anything else.  On the other
 hand, if process A is awaiting tuples from B, C, and D, it just needs
 ONE of those processes to emit tuples in order to make progress.  Now

Right.

 maybe that doesn't make any difference in practice, because even if
 two of those processes are making lively progress and A is receiving
 tuples from them and processing them like gangbusters, that's probably
 not going to help the third one get unstuck.  If we adopt the approach
 of discounting that possibility, then as long as the parallel leader
 can generate tuples locally (that is, local_scan_done = false) we
 don't report the deadlock, but as soon as it can no 

Re: [HACKERS] parallel mode and parallel contexts

2015-04-30 Thread Robert Haas
On Wed, Apr 29, 2015 at 12:23 PM, Robert Haas robertmh...@gmail.com wrote:
 So, I think it makes sense to split up this patch in two.  There's no
 real debate, AFAICS, about anything in the patch other than the
 heavyweight locking stuff.  So I'd like to go ahead and commit the
 rest.  That's attached here as parallel-mode-v10.patch.

Hearing no objections, done.

Still hoping for some input on the rest.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-04-28 Thread Robert Haas
On Sun, Apr 26, 2015 at 2:03 PM, Euler Taveira eu...@timbira.com.br wrote:
 On 19-03-2015 15:13, Robert Haas wrote:
 On Wed, Mar 18, 2015 at 5:36 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
 Reading the README first, the rest later. So you can comment on my
 comments, while I actually look at the code. Parallelism, yay!

 I'm also looking at this piece of code and found some low-hanging fruit.

Thanks.  0001 and 0003 look good, but 0002 is actually unrelated to this patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-04-28 Thread Robert Haas
On Sun, Apr 26, 2015 at 9:57 PM, Amit Langote
langote_amit...@lab.ntt.co.jp wrote:
 On 2015-04-22 AM 04:14, Robert Haas wrote:
 We should check IsParallelWorker() for operations that are allowed in
 the master during parallel mode, but not allowed in the workers - e.g.
 the master can scan its own temporary relations, but its workers
 can't.  We should check IsInParallelMode() for operations that are
 completely off-limits in parallel mode - i.e. writes.

 We use ereport() where we expect that SQL could hit that check, and
 elog() where we expect that only (buggy?) C code could hit that check.


 By the way, perhaps orthogonal to the basic issue Alvaro and you are
 discussing here - when playing around with (parallel-mode + parallel-safety +
 parallel heapscan + parallel seqscan), I'd observed (been a while since) that
 if you run a CREATE TABLE AS ... SELECT and the SELECT happens to use parallel
 scan, the statement as a whole fails because the top level statement (CTAS) is
 not read-only. So, only way to make CTAS succeed is to disable seqscan; which
 may require some considerations in reporting to user to figure out. I guess it
 happens because the would-be parallel leader of the scan would also be the one
 doing DefineRelation() DDL. Apologies if this has been addressed since.

That sounds like a bug in the assess-parallel-safety patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-04-26 Thread Euler Taveira
On 19-03-2015 15:13, Robert Haas wrote:
 On Wed, Mar 18, 2015 at 5:36 PM, Andres Freund and...@2ndquadrant.com wrote:
 Reading the README first, the rest later. So you can comment on my
 comments, while I actually look at the code. Parallelism, yay!
 
I'm also looking at this piece of code and found some low-hanging fruit.


-- 
   Euler Taveira   Timbira - http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento
From 3ce5376868f61a67540915b83a15c59a31fc895a Mon Sep 17 00:00:00 2001
From: Euler Taveira eu...@timbira.com
Date: Sun, 26 Apr 2015 13:49:37 -0300
Subject: [PATCH 1/3] fix some typos.

---
 src/backend/access/heap/heapam.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index da0b70e..16d8c59 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2250,7 +2250,7 @@ heap_prepare_insert(Relation relation, HeapTuple tup, TransactionId xid,
 	/*
 	 * For now, parallel operations are required to be strictly read-only.
 	 * Unlike heap_update() and heap_delete(), an insert should never create
-	 * a combo CID, so it might be possible to relax this restrction, but
+	 * a combo CID, so it might be possible to relax this restriction, but
 	 * not without more thought and testing.
 	 */
 	if (IsInParallelMode())
@@ -2666,7 +2666,7 @@ heap_delete(Relation relation, ItemPointer tid,
 	Assert(ItemPointerIsValid(tid));
 
 	/*
-	 * Forbid this during a parallel operation, lest it allocate a combocid.
+	 * Forbid this during a parallel operation, lets it allocate a combocid.
 	 * Other workers might need that combocid for visibility checks, and we
 	 * have no provision for broadcasting it to them.
 	 */
@@ -3124,7 +3124,7 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
 	Assert(ItemPointerIsValid(otid));
 
 	/*
-	 * Forbid this during a parallel operation, lest it allocate a combocid.
+	 * Forbid this during a parallel operation, lets it allocate a combocid.
 	 * Other workers might need that combocid for visibility checks, and we
 	 * have no provision for broadcasting it to them.
 	 */
@@ -5435,7 +5435,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	/*
 	 * For now, parallel operations are required to be strictly read-only.
 	 * Unlike a regular update, this should never create a combo CID, so it
-	 * might be possible to relax this restrction, but not without more
+	 * might be possible to relax this restriction, but not without more
 	 * thought and testing.  It's not clear that it would be useful, anyway.
 	 */
 	if (IsInParallelMode())
-- 
2.1.4

From cf25445d9d21496b6927e9ef45e6c3815fef8ad5 Mon Sep 17 00:00:00 2001
From: Euler Taveira eu...@timbira.com
Date: Sun, 26 Apr 2015 14:24:26 -0300
Subject: [PATCH 2/3] Avoid compiler warnings about unused variables in
 assert-disabled builds.

---
 src/backend/utils/fmgr/funcapi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/utils/fmgr/funcapi.c b/src/backend/utils/fmgr/funcapi.c
index ebd7ddd..b9f2b06 100644
--- a/src/backend/utils/fmgr/funcapi.c
+++ b/src/backend/utils/fmgr/funcapi.c
@@ -887,7 +887,7 @@ get_func_trftypes(HeapTuple procTup,
   Oid **p_trftypes)
 {
 
-	Form_pg_proc procStruct = (Form_pg_proc) GETSTRUCT(procTup);
+	Form_pg_proc procStruct PG_USED_FOR_ASSERTS_ONLY = (Form_pg_proc) GETSTRUCT(procTup);
 	Datum		protrftypes;
 	ArrayType  *arr;
 	int			nelems;
-- 
2.1.4

From 7d1716fdf84f24a1dddfa02db27d532e06c92c3d Mon Sep 17 00:00:00 2001
From: Euler Taveira eu...@timbira.com
Date: Sun, 26 Apr 2015 14:52:39 -0300
Subject: [PATCH 3/3] Fix some more typos.

---
 src/backend/access/transam/README.parallel | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/transam/README.parallel b/src/backend/access/transam/README.parallel
index c066fff..2257e4c 100644
--- a/src/backend/access/transam/README.parallel
+++ b/src/backend/access/transam/README.parallel
@@ -76,7 +76,7 @@ Instead, we take a more pragmatic approach. First, we try to make as many of
 the operations that are safe outside of parallel mode work correctly in
 parallel mode as well.  Second, we try to prohibit common unsafe operations
 via suitable error checks.  These checks are intended to catch 100% of
-unsafe things that a user might do from the SQL interface, but code writen
+unsafe things that a user might do from the SQL interface, but code written
 in C can do unsafe things that won't trigger these checks.  The error checks
 are engaged via EnterParallelMode(), which should be called before creating
 a parallel context, and disarmed via ExitParallelMode(), which should be
@@ -108,7 +108,7 @@ worker.  This includes:
   - The combo CID mappings.  This is needed to ensure consistent answers to
 tuple visibility checks.  The need to synchronize this data structure is
 a major reason why 

Re: [HACKERS] parallel mode and parallel contexts

2015-04-26 Thread Amit Langote
On 2015-04-22 AM 04:14, Robert Haas wrote:
 
 We should check IsParallelWorker() for operations that are allowed in
 the master during parallel mode, but not allowed in the workers - e.g.
 the master can scan its own temporary relations, but its workers
 can't.  We should check IsInParallelMode() for operations that are
 completely off-limits in parallel mode - i.e. writes.
 
 We use ereport() where we expect that SQL could hit that check, and
 elog() where we expect that only (buggy?) C code could hit that check.
 

By the way, perhaps orthogonal to the basic issue Alvaro and you are
discussing here - when playing around with (parallel-mode + parallel-safety +
parallel heapscan + parallel seqscan), I'd observed (been a while since) that
if you run a CREATE TABLE AS ... SELECT and the SELECT happens to use parallel
scan, the statement as a whole fails because the top level statement (CTAS) is
not read-only. So, only way to make CTAS succeed is to disable seqscan; which
may require some considerations in reporting to user to figure out. I guess it
happens because the would-be parallel leader of the scan would also be the one
doing DefineRelation() DDL. Apologies if this has been addressed since.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-04-25 Thread Amit Kapila
On Tue, Apr 21, 2015 at 11:38 PM, Robert Haas robertmh...@gmail.com wrote:


 After thinking about it a bit more, I realized that even if we settle
 on some solution to that problem, there's another issues: the
 wait-edges created by this system don't really have the same semantics
 as regular lock waits.  Suppose A is waiting on a lock held by B and B
 is waiting for a lock held by A; that's a deadlock.  But if A is
 waiting for B to write to a tuple queue and B is waiting for A to read
 from a tuple queue, that's not a deadlock if the queues in question
 are the same.  If they are different queues, it might be a deadlock,
 but then again maybe not.  It may be that A is prepared to accept B's
 message from one queue, and that upon fully receiving it, it will do
 some further work that will lead it to write a tuple into the other
 queue.  If so, we're OK; if not, it's a deadlock.

I agree that the way deadlock detector works for locks, it might not
be same as it needs to work for communication buffers (tuple queues).
Here I think we also need to devise some different way to remove resources
from wait/resource queue, it might not be a good idea to do it similar to
locks
as locks are released at transaction end whereas this new resources
(communication buffers) don't operate at transaction boundary.

 I'm not sure
 whether you'll want to argue that that is an implausible scenario, but
 I'm not too sure it is.  The worker could be saying hey, I need some
 additional piece of your backend-local state in order to finish this
 computation, and the master could then provide it.  I don't have any
 plans like that, but it's been suggested previously by others, so it's
 not an obviously nonsensical thing to want to do.


If such a thing is not possible today and also it seems that is a not a
good design to solve any problem, then why to spend too much effort
in trying to find ways to detect deadlocks for the same.

 A further difference in the semantics of these wait-edges is that if
 process A is awaiting AccessExclusiveLock on a resource held by B, C,
 and D at AccessShareLock, it needs to wait for all of those processes
 to release their locks before it can do anything else.  On the other
 hand, if process A is awaiting tuples from B, C, and D, it just needs
 ONE of those processes to emit tuples in order to make progress.  Now
 maybe that doesn't make any difference in practice, because even if
 two of those processes are making lively progress and A is receiving
 tuples from them and processing them like gangbusters, that's probably
 not going to help the third one get unstuck.  If we adopt the approach
 of discounting that possibility, then as long as the parallel leader
 can generate tuples locally (that is, local_scan_done = false) we
 don't report the deadlock, but as soon as it can no longer do that
 (local_scan_done = true) then we do, even though we could still
 theoretically read more tuples from the non-stuck workers.  So then
 you have to wonder why we're not solving problem #1, because the
 deadlock was just as certain before we generated the maximum possible
 number of tuples locally as it was afterwards.


I think the deadlock detection code should run if anytime we have to
wait for more than deadlock timeout.  Now I think the important point
here is when to actually start waiting, as per current parallel_seqscan
patch we wait only after checking the tuples from all queues, we could
have done it other way (wait if we can't fetch from one queue) as well.
It seems both has pros and cons, so we can proceed assuming current
way is okay and we can consider to change it in future once we find some
reason for the same.

Having said above, I feel this is not the problem that we should try to
solve at this point unless there is any scenario where we could hit
deadlock due to communication buffers.  I think such a solution would
be required for advanced form of parallelism (like intra-query parallelism).
By the way, why are we trying to solve this problem, is there any way
with which we can hit it for parallel sequential scan?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] parallel mode and parallel contexts

2015-04-21 Thread Robert Haas
On Tue, Mar 24, 2015 at 3:26 PM, Andres Freund and...@2ndquadrant.com wrote:
 You'd need some kind of
 API that says pretend I'm waiting for this lock, but don't really
 wait for it, and you'd need to be darn sure that you removed yourself
 from the wait queue again before doing any other heavyweight lock
 manipulation.  Do you have specific thoughts on how to implement this?

 I've thought some about this, and I think it's a bit easier to not do it
 on the actual lock waitqueues, but teach deadlock.c about that kind of
 blocking.

 deadlock.c is far from simple, and at least I don't find the control
 flow to be particularly clear. So it's not easy.  It'd be advantageous
 to tackle things at that level because it'd avoid the need to acquire
 locks on some lock's waitqueue when blocking; we're going to do that a
 lot.

 But It seems to me that it should be possible to suceed: In
 FindLockCycleRecurse(), in the case that we're not waiting for an actual
 lock (checkProc-links.next == NULL) we can add a case that considers
 the 'blocking parallelism' case. ISTM that that's just a
 FindLockCycleRecurse() call on the process that we're waiting for. We'd
 either have to find the other side's locktag for DEADLOCK_INFO or invent
 another field in there; but that seems like a solveable problem.

I (finally) had some time to look at this today.  Initially, it looked
good.  Unfortunately, the longer I looked at it, the less convinced I
got that we could solve the problem this way.  The core of the issue
is that the Funnel node in the parallel group leader basically does
this:

while (!local_scan_done || !remote_scan_done)
{
attempt a read from each remaining worker's tuple queue, blocking
only if local_scan_done;
if (we got a tuple)
return it;
else if (there are no remaining workers)
remote_scan_done = true;

attempt to produce a tuple just as if we were a worker ourselves;
if (we got a tuple)
return it;
else
local_scan_done = true;
}

Imagine that we're doing a parallel sequential scan; each worker
claims one page but goes into the tank before it has returned all of
the tuples on that page.  The master reads all the remaining pages but
must now wait for the workers to finish returning the tuples on the
pages they claimed.

So what this means is:

1. The master doesn't actually *wait* until the very end of the parallel phase.
2. When the master does wait, it waits for all of the parallel workers
at once, not each one individually.

So, I don't think anything as simplistic as teaching a blocking
shm_mq_receive() to tip off the deadlock detector that we're waiting
for the process on the other end of that particular queue can ever
work.  Because of point #2, that never happens.  When I first started
thinking about how to fix this, I said, well, that's no big deal, we
can just advertise the whole list of processes that we're waiting for
in shared memory, rather than just the one.  This is a bit tricky,
though. Any general API for any backend to advertise that it's waiting
for an arbitrary subset of the other backends would require O(n^2)
shared memory state.  That wouldn't be completely insane, but it's not
great, either.  For this particular case, we could optimize that down
to O(n) by just chaining all of the children of a given parallel group
leader in a linked whose nodes are inlined in their PGPROCs, but that
doesn't feel very general, because it forecloses the possibility of
the children ever using that API, and I think they might need to.  If
nothing else, they might need to advertise that they're blocking on
the master if they are trying to send data, the queue is full, and
they have to wait for the master to drain some of it before they can
proceed.

After thinking about it a bit more, I realized that even if we settle
on some solution to that problem, there's another issues: the
wait-edges created by this system don't really have the same semantics
as regular lock waits.  Suppose A is waiting on a lock held by B and B
is waiting for a lock held by A; that's a deadlock.  But if A is
waiting for B to write to a tuple queue and B is waiting for A to read
from a tuple queue, that's not a deadlock if the queues in question
are the same.  If they are different queues, it might be a deadlock,
but then again maybe not.  It may be that A is prepared to accept B's
message from one queue, and that upon fully receiving it, it will do
some further work that will lead it to write a tuple into the other
queue.  If so, we're OK; if not, it's a deadlock.  I'm not sure
whether you'll want to argue that that is an implausible scenario, but
I'm not too sure it is.  The worker could be saying hey, I need some
additional piece of your backend-local state in order to finish this
computation, and the master could then provide it.  I don't have any
plans like that, but it's been suggested previously by others, so it's
not an obviously nonsensical thing to want to do.

A further 

Re: [HACKERS] parallel mode and parallel contexts

2015-04-21 Thread Robert Haas
On Mon, Apr 20, 2015 at 6:49 PM, Peter Geoghegan p...@heroku.com wrote:
 I see that you're using git format-patch to generate this. But the
 patch is only patch 1/4. Is that intentional? Where are the other
 pieces?

 I think that the parallel seqscan patch, and the assessing parallel
 safety patch are intended to fit together with this patch, but I can't
 find a place where there is a high level overview explaining just how
 they fit together (I notice Amit's patch has an #include
 access/parallel.h, which is here, but that wasn't trivial to figure
 out). I haven't been paying too much attention to this patch series.

The intended order of application is:

- parallel mode/contexts
- assess parallel safety
- parallel heap scan
- parallel seq scan

The parallel heap scan patch should probably be merged into the
parallel seq scan patch, which should probably then be split into
several patches, one or more with general parallel query
infrastructure and then another with stuff specific to parallel
sequential scan.  But we're not quite there yet.  Until we can get
agreement on how to finalize the parallel mode/contexts patch, I
haven't been too worried about getting the other stuff sorted out
exactly right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-04-21 Thread Robert Haas
On Tue, Mar 24, 2015 at 11:56 PM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 Well, it's not actually the same message.  They're all a bit
 different.  Or mostly all of them.  And the variable part is not a
 command name, as in the PreventTransactionChain() case, so it would
 affect translatability if we did that.

 Three things that vary are 1) the fact that some check IsParallelWorker()
 and others check IsInParallelMode(), and 2) that some of them are
 ereports() while others are elog(), and 3) that some are ERROR and
 others are FATAL.  Is there a reason for these differences?

The message text also varies, because it states the particular
operation that is prohibited.

We should check IsParallelWorker() for operations that are allowed in
the master during parallel mode, but not allowed in the workers - e.g.
the master can scan its own temporary relations, but its workers
can't.  We should check IsInParallelMode() for operations that are
completely off-limits in parallel mode - i.e. writes.

We use ereport() where we expect that SQL could hit that check, and
elog() where we expect that only (buggy?) C code could hit that check.

We use FATAL for some of the checks in xact.c, for parity with other,
similar checks in xact.c that relate to checking the transaction
state.  I think this is because we assume that if the transaction
state is hosed, it's necessary to terminate the backend to recover.
In other cases, we use ERROR.

 (Note typo restrction in quoted paragraph above.)

Thanks, will fix.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-04-20 Thread Peter Geoghegan
On Thu, Mar 19, 2015 at 11:13 AM, Robert Haas robertmh...@gmail.com wrote:
 Here is yet another version of this patch.  In addition to the fixes
 mentioned above, this version includes some minor rebasing around
 recent commits, and also better handling of the case where we discover
 that we cannot launch workers after all.  This can happen because (1)
 dynamic_shared_memory_type=none, (2) the maximum number of DSM
 segments supported by the system configuration are already in use, or
 (3) the user creates a parallel context with nworkers=0.  In any of
 those cases, the system will now create a backend-private memory
 segment instead of a dynamic shared memory segment, and will skip
 steps that don't need to be done in that case.  This is obviously an
 undesirable scenario.  If we choose a parallel sequential scan, we
 want it to launch workers and really run in parallel.  Hopefully, in
 case (1) or case (3), we will avoid choosing a parallel plan in the
 first place, but case (2) is pretty hard to avoid completely, as we
 have no idea what other processes may or may not be doing with dynamic
 shared memory segments ... and, in any case, degrading to non-parallel
 execution beats failing outright.

I see that you're using git format-patch to generate this. But the
patch is only patch 1/4. Is that intentional? Where are the other
pieces?

I think that the parallel seqscan patch, and the assessing parallel
safety patch are intended to fit together with this patch, but I can't
find a place where there is a high level overview explaining just how
they fit together (I notice Amit's patch has an #include
access/parallel.h, which is here, but that wasn't trivial to figure
out). I haven't been paying too much attention to this patch series.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-03-24 Thread Robert Haas
On Tue, Mar 24, 2015 at 3:26 PM, Andres Freund and...@2ndquadrant.com wrote:
  +  - The currently active user ID and security context.  Note that this is
  +the fourth user ID we restore: the initial step of binding to the 
  correct
  +database also involves restoring the authenticated user ID.  When GUC
  +values are restored, this incidentally sets SessionUserId and 
  OuterUserId
  +to the correct values.  This final step restores CurrentUserId.
 
  Ah. That's the answer for above. Could you just move it next to the
  other user bit?

 Well, I think it's good to keep this in the same order it happens.
 That's almost true, with the exception of the libraries, which were
 out of order.  (I've fixed that now.)

 I don't find this convincing in the least. This should explain a
 developer which state he can rely on being shared. For that a sane order
 is helpful. In which precise order this happens doesn't seem to matter
 for that at all; it's also likely ot get out of date.

I'm hoping we can agree to disagree on this point.  I like my order
order better, you like your order better; we're arguing about the
ordering of paragraphs in a README.

 You'd need some kind of
 API that says pretend I'm waiting for this lock, but don't really
 wait for it, and you'd need to be darn sure that you removed yourself
 from the wait queue again before doing any other heavyweight lock
 manipulation.  Do you have specific thoughts on how to implement this?

 I've thought some about this, and I think it's a bit easier to not do it
 on the actual lock waitqueues, but teach deadlock.c about that kind of
 blocking.

 deadlock.c is far from simple, and at least I don't find the control
 flow to be particularly clear. So it's not easy.  It'd be advantageous
 to tackle things at that level because it'd avoid the need to acquire
 locks on some lock's waitqueue when blocking; we're going to do that a
 lot.

 But It seems to me that it should be possible to suceed: In
 FindLockCycleRecurse(), in the case that we're not waiting for an actual
 lock (checkProc-links.next == NULL) we can add a case that considers
 the 'blocking parallelism' case. ISTM that that's just a
 FindLockCycleRecurse() call on the process that we're waiting for. We'd
 either have to find the other side's locktag for DEADLOCK_INFO or invent
 another field in there; but that seems like a solveable problem.

I'll take a look at this.  Thanks for the pointers.

 That's a fair point. I was worried that this is going to introduce
 additional hard edges between processes. I think it actually might, but
 only when  a lock is upgraded; which really shouldn't happen for the
 copied locks.

Agreed.  And if it does, and we get a deadlock, I think I'm sorta OK
with that.  It's a known fact that lock upgrades are a deadlock
hazard, and the possible existence of obscure scenarios where that's
more likely when parallelism is involved than without it doesn't
bother me terribly.  I mean, we have to look at the specifics, but I
think it's far from obvious that

 I'm not sure if I said that somewhere before: If we aborted parallelism
 if any of the to-be-copied locks would conflict with its copy, instead
 of duplicating them, I could live with this. Then this would amount to
 jumping the lock queue, which seems reasonable to me.  Since not being
 able to do parallelism already needs to be handled gracefull, this
 doesn't seem to be too bad.

It's a tempting offer since it would shorten the path to getting this
committed, but I think that's leaving a lot on the table.  In
particular, if we ever want to have parallel CLUSTER, VACUUM FULL, or
ALTER TABLE, that's not going to get us there.  If we don't have the
right infrastructure to support that in whatever the initial commit
is, fine.  But we've got to have a plan that gets us there, or we're
just boxing ourselves into a corner.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-03-24 Thread Alvaro Herrera
Robert Haas wrote:
 On Wed, Mar 18, 2015 at 7:10 PM, Andres Freund and...@2ndquadrant.com wrote:

  + /*
  +  * For now, parallel operations are required to be strictly 
  read-only.
  +  * Unlike heap_update() and heap_delete(), an insert should never 
  create
  +  * a combo CID, so it might be possible to relax this restrction, but
  +  * not without more thought and testing.
  +  */
  + if (IsInParallelMode())
  + ereport(ERROR,
  + (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
  +  errmsg(cannot insert tuples during a 
  parallel operation)));
  +
 
  Minor nitpick: Should we perhaps move the ereport to a separate
  function? Akin to PreventTransactionChain()? Seems a bit nicer to not
  have the same message everywhere.
 
 Well, it's not actually the same message.  They're all a bit
 different.  Or mostly all of them.  And the variable part is not a
 command name, as in the PreventTransactionChain() case, so it would
 affect translatability if we did that.

Three things that vary are 1) the fact that some check IsParallelWorker()
and others check IsInParallelMode(), and 2) that some of them are
ereports() while others are elog(), and 3) that some are ERROR and
others are FATAL.  Is there a reason for these differences?

(Note typo restrction in quoted paragraph above.)

Maybe something similar to what commit f88d4cfc did: have a function
where all possible messages are together, and one is selected with some
enum.  That way, it's easier to maintain consistency if more cases are
added in the future.

 I don't actually think this is a big deal.

Yeah.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-03-24 Thread Andres Freund
On 2015-03-19 14:13:59 -0400, Robert Haas wrote:
 On Wed, Mar 18, 2015 at 5:36 PM, Andres Freund and...@2ndquadrant.com wrote:
  Reading the README first, the rest later. So you can comment on my
  comments, while I actually look at the code. Parallelism, yay!

 Sorry, we don't support parallelism yet.  :-)

And not even proper sequential work apparently...

  +  - The currently active user ID and security context.  Note that this is
  +the fourth user ID we restore: the initial step of binding to the 
  correct
  +database also involves restoring the authenticated user ID.  When GUC
  +values are restored, this incidentally sets SessionUserId and 
  OuterUserId
  +to the correct values.  This final step restores CurrentUserId.
 
  Ah. That's the answer for above. Could you just move it next to the
  other user bit?

 Well, I think it's good to keep this in the same order it happens.
 That's almost true, with the exception of the libraries, which were
 out of order.  (I've fixed that now.)

I don't find this convincing in the least. This should explain a
developer which state he can rely on being shared. For that a sane order
is helpful. In which precise order this happens doesn't seem to matter
for that at all; it's also likely ot get out of date.

  +At the end of a parallel operation, which can happen either because it
  +completed successfully or because it was interrupted by an error, parallel
  +workers associated with that operation exit.  In the error case, 
  transaction
  +abort processing in the parallel leader kills of any remaining workers, 
  and
  +the parallel leader then waits for them to die.
 
  Very slightly awkward because first you talk about successful *or* error
  and then about abort processing.

 I don't understand what's awkward about that.  I make a general
 statement about what happens at the end of a parallel operation, and
 then the next few sentences follow up by explaining what happens in
 the error case, and what happens in the success case.

I seem to have completely overread the In the error case,  part of the
sentence. Forget what I wrote.

  +Copying locks to workers is important for the avoidance of undetected
  +deadlock between the initiating process and its parallel workers.  If the
  +initiating process holds a lock on an object at the start of parallelism,
  +and the worker subsequently attempts to acquire a lock on that same object
  +and blocks, this will result in an undetected deadlock, because the
  +initiating process cannot finish the transaction (thus releasing the lock)
  +until the worker terminates, and the worker cannot acquire the lock while
  +the initiating process holds it.  Locks which the processes involved 
  acquire
  +and then release during parallelism do not present this hazard; they 
  simply
  +force the processes involved to take turns accessing the protected 
  resource.
 
  I don't think this is a strong guarantee. There very well can be lack of
  forward progress if they're waiting on each other in some way. Say the
  master backend holds the lock and waits on output from a worker. The
  worker then will endlessly wait for the lock to become free. A
  deadlock.   Or, as another scenario, consider cooperating backends that
  both try to send tuples to each other but the queue is full. A deadlock.

 The idea is that both the master and the workers are restricted to
 locks which they take and release again.  The idea is, specifically,
 to allow syscache lookups.  Taking a lock and then waiting for the
 worker is no good, which is why WaitForParallelWorkersToFinish() calls
 CheckForRetainedParallelLocks().  That would catch your first
 scenario.

 Your second scenario seems to me to be a different kind of problem.
 If that comes up, what you're going to want to do is rewrite the
 workers to avoid the deadlock by using non-blocking messaging.

I don't think that actually really solves the problem. You'll still end
up blocking with full pipes at some point. Whether you do it inside
the messaging or outside the messaging code doesn't particularly matter.

 In any case, I don't think it's really this patch's job to prevent
 deadlocks in code that doesn't exist today and in no way involves the
 lock manager.

Well, you're arguing that we need a solution in the parallelism
infrastructure for the lock deadlocks problem. I don't think it's absurd
to extend care to other cases.

  To me it seems the deadlock detector has to be enhanced to be able to
  see 'waiting for' edges. Independent on how we resolve our difference of
  opinion on the copying of locks.
 
  It seems to me that this isn't all that hard: Whenever we block waiting
  for another backend we enter ourselves on the wait queue to that
  backend's virtual transaction. When finished we take the blocking
  backend off. That, afaics, should do it. Alternatively we can just
  publish what backend we're waiting for in PGPROC and make deadlock also
  look at that; but, while 

Re: [HACKERS] parallel mode and parallel contexts

2015-03-24 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 Also: Man, trying to understand the guts of deadlock.c only made me
 understand how *friggin* expensive deadlock checking is. I'm really
 rather surprised that it only infrequently causes problems.

The reason for that is that we only run deadlock checking if something's
been waiting for at least one second, which pretty much takes it out
of any performance-relevant code pathway.  I think it would be a serious
error to put any deadlock-checking-like behavior into mainline code.
Which probably means that Andres is right that teaching deadlock.c
about any new sources of deadlock is the way to approach this.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-03-18 Thread Andres Freund
On 2015-03-18 12:02:07 -0400, Robert Haas wrote:
 diff --git a/src/backend/access/heap/heapam.c 
 b/src/backend/access/heap/heapam.c
 index cb6f8a3..173f0ba 100644
 --- a/src/backend/access/heap/heapam.c
 +++ b/src/backend/access/heap/heapam.c
 @@ -2234,6 +2234,17 @@ static HeapTuple
  heap_prepare_insert(Relation relation, HeapTuple tup, TransactionId xid,
   CommandId cid, int options)
  {
 + /*
 +  * For now, parallel operations are required to be strictly read-only.
 +  * Unlike heap_update() and heap_delete(), an insert should never create
 +  * a combo CID, so it might be possible to relax this restrction, but
 +  * not without more thought and testing.
 +  */
 + if (IsInParallelMode())
 + ereport(ERROR,
 + (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
 +  errmsg(cannot insert tuples during a parallel 
 operation)));
 +

Minor nitpick: Should we perhaps move the ereport to a separate
function? Akin to PreventTransactionChain()? Seems a bit nicer to not
have the same message everywhere.

 +void
 +DestroyParallelContext(ParallelContext *pcxt)
 +{
 + int i;
 +
 + /*
 +  * Be careful about order of operations here!  We remove the parallel
 +  * context from the list before we do anything else; otherwise, if an
 +  * error occurs during a subsequent step, we might try to nuke it again
 +  * from AtEOXact_Parallel or AtEOSubXact_Parallel.
 +  */
 + dlist_delete(pcxt-node);

Hm. I'm wondering about this. What if we actually fail below? Like in
dsm_detach() or it's callbacks? Then we'll enter abort again at some
point (during proc_exit at the latest) but will not wait for the child
workers. Right?
 +/*
 + * End-of-subtransaction cleanup for parallel contexts.
 + *
 + * Currently, it's forbidden to enter or leave a subtransaction while
 + * parallel mode is in effect, so we could just blow away everything.  But
 + * we may want to relax that restriction in the future, so this code
 + * contemplates that there may be multiple subtransaction IDs in pcxt_list.
 + */
 +void
 +AtEOSubXact_Parallel(bool isCommit, SubTransactionId mySubId)
 +{
 + while (!dlist_is_empty(pcxt_list))
 + {
 + ParallelContext *pcxt;
 +
 + pcxt = dlist_head_element(ParallelContext, node, pcxt_list);
 + if (pcxt-subid != mySubId)
 + break;
 + if (isCommit)
 + elog(WARNING, leaked parallel context);
 + DestroyParallelContext(pcxt);
 + }
 +}

 +/*
 + * End-of-transaction cleanup for parallel contexts.
 + */
 +void
 +AtEOXact_Parallel(bool isCommit)
 +{
 + while (!dlist_is_empty(pcxt_list))
 + {
 + ParallelContext *pcxt;
 +
 + pcxt = dlist_head_element(ParallelContext, node, pcxt_list);
 + if (isCommit)
 + elog(WARNING, leaked parallel context);
 + DestroyParallelContext(pcxt);
 + }
 +}

Is there any reason to treat the isCommit case as a warning? To me that
sounds like a pretty much guaranteed programming error. If your're going
to argue that a couple resource leakage warnings do similarly: I don't
think that counts for that much. For one e.g. a leaked tupdesc has much
less consequences, for another IIRC those warnings were introduced
after the fact.

 + * When running as a parallel worker, we place only a single
 + * TransactionStateData is placed on the parallel worker's
 + * state stack,

'we place .. is placed'

  /*
 + *   EnterParallelMode
 + */
 +void
 +EnterParallelMode(void)
 +{
 + TransactionState s = CurrentTransactionState;
 +
 + Assert(s-parallelModeLevel = 0);
 +
 + ++s-parallelModeLevel;
 +}
 +
 +/*
 + *   ExitParallelMode
 + */
 +void
 +ExitParallelMode(void)
 +{
 + TransactionState s = CurrentTransactionState;
 +
 + Assert(s-parallelModeLevel  0);
 + Assert(s-parallelModeLevel  1 || !ParallelContextActive());
 +
 + --s-parallelModeLevel;
 +
 + if (s-parallelModeLevel == 0)
 + CheckForRetainedParallelLocks();
 +}

Hm. Is it actually ok for nested parallel mode to retain locks on their
own? Won't that pose a problem? Or did you do it that way just because
we don't have more state? If so that deserves a comment explaining that
htat's the case and why that's acceptable.
 @@ -1769,6 +1904,9 @@ CommitTransaction(void)
  {
   TransactionState s = CurrentTransactionState;
   TransactionId latestXid;
 + boolparallel;
 +
 + parallel = (s-blockState == TBLOCK_PARALLEL_INPROGRESS);

 + /* If we might have parallel workers, clean them up now. */
 + if (IsInParallelMode())
 + {
 + CheckForRetainedParallelLocks();
 + AtEOXact_Parallel(true);
 + s-parallelModeLevel = 0;
 + }

'parallel' looks strange when we're also, rightly so, do stuff like
checking 

Re: [HACKERS] parallel mode and parallel contexts

2015-03-18 Thread Andres Freund
Hi,

Reading the README first, the rest later. So you can comment on my
comments, while I actually look at the code. Parallelism, yay!

On 2015-03-18 12:02:07 -0400, Robert Haas wrote:
 +Instead, we take a more pragmatic approach: we try to make as many of the
 +operations that are safe outside of parallel mode work correctly in parallel
 +mode as well, and we try to prohibit the rest via suitable error
 checks.

I'd say that we'd try to prohibit a bunch of common cases. Among them
the ones that are triggerable from SQL. We don't really try to prohibit
many kinds of traps, as you describe.

 +  - The authenticated user ID and current database.  Each parallel worker
 +will connect to the same database as the initiating backend, using the
 +same user ID.

This sentence immediately makes me wonder why only the authenticated
user, and not the currently active role, is mentioned.

 +  - The values of all GUCs.  Accordingly, permanent changes to the value of
 +any GUC are forbidden while in parallel mode; but temporary changes,
 +such as entering a function with non-NULL proconfig, are potentially OK.

potentially OK sounds odd to me. Which danger are you seing that isn't
relevan tfor norm

 +  - The combo CID mappings.  This is needed to ensure consistent answers to
 +tuple visibility checks.  The need to synchronize this data structure is
 +a major reason why we can't support writes in parallel mode: such writes
 +might create new combo CIDs, and we have now way to let other workers
 +(or the initiating backend) know about them.

Absolutely *not* in the initial version, but we really need to find a
better solution for this.

 +  - The currently active user ID and security context.  Note that this is
 +the fourth user ID we restore: the initial step of binding to the correct
 +database also involves restoring the authenticated user ID.  When GUC
 +values are restored, this incidentally sets SessionUserId and OuterUserId
 +to the correct values.  This final step restores CurrentUserId.

Ah. That's the answer for above. Could you just move it next to the
other user bit?

 +We could copy the entire transaction state stack,
 +but most of it would be useless: for example, you can't roll back to a
 +savepoint from within a parallel worker, and there are no resources to
 +associated with the memory contexts or resource owners of intermediate
 +subtransactions.

I do wonder if we're not going to have to change this in the not too far
away future. But then, this isn't externally visible at all, so
whatever.

 +At the end of a parallel operation, which can happen either because it
 +completed successfully or because it was interrupted by an error, parallel
 +workers associated with that operation exit.  In the error case, transaction
 +abort processing in the parallel leader kills of any remaining workers, and
 +the parallel leader then waits for them to die.

Very slightly awkward because first you talk about successful *or* error
and then about abort processing.

 +  - Cleanup of pg_temp namespaces is not done.  The initiating backend is
 +responsible for this, too.

How could a worker have its own pg_temp namespace?

 +Lock Management
 +===
 +
 +Certain heavyweight locks that the initiating backend holds at the beginning
 +of parallelism are copied to each worker, which unconditionally acquires 
 them.
 +The parallel backends acquire - without waiting - each such lock that the
 +leader holds, even if that lock is self-exclusive.  This creates the unusual
 +situation that a lock which could normally only be held by a single backend
 +can be shared among several backends in a parallel group.
 +
 +Obviously, this presents significant hazards that would not be present in
 +normal execution.  If, for example, a backend were to initiate parallelism
 +while ReindexIsProcessingIndex() were true for some index, the parallel
 +backends launched at that time would neither share this state nor be excluded
 +from accessing the index via the heavyweight lock mechanism.  It is therefore
 +imperative that backends only initiate parallelism from places where it will
 +be safe for parallel workers to access the relations on which they hold 
 locks.
 +It is also important that they not afterwards do anything which causes access
 +to those relations to become unsafe, or at least not until after parallelism
 +has concluded.  The fact that parallelism is strictly read-only means that 
 the
 +opportunities for such mishaps are few and far between; furthermore, most
 +operations which present these hazards are DDL operations, which will be
 +rejected by CheckTableNotInUse() if parallel mode is active.
 +
 +Only relation locks, locks on database or shared objects, and perhaps the 
 lock
 +on our transaction ID are copied to workers.

perhaps?

 Advisory locks are not copied,
 +but the leader may hold them at the start of parallelism; they cannot
 +subsequently be manipulated 

Re: [HACKERS] parallel mode and parallel contexts

2015-03-06 Thread Amit Kapila
On Sun, Feb 15, 2015 at 11:48 AM, Robert Haas robertmh...@gmail.com wrote:

 On Fri, Feb 13, 2015 at 2:22 AM, Michael Paquier
 michael.paqu...@gmail.com wrote:
  On Thu, Feb 12, 2015 at 3:59 AM, Robert Haas robertmh...@gmail.com
wrote:
 
  We're not seeing eye to eye here yet, since I don't accept your
  example scenario and you don't accept mine.  Let's keep discussing.
 
  Meanwhile, here's an updated patch.
 
  A lot of cool activity is showing up here, so moved the patch to CF
2015-02.
  Perhaps Andres you could add yourself as a reviewer?

 Here's a new version of the patch with (a) some additional
 restrictions on heavyweight locking both at the start of, and during,
 parallel mode and (b) a write-up in the README explaining the
 restrictions and my theory of why the handling of heavyweight locking
 is safe.  Hopefully this goes some way towards addressing Andres's
 concerns.  I've also replaced the specific (and wrong) messages about
 advisory locks with a more generic message, as previously proposed;
 and I've fixed at least one bug.


Today, while testing parallel_seqscan patch, I encountered one
intermittent issue (it hangs in below function) and I have question
related to below function.

+void
+WaitForParallelWorkersToFinish(ParallelContext *pcxt)
+{
..
+ for (;;)
+ {
..
+ CHECK_FOR_INTERRUPTS();
+ for (i = 0; i  pcxt-nworkers; ++i)
+ {
+ if (pcxt-worker[i].error_mqh != NULL)
+ {
+ anyone_alive = true;
+ break;
+ }
+ }
+
+ if (!anyone_alive)
+ break;
+
+ WaitLatch(MyProc-procLatch, WL_LATCH_SET, -1);
+ ResetLatch(MyProc-procLatch);
+ }

Isn't there a race condition in this function such that after it finds
that there is some alive worker and before it does WaitLatch(), the
worker completes its work and exits, now in such a case who is
going to wake the backend waiting on procLatch?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] parallel mode and parallel contexts

2015-03-06 Thread Robert Haas
On Fri, Mar 6, 2015 at 7:01 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 Today, while testing parallel_seqscan patch, I encountered one
 intermittent issue (it hangs in below function) and I have question
 related to below function.

 +void
 +WaitForParallelWorkersToFinish(ParallelContext *pcxt)
 +{
 ..
 + for (;;)
 + {
 ..
 + CHECK_FOR_INTERRUPTS();
 + for (i = 0; i  pcxt-nworkers; ++i)
 + {
 + if (pcxt-worker[i].error_mqh != NULL)
 + {
 + anyone_alive = true;
 + break;
 + }
 + }
 +
 + if (!anyone_alive)
 + break;
 +
 + WaitLatch(MyProc-procLatch, WL_LATCH_SET, -1);
 + ResetLatch(MyProc-procLatch);
 + }

 Isn't there a race condition in this function such that after it finds
 that there is some alive worker and before it does WaitLatch(), the
 worker completes its work and exits, now in such a case who is
 going to wake the backend waiting on procLatch?

It doesn't matter whether some other backend sets the process latch
before we reach WaitLatch() or after we begin waiting.  Either way,
it's fine.

It would be bad if the process could exit without setting the latch at
all, though.  I hope that's not the case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-02-17 Thread Andres Freund
On 2015-02-11 13:59:04 -0500, Robert Haas wrote:
 On Tue, Feb 10, 2015 at 8:45 PM, Andres Freund and...@2ndquadrant.com wrote:
  The only reason I'd like it to be active is because that'd *prohibit*
  doing the crazier stuff. There seems little reason to not da it under
  the additional protection against crazy things that'd give us?
 
 Trying to load additional libraries once parallel mode is in progress
 can provide failures, because the load of the libraries causes the
 system to do GUC_ACTION_SET on the GUCs whose initialization was
 deferred, and that trips up the
 no-changing-things-while-in-parallel-mode checks.

Oh, that's a good point.

 I'm not sure if there's anything we can, or should, do about that.

Fine with me.

  Well, ExportSnapshot()/Import has quite a bit more restrictions than
  what you're doing... Most importantly it cannot import in transactions
  unless using read committed isolation, forces xid assignment during
  export, forces the old snapshot to stick around for the whole
  transaction and only works on a primary. I'm not actually sure it makes
  a relevant difference, but it's certainly worth calling attention to.
 
  The fuding I was wondering about certainly is unnecessary though -
  GetSnapshotData() has already performed it...
 
  I'm not particularly awake right now and I think this needs a closer
  look either by someone awake... I'm not fully convinced this is safe.
 
 I'm not 100% comfortable with it either, but I just spent some time
 looking at it and can't see what's wrong with it.  Basically, we're
 trying to get the parallel worker into a state that matches the
 master's state after doing GetTransactionSnapshot() - namely,
 CurrentSnapshot should point to the same snapshot on the master, and
 FirstSnapshotSet should be true, plus the same additional processing
 that GetTransactionSnapshot() would have done if we're in a higher
 transaction isolation level.  It's possible we don't need to mimic
 that state, but it seems like a good idea.

I plan to look at this soonish.

 Still, I wonder if we ought to be banning GetTransactionSnapshot()
 altogether.  I'm not sure if there's ever a time when it's safe for a
 worker to call that.

Why?

   Also, you don't seem to
 prohibit popping the active snapshot (should that be prohibitted
 maybe?) which might bump the initiator's xmin horizon.
 
  I think as long as our transaction snapshot is installed correctly our
  xmin horizon can't advance; am I missing something?
 
  Maybe I'm missing something, but why would that be the case in a read
  committed session? ImportSnapshot() only ever calls
  SetTransactionSnapshot it such a case (why does it contain code to cope
  without it?), but your patch doesn't seem to guarantee that.
 
  But as don't actually transport XactIsoLevel anywhere (omission
  probably?) that seems to mean that the SetTransactionSnapshot() won't do
  much, even if the source transaction is repeatable read.
 
 Doesn't XactIsoLevel get set by assign_XactIsoLevel() when we restore
 the GUC state?

Yes, but afaics the transaction start will just overwrite it again:

static void
StartTransaction(void)
{
...
XactDeferrable = DefaultXactDeferrable;
XactIsoLevel = DefaultXactIsoLevel;
...

For a client issued BEGIN it works because utility.c does:
case TRANS_STMT_BEGIN:
case TRANS_STMT_START:
{
ListCell   *lc;

BeginTransactionBlock();
foreach(lc, stmt-options)
{
DefElem*item = (DefElem *) lfirst(lc);

if (strcmp(item-defname, 
transaction_isolation) == 0)
SetPGVariable(transaction_isolation,
  
list_make1(item-arg),
  true);
else if (strcmp(item-defname, 
transaction_read_only) == 0)
SetPGVariable(transaction_read_only,
  
list_make1(item-arg),
  true);
else if (strcmp(item-defname, 
transaction_deferrable) == 0)
SetPGVariable(transaction_deferrable,
  
list_make1(item-arg),
  true);
}

Pretty, isn't it?


  Your manual ones don't either, that's what made me
  complain. E.g. pg_advisory_lock_int8 isn't called that on the SQL
  level. Instead they use the C name + parens.
 
 On further reflection, I think I should probably just change all of
 those to use a general message about advisory locks, i.e.:
 
 ERROR: 

Re: [HACKERS] parallel mode and parallel contexts

2015-02-12 Thread Michael Paquier
On Thu, Feb 12, 2015 at 3:59 AM, Robert Haas robertmh...@gmail.com wrote:

 We're not seeing eye to eye here yet, since I don't accept your
 example scenario and you don't accept mine.  Let's keep discussing.

 Meanwhile, here's an updated patch.


A lot of cool activity is showing up here, so moved the patch to CF
2015-02. Perhaps Andres you could add yourself as a reviewer?
-- 
Michael


Re: [HACKERS] parallel mode and parallel contexts

2015-02-11 Thread Robert Haas
On Wed, Feb 11, 2015 at 1:59 PM, Robert Haas robertmh...@gmail.com wrote:
 I'm not sure what you mean by the severity of the lock.  How about
 marking the locks that the worker inherited from the parallel master
 and throwing an error if it tries to lock one of those objects in a
 mode that it does not already hold?  That'd probably catch a sizeable
 number of programming errors without tripping up many legitimate use
 cases (and we can always relax or modify the prohibition if it later
 turns out to be problematic).

Or maybe do that only when the new lock mode is stronger than one it
already holds.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-02-10 Thread Andres Freund
On 2015-02-10 11:49:58 -0500, Robert Haas wrote:
 On Sat, Feb 7, 2015 at 7:20 PM, Andres Freund and...@2ndquadrant.com wrote:
  * Don't like CreateParallelContextForExtension much as a function name -
I don't think we normally equate the fact that code is located in a
loadable library with the extension mechanism.

 Suggestions for a better name?  CreateParallelContextForLoadableFunction?

*ExternalFunction maybe? That's dfmgr.c calls it.

  * the plain shm calculations intentionally use mul_size/add_size to deal
with overflows. On 32bit it doesn't seem impossible, but unlikely to
overflow size_t.

 Yes, I do that here too, though as with the plain shm calculations,
 only in the estimate functions.  The functions that actually serialize
 stuff don't have to worry about overflow because it's already been
 checked.

I thought I'd seen some estimation functions that didn't use it. But
apparently I was wrong.

  * In +LaunchParallelWorkers, does it make sense trying to start workers
if one failed? ISTM that's likely to not be helpful. I.e. it should
just break; after the first failure.

 It can't just break, because clearing pcxt-worker[i].error_mqh is an
 essential step.  I could add a flag variable that tracks whether any
 registrations have failed and change the if statement to if
 (!any_registrations_failed  RegisterDynamicBackgroundWorker(worker,
 pcxt-worker[i].bgwhandle)), if you want.  I thought about doing that
 but decided it was very unlikely to affect the real-world performance
 of anything, so easier just to keep the code simple. But I'll change
 it if you want.

I think it'd be better.

  * ParallelMain restores libraries before GUC state. Given that
librararies can, and actually somewhat frequently do, inspect GUCs on
load, it seems better to do it the other way round? You argue We want
to do this before restoring GUCs, because the libraries might define
custom variables., but I don't buy that. It's completely normal for
namespaced GUCs to be present before a library is loaded? Especially
as you then go and process_session_preload_libraries() after setting
the GUCs.

 This is a good question, but the answer is not entirely clear to me.
 I'm thinking I should probably just remove
 process_session_preload_libraries() altogether at this point.  That
 was added at a time when RestoreLibraryState() didn't exist yet, and I
 think RestoreLibraryState() is a far superior way of handling this,
 because surely the list of libraries that the parallel leader
 *actually had loaded* is more definitive than any GUC.

That sounds like a good idea to me.

 Now, that doesn't answer the question about whether we should load
 libraries first or GUCs.  I agree that the comment's reasoning is
 bogus, but I'm not sure I understand why you think the other order is
 better.  It *is* normal for namespaced GUCs to be present before a
 library is loaded, but it's equally normal (and, I think, often more
 desirable in practice) to load the library first and then set the
 GUCs.

Well, it's pretty much never the case that the library is loaded before
postgresql.conf gucs, right? A changed postgresql.conf is the only
exception I can see. Neither is it the normal case for
session|local_preload_libraries. Not even when GUCs are loaded via
pg_db_role_setting or the startup packet...

 Generally, I think that libraries ought to be loaded as early
 as possible, because they may install hooks that change the way other
 stuff works, and should therefore be loaded before that other stuff
 happens.

While that may be desirable I don't really see a reason for this to be
treated differently for worker processes than the majority of cases
otherwise.

Anyway, I think this is a relatively minor issue.

  * Should ParallelMain maybe enter the parallel context before loading
user defined libraries? It's far from absurd to execute code touching
the database on library initialization...

 It's hard to judge without specific examples.  What kinds of things do
 they do?

I've seen code filling lookup caches and creating system objects
(including tables and extensions).

 Are they counting on a transaction being active?

 I would have thought that was a no-no, since there are many instances
 in which it won't be true.  Also, you might have just gotten loaded
 because a function stored in your library was called, so you could be
 in a transaction that's busy doing something else, or deep in a
 subtransaction stack, etc.  It seems unsafe to do very much more than
 a few syscache lookups here, even if there does happen to be a
 transaction active.

The only reason I'd like it to be active is because that'd *prohibit*
doing the crazier stuff. There seems little reason to not da it under
the additional protection against crazy things that'd give us?

  * I think restoring snapshots needs to fudge the worker's PGXACT-xmin
to be the minimum of the top transaction id and the
snapshots. 

Re: [HACKERS] parallel mode and parallel contexts

2015-02-10 Thread Robert Haas
On Sat, Feb 7, 2015 at 7:20 PM, Andres Freund and...@2ndquadrant.com wrote:
 Observations:
 * Some tailing whitespace in the readme. Very nice otherwise.

Fixed.  Thanks.

 * Don't like CreateParallelContextForExtension much as a function name -
   I don't think we normally equate the fact that code is located in a
   loadable library with the extension mechanism.

Suggestions for a better name?  CreateParallelContextForLoadableFunction?

 * StaticAssertStmt(BUFFERALIGN(PARALLEL_ERROR_QUEUE_SIZE) ...) what's
   that about?

Gee, maybe I should have added a comment so I'd remember.  If the
buffer size isn't MAXALIGN'd, that would be really bad, because
shm_mq_create() assumes that it's being given an aligned address.
Maybe I should add an Assert() there.  If it is MAXALIGN'd but not
BUFFERALIGN'd, we might waste a few bytes of space, since
shm_toc_allocate() always allocates in BUFFERALIGN'd chunks, but I
don't think anything will actually break.  Not sure if that's worth an
assert or not.

 * the plain shm calculations intentionally use mul_size/add_size to deal
   with overflows. On 32bit it doesn't seem impossible, but unlikely to
   overflow size_t.

Yes, I do that here too, though as with the plain shm calculations,
only in the estimate functions.  The functions that actually serialize
stuff don't have to worry about overflow because it's already been
checked.

 * I'd s/very // i We might be running in a very short-lived memory
   context.. Or replace it with too.

Removed very.

 * In +LaunchParallelWorkers, does it make sense trying to start workers
   if one failed? ISTM that's likely to not be helpful. I.e. it should
   just break; after the first failure.

It can't just break, because clearing pcxt-worker[i].error_mqh is an
essential step.  I could add a flag variable that tracks whether any
registrations have failed and change the if statement to if
(!any_registrations_failed  RegisterDynamicBackgroundWorker(worker,
pcxt-worker[i].bgwhandle)), if you want.  I thought about doing that
but decided it was very unlikely to affect the real-world performance
of anything, so easier just to keep the code simple. But I'll change
it if you want.

 * +WaitForParallelWorkersToFinish says that it waits for workers to exit
   cleanly. To me that's ambiguous. How about fully?

I've removed the word cleanly and added a comment to more fully
explain the danger:

+ * Even if the parallel operation seems to have completed successfully, it's
+ * important to call this function afterwards.  We must not miss any errors
+ * the workers may have thrown during the parallel operation, or any that they
+ * may yet throw while shutting down.

 * ParallelMain restores libraries before GUC state. Given that
   librararies can, and actually somewhat frequently do, inspect GUCs on
   load, it seems better to do it the other way round? You argue We want
   to do this before restoring GUCs, because the libraries might define
   custom variables., but I don't buy that. It's completely normal for
   namespaced GUCs to be present before a library is loaded? Especially
   as you then go and process_session_preload_libraries() after setting
   the GUCs.

This is a good question, but the answer is not entirely clear to me.
I'm thinking I should probably just remove
process_session_preload_libraries() altogether at this point.  That
was added at a time when RestoreLibraryState() didn't exist yet, and I
think RestoreLibraryState() is a far superior way of handling this,
because surely the list of libraries that the parallel leader
*actually had loaded* is more definitive than any GUC.

Now, that doesn't answer the question about whether we should load
libraries first or GUCs.  I agree that the comment's reasoning is
bogus, but I'm not sure I understand why you think the other order is
better.  It *is* normal for namespaced GUCs to be present before a
library is loaded, but it's equally normal (and, I think, often more
desirable in practice) to load the library first and then set the
GUCs.  Generally, I think that libraries ought to be loaded as early
as possible, because they may install hooks that change the way other
stuff works, and should therefore be loaded before that other stuff
happens.

 * Should ParallelMain maybe enter the parallel context before loading
   user defined libraries? It's far from absurd to execute code touching
   the database on library initialization...

It's hard to judge without specific examples.  What kinds of things do
they do?  Are they counting on a transaction being active?  I would
have thought that was a no-no, since there are many instances in which
it won't be true.  Also, you might have just gotten loaded because a
function stored in your library was called, so you could be in a
transaction that's busy doing something else, or deep in a
subtransaction stack, etc.  It seems unsafe to do very much more than
a few syscache lookups here, even if there does happen to be a
transaction active.

 * 

Re: [HACKERS] parallel mode and parallel contexts

2015-02-07 Thread Andres Freund
Hi,


On 2015-02-06 22:43:21 -0500, Robert Haas wrote:
 Here's v4, with that fixed and a few more tweaks.

If you attached files generated with 'git format-patch' I could directly
apply then with the commit message and such. All at once if it's
mutliple patches, as individual commits. On nontrivial patches it's nice
to see the commit message(s) along the diff(s).

Observations:
* Some tailing whitespace in the readme. Very nice otherwise.

* Don't like CreateParallelContextForExtension much as a function name -
  I don't think we normally equate the fact that code is located in a
  loadable library with the extension mechanism.

* StaticAssertStmt(BUFFERALIGN(PARALLEL_ERROR_QUEUE_SIZE) ...) what's
  that about?

* the plain shm calculations intentionally use mul_size/add_size to deal
  with overflows. On 32bit it doesn't seem impossible, but unlikely to
  overflow size_t.

* I'd s/very // i We might be running in a very short-lived memory
  context.. Or replace it with too.

* In +LaunchParallelWorkers, does it make sense trying to start workers
  if one failed? ISTM that's likely to not be helpful. I.e. it should
  just break; after the first failure.

* +WaitForParallelWorkersToFinish says that it waits for workers to exit
  cleanly. To me that's ambiguous. How about fully?

* ParallelMain restores libraries before GUC state. Given that
  librararies can, and actually somewhat frequently do, inspect GUCs on
  load, it seems better to do it the other way round? You argue We want
  to do this before restoring GUCs, because the libraries might define
  custom variables., but I don't buy that. It's completely normal for
  namespaced GUCs to be present before a library is loaded? Especially
  as you then go and process_session_preload_libraries() after setting
  the GUCs.

* Should ParallelMain maybe enter the parallel context before loading
  user defined libraries? It's far from absurd to execute code touching
  the database on library initialization...

* rename ParallelMain to ParallelWorkerMain?

* I think restoring snapshots needs to fudge the worker's PGXACT-xmin
  to be the minimum of the top transaction id and the
  snapshots. Otherwise if the initiating process dies badly
  (e.g. because postmaster died) the workers will continue to work,
  while other processes may remove things. Also, you don't seem to
  prohibit popping the active snapshot (should that be prohibitted
  maybe?) which might bump the initiator's xmin horizon.

* Comment about 'signal_handler' in +HandleParallelMessageInterrupt
  is outdated.

* Is it really a good idea to overlay Z/ReadyForQuery with 'this worker
  exited cleanly'? Shouldn't it at least be T/Terminate? I'm generally
  not sure it's wise to use this faux FE/BE protocol here...

* HandlingParallelMessages looks dead.

* ParallelErrorContext has the wrong comment.

* ParallelErrorContext() provides the worker's pid in the context
  message. I guess that's so it's easier to interpret when sent to the
  initiator? It'll look odd when logged by the failing process.

* We now have pretty much the same handle_sigterm in a bunch of
  places. Can't we get rid of those? You actually probably can just use
  die().

* The comments in xact.c above XactTopTransactionId pretty much assume
  that the reader knows that that is about parallel stuff.

* I'm a bit confused by the fact that Start/CommitTransactionCommand()
  emit a generic elog when encountering a PARALLEL_INPROGRESS, whereas
  ReleaseSavepoint()/RollbackTo has a special message. Shouldn't it be
  pretty much impossible to hit a ReleaseSavepoint() with a parallel
  transaction in progress? We'd have had to end the previous transaction
  command while parallel stuff was in progress - right?
  (Internal/ReleaseCurrentSubTransaction is different, that's used in code)

* Why are you deviating from the sorting order used in other places for
  ParallelCurrentXids? That seems really wierd, especially as we use
  something else a couple lines down. Especially as you actually seem to
  send stuff in xidComparator order?

* I don't think skipping AtEOXact_Namespace() entirely if parallel is a
  good idea. That function already does some other stuff than cleaning
  up temp tables. I think you should pass in parallel and do the skip in
  there.

* Start/DndParallelWorkerTransaction assert the current state, whereas
  the rest of the file FATALs in that case. I think it'd actually be
  good to be conservative and do the same in this case.

* You're manually passing function names to
  PreventCommandIfParallelMode() in a fair number of cases. I'd either
  try and keep the names consistent with what the functions are actually
  called at the sql level (adding the types in the parens) or just use
  PG_FUNCNAME_MACRO to keep them correct.

* Wait. You now copy all held relation held as is to the standby? I
  quite doubt that's a good idea, and it's not my reading of the
  conclusions made in the group locking thread. At the very least 

Re: [HACKERS] parallel mode and parallel contexts

2015-02-04 Thread Amit Kapila
On Wed, Feb 4, 2015 at 9:40 PM, Robert Haas robertmh...@gmail.com wrote:

 On Fri, Jan 30, 2015 at 12:08 PM, Robert Haas robertmh...@gmail.com
wrote:
  Here's a new version.  Andres mentioned previously that he thought it
  would be a good idea to commit the addition of
  BackgroundWorkerInitializeConnectionByOid() separately, as he's had
  cause to want it independently of the rest of this stuff.  It would be
  useful for pg_background, too.  So I've broken that out into a
  separate patch here (bgworker-by-oid.patch) and will commit that RSN
  unless somebody thinks it's a bad idea for some reason.  AFAICS it
  should be uncontroversial.

 This is now done.

 The main patch needed some updating in light of Andres's recent
 assault on ImmediateInterruptOK (final result: Andres 1-0 IIOK) so
 here's a new version.


I think we should expose variable ParallelWorkerNumber (or if you don't
want to expose it then atleast GetParallel* function call is required to get
the value of same), as that is needed for external applications wherever
they want to allocate something for each worker, some examples w.r.t
parallel seq scan patch are each worker should have separate tuple
queue and probably for implementation of Explain statement also we
might need it.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] parallel mode and parallel contexts

2015-01-30 Thread Jim Nasby

On 1/30/15 11:08 AM, Robert Haas wrote:

The final patch attached her (parallel-dummy-v2.patch) has been
updated slightly to incorporate some prefetching logic.  It's still
just demo code and is not intended for commit.  I'm not sure whether
the prefetching logic can actually be made to improve performance,
either; if anyone feels like playing with that and reporting results
back, that would be swell.


Wouldn't we want the prefetching to happen after ReadBuffer() instead of 
before? This way we risk pushing the buffer we actually want out, plus 
if it's not already available the OS probably won't try and read it 
until it's read all the prefetch blocks.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-01-21 Thread Robert Haas
On Wed, Jan 21, 2015 at 2:11 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 On Tue, Jan 20, 2015 at 9:52 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Jan 20, 2015 at 9:41 AM, Amit Kapila amit.kapil...@gmail.com
 wrote:
  It seems [WaitForBackgroundWorkerShutdown] has possibility to wait
  forever.
  Assume one of the worker is not able to start (not able to attach
  to shared memory or some other reason), then status returned by
  GetBackgroundWorkerPid() will be BGWH_NOT_YET_STARTED
  and after that it can wait forever in WaitLatch.

 I don't think that's right.  The status only remains
 BGWH_NOT_YET_STARTED until the postmaster forks the child process.

 I think the control flow can reach the above location before
 postmaster could fork all the workers.  Consider a case that
 we have launched all workers during ExecutorStart phase
 and then before postmaster could start all workers an error
 occurs in master backend and then it try to Abort the transaction
 and destroy the parallel context, at that moment it will get the
 above status and wait forever in above code.

 I am able to reproduce above scenario with debugger by using
 parallel_seqscan patch.

Hmm.  Well, if you can reproduce it, there clearly must be a bug.  But
I'm not quite sure where.  What should happen in that case is that the
process that started the worker has to wait for the postmaster to
actually start it, and then after that for the new process to die, and
then it should return.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-01-21 Thread Amit Kapila
On Wed, Jan 21, 2015 at 6:35 PM, Robert Haas robertmh...@gmail.com wrote:

 On Wed, Jan 21, 2015 at 2:11 AM, Amit Kapila amit.kapil...@gmail.com
wrote:
  On Tue, Jan 20, 2015 at 9:52 PM, Robert Haas robertmh...@gmail.com
wrote:
  On Tue, Jan 20, 2015 at 9:41 AM, Amit Kapila amit.kapil...@gmail.com
  wrote:
   It seems [WaitForBackgroundWorkerShutdown] has possibility to wait
   forever.
   Assume one of the worker is not able to start (not able to attach
   to shared memory or some other reason), then status returned by
   GetBackgroundWorkerPid() will be BGWH_NOT_YET_STARTED
   and after that it can wait forever in WaitLatch.
 
  I don't think that's right.  The status only remains
  BGWH_NOT_YET_STARTED until the postmaster forks the child process.
 
  I think the control flow can reach the above location before
  postmaster could fork all the workers.  Consider a case that
  we have launched all workers during ExecutorStart phase
  and then before postmaster could start all workers an error
  occurs in master backend and then it try to Abort the transaction
  and destroy the parallel context, at that moment it will get the
  above status and wait forever in above code.
 
  I am able to reproduce above scenario with debugger by using
  parallel_seqscan patch.

 Hmm.  Well, if you can reproduce it, there clearly must be a bug.  But
 I'm not quite sure where.  What should happen in that case is that the
 process that started the worker has to wait for the postmaster to
 actually start it,

Okay, I think this should solve the issue, also it should be done
before call of TerminateBackgroundWorker().


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] parallel mode and parallel contexts

2015-01-20 Thread Amit Kapila
On Sat, Jan 17, 2015 at 3:40 AM, Robert Haas robertmh...@gmail.com wrote:

 New patch attached.  I'm going to take the risk of calling this v1
 (previous versions have been 0.x), since I've now done something about
 the heavyweight locking issue, as well as fixed the message-looping
 bug Amit pointed out.  It doubtless needs more work, but it's starting
 to smell a bit more like a real patch.


I need some clarification regarding below code:

+BgwHandleStatus
+WaitForBackgroundWorkerShutdown(BackgroundWorkerHandle *handle)
+{
+ BgwHandleStatus
status;
+ int rc;
+ bool save_set_latch_on_sigusr1;
+
+
save_set_latch_on_sigusr1 = set_latch_on_sigusr1;
+ set_latch_on_sigusr1 = true;
+
+ PG_TRY();
+ {
+
for (;;)
+ {
+ pid_t pid;
+
+
CHECK_FOR_INTERRUPTS();
+
+ status = GetBackgroundWorkerPid(handle, pid);
+
if (status == BGWH_STOPPED)
+ return status;
+
+ rc =
WaitLatch(MyProc-procLatch,
+   WL_LATCH_SET |
WL_POSTMASTER_DEATH, 0);
+
+ if (rc  WL_POSTMASTER_DEATH)
+
return BGWH_POSTMASTER_DIED;

It seems this code has possibility to wait forever.
Assume one of the worker is not able to start (not able to attach
to shared memory or some other reason), then status returned by
GetBackgroundWorkerPid() will be BGWH_NOT_YET_STARTED
and after that it can wait forever in WaitLatch.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] parallel mode and parallel contexts

2015-01-20 Thread Robert Haas
On Tue, Jan 20, 2015 at 9:41 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 It seems [WaitForBackgroundWorkerShutdown] has possibility to wait forever.
 Assume one of the worker is not able to start (not able to attach
 to shared memory or some other reason), then status returned by
 GetBackgroundWorkerPid() will be BGWH_NOT_YET_STARTED
 and after that it can wait forever in WaitLatch.

I don't think that's right.  The status only remains
BGWH_NOT_YET_STARTED until the postmaster forks the child process.  At
that point it immediately changes to BGWH_STARTED.  If it starts up
and then dies early on, for example because it can't attach to shared
memory or somesuch, the status will change to BGWH_STOPPED.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-01-20 Thread Amit Kapila
On Tue, Jan 20, 2015 at 9:52 PM, Robert Haas robertmh...@gmail.com wrote:

 On Tue, Jan 20, 2015 at 9:41 AM, Amit Kapila amit.kapil...@gmail.com
wrote:
  It seems [WaitForBackgroundWorkerShutdown] has possibility to wait
forever.
  Assume one of the worker is not able to start (not able to attach
  to shared memory or some other reason), then status returned by
  GetBackgroundWorkerPid() will be BGWH_NOT_YET_STARTED
  and after that it can wait forever in WaitLatch.

 I don't think that's right.  The status only remains
 BGWH_NOT_YET_STARTED until the postmaster forks the child process.

I think the control flow can reach the above location before
postmaster could fork all the workers.  Consider a case that
we have launched all workers during ExecutorStart phase
and then before postmaster could start all workers an error
occurs in master backend and then it try to Abort the transaction
and destroy the parallel context, at that moment it will get the
above status and wait forever in above code.

I am able to reproduce above scenario with debugger by using
parallel_seqscan patch.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] parallel mode and parallel contexts

2015-01-16 Thread Andres Freund
On 2015-01-05 11:27:57 -0500, Robert Haas wrote:
 On Sat, Jan 3, 2015 at 7:31 PM, Andres Freund and...@2ndquadrant.com wrote:
  * I wonder if parallel contexts shouldn't be tracked via resowners
 
 That is a good question.  I confess that I'm somewhat fuzzy about
 which things should be tracked via the resowner mechanism vs. which
 things should have their own bespoke bookkeeping.  However, the
 AtEOXact_Parallel() stuff happens well before ResourceOwnerRelease(),
 which makes merging them seem not particularly clean.

I'm not sure either. But I think the current location is wrong anyway -
during AtEOXact_Parallel() before running user defined queries via
AfterTriggerFireDeferred() seems wrong.

  * combocid.c should error out in parallel mode, as insurance
 
 Eh, which function?  HeapTupleHeaderGetCmin(),
 HeapTupleHeaderGetCmax(), and AtEOXact_ComboCid() are intended to work
 in parallel mode. HeapTupleHeaderAdjustCmax() could
 Assert(!IsInParallelMode()) but the only calls are in heap_update()
 and heap_delete() which already have error checks, so putting yet
 another elog() there seems like overkill.

To me it seems like a good idea, but whatever.

  * I doubt it's a good idea to allow heap_insert, heap_inplace_update,
index_insert. I'm not convinced that those will be handled correct and
relaxing restrictions is easier than adding them.
 
 I'm fine with adding checks to heap_insert() and
 heap_inplace_update().  I'm not sure we really need to add anything to
 index_insert(); how are we going to get there without hitting some
 other prohibited operation first?

I'm not sure. But it's not that hard to imagine that somebody will start
adding codepaths that insert into indexes first. Think upsert.

  * I'd much rather separate HandleParallelMessageInterrupt() in one
variant that just tells the machinery there's a interrupt (called from
signal handlers) and one that actually handles them. We shouldn't even
consider adding any new code that does allocations, errors and such in
signal handlers. That's just a *bad* idea.
 
 That's a nice idea, but I didn't invent the way this crap works today.
 ImmediateInterruptOK gets set to true while performing a lock wait,
 and we need to be able to respond to errors while in that state.  I
 think there's got to be a better way to handle that than force every
 asynchronous operation to have to cope with the fact that
 ImmediateInterruptOK may be set or not set, but as of today that's
 what you have to do.

I personally think it's absolutely unacceptable to add any more of
that. That the current mess works is more luck than anything else - and
I'm pretty sure there's several bugs in it. But since I realize I can't
force you to develop a alternative solution, I tried to implement enough
infrastructure to deal with it without too much work.

As far as I can see this could relatively easily be implemented ontop of
the removal of ImmediateInterruptOK in the patch series I posted
yesterday? Afaics you just need to remove most of
+HandleParallelMessageInterrupt() and replace it by a single SetLatch().

  * I'm not a fan of the shm_toc stuff... Verbose and hard to read. I'd
much rather place a struct there and be careful about not using
pointers. That also would obliviate the need to reserve some ids.
 
 I don't see how that's going to work with variable-size data
 structures, and a bunch of the things that we need to serialize are
 variable-size.

Meh. Just appending the variable data to the end of the structure and
calculating offsets works just fine.

 Moreover, I'm not really interested in rehashing this
 design again.  I know it's not your favorite; you've said that before.

Well, if I keep having to read code using it, I'll keep being annoyed by
it. I guess we both have to live with that.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-01-15 Thread Robert Haas
On Thu, Jan 15, 2015 at 9:09 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 On Thu, Jan 15, 2015 at 6:52 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Jan 15, 2015 at 7:00 AM, Amit Kapila amit.kapil...@gmail.com
 wrote:
  +HandleParallelMessages(void)
  +{
  ..
  ..
  + for (i = 0; i  pcxt-nworkers; ++i)
  + {
  + /*
  + * Read messages for as long as we have an error queue; if we
  + * have hit (or hit while reading) ReadyForQuery, this will go to
  + * NULL.
  + */
  + while (pcxt-worker[i].error_mqh != NULL)
  + {
  + shm_mq_result res;
  +
  + CHECK_FOR_INTERRUPTS();
  +
  + res = shm_mq_receive(pcxt-worker[i].error_mqh, nbytes,
  + data, true);
  + if (res == SHM_MQ_SUCCESS)
 
  Here we are checking the error queue for all the workers and this loop
  will continue untill all have sent ReadyForQuery() message ('Z') which
  will make this loop continue till all workers have finished their work.
  Assume situation where first worker has completed the work and sent
  'Z' message and second worker is still sending some tuples, now above
  code will keep on waiting for 'Z' message from second worker and won't
  allow to receive tuples sent by second worker till it send 'Z' message.
 
  As each worker send its own 'Z' message after completion, so ideally
  the above code should receive the message only for worker which has
  sent the message.  I think for that it needs worker information who has
  sent the message.

 Are you talking about HandleParallelMessages() or
 WaitForParallelWorkersToFinish()?  The former doesn't wait for
 anything; it just handles any messages that are available now.

 I am talking about HandleParallelMessages().  It doesn't wait but
 it is looping which will make it run for longer time as explained
 above.  Just imagine a case where there are two workers and first
 worker has sent 'Z' message and second worker is doing some
 work, now in such a scenario loop will not finish until second worker
 also send 'Z' message or error.  Am I missing something?

Blah.  You're right.  I intended to write this loop so that it only
runs until shm_mq_receive() returns SHM_MQ_WOULD_BLOCK.  But that's
not what I did.  Will fix.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-01-15 Thread Robert Haas
On Thu, Jan 15, 2015 at 7:00 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 On Tue, Jan 13, 2015 at 1:33 AM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Jan 8, 2015 at 6:52 AM, Amit Kapila amit.kapil...@gmail.com
 wrote:
  + seg = dsm_attach(DatumGetInt32(main_arg));
 
  Here, I think DatumGetUInt32() needs to be used instead of
  DatumGetInt32() as the segment handle is uint32.

 OK, I'll change that in the next version.


 No issues, I have another question related to below code:

 +HandleParallelMessages(void)
 +{
 ..
 ..
 + for (i = 0; i  pcxt-nworkers; ++i)
 + {
 + /*
 + * Read messages for as long as we have an error queue; if we
 + * have hit (or hit while reading) ReadyForQuery, this will go to
 + * NULL.
 + */
 + while (pcxt-worker[i].error_mqh != NULL)
 + {
 + shm_mq_result res;
 +
 + CHECK_FOR_INTERRUPTS();
 +
 + res = shm_mq_receive(pcxt-worker[i].error_mqh, nbytes,
 + data, true);
 + if (res == SHM_MQ_SUCCESS)

 Here we are checking the error queue for all the workers and this loop
 will continue untill all have sent ReadyForQuery() message ('Z') which
 will make this loop continue till all workers have finished their work.
 Assume situation where first worker has completed the work and sent
 'Z' message and second worker is still sending some tuples, now above
 code will keep on waiting for 'Z' message from second worker and won't
 allow to receive tuples sent by second worker till it send 'Z' message.

 As each worker send its own 'Z' message after completion, so ideally
 the above code should receive the message only for worker which has
 sent the message.  I think for that it needs worker information who has
 sent the message.

Are you talking about HandleParallelMessages() or
WaitForParallelWorkersToFinish()?  The former doesn't wait for
anything; it just handles any messages that are available now.  The
latter does wait for all workers to finish, but the intention is that
you only call it when you're ready to wind up the entire parallel
operation, so that's OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-01-15 Thread Amit Kapila
On Tue, Jan 13, 2015 at 1:33 AM, Robert Haas robertmh...@gmail.com wrote:

 On Thu, Jan 8, 2015 at 6:52 AM, Amit Kapila amit.kapil...@gmail.com
wrote:
  + seg = dsm_attach(DatumGetInt32(main_arg));
 
  Here, I think DatumGetUInt32() needs to be used instead of
  DatumGetInt32() as the segment handle is uint32.

 OK, I'll change that in the next version.


No issues, I have another question related to below code:

+HandleParallelMessages(void)
+{
..
..
+ for (i = 0; i  pcxt-nworkers; ++i)
+ {
+ /*
+ * Read messages for as long as we have an error queue; if we
+ * have hit (or hit while reading) ReadyForQuery, this will go to
+ * NULL.
+ */
+ while (pcxt-worker[i].error_mqh != NULL)
+ {
+ shm_mq_result res;
+
+ CHECK_FOR_INTERRUPTS();
+
+ res = shm_mq_receive(pcxt-worker[i].error_mqh, nbytes,
+ data, true);
+ if (res == SHM_MQ_SUCCESS)


Here we are checking the error queue for all the workers and this loop
will continue untill all have sent ReadyForQuery() message ('Z') which
will make this loop continue till all workers have finished their work.
Assume situation where first worker has completed the work and sent
'Z' message and second worker is still sending some tuples, now above
code will keep on waiting for 'Z' message from second worker and won't
allow to receive tuples sent by second worker till it send 'Z' message.

As each worker send its own 'Z' message after completion, so ideally
the above code should receive the message only for worker which has
sent the message.  I think for that it needs worker information who has
sent the message.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] parallel mode and parallel contexts

2015-01-15 Thread Amit Kapila
On Thu, Jan 15, 2015 at 6:52 PM, Robert Haas robertmh...@gmail.com wrote:

 On Thu, Jan 15, 2015 at 7:00 AM, Amit Kapila amit.kapil...@gmail.com
wrote:
  +HandleParallelMessages(void)
  +{
  ..
  ..
  + for (i = 0; i  pcxt-nworkers; ++i)
  + {
  + /*
  + * Read messages for as long as we have an error queue; if we
  + * have hit (or hit while reading) ReadyForQuery, this will go to
  + * NULL.
  + */
  + while (pcxt-worker[i].error_mqh != NULL)
  + {
  + shm_mq_result res;
  +
  + CHECK_FOR_INTERRUPTS();
  +
  + res = shm_mq_receive(pcxt-worker[i].error_mqh, nbytes,
  + data, true);
  + if (res == SHM_MQ_SUCCESS)
 
  Here we are checking the error queue for all the workers and this loop
  will continue untill all have sent ReadyForQuery() message ('Z') which
  will make this loop continue till all workers have finished their work.
  Assume situation where first worker has completed the work and sent
  'Z' message and second worker is still sending some tuples, now above
  code will keep on waiting for 'Z' message from second worker and won't
  allow to receive tuples sent by second worker till it send 'Z' message.
 
  As each worker send its own 'Z' message after completion, so ideally
  the above code should receive the message only for worker which has
  sent the message.  I think for that it needs worker information who has
  sent the message.

 Are you talking about HandleParallelMessages() or
 WaitForParallelWorkersToFinish()?  The former doesn't wait for
 anything; it just handles any messages that are available now.

I am talking about HandleParallelMessages().  It doesn't wait but
it is looping which will make it run for longer time as explained
above.  Just imagine a case where there are two workers and first
worker has sent 'Z' message and second worker is doing some
work, now in such a scenario loop will not finish until second worker
also send 'Z' message or error.  Am I missing something?



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] parallel mode and parallel contexts

2015-01-12 Thread Robert Haas
On Wed, Jan 7, 2015 at 3:54 PM, Jim Nasby jim.na...@bluetreble.com wrote:
 Agreed, I was more concerned with calls to nextval(), which don't seem to be
 prevented in parallel mode?

It looks prevented:

/*
 * Forbid this during parallel operation because, to make it work,
 * the cooperating backends would need to share the backend-local cached
 * sequence information.  Currently, we don't support that.
 */
PreventCommandIfParallelMode(nextval());

 I was more thinking about all the add-on pl's like pl/ruby. But yeah, I
 don't see that there's much we can do if they're not using SPI.

Actually, there is: it's forbidden at the layer of heap_insert(),
heap_update(), heap_delete().  It's hard to imagine anyone trying to
modify the database as a lower level than that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-01-12 Thread Robert Haas
On Thu, Jan 8, 2015 at 6:52 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 + seg = dsm_attach(DatumGetInt32(main_arg));

 Here, I think DatumGetUInt32() needs to be used instead of
 DatumGetInt32() as the segment handle is uint32.

OK, I'll change that in the next version.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-01-08 Thread Amit Kapila
On Wed, Jan 7, 2015 at 11:03 PM, Robert Haas robertmh...@gmail.com wrote:

 I have little doubt that this version is still afflicted with various
 bugs, and the heavyweight locking issue remains to be dealt with, but
 on the whole I think this is headed in the right direction.


+ParallelMain(Datum main_arg)
{
..
+ /*
+ * Now that we have a resource owner, we can attach to the dynamic
+ * shared memory
segment and read the table of contents.
+ */
+ seg = dsm_attach(DatumGetInt32(main_arg));

Here, I think DatumGetUInt32() needs to be used instead of
DatumGetInt32() as the segment handle is uint32.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] parallel mode and parallel contexts

2015-01-07 Thread Simon Riggs
On 6 January 2015 at 21:37, Simon Riggs si...@2ndquadrant.com wrote:

 I get it now and agree

Yes, very much.

Should we copy both the top-level frame and the current subxid?
Hot Standby links subxids directly to the top-level, so this would be similar.

If we copied both, we wouldn't need to special case the Get functions.
It would still be O(1).

 but please work some more on clarity of
 README, comments and variable naming!

Something other than top sounds good.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-01-07 Thread Robert Haas
On Tue, Jan 6, 2015 at 4:37 PM, Simon Riggs si...@2ndquadrant.com wrote:
 So when you say Only the top frame of the transaction state stack is
 copied you don't mean the top, you mean the bottom (the latest
 subxact)? Which then becomes the top in the parallel worker? OK...

The item most recently added to the stack is properly called the top,
but I guess it's confusing in this case because the item on the bottom
of the stack is referred to as the TopTransaction.  I'll see if I can
rephrase that.

 Those comments really belong in a README, not the first visible
 comment in xact.c

OK.

 You need to start with the explanation that parallel workers have a
 faked-up xact stack to make it easier to copy and manage. That is
 valid because we never change xact state during a worker operation.

OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-01-07 Thread Jim Nasby

On 1/7/15, 9:39 AM, Robert Haas wrote:

sequence.c: Is it safe to read a sequence value in a parallel worker if the
cache_value is  1?

No, because the sequence cache isn't synchronized between the workers.
Maybe it would be safe if cache_value == 1, but there's not much
use-case: how often are you going to have a read-only query that uses
a sequence value.  At some point we can look at making sequences
parallel-safe, but worrying about it right now doesn't seem like a
good use of time.


Agreed, I was more concerned with calls to nextval(), which don't seem to be 
prevented in parallel mode?


This may be a dumb question, but for functions do we know that all pl's
besides C and SQL use SPI? If not I think they could end up writing in a
worker.

Well, the lower-level checks would catch that.  But it is generally
true that there's no way to prevent arbitrary C code from doing things
that are unsafe in parallel mode and that we can't tell are unsafe.
As I've said before, I think that we'll need to have a method of
labeling functions as parallel-safe or not, but this patch isn't
trying to solve that part of the problem.


I was more thinking about all the add-on pl's like pl/ruby. But yeah, I don't 
see that there's much we can do if they're not using SPI.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-01-07 Thread Robert Haas
On Tue, Jan 6, 2015 at 9:37 PM, Jim Nasby jim.na...@bluetreble.com wrote:
 CreateParallelContext(): Does it actually make sense to have nworkers=0?
 ISTM that's a bogus case.

I'm not sure whether we'll ever use the zero-worker case in
production, but I've certainly found it useful for
performance-testing.

 Also, since the number of workers will normally be
 determined dynamically by the planner, should that check be a regular
 conditional instead of just an Assert?

I don't think that's really necessary.  It shouldn't be too much of a
stretch for the planner to come up with a non-negative value.

 In LaunchParallelWorkers() the Start Workers comment states that we give
 up registering more workers if one fails to register, but there's nothing in
 the if condition to do that, and I don't see
 RegisterDynamicBackgroundWorker() doing it either. Is the comment just
 incorrect?

Woops, that got changed at some point and I forgot to update the
comment.  Will fix.

 SerializeTransactionState(): instead of looping through the transaction
 stack to calculate nxids, couldn't we just set it to maxsize -
 sizeof(TransactionId) * 3? If we're looping a second time for safety a
 comment explaining that would be useful...

Yeah, I guess we could do that.  I don't think it really matters very
much one way or the other.

 sequence.c: Is it safe to read a sequence value in a parallel worker if the
 cache_value is  1?

No, because the sequence cache isn't synchronized between the workers.
Maybe it would be safe if cache_value == 1, but there's not much
use-case: how often are you going to have a read-only query that uses
a sequence value.  At some point we can look at making sequences
parallel-safe, but worrying about it right now doesn't seem like a
good use of time.

 This may be a dumb question, but for functions do we know that all pl's
 besides C and SQL use SPI? If not I think they could end up writing in a
 worker.

Well, the lower-level checks would catch that.  But it is generally
true that there's no way to prevent arbitrary C code from doing things
that are unsafe in parallel mode and that we can't tell are unsafe.
As I've said before, I think that we'll need to have a method of
labeling functions as parallel-safe or not, but this patch isn't
trying to solve that part of the problem.

 @@ -2968,7 +2969,8 @@ ProcessInterrupts(void)
  errmsg(canceling statement due to
 user request)));
 }
 }
 -   /* If we get here, do nothing (probably, QueryCancelPending was
 reset) */
 +   if (ParallelMessagePending)
 +   HandleParallelMessageInterrupt(false);
 ISTM it'd be good to leave that comment in place (after the if).

 RestoreComboCIDState(): Assert(!comboCids || !comboHash): Shouldn't that be
 ? AIUI both should always be either set or 0.

Fixed.

 Typo: + elog(ERROR, cannot update SecondarySnapshpt during a
 parallel operation);

Fixed.

 The comment for RestoreSnapshot refers to set_transaction_snapshot, but that
 doesn't actually exist (or isn't referenced).

Fixed.

Will post a new version in a bit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-01-07 Thread Simon Riggs
On 7 January 2015 at 13:11, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Jan 6, 2015 at 4:37 PM, Simon Riggs si...@2ndquadrant.com wrote:
 So when you say Only the top frame of the transaction state stack is
 copied you don't mean the top, you mean the bottom (the latest
 subxact)? Which then becomes the top in the parallel worker? OK...

 The item most recently added to the stack is properly called the top,
 but I guess it's confusing in this case because the item on the bottom
 of the stack is referred to as the TopTransaction.  I'll see if I can
 rephrase that.

Yes, I didn't mean to suggest the confusion was introduced by you -
it's just you stepped into the mess by correctly using the word top in
a place where its meaning would be opposite to the close-by usage of
TopTransaction.

Anyway, feeling good about this now. Thanks for your patience.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-01-06 Thread Simon Riggs
On 6 January 2015 at 16:33, Robert Haas robertmh...@gmail.com wrote:

 These comments don’t have any explanation or justification

 OK, I rewrote them.  Hopefully it's better now.

Thanks for new version and answers.

 * Transaction stuff strikes me as overcomplicated and error prone.
 Given there is no explanation or justification for most of it, I’d be
 inclined to question why its there

 Gosh, I was pretty pleased with how simple the transaction integration
 turned out to be.  Most of what's there right now is either (a) code
 to copy state from the master to the parallel workers or (b) code to
 throw errors if the workers try to things that aren't safe.  I suspect
 there are a few things missing, but I don't see anything there that
 looks unnecessary.

If you can explain it in more detail in comments and README, I may
agree. At present, I don't get it and it makes me nervous.

The comment says
Only the top frame of the transaction state stack is copied to a parallel
worker
but I'm not sure why.

Top meaning the current subxact or the main xact?
If main, why are do we need XactTopTransactionId


 This is very impressive, but it looks like we’re trying to lift too
 much weight on the first lift. If we want to go anywhere near this, we
 need to have very clear documentation about how and why its like that.
 I’m actually very sorry to say that because the review started very
 well, much much better than most.

 When I posted the group locking patch, it got criticized because it
 didn't actually do anything useful by itself; similarly, the
 pg_background patch was criticized for not being a large enough step
 toward parallelism.  So, this time, I posted something more
 comprehensive.  I don't think it's quite complete yet.  I expect a
 committable version of this patch to be maybe another 500-1000 lines
 over what I have here right now -- I think it needs to do something
 about heavyweight locking, and I expect that there are some unsafe
 things that aren't quite prohibited yet.  But the current patch is
 only 2300 lines, which is not astonishingly large for a feature of
 this magnitude; if anything, I'd say it's surprisingly small, due to a
 year and a half of effort laying the necessary groundwork via long
 series of preliminary commits.  I'm not unwilling to divide this up
 some more if we can agree on a way to do that that makes sense, but I
 think we're nearing the point where we need to take the plunge and
 say, look, this is version one of parallelism.  Thunk.

I want this also; the only debate is where to draw the line and please
don't see that as criticism.

I'm very happy it's so short, I agree it could be longer.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-01-06 Thread Robert Haas
On Tue, Jan 6, 2015 at 3:04 PM, Simon Riggs si...@2ndquadrant.com wrote:
 If you can explain it in more detail in comments and README, I may
 agree. At present, I don't get it and it makes me nervous.

 The comment says
 Only the top frame of the transaction state stack is copied to a parallel
 worker
 but I'm not sure why.

 Top meaning the current subxact or the main xact?
 If main, why are do we need XactTopTransactionId

Current subxact.

I initially thought of copying the entire TransactionStateData stack,
but Heikki suggested (via IM) that I do it this way instead.  I
believe his concern was that it's never valid to commit or roll back
to a subtransaction that is not at the top of the stack, and if you
don't copy the stack, you avoid the risk of somehow ending up in that
state.  Also, you avoid having to invent resource owners for
(sub)transactions that don't really exist in the current process.  On
the other hand, you do end up with a few special cases that wouldn't
exist with the other approach.  Still, I'm pretty happy to have taken
Heikki's advice: it was certainly simple to implement this way, plus
hopefully that way at least one person likes what I ended up with.
:-)

What else needs clarification?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-01-06 Thread Simon Riggs
On 6 January 2015 at 21:01, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Jan 6, 2015 at 3:04 PM, Simon Riggs si...@2ndquadrant.com wrote:
 If you can explain it in more detail in comments and README, I may
 agree. At present, I don't get it and it makes me nervous.

 The comment says
 Only the top frame of the transaction state stack is copied to a parallel
 worker
 but I'm not sure why.

 Top meaning the current subxact or the main xact?
 If main, why are do we need XactTopTransactionId

 Current subxact.

TopTransactionStateData points to the top-level transaction data, but
XactTopTransactionId points to the current subxact.

So when you say Only the top frame of the transaction state stack is
copied you don't mean the top, you mean the bottom (the latest
subxact)? Which then becomes the top in the parallel worker? OK...

 I initially thought of copying the entire TransactionStateData stack,
 but Heikki suggested (via IM) that I do it this way instead.  I
 believe his concern was that it's never valid to commit or roll back
 to a subtransaction that is not at the top of the stack, and if you
 don't copy the stack, you avoid the risk of somehow ending up in that
 state.  Also, you avoid having to invent resource owners for
 (sub)transactions that don't really exist in the current process.  On
 the other hand, you do end up with a few special cases that wouldn't
 exist with the other approach.  Still, I'm pretty happy to have taken
 Heikki's advice: it was certainly simple to implement this way, plus
 hopefully that way at least one person likes what I ended up with.
 :-)

 What else needs clarification?

Those comments really belong in a README, not the first visible
comment in xact.c

You need to start with the explanation that parallel workers have a
faked-up xact stack to make it easier to copy and manage. That is
valid because we never change xact state during a worker operation.

I get it now and agree, but please work some more on clarity of
README, comments and variable naming!

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-01-06 Thread Jim Nasby

On 1/6/15, 10:33 AM, Robert Haas wrote:

Entrypints?

Already noted by Andres; fixed in the attached version.


Perhaps we only parallelize while drinking beer... ;)

CreateParallelContext(): Does it actually make sense to have nworkers=0? ISTM 
that's a bogus case. Also, since the number of workers will normally be 
determined dynamically by the planner, should that check be a regular 
conditional instead of just an Assert?

In LaunchParallelWorkers() the Start Workers comment states that we give up 
registering more workers if one fails to register, but there's nothing in the if 
condition to do that, and I don't see RegisterDynamicBackgroundWorker() doing it either. 
Is the comment just incorrect?

SerializeTransactionState(): instead of looping through the transaction stack 
to calculate nxids, couldn't we just set it to maxsize - sizeof(TransactionId) 
* 3? If we're looping a second time for safety a comment explaining that would 
be useful...

sequence.c: Is it safe to read a sequence value in a parallel worker if the 
cache_value is  1?

This may be a dumb question, but for functions do we know that all pl's besides 
C and SQL use SPI? If not I think they could end up writing in a worker.

@@ -2968,7 +2969,8 @@ ProcessInterrupts(void)
 errmsg(canceling statement due to user 
request)));
}
}
-   /* If we get here, do nothing (probably, QueryCancelPending was reset) 
*/
+   if (ParallelMessagePending)
+   HandleParallelMessageInterrupt(false);
ISTM it'd be good to leave that comment in place (after the if).

RestoreComboCIDState(): Assert(!comboCids || !comboHash): Shouldn't that be ? 
AIUI both should always be either set or 0.

Typo: + elog(ERROR, cannot update SecondarySnapshpt during a parallel 
operation);

The comment for RestoreSnapshot refers to set_transaction_snapshot, but that 
doesn't actually exist (or isn't referenced).
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-01-05 Thread Robert Haas
On Sat, Jan 3, 2015 at 7:31 PM, Andres Freund and...@2ndquadrant.com wrote:
 A couple remarks:
 * Shouldn't this provide more infrastructure to deal with the fact that
   we might get less parallel workers than we had planned for?

Maybe, but I'm not really sure what that should look like.  My working
theory is that individual parallel applications (executor nodes, or
functions that use parallelism internally, or whatever) should be
coded in such a way that they work correctly even if the number of
workers that starts is smaller than what they requested, or even zero.
It may turn out that this is impractical in some cases, but I don't
know what those cases are yet so I don't know what the common threads
will be.

I think parallel seq scan should work by establishing N tuple queues
(where N is the number of workers we have).  When asked to return a
tuple, the master should poll all of those queues one after another
until it finds one that contains a tuple.  If none of them contain a
tuple then it should claim a block to scan and return a tuple from
that block.  That way, if there are enough running workers to keep up
with the master's demand for tuples, the master won't do any of the
actual scan itself.  But if the workers can't keep up -- e.g. suppose
90% of the CPU consumed by the query is in the filter qual for the
scan -- then the master can pitch in along with everyone else.  As a
non-trivial fringe benefit, if the workers don't all start, or take a
while to initialize, the user backend isn't stalled meanwhile.

 * I wonder if parallel contexts shouldn't be tracked via resowners

That is a good question.  I confess that I'm somewhat fuzzy about
which things should be tracked via the resowner mechanism vs. which
things should have their own bespoke bookkeeping.  However, the
AtEOXact_Parallel() stuff happens well before ResourceOwnerRelease(),
which makes merging them seem not particularly clean.

 * combocid.c should error out in parallel mode, as insurance

Eh, which function?  HeapTupleHeaderGetCmin(),
HeapTupleHeaderGetCmax(), and AtEOXact_ComboCid() are intended to work
in parallel mode. HeapTupleHeaderAdjustCmax() could
Assert(!IsInParallelMode()) but the only calls are in heap_update()
and heap_delete() which already have error checks, so putting yet
another elog() there seems like overkill.

 * I doubt it's a good idea to allow heap_insert, heap_inplace_update,
   index_insert. I'm not convinced that those will be handled correct and
   relaxing restrictions is easier than adding them.

I'm fine with adding checks to heap_insert() and
heap_inplace_update().  I'm not sure we really need to add anything to
index_insert(); how are we going to get there without hitting some
other prohibited operation first?

 * I'd much rather separate HandleParallelMessageInterrupt() in one
   variant that just tells the machinery there's a interrupt (called from
   signal handlers) and one that actually handles them. We shouldn't even
   consider adding any new code that does allocations, errors and such in
   signal handlers. That's just a *bad* idea.

That's a nice idea, but I didn't invent the way this crap works today.
ImmediateInterruptOK gets set to true while performing a lock wait,
and we need to be able to respond to errors while in that state.  I
think there's got to be a better way to handle that than force every
asynchronous operation to have to cope with the fact that
ImmediateInterruptOK may be set or not set, but as of today that's
what you have to do.

 * I'm not a fan of the shm_toc stuff... Verbose and hard to read. I'd
   much rather place a struct there and be careful about not using
   pointers. That also would obliviate the need to reserve some ids.

I don't see how that's going to work with variable-size data
structures, and a bunch of the things that we need to serialize are
variable-size.  Moreover, I'm not really interested in rehashing this
design again.  I know it's not your favorite; you've said that before.
But it makes it possible to write code to do useful things in
parallel, a capability that we do not otherwise have.  And I like it
fine.

 * Doesn't the restriction in GetSerializableTransactionSnapshotInt()
   apply for repeatable read just the same?

Yeah.  I'm not sure whether we really need that check at all, because
there is a check in GetTransactionSnapshot() that probably checks the
same thing.

 * I'm not a fan of relying on GetComboCommandId() to restore things in
   the same order as before.

I thought that was a little wonky, but it's not likely to break, and
there is an elog(ERROR) there to catch it if it does, so I'd rather
not make it more complicated.

 * I'd say go ahead and commit the BgworkerByOid thing
   earlier/independently. I've seen need for that a couple times.

Woohoo!  I was hoping someone would say that.

 * s/entrypints/entrypoints/

Thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via 

Re: [HACKERS] parallel mode and parallel contexts

2015-01-05 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Sat, Jan 3, 2015 at 7:31 PM, Andres Freund and...@2ndquadrant.com wrote:
 * I wonder if parallel contexts shouldn't be tracked via resowners

 That is a good question.  I confess that I'm somewhat fuzzy about
 which things should be tracked via the resowner mechanism vs. which
 things should have their own bespoke bookkeeping.  However, the
 AtEOXact_Parallel() stuff happens well before ResourceOwnerRelease(),
 which makes merging them seem not particularly clean.

FWIW, the resowner mechanism was never meant as a substitute for bespoke
bookkeeping.  What it is is a helper mechanism to reduce the need for
PG_TRY blocks that guarantee that a resource-releasing function will be
called even in error paths.  I'm not sure whether that analogy applies
well in parallel-execution cases.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-01-05 Thread Simon Riggs
On 22 December 2014 at 19:14, Robert Haas robertmh...@gmail.com wrote:

 Here is another new version, with lots of bugs fixed.

An initial blind review, independent of other comments already made on thread.

OK src/backend/access/heap/heapam.c
heapam.c prohibitions on update and delete look fine

OK src/backend/access/transam/Makefile

OK src/backend/access/transam/README.parallel
README.parallel and all concepts look good

PARTIAL src/backend/access/transam/parallel.c
wait_patiently mentioned in comment but doesn’t exist
Why not make nworkers into a uint?
Trampoline? Really? I think we should define what we mean by that,
somewhere, rather than just use the term as if it was self-evident.
Entrypints?
..

OK src/backend/access/transam/varsup.c

QUESTIONS  src/backend/access/transam/xact.c
These comments don’t have any explanation or justification

+ * This stores XID of the toplevel transaction to which we belong.
+ * Normally, it's the same as TopTransactionState.transactionId, but in
+ * a parallel worker it may not be.

+ * If we're a parallel worker, there may be additional XIDs that we need
+ * to regard as part of our transaction even though they don't appear
+ * in the transaction state stack.

This comment is copied-and-pasted too many times for safety and elegance
+   /*
+* Workers synchronize transaction state at the beginning of
each parallel
+* operation, so we can't account for transaction state change
after that
+* point.  (Note that this check will certainly error out if
s-blockState
+* is TBLOCK_PARALLEL_INPROGRESS, so we can treat that as an
invalid case
+* below.)
+*/
We need a single Check() that contains more detail and comments within

Major comments

* Parallel stuff at start sounded OK
* Transaction stuff strikes me as overcomplicated and error prone.
Given there is no explanation or justification for most of it, I’d be
inclined to question why its there
* If we have a trampoline for DSM, why don’t we use a trampoline for
snapshots, then you wouldn’t need to do all this serialize stuff

This is very impressive, but it looks like we’re trying to lift too
much weight on the first lift. If we want to go anywhere near this, we
need to have very clear documentation about how and why its like that.
I’m actually very sorry to say that because the review started very
well, much much better than most.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-01-05 Thread Amit Kapila
On Mon, Jan 5, 2015 at 9:57 PM, Robert Haas robertmh...@gmail.com wrote:

 On Sat, Jan 3, 2015 at 7:31 PM, Andres Freund and...@2ndquadrant.com
wrote:

  * Doesn't the restriction in GetSerializableTransactionSnapshotInt()
apply for repeatable read just the same?

 Yeah.  I'm not sure whether we really need that check at all, because
 there is a check in GetTransactionSnapshot() that probably checks the
 same thing.


The check in GetSerializableTransactionSnapshotInt() will also prohibit
any user/caller of SetSerializableTransactionSnapshot() in parallel mode
as that can't be prohibited by GetTransactionSnapshot().


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] parallel mode and parallel contexts

2015-01-05 Thread Robert Haas
On Fri, Jan 2, 2015 at 9:04 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 While working on parallel seq-scan patch to adapt this framework, I
 noticed few things and have questions regrading the same.

 1.
 Currently parallel worker just attaches to error queue, for tuple queue
 do you expect it to be done in the same place or in the caller supplied
 function, if later then we need segment address as input to that function
 to attach queue to the segment(shm_mq_attach()).
 Another question, I have in this regard is that if we have redirected
 messages to error queue by using pq_redirect_to_shm_mq, then how can
 we set tuple queue for the same purpose.  Similarly I think more handling
 is needed for tuple queue in master backend and the answer to above will
 dictate what is the best way to do it.

I've come to the conclusion that it's a bad idea for tuples to be sent
through the same queue as errors.  We want errors (or notices, but
especially errors) to be processed promptly, but there may be a
considerable delay in processing tuples.  For example, imagine a plan
that looks like this:

Nested Loop
- Parallel Seq Scan on p
- Index Scan on q
Index Scan: q.x = p.x

The parallel workers should fill up the tuple queues used by the
parallel seq scan so that the master doesn't have to do any of that
work itself.  Therefore, the normal situation will be that those tuple
queues are all full.  If an error occurs in a worker at that point, it
can't add it to the tuple queue, because the tuple queue is full.  But
even if it could do that, the master then won't notice the error until
it reads all of the queued-up tuple messges that are in the queue in
front of the error, and maybe some messages from the other queues as
well, since it probably round-robins between the queues or something
like that.  Basically, it could do a lot of extra work before noticing
that error in there.

Now we could avoid that by having the master read messages from the
queue immediately and just save them off to local storage if they
aren't error messages.  But that's not very desirable either, because
now we have no flow-control.  The workers will just keep spamming
tuples that the master isn't ready for into the queues, and the master
will keep reading them and saving them to local storage, and
eventually it will run out of memory and die.

We could engineer some solution to this problem, of course, but it
seems quite a bit simpler to just have two queues.  The error queues
don't need to be very big (I made them 16kB, which is trivial on any
system on which you care about having working parallelism) and the
tuple queues can be sized as needed to avoid pipeline stalls.

 2.
 Currently there is no interface for wait_for_workers_to_become_ready()
 in your patch, don't you think it is important that before start of fetching
 tuples, we should make sure all workers are started, what if some worker
 fails to start?

I think that, in general, getting the most benefit out of parallelism
means *avoiding* situations where backends have to wait for each
other.  If the relation being scanned is not too large, the user
backend might be able to finish the whole scan - or a significant
fraction of it - before the workers initialize.  Of course in that
case it might have been a bad idea to parallelize in the first place,
but we should still try to make the best of the situation.  If some
worker fails to start, then instead of having the full degree N
parallelism we were hoping for, we have some degree K  N, so things
will take a little longer, but everything should still work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-01-03 Thread Andres Freund
On 2014-12-22 14:14:31 -0500, Robert Haas wrote:
 On Fri, Dec 19, 2014 at 8:59 AM, Robert Haas robertmh...@gmail.com wrote:
  And here is a new version.

 Here is another new version, with lots of bugs fixed.

A couple remarks:
* Shouldn't this provide more infrastructure to deal with the fact that
  we might get less parallel workers than we had planned for?
* I wonder if parallel contexts shouldn't be tracked via resowners
* combocid.c should error out in parallel mode, as insurance
* I doubt it's a good idea to allow heap_insert, heap_inplace_update,
  index_insert. I'm not convinced that those will be handled correct and
  relaxing restrictions is easier than adding them.
* I'd much rather separate HandleParallelMessageInterrupt() in one
  variant that just tells the machinery there's a interrupt (called from
  signal handlers) and one that actually handles them. We shouldn't even
  consider adding any new code that does allocations, errors and such in
  signal handlers. That's just a *bad* idea.
* I'm not a fan of the shm_toc stuff... Verbose and hard to read. I'd
  much rather place a struct there and be careful about not using
  pointers. That also would obliviate the need to reserve some ids.
* Doesn't the restriction in GetSerializableTransactionSnapshotInt()
  apply for repeatable read just the same?
* I'm not a fan of relying on GetComboCommandId() to restore things in
  the same order as before.
* I'd say go ahead and commit the BgworkerByOid thing
  earlier/independently. I've seen need for that a couple times.
* s/entrypints/entrypoints/

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2015-01-02 Thread Amit Kapila
On Thu, Dec 18, 2014 at 1:23 AM, Robert Haas robertmh...@gmail.com wrote:


 In the meantime, I had a good chat with Heikki on IM yesterday which
 gave me some new ideas on how to fix up the transaction handling in
 here, which I am working on implementing.  So hopefully I will have
 that by then.

 I am also hoping Amit will be adapting his parallel seq-scan patch to
 this framework.


While working on parallel seq-scan patch to adapt this framework, I
noticed few things and have questions regrading the same.

1.
Currently parallel worker just attaches to error queue, for tuple queue
do you expect it to be done in the same place or in the caller supplied
function, if later then we need segment address as input to that function
to attach queue to the segment(shm_mq_attach()).
Another question, I have in this regard is that if we have redirected
messages to error queue by using pq_redirect_to_shm_mq, then how can
we set tuple queue for the same purpose.  Similarly I think more handling
is needed for tuple queue in master backend and the answer to above will
dictate what is the best way to do it.

2.
Currently there is no interface for wait_for_workers_to_become_ready()
in your patch, don't you think it is important that before start of fetching
tuples, we should make sure all workers are started, what if some worker
fails to start?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] parallel mode and parallel contexts

2014-12-21 Thread Michael Paquier
On Thu, Dec 18, 2014 at 4:53 AM, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Dec 17, 2014 at 9:16 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On 12 December 2014 at 22:52, Robert Haas robertmh...@gmail.com wrote:

 I would be remiss if I failed to mention that this patch includes work
 by my colleagues Amit Kapila, Rushabh Lathia, and Jeevan Chalke, as
 well as my former colleague Noah Misch; and that it would not have
 been possible without the patient support of EnterpriseDB management.

 Noted and thanks to all.

 I will make this my priority for review, but regrettably not until New
 Year, about 2.5-3 weeks away.

 OK!

 In the meantime, I had a good chat with Heikki on IM yesterday which
 gave me some new ideas on how to fix up the transaction handling in
 here, which I am working on implementing.  So hopefully I will have
 that by then.

 I am also hoping Amit will be adapting his parallel seq-scan patch to
 this framework.
Simon, if you are planning to review this patch soon, could you add
your name as a reviewer for this patch in the commit fest app?
Thanks,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2014-12-19 Thread Robert Haas
On Wed, Dec 17, 2014 at 2:53 PM, Robert Haas robertmh...@gmail.com wrote:
 In the meantime, I had a good chat with Heikki on IM yesterday which
 gave me some new ideas on how to fix up the transaction handling in
 here, which I am working on implementing.  So hopefully I will have
 that by then.

And here is a new version.  This version has some real integration
with the transaction system, along the lines of what Heikki suggested
to me, so hopefully tuple visibility calculations in a parallel worker
will now produce the right answers, though I need to integrate this
with code to actually do something-or-other in parallel in order to
really test that.  There are still some problems with parallel worker
shutdown.  As hard as I tried to fight it, I'm gradually resigning
myself to the fact that we're probably going to have to set things up
so that the worker waits for all of its children to die during abort
processing (and during commit processing, but that's not bothersome).
Otherwise, to take just one example, chaos potentially ensues if you
run a parallel query in a subtransaction and then roll back to a
savepoint.  But this version just kills the workers and doesn't
actually wait for them to die; I'll see about fixing that, but wanted
to send this around for comments first.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/contrib/parallel_dummy/Makefile b/contrib/parallel_dummy/Makefile
new file mode 100644
index 000..de00f50
--- /dev/null
+++ b/contrib/parallel_dummy/Makefile
@@ -0,0 +1,19 @@
+MODULE_big = parallel_dummy
+OBJS = parallel_dummy.o $(WIN32RES)
+PGFILEDESC = parallel_dummy - dummy use of parallel infrastructure
+
+EXTENSION = parallel_dummy
+DATA = parallel_dummy--1.0.sql
+
+REGRESS = parallel_dummy
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/parallel_dummy
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/parallel_dummy/parallel_dummy--1.0.sql b/contrib/parallel_dummy/parallel_dummy--1.0.sql
new file mode 100644
index 000..2a7251c
--- /dev/null
+++ b/contrib/parallel_dummy/parallel_dummy--1.0.sql
@@ -0,0 +1,7 @@
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use CREATE EXTENSION parallel_dummy to load this file. \quit
+
+CREATE FUNCTION parallel_dummy(sleep_time pg_catalog.int4,
+			  nworkers pg_catalog.int4)
+RETURNS pg_catalog.void STRICT
+	AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/contrib/parallel_dummy/parallel_dummy.c b/contrib/parallel_dummy/parallel_dummy.c
new file mode 100644
index 000..99b830f
--- /dev/null
+++ b/contrib/parallel_dummy/parallel_dummy.c
@@ -0,0 +1,82 @@
+/*--
+ *
+ * parallel_dummy.c
+ *		Test harness code for parallel mode code.
+ *
+ * Copyright (C) 2013-2014, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		contrib/parallel_dummy/parallel_dummy.c
+ *
+ * -
+ */
+
+#include postgres.h
+
+#include access/parallel.h
+#include access/xact.h
+#include fmgr.h
+#include miscadmin.h
+#include utils/builtins.h
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(parallel_dummy);
+
+#define		PARALLEL_DUMMY_KEY			1
+
+typedef struct {
+	int32		sleep_time;
+} ParallelDummyInfo;
+
+void		_PG_init(void);
+void		worker_main(shm_toc *toc);
+
+Datum
+parallel_dummy(PG_FUNCTION_ARGS)
+{
+	int32		sleep_time = PG_GETARG_INT32(0);
+	int32		nworkers = PG_GETARG_INT32(1);
+	bool		already_in_parallel_mode = IsInParallelMode();
+	ParallelContext *pcxt;
+	ParallelDummyInfo *info;
+
+	if (nworkers  1)
+		ereport(ERROR,
+(errmsg(number of parallel workers must be positive)));
+
+	if (!already_in_parallel_mode)
+		EnterParallelMode();
+
+	pcxt = CreateParallelContextForExtension(parallel_dummy, worker_main,
+			 nworkers);
+	shm_toc_estimate_chunk(pcxt-estimator, sizeof(ParallelDummyInfo));
+	shm_toc_estimate_keys(pcxt-estimator, 1);
+	InitializeParallelDSM(pcxt);
+	info = shm_toc_allocate(pcxt-toc, sizeof(ParallelDummyInfo));
+	info-sleep_time = sleep_time;
+	shm_toc_insert(pcxt-toc, PARALLEL_DUMMY_KEY, info);
+	LaunchParallelWorkers(pcxt);
+
+	/* here's where we do the real work ... */
+	DirectFunctionCall1(pg_sleep, Float8GetDatum((float8) sleep_time));
+
+	DestroyParallelContext(pcxt);
+
+	if (!already_in_parallel_mode)
+		ExitParallelMode();
+
+	PG_RETURN_VOID();
+}
+
+void
+worker_main(shm_toc *toc)
+{
+	ParallelDummyInfo *info;
+
+	info = shm_toc_lookup(toc, PARALLEL_DUMMY_KEY);
+	Assert(info != NULL);
+
+	/* here's where we do the real work ... */
+	DirectFunctionCall1(pg_sleep, Float8GetDatum((float8) info-sleep_time));
+}
diff --git a/contrib/parallel_dummy/parallel_dummy.control b/contrib/parallel_dummy/parallel_dummy.control
new file 

Re: [HACKERS] parallel mode and parallel contexts

2014-12-17 Thread Simon Riggs
On 12 December 2014 at 22:52, Robert Haas robertmh...@gmail.com wrote:

 I would be remiss if I failed to mention that this patch includes work
 by my colleagues Amit Kapila, Rushabh Lathia, and Jeevan Chalke, as
 well as my former colleague Noah Misch; and that it would not have
 been possible without the patient support of EnterpriseDB management.

Noted and thanks to all.

I will make this my priority for review, but regrettably not until New
Year, about 2.5-3 weeks away.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel mode and parallel contexts

2014-12-17 Thread Robert Haas
On Wed, Dec 17, 2014 at 9:16 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On 12 December 2014 at 22:52, Robert Haas robertmh...@gmail.com wrote:

 I would be remiss if I failed to mention that this patch includes work
 by my colleagues Amit Kapila, Rushabh Lathia, and Jeevan Chalke, as
 well as my former colleague Noah Misch; and that it would not have
 been possible without the patient support of EnterpriseDB management.

 Noted and thanks to all.

 I will make this my priority for review, but regrettably not until New
 Year, about 2.5-3 weeks away.

OK!

In the meantime, I had a good chat with Heikki on IM yesterday which
gave me some new ideas on how to fix up the transaction handling in
here, which I am working on implementing.  So hopefully I will have
that by then.

I am also hoping Amit will be adapting his parallel seq-scan patch to
this framework.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers