Re: [HACKERS] asynchronous execution

2017-10-20 Thread Kyotaro HORIGUCHI
Hello.

Fully-asynchronous executor needs that every node is stateful and
suspendable at the time of requesting for the next tuples to
underneath nodes. I tried pure push-base executor but failed.

After the miserable patch upthread, I finally managed to make
executor nodes suspendable using computational jump and got rid
of recursive calls of executor. But it runs about x10 slower for
simple SeqScan case. (pgbench ran with 9% degradation.) It
doesn't seem recoverable by handy improvements. So I gave up
that.

Then I returned to single-level asynchrony, in other words, the
simple case with async-aware nodes just above async-capable
nodes. The motive of using the framework in the previous patch
was that we had degradation on the sync (or normal) code paths by
polluting ExecProcNode with async stuff and as Tom's suggestion
the node->ExecProcNode trick can isolate the async code path.

The attached PoC patch theoretically has no impact on the normal
code paths and just brings gain in async cases. (Additional
members in PlanState made degradation seemingly comes from
alignment, though.)

But I haven't had enough stable result from performance
test. Different builds from the same source code gives apparently
different results...

Anyway I'll show the best one in the several times run here.

   original(ms) patched(ms)gain(%)
A: simple table scan :  9714.70  9656.73 0.6
B: local partitioning:  4119.44  4131.10-0.3
C: single remote table   :  9484.86  9141.89 3.7
D: sharding (single con) :  7114.34  6751.21 5.1
E: sharding (multi con)  :  7166.56  1827.9374.5

A and B are degradation checks, which are expected to show no
degradation.  C is the gain only by postgres_fdw's command
presending on a remote table.  D is the gain of sharding on a
connection. The number of partitions/shards is 4.  E is the gain
using dedicate connection per shard.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From fc424c16e124934581a184fcadaed1e05f7672c8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH 1/3] Allow wait event set to be registered to resource owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c|  2 +-
 src/backend/storage/ipc/latch.c   | 18 ++-
 src/backend/storage/lmgr/condition_variable.c |  2 +-
 src/backend/utils/resowner/resowner.c | 68 +++
 src/include/storage/latch.h   |  4 +-
 src/include/utils/resowner_private.h  |  8 
 6 files changed, 97 insertions(+), 5 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 754154b..d459f32 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -220,7 +220,7 @@ pq_init(void)
 (errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
 	AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
 	  NULL, NULL);
 	AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 4eb6e83..e6fc3dd 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -51,6 +51,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -77,6 +78,8 @@ struct WaitEventSet
 	int			nevents;		/* number of registered events */
 	int			nevents_space;	/* maximum number of events in this set */
 
+	ResourceOwner	resowner;	/* Resource owner */
+
 	/*
 	 * Array, of nevents_space length, storing the definition of events this
 	 * set is waiting for.
@@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	int			ret = 0;
 	int			rc;
 	WaitEvent	event;
-	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 
 	if (wakeEvents & WL_TIMEOUT)
 		Assert(timeout >= 0);
@@ -517,12 +520,15 @@ ResetLatch(volatile Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
 {
 	WaitEventSet *set;
 	char	   *data;
 	Size		sz = 0;
 
+	if (res)
+		ResourceOwnerEnlargeWESs(res);
+
 	/*
 	 * Use MAXALIGN size/alignment to guarantee that later uses of memory are
 	 * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -591,6 +597,11 @@ 

Re: [HACKERS] asynchronous execution

2017-09-03 Thread Kyotaro HORIGUCHI
At Thu, 31 Aug 2017 21:52:36 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170831.215236.135328985.horiguchi.kyot...@lab.ntt.co.jp>
> At Thu, 03 Aug 2017 09:30:57 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
>  wrote in 
> <20170803.093057.261590619.horiguchi.kyot...@lab.ntt.co.jp>
> > > Unfortunately, that's probably another gigantic patch (that
> > > should probably be written by Andres).
> > 
> > Yeah, but async executor on the current style of executor seems
> > furtile work, or sitting until the patch comes is also waste of
> > time. So I'm planning to include the following sutff in the next
> > PoC patch. Even I'm not sure it can land on the coming
> > Andres'patch.
> > 
> > - Tuple passing outside call-stack. (I remember it was in the
> >   past of the thread around but not found)
> > 
> >   This should be included in the Andres' patch.
> > 
> > - Give executor an ability to run from data-source (or driver)
> >   nodes to the root.
> > 
> >   I'm not sure this is included, but I suppose he is aiming this
> >   kind of thing.
> > 
> > - Rebuid asynchronous execution on the upside-down executor.
> 
> Anyway, I modified ExecProcNode into push-up form and it *seems*
> working to some extent. But trigger and cursors are almost broken
> and several other regressions fail. Some nodes such like
> windowagg are terriblly difficult to change to this push-up form
> (using state machine). And of course it is terribly inefficient.
> 
> I'm afraid that all of this turns out to be in vain. But anyway,
> and FWIW, I'll show the work to here after some cleansing work on
> it.

So, this is that. Maybe this is really a bad way to go. Top of
the bads is it's terriblly hard to maintain because the behavior
of the state machine constructed in this patch is hardly
predictable so easily broken. During the 'cleansing work' I had
many crash or infinite-loop and they were a bit hard to
diagnose.. This will be soon broken by following commits.

Anyway and, again FWIW, this is that. I'll leave this for a while
(at least the period of this CF) and reconsider on async in
different forms.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


poc_pushexecutor_20170904_4faa1dc.tar.bz2
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2017-08-31 Thread Kyotaro HORIGUCHI
At Thu, 03 Aug 2017 09:30:57 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170803.093057.261590619.horiguchi.kyot...@lab.ntt.co.jp>
> > Unfortunately, that's probably another gigantic patch (that
> > should probably be written by Andres).
> 
> Yeah, but async executor on the current style of executor seems
> furtile work, or sitting until the patch comes is also waste of
> time. So I'm planning to include the following sutff in the next
> PoC patch. Even I'm not sure it can land on the coming
> Andres'patch.
> 
> - Tuple passing outside call-stack. (I remember it was in the
>   past of the thread around but not found)
> 
>   This should be included in the Andres' patch.
> 
> - Give executor an ability to run from data-source (or driver)
>   nodes to the root.
> 
>   I'm not sure this is included, but I suppose he is aiming this
>   kind of thing.
> 
> - Rebuid asynchronous execution on the upside-down executor.

Anyway, I modified ExecProcNode into push-up form and it *seems*
working to some extent. But trigger and cursors are almost broken
and several other regressions fail. Some nodes such like
windowagg are terriblly difficult to change to this push-up form
(using state machine). And of course it is terribly inefficient.

I'm afraid that all of this turns out to be in vain. But anyway,
and FWIW, I'll show the work to here after some cleansing work on
it.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2017-08-02 Thread Kyotaro HORIGUCHI
Thank you for the comment.

At Tue, 1 Aug 2017 16:27:41 -0400, Robert Haas  wrote in 

> On Mon, Jul 31, 2017 at 5:42 AM, Kyotaro HORIGUCHI
>  wrote:
> > Another is getting rid of recursive call to run an execution
> > tree.
> 
> That happens to be exactly what Andres did for expression evaluation
> in commit b8d7f053c5c2bf2a7e8734fe3327f6a8bc711755, and I think
> generalizing that to include the plan tree as well as expression trees
> is likely to be the long-term way forward here.

I read it in the source tree. The patch implements converting
expression tree to an intermediate expression then run it on a
custom-made interpreter. Guessing from the word "upside down"
from Andres, the whole thing will become source-driven.

> Unfortunately, that's probably another gigantic patch (that
> should probably be written by Andres).

Yeah, but async executor on the current style of executor seems
furtile work, or sitting until the patch comes is also waste of
time. So I'm planning to include the following sutff in the next
PoC patch. Even I'm not sure it can land on the coming
Andres'patch.

- Tuple passing outside call-stack. (I remember it was in the
  past of the thread around but not found)

  This should be included in the Andres' patch.

- Give executor an ability to run from data-source (or driver)
  nodes to the root.

  I'm not sure this is included, but I suppose he is aiming this
  kind of thing.

- Rebuid asynchronous execution on the upside-down executor.


regrds,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2017-08-01 Thread Robert Haas
On Mon, Jul 31, 2017 at 5:42 AM, Kyotaro HORIGUCHI
 wrote:
> Another is getting rid of recursive call to run an execution
> tree.

That happens to be exactly what Andres did for expression evaluation
in commit b8d7f053c5c2bf2a7e8734fe3327f6a8bc711755, and I think
generalizing that to include the plan tree as well as expression trees
is likely to be the long-term way forward here.  Unfortunately, that's
probably another gigantic patch (that should probably be written by
Andres).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2017-07-31 Thread Kyotaro HORIGUCHI
At Fri, 28 Jul 2017 17:31:05 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170728.173105.238045591.horiguchi.kyot...@lab.ntt.co.jp>
> Thank you for the comment.
> 
> At Wed, 26 Jul 2017 17:16:43 -0400, Robert Haas  wrote 
> in 
> > regression all the same.  Every type of intermediate node will have to
> > have a code path where it uses ExecAsyncRequest() /
> > ExecAyncHogeResponse() rather than ExecProcNode() to get tuples, and
> 
> I understand what Robert concerns and I share the same
> opinion. It needs further different framework.
> 
> At Thu, 27 Jul 2017 06:39:51 -0400, Robert Haas  wrote 
> in 
> > On Wed, Jul 26, 2017 at 5:43 PM, Tom Lane  wrote:
> > > The scheme he has allows $extra_stuff to be injected into ExecProcNode at
> > > no cost when $extra_stuff is not needed, because you simply don't insert
> > > the wrapper function when it's not needed.  I'm not sure that it will
...
> > Yeah, I don't quite see how that would apply in this case -- what we
> > need here is not as simple as just conditionally injecting an extra
> > bit.
> 
> Thank you for the pointer, Tom. The subject (segfault in HEAD...)
> haven't made me think that this kind of discussion was held
> there. Anyway it seems very closer to asynchronous execution so
> I'll catch up it considering how I can associate with this.

I understand the executor change which has just been made at
master based on the pointed thread. This seems to have the
capability to let exec-node switch to async-aware with no extra
cost on non-async processing. So it would be doable to (just)
*shrink* the current framework by detaching the async-aware side
of the API. But to get the most out of asynchrony, it is required
that multiple async-capable nodes distributed under async-unaware
nodes run simultaneously.

There seems two ways to achieve this.

One is propagating required-async-nodes bitmap up to the topmost
node and waiting for the all required nodes to become ready. In
the long run this requires all nodes to be async-aware and that
apparently would have bad effect on performance to async-unaware
nodes containing async-capable nodes.

Another is getting rid of recursive call to run an execution
tree. It is perhaps the same to what mentioned as "data-centric
processing" in a previous threads [1], [2], but I'd like to I pay
attention on the aspect of "enableing to resume execution tree
from arbitrary leaf node".  So I'm considering to realize it
still in one-tuple-by-one manner instead of collecting all tuples
of a leaf node first. Even though I'm not sure it is doable.


[1] 
https://www.postgresql.org/message-id/bf2827dcce55594c8d7a8f7ffd3ab77159a9b...@szxeml521-mbs.china.huawei.com
[2] 
https://www.postgresql.org/message-id/20160629183254.frcm3dgg54ud5...@alap3.anarazel.de

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2017-07-28 Thread Kyotaro HORIGUCHI
Thank you for the comment.

At Wed, 26 Jul 2017 17:16:43 -0400, Robert Haas  wrote 
in 
> But if we do, then I fear we'll just be reintroducing the same
> performance regression that we introduced by switching to this
> framework from the previous one - or maybe a different one, but a
> regression all the same.  Every type of intermediate node will have to
> have a code path where it uses ExecAsyncRequest() /
> ExecAyncHogeResponse() rather than ExecProcNode() to get tuples, and

I understand what Robert concerns and I think I share the same
opinion. It needs further different framework.

At Thu, 27 Jul 2017 06:39:51 -0400, Robert Haas  wrote 
in 
> On Wed, Jul 26, 2017 at 5:43 PM, Tom Lane  wrote:
> > I have not been paying any attention to this thread whatsoever,
> > but I wonder if you can address your problem by building on top of
> > the ExecProcNode replacement that Andres is working on,
> > https://www.postgresql.org/message-id/20170726012641.bmhfcp5ajpudi...@alap3.anarazel.de
> >
> > The scheme he has allows $extra_stuff to be injected into ExecProcNode at
> > no cost when $extra_stuff is not needed, because you simply don't insert
> > the wrapper function when it's not needed.  I'm not sure that it will
> > scale well to several different kinds of insertions though, for instance
> > if you wanted both instrumentation and async support on the same node.
> > But maybe those two couldn't be arms-length from each other anyway,
> > in which case it might be fine as-is.
> 
> Yeah, I don't quite see how that would apply in this case -- what we
> need here is not as simple as just conditionally injecting an extra
> bit.

Thank you for the pointer, Tom. The subject (segfault in HEAD...)
haven't made me think that this kind of discussion was held
there. Anyway it seems very closer to asynchronous execution so
I'll catch up it considering how I can associate with this.

Regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2017-07-27 Thread Robert Haas
On Wed, Jul 26, 2017 at 5:43 PM, Tom Lane  wrote:
> I have not been paying any attention to this thread whatsoever,
> but I wonder if you can address your problem by building on top of
> the ExecProcNode replacement that Andres is working on,
> https://www.postgresql.org/message-id/20170726012641.bmhfcp5ajpudi...@alap3.anarazel.de
>
> The scheme he has allows $extra_stuff to be injected into ExecProcNode at
> no cost when $extra_stuff is not needed, because you simply don't insert
> the wrapper function when it's not needed.  I'm not sure that it will
> scale well to several different kinds of insertions though, for instance
> if you wanted both instrumentation and async support on the same node.
> But maybe those two couldn't be arms-length from each other anyway,
> in which case it might be fine as-is.

Yeah, I don't quite see how that would apply in this case -- what we
need here is not as simple as just conditionally injecting an extra
bit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2017-07-26 Thread Tom Lane
Robert Haas  writes:
> Ostensibly, the advantage of this framework over my previous proposal
> is that it avoids inserting anything into ExecProcNode(), which is
> probably a good thing to avoid given how frequently ExecProcNode() is
> called.  Unless the parent and the child both know about asynchronous
> execution and choose to use it, everything runs exactly as it does
> today and so there is no possibility of a complaint about a
> performance hit.  As far as it goes, that is good.

> However, at a deeper level, I fear we haven't really solved the
> problem.  If an Append is directly on top of a ForeignScan node, then
> this will work.  But if an Append is indirectly on top of a
> ForeignScan node with some other stuff in the middle, then it won't -
> unless we make whichever nodes appear between the Append and the
> ForeignScan async-capable.

I have not been paying any attention to this thread whatsoever,
but I wonder if you can address your problem by building on top of
the ExecProcNode replacement that Andres is working on,
https://www.postgresql.org/message-id/20170726012641.bmhfcp5ajpudi...@alap3.anarazel.de

The scheme he has allows $extra_stuff to be injected into ExecProcNode at
no cost when $extra_stuff is not needed, because you simply don't insert
the wrapper function when it's not needed.  I'm not sure that it will
scale well to several different kinds of insertions though, for instance
if you wanted both instrumentation and async support on the same node.
But maybe those two couldn't be arms-length from each other anyway,
in which case it might be fine as-is.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2017-07-26 Thread Robert Haas
On Tue, Jul 25, 2017 at 5:11 AM, Kyotaro HORIGUCHI
 wrote:
>  [ new patches ]

I spent some time today refreshing my memory of what's going with this
thread today.

Ostensibly, the advantage of this framework over my previous proposal
is that it avoids inserting anything into ExecProcNode(), which is
probably a good thing to avoid given how frequently ExecProcNode() is
called.  Unless the parent and the child both know about asynchronous
execution and choose to use it, everything runs exactly as it does
today and so there is no possibility of a complaint about a
performance hit.  As far as it goes, that is good.

However, at a deeper level, I fear we haven't really solved the
problem.  If an Append is directly on top of a ForeignScan node, then
this will work.  But if an Append is indirectly on top of a
ForeignScan node with some other stuff in the middle, then it won't -
unless we make whichever nodes appear between the Append and the
ForeignScan async-capable.  Indeed, we'd really want all kinds of
joins and aggregates to be async-capable so that examples like the one
Corey asked about in
http://postgr.es/m/CADkLM=fuvVdKvz92XpCRnb4=rj6bLOhSLifQ3RV=sb4q5rj...@mail.gmail.com
will work.

But if we do, then I fear we'll just be reintroducing the same
performance regression that we introduced by switching to this
framework from the previous one - or maybe a different one, but a
regression all the same.  Every type of intermediate node will have to
have a code path where it uses ExecAsyncRequest() /
ExecAyncHogeResponse() rather than ExecProcNode() to get tuples, and
it seems like that will either end up duplicating a lot of code from
the regular code path or, alternatively, polluting the regular code
path with some of the async code's concerns to avoid duplication, and
maybe slowing things down.

Maybe that concern is unjustified; I'm not sure.  Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2017-07-25 Thread Kyotaro HORIGUCHI
Hello,

8bf58c0d9bd33686 badly conflicts with this patch, so I'll rebase
this and added a patch to refactor the function that Anotonin
pointed. This would be merged into 0002 patch.

At Tue, 18 Jul 2017 16:24:52 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170718.162452.221576658.horiguchi.kyot...@lab.ntt.co.jp>
> I'll put an upper limit to the number of waiters processed at
> once. Then add a comment like that.
> 
> > Actually the reason I thought of simplification was that I noticed small
> > inefficiency in the way you do the compaction. In particular, I think it's 
> > not
> > always necessary to swap the tail and head entries. Would something like 
> > this
> > make sense?
> 
> I'm not sure, but I suppose that it is rare that all of the first
> many elements in the array are not COMPLETE. In most cases the
> first element gets a response first.
...
> Yeah, but maybe the "head" is still confusing even if reversed
> because it is still not a head of something.  It might be less
> confusing by rewriting it in more verbose-but-straightforwad way.
> 
> 
> |  int npending = 0;
> | 
> |  /* Skip over not-completed items at the beginning */
> |  while (npending < estate->es_num_pending_async &&
> | estate->es_pending_async[npending] != ASYNCREQ_COMPLETE)
> |npending++;
> | 
> |  /* Scan over the rest for not-completed items */
> |  for (i = npending + 1 ; i < estate->es_num_pending_async; ++i)
> |  {
> |PendingAsyncRequest *tmp;
> |PendingAsyncRequest *curr = estate->es_pending_async[i];
> |
> |if (curr->state == ASYNCREQ_COMPLETE)
> |  continue;
> |
> |  /* Move the not-completed item to the tail of the first chunk */
> |tmp = estate->es_pending_async[i];
> |estate->es_pending_async[nepending] = tmp;
> |estate->es_pending_async[i] = tmp;
> |++npending;
> |  }

The last patch does something like this (with apparent bugs
fixed)

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 41ad9a7518c066da619363e6cdf8574fa00ee1e5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 22 Feb 2017 09:07:49 +0900
Subject: [PATCH 1/5] Allow wait event set to be registered to resource owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c|  2 +-
 src/backend/storage/ipc/latch.c   | 18 ++-
 src/backend/storage/lmgr/condition_variable.c |  2 +-
 src/backend/utils/resowner/resowner.c | 68 +++
 src/include/storage/latch.h   |  4 +-
 src/include/utils/resowner_private.h  |  8 
 6 files changed, 97 insertions(+), 5 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 4452ea4..ed71e7c 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -220,7 +220,7 @@ pq_init(void)
 (errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
 	AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
 	  NULL, NULL);
 	AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 07b1364..9543397 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -51,6 +51,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -77,6 +78,8 @@ struct WaitEventSet
 	int			nevents;		/* number of registered events */
 	int			nevents_space;	/* maximum number of events in this set */
 
+	ResourceOwner	resowner;	/* Resource owner */
+
 	/*
 	 * Array, of nevents_space length, storing the definition of events this
 	 * set is waiting for.
@@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	int			ret = 0;
 	int			rc;
 	WaitEvent	event;
-	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 
 	if (wakeEvents & WL_TIMEOUT)
 		Assert(timeout >= 0);
@@ -518,12 +521,15 @@ ResetLatch(volatile Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
 {
 	WaitEventSet *set;
 	char	   *data;
 	Size		sz = 0;
 
+	if (res)
+		ResourceOwnerEnlargeWESs(res);
+
 	/*
 	 * Use MAXALIGN size/alignment to guarantee that later uses of memory are
 	 * aligned correctly. E.g. 

Re: [HACKERS] asynchronous execution

2017-07-18 Thread Kyotaro HORIGUCHI
Hello,

At Tue, 11 Jul 2017 10:28:51 +0200, Antonin Houska  wrote in 
<6448.1499761731@localhost>
> Kyotaro HORIGUCHI  wrote:
> > Effectively it is a waiting-queue followed by a
> > completed-list. The point of the compaction is keeping the order
> > of waiting or not-yet-completed requests, which is crucial to
> > avoid kind-a precedence inversion. We cannot keep the order by
> > using bitmapset in such way.
> 
> > The current code waits all waiters at once and processes all
> > fired events at once. The order in the waiting-queue is
> > inessential in the case. On the other hand I suppoese waiting on
> > several-tens to near-hundred remote hosts is in a realistic
> > target range. Keeping the order could be crucial if we process a
> > part of the queue at once in the case.
> > 
> > Putting siginificance on the deviation of response time of
> > remotes, process-all-at-once is effective. In turn we should
> > consider the effectiveness of the lifecycle of the larger wait
> > event set.
> 
> ok, I missed the fact that the order of es_pending_async entries is
> important. I think this is worth adding a comment.

I'll put an upper limit to the number of waiters processed at
once. Then add a comment like that.

> Actually the reason I thought of simplification was that I noticed small
> inefficiency in the way you do the compaction. In particular, I think it's not
> always necessary to swap the tail and head entries. Would something like this
> make sense?

I'm not sure, but I suppose that it is rare that all of the first
many elements in the array are not COMPLETE. In most cases the
first element gets a response first.

> 
>   /* If any node completed, compact the array. */
>   if (any_node_done)
>   {
...
>   for (tidx = 0; tidx < estate->es_num_pending_async; 
> ++tidx)
>   {
...
>   if (tail->state == ASYNCREQ_COMPLETE)
>   continue;
> 
>   /*
>* If the array starts with one or more 
> incomplete requests,
>* both head and tail point at the same item, 
> so there's no
>* point in swapping.
>*/
>   if (tidx > hidx)
>   {

This works to skip the first several elements when all of them
are ASYNCREQ_COMPLETE. I think it makes sense as long as it
doesn't harm the loop. The optimization is more effective by
putting out of the loop like this.

|  for (tidx = 0; tidx < estate->es_num_pending_async &&
  estate->es_pending_async[tidx] == ASYNCREQ_COMPLETE; ++tidx);
|  for (; tidx < estate->es_num_pending_async; ++tidx)
...


> And besides that, I think it'd be more intuitive if the meaning of "head" and
> "tail" was reversed: if the array is iterated from lower to higher positions,
> then I'd consider head to be at higher position, not tail.

Yeah, but maybe the "head" is still confusing even if reversed
because it is still not a head of something.  It might be less
confusing by rewriting it in more verbose-but-straightforwad way.


|  int npending = 0;
| 
|  /* Skip over not-completed items at the beginning */
|  while (npending < estate->es_num_pending_async &&
| estate->es_pending_async[npending] != ASYNCREQ_COMPLETE)
|npending++;
| 
|  /* Scan over the rest for not-completed items */
|  for (i = npending + 1 ; i < estate->es_num_pending_async; ++i)
|  {
|PendingAsyncRequest *tmp;
|PendingAsyncRequest *curr = estate->es_pending_async[i];
|
|if (curr->state == ASYNCREQ_COMPLETE)
|  continue;
|
|/* Move the not-completed item to the tail of the first chunk */
|tmp = estate->es_pending_async[i];
|estate->es_pending_async[nepending] = tmp;
|estate->es_pending_async[i] = tmp;
|++npending;
|  }


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2017-07-11 Thread Antonin Houska
Kyotaro HORIGUCHI  wrote:

> > Just one idea that I had while reading the code.
> > 
> > In ExecAsyncEventLoop you iterate estate->es_pending_async, then move the
> > complete requests to the end and finaly adjust estate->es_num_pending_async 
> > so
> > that the array no longer contains the complete requests. I think the point 
> > is
> > that then you can add new requests to the end of the array.
> > 
> > I wonder if a set (Bitmapset) of incomplete requests would make the code 
> > more
> > efficient. The set would contain position of each incomplete request in
> > estate->es_num_pending_async (I think it's the myindex field of
> > PendingAsyncRequest). If ExecAsyncEventLoop used this set to retrieve the
> > requests subject to ExecAsyncNotify etc, then the compaction of
> > estate->es_pending_async wouldn't be necessary.
> > 
> > ExecAsyncRequest would use the set to look for space for new requests by
> > iterating it and trying to find the first gap (which corresponds to 
> > completed
> > request).
> > 
> > And finally, item would be removed from the set at the moment the request
> > state is being set to ASYNCREQ_COMPLETE.
> 
> Effectively it is a waiting-queue followed by a
> completed-list. The point of the compaction is keeping the order
> of waiting or not-yet-completed requests, which is crucial to
> avoid kind-a precedence inversion. We cannot keep the order by
> using bitmapset in such way.

> The current code waits all waiters at once and processes all
> fired events at once. The order in the waiting-queue is
> inessential in the case. On the other hand I suppoese waiting on
> several-tens to near-hundred remote hosts is in a realistic
> target range. Keeping the order could be crucial if we process a
> part of the queue at once in the case.
> 
> Putting siginificance on the deviation of response time of
> remotes, process-all-at-once is effective. In turn we should
> consider the effectiveness of the lifecycle of the larger wait
> event set.

ok, I missed the fact that the order of es_pending_async entries is
important. I think this is worth adding a comment.

Actually the reason I thought of simplification was that I noticed small
inefficiency in the way you do the compaction. In particular, I think it's not
always necessary to swap the tail and head entries. Would something like this
make sense?


/* If any node completed, compact the array. */
if (any_node_done)
{
int hidx = 0,
tidx;

/*
 * Swap all non-yet-completed items to the start of the 
array.
 * Keep them in the same order.
 */
for (tidx = 0; tidx < estate->es_num_pending_async; 
++tidx)
{
PendingAsyncRequest *tail = 
estate->es_pending_async[tidx];

Assert(tail->state != 
ASYNCREQ_CALLBACK_PENDING);

if (tail->state == ASYNCREQ_COMPLETE)
continue;

/*
 * If the array starts with one or more 
incomplete requests,
 * both head and tail point at the same item, 
so there's no
 * point in swapping.
 */
if (tidx > hidx)
{
PendingAsyncRequest *head = 
estate->es_pending_async[hidx];

/*
 * Once the tail got ahead, it should 
only leave
 * ASYNCREQ_COMPLETE behind. Only those 
can then be seen
 * by head.
 */
Assert(head->state == 
ASYNCREQ_COMPLETE);

estate->es_pending_async[tidx] = head;
estate->es_pending_async[hidx] = tail;
}

++hidx;
}

estate->es_num_pending_async = hidx;
}

And besides that, I think it'd be more intuitive if the meaning of "head" and
"tail" was reversed: if the array is iterated from lower to higher positions,
then I'd consider head to be at higher position, not tail.

-- 
Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener
Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2017-07-03 Thread Kyotaro HORIGUCHI
Thank you for the thought.

This is at PoC level so I'd be grateful for this kind of
fundamental comments.

At Wed, 28 Jun 2017 20:22:24 +0200, Antonin Houska  wrote in 
<392.1498674144@localhost>
> Kyotaro HORIGUCHI  wrote:
> 
> > The patch got conflicted. This is a new version just rebased to
> > the current master. Furtuer amendment will be taken later.
> 
> Just one idea that I had while reading the code.
> 
> In ExecAsyncEventLoop you iterate estate->es_pending_async, then move the
> complete requests to the end and finaly adjust estate->es_num_pending_async so
> that the array no longer contains the complete requests. I think the point is
> that then you can add new requests to the end of the array.
> 
> I wonder if a set (Bitmapset) of incomplete requests would make the code more
> efficient. The set would contain position of each incomplete request in
> estate->es_num_pending_async (I think it's the myindex field of
> PendingAsyncRequest). If ExecAsyncEventLoop used this set to retrieve the
> requests subject to ExecAsyncNotify etc, then the compaction of
> estate->es_pending_async wouldn't be necessary.
> 
> ExecAsyncRequest would use the set to look for space for new requests by
> iterating it and trying to find the first gap (which corresponds to completed
> request).
> 
> And finally, item would be removed from the set at the moment the request
> state is being set to ASYNCREQ_COMPLETE.

Effectively it is a waiting-queue followed by a
completed-list. The point of the compaction is keeping the order
of waiting or not-yet-completed requests, which is crucial to
avoid kind-a precedence inversion. We cannot keep the order by
using bitmapset in such way.

The current code waits all waiters at once and processes all
fired events at once. The order in the waiting-queue is
inessential in the case. On the other hand I suppoese waiting on
several-tens to near-hundred remote hosts is in a realistic
target range. Keeping the order could be crucial if we process a
part of the queue at once in the case.

Putting siginificance on the deviation of response time of
remotes, process-all-at-once is effective. In turn we should
consider the effectiveness of the lifecycle of the larger wait
event set.

Sorry for the discursive discussion but in short, I have noticed
that I have a lot to consider on this:p Thanks!

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2017-07-03 Thread Kyotaro HORIGUCHI
Hi, I've returned.

At Thu, 29 Jun 2017 14:08:27 +0900, Amit Langote 
 wrote in 
<63a5a01c-2967-83e0-8bbf-c981404f5...@lab.ntt.co.jp>
> Hi,
> 
> On 2017/06/29 13:45, Kyotaro HORIGUCHI wrote:
> > Thank you for looking this.
> > 
> > At Wed, 28 Jun 2017 10:23:54 +0200, Antonin Houska wrote:
> >> Can you please explain this part of make_append() ?
> >>
> >> /* Currently async on partitioned tables is not available */
> >> Assert(nasyncplans == 0 || partitioned_rels == NIL);
> >>
> >> I don't think the output of Append plan is supposed to be ordered even if 
> >> the
> >> underlying relation is partitioned. Besides ordering, is there any other
> >> reason not to use the asynchronous execution?
> 
> When making an Append for a partitioned table, among the arguments passed
> to make_append(), 'partitioned_rels' is a list of RT indexes of
> partitioned tables in the inheritance tree of which the aforementioned
> partitioned table is the root.  'appendplans' is a list of subplans for
> scanning the leaf partitions in the tree.  Note that the 'appendplans'
> list contains no members corresponding to the partitioned tables, because
> we don't need to scan them (only leaf relations contain any data).
> 
> The point of having the 'partitioned_rels' list in the resulting Append
> plan is so that the executor can identify those relations and take the
> appropriate locks on them.

Amit, thank you for the detailed explanation. I understand what
it is and that just ignoring it is enough, then confirmed that
actually works as before.

I'll then adresss Antonin's comments tomorrow.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2017-06-28 Thread Amit Langote
Hi,

On 2017/06/29 13:45, Kyotaro HORIGUCHI wrote:
> Thank you for looking this.
> 
> At Wed, 28 Jun 2017 10:23:54 +0200, Antonin Houska wrote:
>> Kyotaro HORIGUCHI  wrote:
>>
>>> The patch got conflicted. This is a new version just rebased to
>>> the current master. Furtuer amendment will be taken later.
>>
>> Can you please explain this part of make_append() ?
>>
>> /* Currently async on partitioned tables is not available */
>> Assert(nasyncplans == 0 || partitioned_rels == NIL);
>>
>> I don't think the output of Append plan is supposed to be ordered even if the
>> underlying relation is partitioned. Besides ordering, is there any other
>> reason not to use the asynchronous execution?
> 
> It was just a developmental sentinel that will remind me later to
> consider the declarative partitions since I didn't have an idea
> of the differences (or the similarity) between appendrels and
> partitioned_rels. It is never to say the condition cannot
> make. I'll check it out and will support partitioned_rels sooner.
> Sorry for having left it as it is.

When making an Append for a partitioned table, among the arguments passed
to make_append(), 'partitioned_rels' is a list of RT indexes of
partitioned tables in the inheritance tree of which the aforementioned
partitioned table is the root.  'appendplans' is a list of subplans for
scanning the leaf partitions in the tree.  Note that the 'appendplans'
list contains no members corresponding to the partitioned tables, because
we don't need to scan them (only leaf relations contain any data).

The point of having the 'partitioned_rels' list in the resulting Append
plan is so that the executor can identify those relations and take the
appropriate locks on them.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2017-06-28 Thread Kyotaro HORIGUCHI
Thank you for looking this.

At Wed, 28 Jun 2017 10:23:54 +0200, Antonin Houska  wrote in 
<4579.1498638234@localhost>
> Kyotaro HORIGUCHI  wrote:
> 
> > The patch got conflicted. This is a new version just rebased to
> > the current master. Furtuer amendment will be taken later.
> 
> Can you please explain this part of make_append() ?
> 
> /* Currently async on partitioned tables is not available */
> Assert(nasyncplans == 0 || partitioned_rels == NIL);
> 
> I don't think the output of Append plan is supposed to be ordered even if the
> underlying relation is partitioned. Besides ordering, is there any other
> reason not to use the asynchronous execution?

It was just a developmental sentinel that will remind me later to
consider the declarative partitions since I didn't have an idea
of the differences (or the similarity) between appendrels and
partitioned_rels. It is never to say the condition cannot
make. I'll check it out and will support partitioned_rels sooner.
Sorry for having left it as it is.

> And even if there was some, the planner should ensure that executor does not
> fire the assertion statement above. The script attached shows an example how
> to cause the assertion failure.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2017-06-28 Thread Antonin Houska
Kyotaro HORIGUCHI  wrote:

> The patch got conflicted. This is a new version just rebased to
> the current master. Furtuer amendment will be taken later.

Just one idea that I had while reading the code.

In ExecAsyncEventLoop you iterate estate->es_pending_async, then move the
complete requests to the end and finaly adjust estate->es_num_pending_async so
that the array no longer contains the complete requests. I think the point is
that then you can add new requests to the end of the array.

I wonder if a set (Bitmapset) of incomplete requests would make the code more
efficient. The set would contain position of each incomplete request in
estate->es_num_pending_async (I think it's the myindex field of
PendingAsyncRequest). If ExecAsyncEventLoop used this set to retrieve the
requests subject to ExecAsyncNotify etc, then the compaction of
estate->es_pending_async wouldn't be necessary.

ExecAsyncRequest would use the set to look for space for new requests by
iterating it and trying to find the first gap (which corresponds to completed
request).

And finally, item would be removed from the set at the moment the request
state is being set to ASYNCREQ_COMPLETE.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2017-06-28 Thread Antonin Houska
Kyotaro HORIGUCHI  wrote:

> The patch got conflicted. This is a new version just rebased to
> the current master. Furtuer amendment will be taken later.

Can you please explain this part of make_append() ?

/* Currently async on partitioned tables is not available */
Assert(nasyncplans == 0 || partitioned_rels == NIL);

I don't think the output of Append plan is supposed to be ordered even if the
underlying relation is partitioned. Besides ordering, is there any other
reason not to use the asynchronous execution?

And even if there was some, the planner should ensure that executor does not
fire the assertion statement above. The script attached shows an example how
to cause the assertion failure.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

OPT_HOST="-h localhost"
USER=postgres

for x in $(psql $OPT_HOST -c "SELECT datname FROM pg_database WHERE datname ~ 
'shard.*' " postgres -A -t)

do
echo dropping $x
dropdb $OPT_HOST $x
done;

createdb $OPT_HOST shard

psql $OPT_HOST -U $USER shard -c "
CREATE EXTENSION postgres_fdw;
CREATE TABLE orders(customer_id int, product_id int)
   PARTITION BY LIST (customer_id);"
createdb $OPT_HOST shard_0
psql $OPT_HOST -U $USER shard_0 -c "
CREATE TABLE orders_0 AS
SELECT trunc(2 * random())::int AS customer_id,
   trunc(100 * random())::int AS product_id
FROM generate_series(1, 5000);
ANALYZE;"
psql $OPT_HOST -U $USER shard -c "
CREATE SERVER server_0 FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (host 'localhost', dbname 'shard_0', fetch_size 
'1000',
use_remote_estimate 'true', fdw_tuple_cost '10');
CREATE USER MAPPING FOR CURRENT_USER SERVER server_0 OPTIONS 
(user '$USER');
IMPORT FOREIGN SCHEMA public FROM SERVER server_0 INTO public;
ALTER TABLE orders ATTACH PARTITION orders_0 FOR VALUES IN (0, 
1);"

createdb $OPT_HOST shard_1
psql $OPT_HOST -U $USER shard_1 -c "
CREATE TABLE orders_1 AS
SELECT trunc(2 * random())::int + 2 AS customer_id,
   trunc(100 * random())::int + 2 AS product_id
FROM generate_series(1, 5000);
ANALYZE;"
psql $OPT_HOST -U $USER shard -c "
CREATE SERVER server_1 FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (host 'localhost', dbname 'shard_1', fetch_size 
'1000',
use_remote_estimate 'true', fdw_tuple_cost '10');
CREATE USER MAPPING FOR CURRENT_USER SERVER server_1 OPTIONS 
(user '$USER');
IMPORT FOREIGN SCHEMA public FROM SERVER server_1 INTO public;
ALTER TABLE orders ATTACH PARTITION orders_1 FOR VALUES IN (2, 
3);"


psql $OPT_HOST -U $USER shard -c "SELECT * FROM orders"

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2017-06-21 Thread Kyotaro HORIGUCHI
The patch got conflicted. This is a new version just rebased to
the current master. Furtuer amendment will be taken later.

> The attached patch is rebased on the current master, but no
> substantial changes other than disallowing partitioned tables on
> async by assertion.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 32d5c143a679bcccee9ff29fe3807dfd8deae458 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 22 Feb 2017 09:07:49 +0900
Subject: [PATCH 1/4] Allow wait event set to be registered to resource owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c|  2 +-
 src/backend/storage/ipc/latch.c   | 18 ++-
 src/backend/storage/lmgr/condition_variable.c |  2 +-
 src/backend/utils/resowner/resowner.c | 68 +++
 src/include/storage/latch.h   |  4 +-
 src/include/utils/resowner_private.h  |  8 
 6 files changed, 97 insertions(+), 5 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 261e9be..c4f336d 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -201,7 +201,7 @@ pq_init(void)
 (errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
 	AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
 	  NULL, NULL);
 	AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 07b1364..9543397 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -51,6 +51,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -77,6 +78,8 @@ struct WaitEventSet
 	int			nevents;		/* number of registered events */
 	int			nevents_space;	/* maximum number of events in this set */
 
+	ResourceOwner	resowner;	/* Resource owner */
+
 	/*
 	 * Array, of nevents_space length, storing the definition of events this
 	 * set is waiting for.
@@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	int			ret = 0;
 	int			rc;
 	WaitEvent	event;
-	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 
 	if (wakeEvents & WL_TIMEOUT)
 		Assert(timeout >= 0);
@@ -518,12 +521,15 @@ ResetLatch(volatile Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
 {
 	WaitEventSet *set;
 	char	   *data;
 	Size		sz = 0;
 
+	if (res)
+		ResourceOwnerEnlargeWESs(res);
+
 	/*
 	 * Use MAXALIGN size/alignment to guarantee that later uses of memory are
 	 * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -592,6 +598,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
 	StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+	/* Register this wait event set if requested */
+	set->resowner = res;
+	if (res)
+		ResourceOwnerRememberWES(set->resowner, set);
+
 	return set;
 }
 
@@ -633,6 +644,9 @@ FreeWaitEventSet(WaitEventSet *set)
 	}
 #endif
 
+	if (set->resowner != NULL)
+		ResourceOwnerForgetWES(set->resowner, set);
+
 	pfree(set);
 }
 
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index b4b7d28..182f759 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -66,7 +66,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
 	/* Create a reusable WaitEventSet. */
 	if (cv_wait_event_set == NULL)
 	{
-		cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, 1);
+		cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 1);
 		AddWaitEventToSet(cv_wait_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
 		  MyLatch, NULL);
 	}
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 4a4a287..f2509c3 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
 	ResourceArray snapshotarr;	/* snapshot references */
 	ResourceArray filearr;		/* open temporary files */
 	ResourceArray dsmarr;		/* dynamic shmem segments */
+	ResourceArray wesarr;		/* wait event sets */
 
 	/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
 	int			nlocks;			/* number of owned locks */
@@ -169,6 +170,7 @@ static 

Re: [HACKERS] asynchronous execution

2017-05-21 Thread Kyotaro HORIGUCHI
At Mon, 22 May 2017 13:12:14 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170522.131214.20936668.horiguchi.kyot...@lab.ntt.co.jp>
> > The attached patch is rebased on the current master, but no
> > substantial changes other than disallowing partitioned tables on
> > async by assertion.
> 
> This is just rebased onto the current master (d761fe2).
> I'll recheck further detail after this.

Sorry, the patch was missing some files to add.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 000f0465a59cdabd02f43e886c76c89c14d987a5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH 1/4] Allow wait event set to be registered to resource owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c|  2 +-
 src/backend/storage/ipc/latch.c   | 18 ++-
 src/backend/storage/lmgr/condition_variable.c |  2 +-
 src/backend/utils/resowner/resowner.c | 68 +++
 src/include/storage/latch.h   |  4 +-
 src/include/utils/resowner_private.h  |  8 
 6 files changed, 97 insertions(+), 5 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index d1cc38b..1c34114 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -201,7 +201,7 @@ pq_init(void)
 (errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
 	AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
 	  NULL, NULL);
 	AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 53e6bf2..8c182a2 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -51,6 +51,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -77,6 +78,8 @@ struct WaitEventSet
 	int			nevents;		/* number of registered events */
 	int			nevents_space;	/* maximum number of events in this set */
 
+	ResourceOwner	resowner;	/* Resource owner */
+
 	/*
 	 * Array, of nevents_space length, storing the definition of events this
 	 * set is waiting for.
@@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	int			ret = 0;
 	int			rc;
 	WaitEvent	event;
-	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 
 	if (wakeEvents & WL_TIMEOUT)
 		Assert(timeout >= 0);
@@ -518,12 +521,15 @@ ResetLatch(volatile Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
 {
 	WaitEventSet *set;
 	char	   *data;
 	Size		sz = 0;
 
+	if (res)
+		ResourceOwnerEnlargeWESs(res);
+
 	/*
 	 * Use MAXALIGN size/alignment to guarantee that later uses of memory are
 	 * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -592,6 +598,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
 	StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+	/* Register this wait event set if requested */
+	set->resowner = res;
+	if (res)
+		ResourceOwnerRememberWES(set->resowner, set);
+
 	return set;
 }
 
@@ -633,6 +644,9 @@ FreeWaitEventSet(WaitEventSet *set)
 	}
 #endif
 
+	if (set->resowner != NULL)
+		ResourceOwnerForgetWES(set->resowner, set);
+
 	pfree(set);
 }
 
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 5afb211..1d9111e 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -66,7 +66,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
 	/* Create a reusable WaitEventSet. */
 	if (cv_wait_event_set == NULL)
 	{
-		cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, 1);
+		cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 1);
 		AddWaitEventToSet(cv_wait_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
 		  >procLatch, NULL);
 	}
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index af46d78..a1a1121 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
 	ResourceArray snapshotarr;	/* snapshot references */
 	ResourceArray filearr;		/* open temporary files */
 	ResourceArray dsmarr;		/* dynamic 

Re: [HACKERS] asynchronous execution

2017-05-21 Thread Kyotaro HORIGUCHI
Hello.

At Tue, 04 Apr 2017 19:25:39 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170404.192539.29699823.horiguchi.kyot...@lab.ntt.co.jp>
> The attached patch is rebased on the current master, but no
> substantial changes other than disallowing partitioned tables on
> async by assertion.

This is just rebased onto the current master (d761fe2).
I'll recheck further detail after this.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 000f0465a59cdabd02f43e886c76c89c14d987a5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH 1/4] Allow wait event set to be registered to resource owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c|  2 +-
 src/backend/storage/ipc/latch.c   | 18 ++-
 src/backend/storage/lmgr/condition_variable.c |  2 +-
 src/backend/utils/resowner/resowner.c | 68 +++
 src/include/storage/latch.h   |  4 +-
 src/include/utils/resowner_private.h  |  8 
 6 files changed, 97 insertions(+), 5 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index d1cc38b..1c34114 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -201,7 +201,7 @@ pq_init(void)
 (errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
 	AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
 	  NULL, NULL);
 	AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 53e6bf2..8c182a2 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -51,6 +51,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -77,6 +78,8 @@ struct WaitEventSet
 	int			nevents;		/* number of registered events */
 	int			nevents_space;	/* maximum number of events in this set */
 
+	ResourceOwner	resowner;	/* Resource owner */
+
 	/*
 	 * Array, of nevents_space length, storing the definition of events this
 	 * set is waiting for.
@@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	int			ret = 0;
 	int			rc;
 	WaitEvent	event;
-	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 
 	if (wakeEvents & WL_TIMEOUT)
 		Assert(timeout >= 0);
@@ -518,12 +521,15 @@ ResetLatch(volatile Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
 {
 	WaitEventSet *set;
 	char	   *data;
 	Size		sz = 0;
 
+	if (res)
+		ResourceOwnerEnlargeWESs(res);
+
 	/*
 	 * Use MAXALIGN size/alignment to guarantee that later uses of memory are
 	 * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -592,6 +598,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
 	StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+	/* Register this wait event set if requested */
+	set->resowner = res;
+	if (res)
+		ResourceOwnerRememberWES(set->resowner, set);
+
 	return set;
 }
 
@@ -633,6 +644,9 @@ FreeWaitEventSet(WaitEventSet *set)
 	}
 #endif
 
+	if (set->resowner != NULL)
+		ResourceOwnerForgetWES(set->resowner, set);
+
 	pfree(set);
 }
 
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 5afb211..1d9111e 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -66,7 +66,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
 	/* Create a reusable WaitEventSet. */
 	if (cv_wait_event_set == NULL)
 	{
-		cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, 1);
+		cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 1);
 		AddWaitEventToSet(cv_wait_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
 		  >procLatch, NULL);
 	}
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index af46d78..a1a1121 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
 	ResourceArray snapshotarr;	/* snapshot references */
 	ResourceArray filearr;		/* open temporary files */
 	ResourceArray dsmarr;		/* dynamic shmem segments */
+	ResourceArray wesarr;		/* wait 

Re: [HACKERS] asynchronous execution

2017-04-04 Thread Kyotaro HORIGUCHI
Hello,

At Sun, 2 Apr 2017 12:21:14 -0400, Corey Huinker  
wrote in 

Re: [HACKERS] asynchronous execution

2017-04-02 Thread Corey Huinker
>
>
> I'll continue working on this from this point aiming to the next
> commit fest.
>
>
This probably will not surprise you given the many commits in the past 2
weeks, but the patches no longer apply to master:

 $ git apply
~/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch
/home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:27:
trailing whitespace.
FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
/home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:39:
trailing whitespace.
#include "utils/resowner_private.h"
/home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:47:
trailing whitespace.
ResourceOwner   resowner;   /* Resource owner */
/home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:48:
trailing whitespace.

/home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:57:
trailing whitespace.
WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL,
3);
error: patch failed: src/backend/libpq/pqcomm.c:201
error: src/backend/libpq/pqcomm.c: patch does not apply
error: patch failed: src/backend/storage/ipc/latch.c:61
error: src/backend/storage/ipc/latch.c: patch does not apply
error: patch failed: src/backend/storage/lmgr/condition_variable.c:66
error: src/backend/storage/lmgr/condition_variable.c: patch does not apply
error: patch failed: src/backend/utils/resowner/resowner.c:124
error: src/backend/utils/resowner/resowner.c: patch does not apply
error: patch failed: src/include/storage/latch.h:101
error: src/include/storage/latch.h: patch does not apply
error: patch failed: src/include/utils/resowner_private.h:18
error: src/include/utils/resowner_private.h: patch does not apply


Re: [HACKERS] asynchronous execution

2017-03-20 Thread Kyotaro HORIGUCHI
Hello. This is the final report in this CF period.

At Fri, 17 Mar 2017 17:35:05 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170317.173505.152063931.horiguchi.kyot...@lab.ntt.co.jp>
> Async-capable plan is generated in planner. An Append contains at
> least one async-capable child becomes async-aware Append. So the
> async feature should be effective also for the UNION ALL case.
> 
> The following will work faster than unpatched version.I 
> 
> SELECT sum(a) FROM (SELECT a FROM ft10 UNION ALL SELECT a FROM ft20 UNION ALL 
> SELECT a FROM ft30 UNION ALL SELECT a FROM ft40) as ft;
> 
> I'll measure the performance for the case next week.

I found that the following query works as the same as partitioned
table.

SELECT sum(a) FROM (SELECT a FROM ft10 UNION ALL SELECT a FROM ft20 UNION ALL 
SELECT a FROM ft30 UNION ALL SELECT a FROM ft40 UNION ALL *SELECT a FROM ONLY 
pf0*) as ft;

So, the difference comes from the additional async-uncapable
child (faster if contains any). In both cases, Append node runs
children asynchronously but slightly differently when all
async-capable children are busy.

I'll continue working on this from this point aiming to the next
commit fest.

Thank you for valuable feedback.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2017-03-17 Thread Kyotaro HORIGUCHI
At Thu, 16 Mar 2017 17:16:32 -0400, Corey Huinker  
wrote in 

Re: [HACKERS] asynchronous execution

2017-03-16 Thread Corey Huinker
On Thu, Mar 16, 2017 at 4:17 PM, Tom Lane  wrote:

> Corey Huinker  writes:
> > I reworked the test such that all of the foreign tables inherit from the
> > same parent table, and if you query that you do get async execution. But
> It
> > doesn't work when just stringing together those foreign tables with UNION
> > ALLs.
>
> > I don't know how to proceed with this review if that was a goal of the
> > patch.
>
> Whether it was a goal or not, I'd say there is something either broken
> or incorrectly implemented if you don't see that.  The planner (and
> therefore also the executor) generally treats inheritance the same as
> simple UNION ALL.  If that's not the case here, I'd want to know why.
>
> regards, tom lane
>

Updated commitfest entry to "Returned With Feedback".


Re: [HACKERS] asynchronous execution

2017-03-16 Thread Tom Lane
Corey Huinker  writes:
> I reworked the test such that all of the foreign tables inherit from the
> same parent table, and if you query that you do get async execution. But It
> doesn't work when just stringing together those foreign tables with UNION
> ALLs.

> I don't know how to proceed with this review if that was a goal of the
> patch.

Whether it was a goal or not, I'd say there is something either broken
or incorrectly implemented if you don't see that.  The planner (and
therefore also the executor) generally treats inheritance the same as
simple UNION ALL.  If that's not the case here, I'd want to know why.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2017-03-16 Thread Corey Huinker
On Mon, Mar 13, 2017 at 9:28 PM, Amit Langote  wrote:

> On 2017/03/14 10:08, Corey Huinker wrote:
> >> I don't think the plan itself will change as a result of applying this
> >> patch. You might however be able to observe some performance
> improvement.
> >
> > I could see no performance improvement, even with 16 separate queries
> > combined with UNION ALL. Query performance was always with +/- 10% of a
> 9.6
> > instance given the same script. I must be missing something.
>
> Hmm, maybe I'm missing something too.
>
> Anyway, here is an older message on this thread from Horiguchi-san where
> he shared some of the test cases that this patch improves performance for:
>
> https://www.postgresql.org/message-id/20161018.103051.
> 30820907.horiguchi.kyotaro%40lab.ntt.co.jp
>
> From that message:
>
> 
> I measured performance and had the following result.
>
> t0  - SELECT sum(a) FROM ;
> pl  - SELECT sum(a) FROM <4 local children>;
> pf0 - SELECT sum(a) FROM <4 foreign children on single connection>;
> pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>;
>
> The result is written as "time (std dev )"
>
> sync
>   t0: 3820.33 (  1.88)
>   pl: 1608.59 ( 12.06)
>  pf0: 7928.29 ( 46.58)
>  pf1: 8023.16 ( 26.43)
>
> async
>   t0: 3806.31 (  4.49)0.4% faster (should be error)
>   pl: 1629.17 (  0.29)1.3% slower
>  pf0: 6447.07 ( 25.19)   18.7% faster
>  pf1: 1876.80 ( 47.13)   76.6% faster
> 
>
> IIUC, pf0 and pf1 is the same test case (all 4 foreign tables target the
> same server) measured with different implementations of the patch.
>
> Thanks,
> Amit
>

I reworked the test such that all of the foreign tables inherit from the
same parent table, and if you query that you do get async execution. But It
doesn't work when just stringing together those foreign tables with UNION
ALLs.

I don't know how to proceed with this review if that was a goal of the
patch.


Re: [HACKERS] asynchronous execution

2017-03-13 Thread Amit Langote
On 2017/03/14 10:08, Corey Huinker wrote:
>> I don't think the plan itself will change as a result of applying this
>> patch. You might however be able to observe some performance improvement.
> 
> I could see no performance improvement, even with 16 separate queries
> combined with UNION ALL. Query performance was always with +/- 10% of a 9.6
> instance given the same script. I must be missing something.

Hmm, maybe I'm missing something too.

Anyway, here is an older message on this thread from Horiguchi-san where
he shared some of the test cases that this patch improves performance for:

https://www.postgresql.org/message-id/20161018.103051.30820907.horiguchi.kyotaro%40lab.ntt.co.jp

>From that message:


I measured performance and had the following result.

t0  - SELECT sum(a) FROM ;
pl  - SELECT sum(a) FROM <4 local children>;
pf0 - SELECT sum(a) FROM <4 foreign children on single connection>;
pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>;

The result is written as "time (std dev )"

sync
  t0: 3820.33 (  1.88)
  pl: 1608.59 ( 12.06)
 pf0: 7928.29 ( 46.58)
 pf1: 8023.16 ( 26.43)

async
  t0: 3806.31 (  4.49)0.4% faster (should be error)
  pl: 1629.17 (  0.29)1.3% slower
 pf0: 6447.07 ( 25.19)   18.7% faster
 pf1: 1876.80 ( 47.13)   76.6% faster


IIUC, pf0 and pf1 is the same test case (all 4 foreign tables target the
same server) measured with different implementations of the patch.

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2017-03-13 Thread Corey Huinker
>
> I don't think the plan itself will change as a result of applying this
> patch. You might however be able to observe some performance improvement.
>
> Thanks,
> Amit
>

I could see no performance improvement, even with 16 separate queries
combined with UNION ALL. Query performance was always with +/- 10% of a 9.6
instance given the same script. I must be missing something.


Re: [HACKERS] asynchronous execution

2017-03-13 Thread Amit Langote
On 2017/03/14 6:31, Corey Huinker wrote:
> On Mon, Mar 13, 2017 at 1:06 AM, Corey Huinker 
> wrote:
> 
>>
>>> I think it will, because Append itself has been made async-capable by one
>>> of the patches and UNION ALL uses Append.  But as mentioned above, only
>>> the postgres_fdw foreign tables will be able to utilize this for now.
>>>
>>>
>> Ok, I'll re-run my test from a few weeks back and see if anything has
>> changed.
>>
> 
> 
> I'm not able to discern any difference in plan between a 9.6 instance and
> this patch.
> 
> The basic outline of my test is:
> 
> EXPLAIN ANALYZE
> SELECT c1, c2, ..., cN FROM tab1 WHERE date = '1 day ago'
> UNION ALL
> SELECT c1, c2, ..., cN FROM tab2 WHERE date = '2 days ago'
> UNION ALL
> SELECT c1, c2, ..., cN FROM tab3 WHERE date = '3 days ago'
> UNION ALL
> SELECT c1, c2, ..., cN FROM tab4 WHERE date = '4 days ago'
> 
> 
> I've tried this test where tab1 through tab4 all are the same postgres_fdw
> foreign table.
> I've tried this test where tab1 through tab4 all are different foreign
> tables pointing to the same remote table sharing a the same server
> definition.
> I've tried this test where tab1 through tab4 all are different foreign
> tables pointing each with it's own foreign server definition, all of which
> happen to point to the same remote cluster.
> 
> Are there some postgresql.conf settings I should set to get a decent test?

I don't think the plan itself will change as a result of applying this
patch. You might however be able to observe some performance improvement.

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2017-03-13 Thread Corey Huinker
On Mon, Mar 13, 2017 at 1:06 AM, Corey Huinker 
wrote:

>
>> I think it will, because Append itself has been made async-capable by one
>> of the patches and UNION ALL uses Append.  But as mentioned above, only
>> the postgres_fdw foreign tables will be able to utilize this for now.
>>
>>
> Ok, I'll re-run my test from a few weeks back and see if anything has
> changed.
>


I'm not able to discern any difference in plan between a 9.6 instance and
this patch.

The basic outline of my test is:

EXPLAIN ANALYZE
SELECT c1, c2, ..., cN FROM tab1 WHERE date = '1 day ago'
UNION ALL
SELECT c1, c2, ..., cN FROM tab2 WHERE date = '2 days ago'
UNION ALL
SELECT c1, c2, ..., cN FROM tab3 WHERE date = '3 days ago'
UNION ALL
SELECT c1, c2, ..., cN FROM tab4 WHERE date = '4 days ago'


I've tried this test where tab1 through tab4 all are the same postgres_fdw
foreign table.
I've tried this test where tab1 through tab4 all are different foreign
tables pointing to the same remote table sharing a the same server
definition.
I've tried this test where tab1 through tab4 all are different foreign
tables pointing each with it's own foreign server definition, all of which
happen to point to the same remote cluster.

Are there some postgresql.conf settings I should set to get a decent test?


Re: [HACKERS] asynchronous execution

2017-03-12 Thread Corey Huinker
>
>
> I think it will, because Append itself has been made async-capable by one
> of the patches and UNION ALL uses Append.  But as mentioned above, only
> the postgres_fdw foreign tables will be able to utilize this for now.
>
>
Ok, I'll re-run my test from a few weeks back and see if anything has
changed.


Re: [HACKERS] asynchronous execution

2017-03-12 Thread Amit Langote
On 2017/03/11 8:19, Corey Huinker wrote:
> 
> On Thu, Feb 23, 2017 at 6:59 AM, Kyotaro HORIGUCHI
> >
> wrote:
> 
> 9e43e87
> 
> 
> Patch fails on current master, but correctly applies to 9e43e87. Thanks
> for including the commit id.
> 
> Regression tests pass.
> 
> As with my last attempt at reviewing this patch, I'm confused about what
> kind of queries can take advantage of this patch. Is it only cases where a
> local table has multiple inherited foreign table children?

IIUC, Horiguchi-san's patch adds asynchronous capability for ForeignScan's
driven by postgres_fdw (after building some relevant infrastructure
first).  The same might be added to other Scan nodes (and probably other
nodes as well) eventually so that more queries will benefit from
asynchronous execution.  It may just be that ForeignScan's benefit more
from asynchronous execution than other Scan types.

> Will it work
> with queries where two foreign tables are referenced and combined with a
> UNION ALL?

I think it will, because Append itself has been made async-capable by one
of the patches and UNION ALL uses Append.  But as mentioned above, only
the postgres_fdw foreign tables will be able to utilize this for now.

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2017-03-10 Thread Corey Huinker
On Thu, Feb 23, 2017 at 6:59 AM, Kyotaro HORIGUCHI <
horiguchi.kyot...@lab.ntt.co.jp> wrote:

> 9e43e87


Patch fails on current master, but correctly applies to 9e43e87. Thanks for
including the commit id.

Regression tests pass.

As with my last attempt at reviewing this patch, I'm confused about what
kind of queries can take advantage of this patch. Is it only cases where a
local table has multiple inherited foreign table children? Will it work
with queries where two foreign tables are referenced and combined with a
UNION ALL?


Re: [HACKERS] asynchronous execution

2017-02-23 Thread Kyotaro HORIGUCHI
Hello, I totally reorganized the patch set to four pathces on the
current master (9e43e87).

At Wed, 22 Feb 2017 17:39:45 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170222.173945.262776579.horiguchi.kyot...@lab.ntt.co.jp>
> Finally, I couldn't see the crash for the (maybe) same case. I
> can guess two reasons for this. One is that a situation where
> node->as_nasyncpending differs from estate->es_num_pending_async,
> but I couldn't find a possibility. Another is a situation in
> postgresIterateForeignScan where the "next owner" reaches eof but
> another waiter is not. I haven't reproduce the situation but
> fixed it for the case. Addition to that I found a bug in
> ExecAsyncAppendResponse. It calls bms_add_member inappropriate
> way.

This found to be wrong. The true problem here was (maybe) that
ExecAsyncRequest can complete a tuple immediately. This causes
multiple calling to ExecAsyncRequest for the same child at
once. (For the case, the processing node is added again to
node->as_needrequest before ExecAsyncRequest returns.)

Using a copy of node->as_needrequest will fix this but it is
uneasy so I changed ExecAsyncRequest not to return a tuple
immediately. Instaed, ExecAsyncEventLoop skips waiting if no node
to wait. The tuple previously "response"'ed in ExecAsyncRequest
is now responsed here.

Addition to that, the current policy of preserving of
es_wait_event_set doesn't seem to work with the async-capable
postgres_fdw. So the current code cleares it at every entering to
ExecAppend. This needs more thoughts.


I measured the performance of async-execution and it was quite
good from the previous version especially for single-connection
environment.

pf0: 4 foreign tables on single connection
   non async : (prev) 7928ms -> (this time)7993ms
   async : (prev) 6447ms -> (this time)3211ms

pf1: 4 foreign tables on dedicate connection for every table
   non async : (prev) 8023ms -> (this time)7953ms
   async : (prev) 1876ms -> (this time)1841ms

Boost rate by async execution is 60% for single connectsion and
77% for dedicate connection environment.


> > Mmm, I reproduces it quite easily. A silly bug.
> > 
> > Something bad is happening between freeing ExecutorState memory
> > context and resource owner. Perhaps the ExecutorState is freed by
> > resowner (as a part of its anscestors) before the memory for the
> > WaitEventSet is freed. It was careless of me. I'll reconsider it.
> 
> The cause was that the WaitEventSet was placed in ExecutorState
> but registered to TopTransactionResourceOwner. I fixed it.

The attached patches are the following.

0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch
  Allows WaitEventSet to released by resource owner.

0002-Asynchronous-execution-framework.patch
  Asynchronous execution framework based on Robert's version. All
  edits on this is merged.

0003-Make-postgres_fdw-async-capable.patch
  Make postgres_fdw async-capable.

0004-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patch

  This can be merged to 0002 but I didn't since the usage of
  using these pragmas is arguable.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From bcd888a98a7aa5e1bd367c83e06d598121fd2d94 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 22 Feb 2017 09:07:49 +0900
Subject: [PATCH 1/5] Allow wait event set to be registered to resource owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c|  2 +-
 src/backend/storage/ipc/latch.c   | 18 ++-
 src/backend/storage/lmgr/condition_variable.c |  2 +-
 src/backend/utils/resowner/resowner.c | 68 +++
 src/include/storage/latch.h   |  4 +-
 src/include/utils/resowner_private.h  |  8 
 6 files changed, 97 insertions(+), 5 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 7939b1f..16a5d7a 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -201,7 +201,7 @@ pq_init(void)
 (errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
 	AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
 	  NULL, NULL);
 	AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 0079ba5..a204b0c 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -62,6 +62,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. 

Re: [HACKERS] asynchronous execution

2017-02-16 Thread Kyotaro HORIGUCHI
Thank you very much for testing this!

At Tue, 7 Feb 2017 13:28:42 +0900, Amit Langote  
wrote in <9058d70b-a6b0-8b3c-091a-fe77ed0df...@lab.ntt.co.jp>
> Horiguchi-san,
> 
> On 2017/01/31 12:45, Kyotaro HORIGUCHI wrote:
> > I noticed that this patch is conflicting with 665d1fa (Logical
> > replication) so I rebased this. Only executor/Makefile
> > conflicted.
> 
> With the latest set of patches, I observe a crash due to an Assert failure:
> 
> #3  0x006883ed in ExecAsyncEventWait (estate=0x13c01b8,
> timeout=-1) at execAsync.c:345

This means no pending fdw scan didn't let itself go to waiting
stage. It leads to a stuck of the whole things. This is caused if
no one acutually is waiting for result. I suppose that all of the
foreign scans ran on the same connection. Anyway it should be a
mistake in state transition. I'll look into it.

> I was running a query whose plan looked like:
> 
> explain (costs off) select tableoid::regclass, a, min(b), max(b) from ptab
> group by 1,2 order by 1;
>   QUERY PLAN
> --
>  Sort
>Sort Key: ((ptab.tableoid)::regclass)
>->  HashAggregate
>  Group Key: (ptab.tableoid)::regclass, ptab.a
>  ->  Result
>->  Append
>  ->  Foreign Scan on ptab_1
>  ->  Foreign Scan on ptab_2
>  ->  Foreign Scan on ptab_3
>  ->  Foreign Scan on ptab_4
>  ->  Foreign Scan on ptab_5
>  ->  Foreign Scan on ptab_6
>  ->  Foreign Scan on ptab_7
>  ->  Foreign Scan on ptab_8
>  ->  Foreign Scan on ptab_9
>  ->  Foreign Scan on ptab_00010
>
> 
> The snipped part contains Foreign Scans on 90 more foreign partitions (in
> fact, I could see the crash even with 10 foreign table partitions for the
> same query).

Yeah, it seems to me unrelated to how many they are.

> There is a crash in one more case, which seems related to how WaitEventSet
> objects are manipulated during resource-owner-mediated cleanup of a failed
> query, such as after the FDW returned an error like below:
> 
> ERROR:  relation "public.ptab_00010" does not exist
> CONTEXT:  Remote SQL command: SELECT a, b FROM public.ptab_00010
> 
> The backtrace in this looks like below:
> 
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf,
> value=20645152) at resowner.c:301
> 301   lastidx = resarr->lastidx;
> (gdb)
> (gdb) bt
> #0  0x009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf,
> value=20645152) at resowner.c:301
> #1  0x009c6578 in ResourceOwnerForgetWES
> (owner=0x7f7f7f7f7f7f7f7f, events=0x13b0520) at resowner.c:1317
> #2  0x00806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600
> #3  0x009c5338 in ResourceOwnerReleaseInternal (owner=0x12de768,
> phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1 '\001')
> at resowner.c:566
> #4  0x009c5155 in ResourceOwnerRelease (owner=0x12de768,
> phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1
> '\001') at resowner.c:485
> #5  0x00524172 in AbortTransaction () at xact.c:2588
> #6  0x00524854 in AbortCurrentTransaction () at xact.c:3016
> #7  0x00836aa6 in PostgresMain (argc=1, argv=0x12d7b08,
> dbname=0x12d7968 "postgres", username=0x12d7948 "amit") at postgres.c:3860
> #8  0x007a49d8 in BackendRun (port=0x12cdf00) at postmaster.c:4310
> #9  0x007a4151 in BackendStartup (port=0x12cdf00) at postmaster.c:3982
> #10 0x007a0885 in ServerLoop () at postmaster.c:1722
> #11 0x0079febf in PostmasterMain (argc=3, argv=0x12aacc0) at
> postmaster.c:1330
> #12 0x006e7549 in main (argc=3, argv=0x12aacc0) at main.c:228
> 
> There is a segfault when accessing the events variable, whose members seem
> to be pfreed:
> 
> (gdb) f 2
> #2  0x00806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600
> 600   ResourceOwnerForgetWES(set->resowner, set);
> (gdb) p *set
> $5 = {
>   nevents = 2139062143,
>   nevents_space = 2139062143,
>   resowner = 0x7f7f7f7f7f7f7f7f,
>   events = 0x7f7f7f7f7f7f7f7f,
>   latch = 0x7f7f7f7f7f7f7f7f,
>   latch_pos = 2139062143,
>   epoll_fd = 2139062143,
>   epoll_ret_events = 0x7f7f7f7f7f7f7f7f
> }

Mmm, I reproduces it quite easily. A silly bug.

Something bad is happening between freeing ExecutorState memory
context and resource owner. Perhaps the ExecutorState is freed by
resowner (as a part of its anscestors) before the memory for the
WaitEventSet is freed. It was careless of me. I'll reconsider it.

Great thanks for the report.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: [HACKERS] asynchronous execution

2017-02-06 Thread Amit Langote
Horiguchi-san,

On 2017/01/31 12:45, Kyotaro HORIGUCHI wrote:
> I noticed that this patch is conflicting with 665d1fa (Logical
> replication) so I rebased this. Only executor/Makefile
> conflicted.

With the latest set of patches, I observe a crash due to an Assert failure:

#0  0x003969632625 in *__GI_raise (sig=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x003969633e05 in *__GI_abort () at abort.c:92
#2  0x0098b22c in ExceptionalCondition (conditionName=0xb30e02
"!(added)", errorType=0xb30d77 "FailedAssertion", fileName=0xb30d50
"execAsync.c",
lineNumber=345) at assert.c:54
#3  0x006883ed in ExecAsyncEventWait (estate=0x13c01b8,
timeout=-1) at execAsync.c:345
#4  0x00687ed5 in ExecAsyncEventLoop (estate=0x13c01b8,
requestor=0x13c1640, timeout=-1) at execAsync.c:186
#5  0x006a5170 in ExecAppend (node=0x13c1640) at nodeAppend.c:257
#6  0x00692b9b in ExecProcNode (node=0x13c1640) at execProcnode.c:411
#7  0x006bf4d7 in ExecResult (node=0x13c1170) at nodeResult.c:113
#8  0x00692b5c in ExecProcNode (node=0x13c1170) at execProcnode.c:399
#9  0x006a596b in fetch_input_tuple (aggstate=0x13c06a0) at
nodeAgg.c:587
#10 0x006a8530 in agg_fill_hash_table (aggstate=0x13c06a0) at
nodeAgg.c:2272
#11 0x006a7e76 in ExecAgg (node=0x13c06a0) at nodeAgg.c:1910
#12 0x00692d69 in ExecProcNode (node=0x13c06a0) at execProcnode.c:514
#13 0x006c1a42 in ExecSort (node=0x13c03d0) at nodeSort.c:103
#14 0x00692d3f in ExecProcNode (node=0x13c03d0) at execProcnode.c:506
#15 0x0068e733 in ExecutePlan (estate=0x13c01b8,
planstate=0x13c03d0, use_parallel_mode=0 '\000', operation=CMD_SELECT,
sendTuples=1 '\001',
numberTuples=0, direction=ForwardScanDirection, dest=0x7fa368ee1da8)
at execMain.c:1609
#16 0x0068c751 in standard_ExecutorRun (queryDesc=0x135c568,
direction=ForwardScanDirection, count=0) at execMain.c:341
#17 0x0068c5dc in ExecutorRun (queryDesc=0x135c568,


I was running a query whose plan looked like:

explain (costs off) select tableoid::regclass, a, min(b), max(b) from ptab
group by 1,2 order by 1;
  QUERY PLAN
--
 Sort
   Sort Key: ((ptab.tableoid)::regclass)
   ->  HashAggregate
 Group Key: (ptab.tableoid)::regclass, ptab.a
 ->  Result
   ->  Append
 ->  Foreign Scan on ptab_1
 ->  Foreign Scan on ptab_2
 ->  Foreign Scan on ptab_3
 ->  Foreign Scan on ptab_4
 ->  Foreign Scan on ptab_5
 ->  Foreign Scan on ptab_6
 ->  Foreign Scan on ptab_7
 ->  Foreign Scan on ptab_8
 ->  Foreign Scan on ptab_9
 ->  Foreign Scan on ptab_00010
   

The snipped part contains Foreign Scans on 90 more foreign partitions (in
fact, I could see the crash even with 10 foreign table partitions for the
same query).

There is a crash in one more case, which seems related to how WaitEventSet
objects are manipulated during resource-owner-mediated cleanup of a failed
query, such as after the FDW returned an error like below:

ERROR:  relation "public.ptab_00010" does not exist
CONTEXT:  Remote SQL command: SELECT a, b FROM public.ptab_00010

The backtrace in this looks like below:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf,
value=20645152) at resowner.c:301
301 lastidx = resarr->lastidx;
(gdb)
(gdb) bt
#0  0x009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf,
value=20645152) at resowner.c:301
#1  0x009c6578 in ResourceOwnerForgetWES
(owner=0x7f7f7f7f7f7f7f7f, events=0x13b0520) at resowner.c:1317
#2  0x00806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600
#3  0x009c5338 in ResourceOwnerReleaseInternal (owner=0x12de768,
phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1 '\001')
at resowner.c:566
#4  0x009c5155 in ResourceOwnerRelease (owner=0x12de768,
phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1
'\001') at resowner.c:485
#5  0x00524172 in AbortTransaction () at xact.c:2588
#6  0x00524854 in AbortCurrentTransaction () at xact.c:3016
#7  0x00836aa6 in PostgresMain (argc=1, argv=0x12d7b08,
dbname=0x12d7968 "postgres", username=0x12d7948 "amit") at postgres.c:3860
#8  0x007a49d8 in BackendRun (port=0x12cdf00) at postmaster.c:4310
#9  0x007a4151 in BackendStartup (port=0x12cdf00) at postmaster.c:3982
#10 0x007a0885 in ServerLoop () at postmaster.c:1722
#11 0x0079febf in PostmasterMain (argc=3, argv=0x12aacc0) at
postmaster.c:1330
#12 0x006e7549 in main (argc=3, argv=0x12aacc0) at 

Re: [HACKERS] asynchronous execution

2017-02-06 Thread Corey Huinker
On Fri, Feb 3, 2017 at 5:04 AM, Antonin Houska  wrote:

> Kyotaro HORIGUCHI  wrote:
>
> > I noticed that this patch is conflicting with 665d1fa (Logical
> > replication) so I rebased this. Only executor/Makefile
> > conflicted.
>
> I was lucky enough to see an infinite loop when using this patch, which I
> fixed by this change:
>
> diff --git a/src/backend/executor/execAsync.c
> b/src/backend/executor/execAsync.c
> new file mode 100644
> index 588ba18..9b87fbd
> *** a/src/backend/executor/execAsync.c
> --- b/src/backend/executor/execAsync.c
> *** ExecAsyncEventWait(EState *estate, long
> *** 364,369 
> --- 364,370 
>
> if ((w->events & WL_LATCH_SET) != 0)
> {
> +   ResetLatch(MyLatch);
> process_latch_set = true;
> continue;
> }



Hi, I've been testing this patch because seemed like it would help a use
case of mine, but can't tell if it's currently working for cases other than
a local parent table that has many child partitions which happen to be
foreign tables. Is it? I was hoping to use it for a case like:

select x, sum(y) from one_remote_table
union all
select x, sum(y) from another_remote_table
union all
select x, sum(y) from a_third_remote_table


but while aggregates do appear to be pushed down, it seems that the remote
tables are being queried in sequence. Am I doing something wrong?


Re: [HACKERS] asynchronous execution

2017-02-03 Thread Antonin Houska
Kyotaro HORIGUCHI  wrote:

> I noticed that this patch is conflicting with 665d1fa (Logical
> replication) so I rebased this. Only executor/Makefile
> conflicted.

I was lucky enough to see an infinite loop when using this patch, which I
fixed by this change:

diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 588ba18..9b87fbd
*** a/src/backend/executor/execAsync.c
--- b/src/backend/executor/execAsync.c
*** ExecAsyncEventWait(EState *estate, long
*** 364,369 
--- 364,370 
  
if ((w->events & WL_LATCH_SET) != 0)
{
+   ResetLatch(MyLatch);
process_latch_set = true;
continue;
}

Actually _almost_ fixed because at some point one of the following

Assert(areq->state == ASYNC_WAITING);

statements fired. I think it was the immediately following one, but I can
imagine the same to happen in the branch

if (process_latch_set)
...

I think the wants_process_latch field of PendingAsyncRequest is not useful
alone because the process latch can be set for reasons completely unrelated to
the asynchronous processing. If the asynchronous node should use latch to
signal it's readiness, I think an additional flag is needed in the request
which tells ExecAsyncEventWait that the latch was set by the asynchronous
node.

BTW, do we really need the ASYNC_CALLBACK_PENDING state? I can imagine the
async node either to change ASYNC_WAITING directly to ASYNC_COMPLETE, or leave
it ASYNC_WAITING if the data is not ready.


In addition, the following comments are based only on code review, I didn't
verify my understanding experimentally:

* Isn't it possible for AppendState.as_asyncresult to contain multiple
  responses from the same async node? Since the array stores TupleTableSlot
  instead of the actual tuple (so multiple items of as_asyncresult point to
  the same slot), I suspect the slot contents might not be defined when the
  Append node eventually tries to return it to the upper plan.

* For the WaitEvent subsystem to work, I think postgres_fdw should keep a
  separate libpq connection per node, not per user mapping. Currently the
  connections are cached by user mapping, but it's legal to locate multiple
  child postgres_fdw nodes of Append plan on the same remote server. I expect
  that these "co-located" nodes would currently use the same user mapping and
  therefore the same connection.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2017-01-31 Thread Kyotaro HORIGUCHI
Thank you.

At Wed, 1 Feb 2017 14:11:58 +0900, Michael Paquier  
wrote in 

Re: [HACKERS] asynchronous execution

2017-01-31 Thread Michael Paquier
On Tue, Jan 31, 2017 at 12:45 PM, Kyotaro HORIGUCHI
 wrote:
> I noticed that this patch is conflicting with 665d1fa (Logical
> replication) so I rebased this. Only executor/Makefile
> conflicted.

The patches still apply, moved to CF 2017-03. Be aware of that:
$ git diff HEAD~6 --check
contrib/postgres_fdw/postgres_fdw.c:388: indent with spaces.
+   PendingAsyncRequest *areq,
contrib/postgres_fdw/postgres_fdw.c:389: indent with spaces.
+   bool reinit);
src/backend/utils/resowner/resowner.c:1332: new blank line at EOF.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2016-11-28 Thread Kyotaro HORIGUCHI
Hello,

I cannot respond until next Monday, so I move this to the next CF
by myself.

At Tue, 15 Nov 2016 20:25:13 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20161115.202513.268072050.horiguchi.kyot...@lab.ntt.co.jp>
> Hello, this is a maintenance post of reased patches.
> I added a change of ResourceOwnerData missed in 0005.
> 
> At Mon, 31 Oct 2016 10:39:12 +0900 (JST), Kyotaro HORIGUCHI 
>  wrote in 
> <20161031.103912.217430542.horiguchi.kyot...@lab.ntt.co.jp>
> > This a PoC patch of asynchronous execution feature, based on a
> > executor infrastructure Robert proposed. These patches are
> > rebased on the current master.
> > 
> > 0001-robert-s-2nd-framework.patch
> > 
> >  Roberts executor async infrastructure. Async-driver nodes
> >  register its async-capable children and sync and data transfer
> >  are done out of band of ordinary ExecProcNode channel. So async
> >  execution no longer disturbs async-unaware node and slows them
> >  down.
> > 
> > 0002-Fix-some-bugs.patch
> > 
> >  Some fixes for 0001 to work. This is just to preserve the shape
> >  of 0001 patch.
> > 
> > 0003-Modify-async-execution-infrastructure.patch
> > 
> >  The original infrastructure doesn't work when multiple foreign
> >  tables is on the same connection. This makes it work.
> > 
> > 0004-Make-postgres_fdw-async-capable.patch
> > 
> >  Makes postgres_fdw to work asynchronously.
> > 
> > 0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patch
> > 
> >  This addresses a problem pointed by Robers about 0001 patch,
> >  that WaitEventSet used for async execution can leak by errors.
> > 
> > 0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patch
> > 
> >  ExecAppend gets a bit slower by penalties of misprediction of
> >  branches. This fixes it by using unlikely() macro.
> > 
> > 0007-Add-instrumentation-to-async-execution.patch
> > 
> >  As the description above for 0001, async infrastructure conveys
> >  tuples outside ExecProcNode channel so EXPLAIN ANALYZE requires
> >  special treat to show sane results. This patch tries that.
> > 
> > 
> > A result of a performance measurement is in this message.
> > 
> > https://www.postgresql.org/message-id/20161025.182150.230901487.horiguchi.kyot...@lab.ntt.co.jp
> > 
> > 
> > | t0  - SELECT sum(a) FROM ;
> > | pl  - SELECT sum(a) FROM <4 local children>;
> > | pf0 - SELECT sum(a) FROM <4 foreign children on single connection>;
> > | pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>;
> > ...
> > | async
> > |   t0: 3885.84 ( 40.20)  0.86% faster (should be error but stable on my 
> > env..)
> > |   pl: 1617.20 (  3.51)  1.26% faster (ditto)
> > |  pf0: 6680.95 (478.72)  19.5% faster
> > |  pf1: 1886.87 ( 36.25)  77.1% faster

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2016-10-25 Thread Kyotaro HORIGUCHI
Hi, this is the 7th patch to make instrumentation work.

Explain analyze shows the following result by the previous patch set .

| Aggregate  (cost=820.25..820.26 rows=1 width=8) (actual 
time=4324.676..4324.676
| rows=1 loops=1)
|   ->  Append  (cost=0.00..791.00 rows=11701 width=4) (actual 
time=0.910..3663.8
|82 rows=400 loops=1)
| ->  Foreign Scan on ft10  (cost=100.00..197.75 rows=2925 width=4)
|   (never executed)
| ->  Foreign Scan on ft20  (cost=100.00..197.75 rows=2925 width=4)
|   (never executed)
| ->  Foreign Scan on ft30  (cost=100.00..197.75 rows=2925 width=4)
|   (never executed)
| ->  Foreign Scan on ft40  (cost=100.00..197.75 rows=2925 width=4)
|   (never executed)
| ->  Seq Scan on pf0  (cost=0.00..0.00 rows=1 width=4)
|  (actual time=0.004..0.004 rows=0 loops=1)

The current instrument stuff assumes that requested tuple always
returns a tuple or the end of tuple comes. This async framework
has two point of executing underneath nodes. ExecAsyncRequest and
ExecAsyncEventLoop. So I'm not sure if this is appropriate but
anyway it seems to show sane numbers.

| Aggregate  (cost=820.25..820.26 rows=1 width=8) (actual 
time=4571.205..4571.206
| rows=1 loops=1)
|   ->  Append  (cost=0.00..791.00 rows=11701 width=4) (actual 
time=1.362..3893.1
|14 rows=400 loops=1)
| ->  Foreign Scan on ft10  (cost=100.00..197.75 rows=2925 width=4)
|(actual time=1.056..770.863 rows=100 loops=1)
| ->  Foreign Scan on ft20  (cost=100.00..197.75 rows=2925 width=4)
|(actual time=0.461..767.840 rows=100 loops=1)
| ->  Foreign Scan on ft30  (cost=100.00..197.75 rows=2925 width=4)
|(actual time=0.474..782.547 rows=100 loops=1)
| ->  Foreign Scan on ft40  (cost=100.00..197.75 rows=2925 width=4)
|(actual time=0.156..765.920 rows=100 loops=1)
| ->  Seq Scan on pf0  (cost=0.00..0.00 rows=1 width=4) (never executed)


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 35c60a46f49aab72d492c798ff7eb8fc0e672250 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Tue, 25 Oct 2016 19:04:04 +0900
Subject: [PATCH 7/7] Add instrumentation to async execution

Make explain analyze give sane result when async execution has taken
place.
---
 src/backend/executor/execAsync.c  | 19 +++
 src/backend/executor/instrument.c |  2 +-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 40e3f67..588ba18 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -46,6 +46,9 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 	PendingAsyncRequest *areq = NULL;
 	int		nasync = estate->es_num_pending_async;
 
+	if (requestee->instrument)
+		InstrStartNode(requestee->instrument);
+
 	/*
 	 * If the number of pending asynchronous nodes exceeds the number of
 	 * available slots in the es_pending_async array, expand the array.
@@ -121,11 +124,17 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 	if (areq->state == ASYNC_COMPLETE)
 	{
 		Assert(areq->result == NULL || IsA(areq->result, TupleTableSlot));
+
 		ExecAsyncResponse(estate, areq);
+		if (areq->requestee->instrument)
+			InstrStopNode(requestee->instrument,
+		  TupIsNull((TupleTableSlot*)areq->result) ? 0.0 : 1.0);
 
 		return;
 	}
 
+	if (areq->requestee->instrument)
+		InstrStopNode(requestee->instrument, 0);
 	/* No result available now, make this node pending */
 	estate->es_num_pending_async++;
 }
@@ -193,6 +202,9 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
 		{
 			PendingAsyncRequest *areq = estate->es_pending_async[i];
 
+			if (areq->requestee->instrument)
+InstrStartNode(areq->requestee->instrument);
+
 			/* Skip it if not pending. */
 			if (areq->state == ASYNC_CALLBACK_PENDING)
 			{
@@ -211,7 +223,14 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
 if (requestor == areq->requestor)
 	requestor_done = true;
 ExecAsyncResponse(estate, areq);
+
+if (areq->requestee->instrument)
+	InstrStopNode(areq->requestee->instrument,
+  TupIsNull((TupleTableSlot*)areq->result) ?
+  0.0 : 1.0);
 			}
+			else if (areq->requestee->instrument)
+InstrStopNode(areq->requestee->instrument, 0);
 		}
 
 		/* If any node completed, compact the array. */
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 2614bf4..6a22a15 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -102,7 +102,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 			 , >bufusage_start);
 
 	/* Is this the first tuple 

Re: [HACKERS] asynchronous execution

2016-10-25 Thread Kyotaro HORIGUCHI
This is the rebased version on the current master(-0004), and
added resowner stuff (0005) and unlikely(0006).

At Tue, 18 Oct 2016 10:30:51 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20161018.103051.30820907.horiguchi.kyot...@lab.ntt.co.jp>
> > > - Errors in the executor can leak the WaitEventSet.  Probably we need
> > > to modify ResourceOwners to be able to own WaitEventSets.
> 
> WaitEventSet itself is not leaked but epoll-fd should be closed
> at failure. This seems doable with TRY-CATCHing in
> ExecAsyncEventLoop. (not yet)

Haha, that's a silly talk. The wait event can continue to live
when timeout and any error can happen on the way after the
that. I added an entry for wait event set to resource owner and
hang ones created in ExecAsyncEventWait to
TopTransactionResourceOwner. Currently WaitLatchOrSocket doesn't
do so not to change the current behavior. WaitEventSet doesn't
have usable identifier for resowner.c so currently I use the
address(pointer value) for the purpose. The patch 0005 does that.

> I measured performance and had the following result.
> 
> t0  - SELECT sum(a) FROM ;
> pl  - SELECT sum(a) FROM <4 local children>;
> pf0 - SELECT sum(a) FROM <4 foreign children on single connection>;
> pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>;
> 
> The result is written as "time (std dev )"
> 
> sync
>   t0: 3820.33 (  1.88)
>   pl: 1608.59 ( 12.06)
>  pf0: 7928.29 ( 46.58)
>  pf1: 8023.16 ( 26.43)
> 
> async
>   t0: 3806.31 (  4.49)0.4% faster (should be error)
>   pl: 1629.17 (  0.29)1.3% slower
>  pf0: 6447.07 ( 25.19)   18.7% faster
>  pf1: 1876.80 ( 47.13)   76.6% faster
> 
> t0 is not affected since the ExecProcNode stuff has gone.
> 
> pl is getting a bit slower. (almost the same to simple seqscan of
> the previous patch) This should be a misprediction penalty.

Using likely macro for ExecAppend, and it seems to have shaken
off the degradation.

sync
  t0: 3919.49 (  5.95)
  pl: 1637.95 (  0.75)
 pf0: 8304.20 ( 43.94)
 pf1: 8222.09 ( 28.20)

async
  t0: 3885.84 ( 40.20)  0.86% faster (should be error but stable on my env..)
  pl: 1617.20 (  3.51)  1.26% faster (ditto)
 pf0: 6680.95 (478.72)  19.5% faster
 pf1: 1886.87 ( 36.25)  77.1% faster

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 25eba7e506228ab087e8b743efb039286a8251c4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 12 Oct 2016 12:46:10 +0900
Subject: [PATCH 1/6] robert's 2nd framework

---
 contrib/postgres_fdw/postgres_fdw.c |  49 
 src/backend/executor/Makefile   |   4 +-
 src/backend/executor/README |  43 +++
 src/backend/executor/execAmi.c  |   5 +
 src/backend/executor/execAsync.c| 462 
 src/backend/executor/nodeAppend.c   | 162 ++-
 src/backend/executor/nodeForeignscan.c  |  49 
 src/backend/nodes/copyfuncs.c   |   1 +
 src/backend/nodes/outfuncs.c|   1 +
 src/backend/nodes/readfuncs.c   |   1 +
 src/backend/optimizer/plan/createplan.c |  45 +++-
 src/include/executor/execAsync.h|  29 ++
 src/include/executor/nodeAppend.h   |   3 +
 src/include/executor/nodeForeignscan.h  |   7 +
 src/include/foreign/fdwapi.h|  15 ++
 src/include/nodes/execnodes.h   |  57 +++-
 src/include/nodes/plannodes.h   |   1 +
 17 files changed, 909 insertions(+), 25 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 906d6e6..c480945 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -19,6 +19,7 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -349,6 +350,14 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
 			 UpperRelationKind stage,
 			 RelOptInfo *input_rel,
 			 RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(EState *estate,
+			PendingAsyncRequest *areq);
+static void postgresForeignAsyncConfigureWait(EState *estate,
+  PendingAsyncRequest *areq,
+  bool reinit);
+static void postgresForeignAsyncNotify(EState *estate,
+		   PendingAsyncRequest *areq);
 
 /*
  * Helper functions
@@ -468,6 +477,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for upper relation push-down */
 	routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
+	/* Support functions for async execution */
+	routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+	routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+	

Re: [HACKERS] asynchronous execution

2016-10-17 Thread Kyotaro HORIGUCHI
Hello, this works but ExecAppend gets a bit degradation.

At Mon, 03 Oct 2016 19:46:32 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20161003.194632.204401048.horiguchi.kyot...@lab.ntt.co.jp>
> > Some notes:
> > 
> > - EvalPlanQual rechecks are broken.

This is fixed by adding (restoring) async-cancelation.

> > - EXPLAIN ANALYZE instrumentation is broken.

EXPLAIN ANALYE seems working but async-specific information is
not available yet.

> > - ExecReScanAppend is broken, because the async stuff needs some way
> > of canceling an async request and I didn't invent anything like that
> > yet.

Fixed as EvalPlanQual.

> > - The postgres_fdw changes pretend to be async but aren't actually.
> > It's just a demo of (part of) the interface at this point.

Applied my previous patch with some modification.

> > - The postgres_fdw changes also report all pg-fdw paths as
> > async-capable, but actually the direct-modify ones aren't, so the
> > regression tests fail.

All actions other than scan does vacate_connection() to use a
connection.

> > - Errors in the executor can leak the WaitEventSet.  Probably we need
> > to modify ResourceOwners to be able to own WaitEventSets.

WaitEventSet itself is not leaked but epoll-fd should be closed
at failure. This seems doable with TRY-CATCHing in
ExecAsyncEventLoop. (not yet)

> > - There are probably other bugs, too.
> > 
> > Whee!
> > 
> > Note that I've tried to solve the re-entrancy problems by (1) putting
> > all of the event loop's state inside the EState rather than in local
> > variables and (2) having the function that is called to report arrival
> > of a result be thoroughly different than the function that is used to
> > return a tuple to a synchronous caller.
> > 
> > Comments welcome, if you're feeling brave enough to look at anything
> > this half-baked.

This doesn't cause reentry since this no longer bubbles up
tupples through async-unaware nodes. This framework passes tuples
through private channels for requestor and requestees.

Anyway, I amended this and made postgres_fdw async and then
finally all regtests passed with minor modifications. The
attached patches are the following.

0001-robert-s-2nd-framework.patch
 The patch Robert shown upthread

0002-Fix-some-bugs.patch
 A small patch to fix complation errors of 0001.

0003-Modify-async-execution-infrastructure.patch
 Several modifications on the infrastructure. The details are
 shown after the measurement below.

0004-Make-postgres_fdw-async-capable.patch
 True-async postgres-fdw.

gentblr.sql, testrun.sh, calc.pl
 Performance test script suite.

   gentblr.sql - creates test tables.
   testrun.sh - does single test run and
   calc.pl - drives testrunc.sh and summirize its results.


I measured performance and had the following result.

t0  - SELECT sum(a) FROM ;
pl  - SELECT sum(a) FROM <4 local children>;
pf0 - SELECT sum(a) FROM <4 foreign children on single connection>;
pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>;

The result is written as "time (std dev )"

sync
  t0: 3820.33 (  1.88)
  pl: 1608.59 ( 12.06)
 pf0: 7928.29 ( 46.58)
 pf1: 8023.16 ( 26.43)

async
  t0: 3806.31 (  4.49)0.4% faster (should be error)
  pl: 1629.17 (  0.29)1.3% slower
 pf0: 6447.07 ( 25.19)   18.7% faster
 pf1: 1876.80 ( 47.13)   76.6% faster

t0 is not affected since the ExecProcNode stuff has gone.

pl is getting a bit slower. (almost the same to simple seqscan of
the previous patch) This should be a misprediction penalty.

pf0, pf1 are faster as expected.




The below is a summary of modifications made by 0002 and 0003 patch.

execAsync.c, execnodes.h:

  - Added include "pgstat.h" to use WAIT_EVENT_ASYNC_WAIT.

  - Changed the interface of ExecAsyncRequest to return if a tuple is
immediately available or not.

  - Made ExecAsyncConfigureWait to return if it registered at least
one waitevent or not. This is used to know the caller
(ExecAsyncEventWait) has a event to wait (for safety).

If two or more postgres_fdw nodes are sharing one connection,
only one of them can be waited at once. It is a
responsibility to the FDW drivers to ensure at least one wait
event to be added but on failure WaitEventSetWait silently
waits forever.

  - There were separate areq->callback_pending and
areq->request_complete but they are altering together so they are
replaced with one state variable areq->state. New enum
AsyncRequestState for areq->state in execnodes.h.


nodeAppend.c:

  - Return a tuple immediately if ExecAsyncRequest says that a
tuple is available.

  - Reduced nest level of for(;;).

nodeForeignscan.[ch], fdwapi.h, execProcnode.c::

  - Calling postgresIterateForeignScan can yield tuples in wrong
shape. Call ExecForeignScan instead.

  - Changed the interface of AsyncConfigureWait as execAsync.c.

  - Added ShutdownForeignScan interface.

createplan.c, ruleutils.c, plannodes.h:

  - 

Re: [HACKERS] asynchronous execution

2016-10-05 Thread Robert Haas
On Tue, Oct 4, 2016 at 7:53 AM, Amit Khandekar  wrote:
> Also, parent pointers are not required in the new design. Thinking of
> parent pointers, now it seems the event won't get bubbled up the tree
> with the new design. But still, , I think it's possible to switch over
> to the other asynchronous tree when some node in the current subtree
> is waiting. But I am not sure, will think more on that.

The bubbling-up still happens, because each node that made an async
request gets a callback with the final response - and if it is itself
the recipient of an async request, it can use that callback to respond
to that request in turn.  This version doesn't bubble up through
non-async-aware nodes, but that might be a good thing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2016-10-04 Thread Amit Khandekar
On 4 October 2016 at 02:30, Robert Haas  wrote:
> On Wed, Sep 28, 2016 at 12:30 AM, Amit Khandekar  
> wrote:
>> On 24 September 2016 at 06:39, Robert Haas  wrote:
>>> Since Kyotaro Horiguchi found that my previous design had a
>>> system-wide performance impact due to the ExecProcNode changes, I
>>> decided to take a different approach here: I created an async
>>> infrastructure where both the requestor and the requestee have to be
>>> specifically modified to support parallelism, and then modified Append
>>> and ForeignScan to cooperate using the new interface.  Hopefully that
>>> means that anything other than those two nodes will suffer no
>>> performance impact.  Of course, it might have other problems
>>
>> I see that the reason why you re-designed the asynchronous execution
>> implementation is because the earlier implementation showed
>> performance degradation in local sequential and local parallel scans.
>> But I checked that the ExecProcNode() changes were not that
>> significant as to cause the degradation.
>
> I think we need some testing to prove that one way or the other.  If
> you can do some - say on a plan with multiple nested loop joins with
> inner index-scans, which will call ExecProcNode() a lot - that would
> be great.  I don't think we can just rely on "it doesn't seem like it
> should be slower"
Agreed. I will come up with some tests.

> , though - ExecProcNode() is too important a function
> for us to guess at what the performance will be.

Also, parent pointers are not required in the new design. Thinking of
parent pointers, now it seems the event won't get bubbled up the tree
with the new design. But still, , I think it's possible to switch over
to the other asynchronous tree when some node in the current subtree
is waiting. But I am not sure, will think more on that.

>
> The thing I'm really worried about with either implementation is what
> happens when we start to add asynchronous capability to multiple
> nodes.  For example, if you imagine a plan like this:
>
> Append
> -> Hash Join
>   -> Foreign Scan
>   -> Hash
> -> Seq Scan
> -> Hash Join
>   -> Foreign Scan
>   -> Hash
> -> Seq Scan
>
> In order for this to run asynchronously, you need not only Append and
> Foreign Scan to be async-capable, but also Hash Join.  That's true in
> either approach.  Things are slightly better with the original
> approach, but the basic problem is there in both cases.  So it seems
> we need an approach that will make adding async capability to a node
> really cheap, which seems like it might be a problem.

Yes, we might have to deal with this.

>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2016-10-03 Thread Robert Haas
On Wed, Sep 28, 2016 at 12:30 AM, Amit Khandekar  wrote:
> On 24 September 2016 at 06:39, Robert Haas  wrote:
>> Since Kyotaro Horiguchi found that my previous design had a
>> system-wide performance impact due to the ExecProcNode changes, I
>> decided to take a different approach here: I created an async
>> infrastructure where both the requestor and the requestee have to be
>> specifically modified to support parallelism, and then modified Append
>> and ForeignScan to cooperate using the new interface.  Hopefully that
>> means that anything other than those two nodes will suffer no
>> performance impact.  Of course, it might have other problems
>
> I see that the reason why you re-designed the asynchronous execution
> implementation is because the earlier implementation showed
> performance degradation in local sequential and local parallel scans.
> But I checked that the ExecProcNode() changes were not that
> significant as to cause the degradation.

I think we need some testing to prove that one way or the other.  If
you can do some - say on a plan with multiple nested loop joins with
inner index-scans, which will call ExecProcNode() a lot - that would
be great.  I don't think we can just rely on "it doesn't seem like it
should be slower", though - ExecProcNode() is too important a function
for us to guess at what the performance will be.

The thing I'm really worried about with either implementation is what
happens when we start to add asynchronous capability to multiple
nodes.  For example, if you imagine a plan like this:

Append
-> Hash Join
  -> Foreign Scan
  -> Hash
-> Seq Scan
-> Hash Join
  -> Foreign Scan
  -> Hash
-> Seq Scan

In order for this to run asynchronously, you need not only Append and
Foreign Scan to be async-capable, but also Hash Join.  That's true in
either approach.  Things are slightly better with the original
approach, but the basic problem is there in both cases.  So it seems
we need an approach that will make adding async capability to a node
really cheap, which seems like it might be a problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2016-10-03 Thread Kyotaro HORIGUCHI
Thank you for the thought.


At Fri, 23 Sep 2016 21:09:03 -0400, Robert Haas  wrote 
in 
> [ Adjusting subject line to reflect the actual topic of discussion better. ]
> 
> On Fri, Sep 23, 2016 at 9:29 AM, Robert Haas  wrote:
> > On Fri, Sep 23, 2016 at 8:45 AM, Amit Khandekar  
> > wrote:
> >> For e.g., in the above plan which you specified, suppose :
> >> 1. Hash Join has called ExecProcNode() for the child foreign scan b, and so
> >> is
> >> waiting in ExecAsyncWaitForNode(foreign_scan_on_b).
> >> 2. The event wait list already has foreign scan on a that is on a different
> >> subtree.
> >> 3. This foreign scan a happens to be ready, so in
> >> ExecAsyncWaitForNode (), ExecDispatchNode(foreign_scan_a) is called,
> >> which returns with result_ready.
> >> 4. Since it returns result_ready, it's parent node is now inserted in the
> >> callbacks array, and so it's parent (Append) is executed.
> >> 5. But, this Append planstate is already in the middle of executing Hash
> >> join, and is waiting for HashJoin.
> >
> > Ah, yeah, something like that could happen.  I've spent much of this
> > week working on a new design for this feature which I think will avoid
> > this problem.  It doesn't work yet - in fact I can't even really test
> > it yet.  But I'll post what I've got by the end of the day today so
> > that anyone who is interested can look at it and critique.
> 
> Well, I promised to post this, so here it is.  It's not really working
> all that well at this point, and it's definitely not doing anything
> that interesting, but you can see the outline of what I have in mind.
> Since Kyotaro Horiguchi found that my previous design had a
> system-wide performance impact due to the ExecProcNode changes, I
> decided to take a different approach here: I created an async
> infrastructure where both the requestor and the requestee have to be
> specifically modified to support parallelism, and then modified Append
> and ForeignScan to cooperate using the new interface.  Hopefully that
> means that anything other than those two nodes will suffer no
> performance impact.  Of course, it might have other problems

The previous framework didn't need to distinguish async-capable
and uncapable nodes from the parent node's view. The things in
ExecProcNode was required for the reason. Instead, this new one
removes the ExecProcNode stuff by distinguishing the two kinds of
node in async-aware parents, that is, Append. This no longer
involves async-unaware nodes into the tuple bubbling-up mechanism
so the reentrant problem doesn't seem to occur.

On the other hand, for example, the following plan, regrardless
of its practicality, (there should be more good example..)

(Async-unaware node)
 -  NestLoop
- Append
  - n * ForegnScan
- Append
  - n * ForegnScan

If the NestLoop, Append are async-aware, all of the ForeignScans
can run asynchronously with the previous framework. The topmost
NestLoop will be awakened after that firing of any ForenScans
makes a tuple bubbles up to the NestLoop. This is because the
not-need-to-distinguish-aware-or-not nature provided by the
ExecProcNode stuff.

On the other hand, with the new one, in order to do the same
thing, ExecAppend have in turn to behave differently whether the
parent is async or not. To do this will be bothersome but not
with confidence.

I examine this further intensively, especially for performance
degeneration and obstacles to complete this.


> Some notes:
> 
> - EvalPlanQual rechecks are broken.
> - EXPLAIN ANALYZE instrumentation is broken.
> - ExecReScanAppend is broken, because the async stuff needs some way
> of canceling an async request and I didn't invent anything like that
> yet.
> - The postgres_fdw changes pretend to be async but aren't actually.
> It's just a demo of (part of) the interface at this point.
> - The postgres_fdw changes also report all pg-fdw paths as
> async-capable, but actually the direct-modify ones aren't, so the
> regression tests fail.
> - Errors in the executor can leak the WaitEventSet.  Probably we need
> to modify ResourceOwners to be able to own WaitEventSets.
> - There are probably other bugs, too.
> 
> Whee!
> 
> Note that I've tried to solve the re-entrancy problems by (1) putting
> all of the event loop's state inside the EState rather than in local
> variables and (2) having the function that is called to report arrival
> of a result be thoroughly different than the function that is used to
> return a tuple to a synchronous caller.
> 
> Comments welcome, if you're feeling brave enough to look at anything
> this half-baked.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2016-09-29 Thread Kyotaro HORIGUCHI
Hello, thank you for the comment.

At Wed, 28 Sep 2016 10:00:08 +0530, Amit Khandekar  
wrote in 

Re: [HACKERS] asynchronous execution

2016-09-28 Thread Kyotaro HORIGUCHI
Sorry for delayed response, I'll have enough time from now and
address this.

At Fri, 23 Sep 2016 21:09:03 -0400, Robert Haas  wrote 
in 
> Well, I promised to post this, so here it is.  It's not really working
> all that well at this point, and it's definitely not doing anything
> that interesting, but you can see the outline of what I have in mind.
> Since Kyotaro Horiguchi found that my previous design had a
> system-wide performance impact due to the ExecProcNode changes, I
> decided to take a different approach here: I created an async
> infrastructure where both the requestor and the requestee have to be
> specifically modified to support parallelism, and then modified Append
> and ForeignScan to cooperate using the new interface.  Hopefully that
> means that anything other than those two nodes will suffer no
> performance impact.  Of course, it might have other problems
> 
> Some notes:
> 
> - EvalPlanQual rechecks are broken.
> - EXPLAIN ANALYZE instrumentation is broken.
> - ExecReScanAppend is broken, because the async stuff needs some way
> of canceling an async request and I didn't invent anything like that
> yet.
> - The postgres_fdw changes pretend to be async but aren't actually.
> It's just a demo of (part of) the interface at this point.
> - The postgres_fdw changes also report all pg-fdw paths as
> async-capable, but actually the direct-modify ones aren't, so the
> regression tests fail.
> - Errors in the executor can leak the WaitEventSet.  Probably we need
> to modify ResourceOwners to be able to own WaitEventSets.
> - There are probably other bugs, too.
> 
> Whee!
> 
> Note that I've tried to solve the re-entrancy problems by (1) putting
> all of the event loop's state inside the EState rather than in local
> variables and (2) having the function that is called to report arrival
> of a result be thoroughly different than the function that is used to
> return a tuple to a synchronous caller.
> 
> Comments welcome, if you're feeling brave enough to look at anything
> this half-baked.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2016-09-27 Thread Amit Khandekar
On 24 September 2016 at 06:39, Robert Haas  wrote:
> Since Kyotaro Horiguchi found that my previous design had a
> system-wide performance impact due to the ExecProcNode changes, I
> decided to take a different approach here: I created an async
> infrastructure where both the requestor and the requestee have to be
> specifically modified to support parallelism, and then modified Append
> and ForeignScan to cooperate using the new interface.  Hopefully that
> means that anything other than those two nodes will suffer no
> performance impact.  Of course, it might have other problems

I see that the reason why you re-designed the asynchronous execution
implementation is because the earlier implementation showed
performance degradation in local sequential and local parallel scans.
But I checked that the ExecProcNode() changes were not that
significant as to cause the degradation. It will not call
ExecAsyncWaitForNode() unless that node supports asynchronism. Do you
feel there is anywhere else in the implementation that is really
causing this degrade ? That previous implementation has some issues,
but they seemed solvable. We could resolve the plan state recursion
issue by explicitly making sure the same plan state does not get
called again while it is already executing.

Thanks
-Amit Khandekar


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous execution

2016-09-23 Thread Robert Haas
[ Adjusting subject line to reflect the actual topic of discussion better. ]

On Fri, Sep 23, 2016 at 9:29 AM, Robert Haas  wrote:
> On Fri, Sep 23, 2016 at 8:45 AM, Amit Khandekar  
> wrote:
>> For e.g., in the above plan which you specified, suppose :
>> 1. Hash Join has called ExecProcNode() for the child foreign scan b, and so
>> is
>> waiting in ExecAsyncWaitForNode(foreign_scan_on_b).
>> 2. The event wait list already has foreign scan on a that is on a different
>> subtree.
>> 3. This foreign scan a happens to be ready, so in
>> ExecAsyncWaitForNode (), ExecDispatchNode(foreign_scan_a) is called,
>> which returns with result_ready.
>> 4. Since it returns result_ready, it's parent node is now inserted in the
>> callbacks array, and so it's parent (Append) is executed.
>> 5. But, this Append planstate is already in the middle of executing Hash
>> join, and is waiting for HashJoin.
>
> Ah, yeah, something like that could happen.  I've spent much of this
> week working on a new design for this feature which I think will avoid
> this problem.  It doesn't work yet - in fact I can't even really test
> it yet.  But I'll post what I've got by the end of the day today so
> that anyone who is interested can look at it and critique.

Well, I promised to post this, so here it is.  It's not really working
all that well at this point, and it's definitely not doing anything
that interesting, but you can see the outline of what I have in mind.
Since Kyotaro Horiguchi found that my previous design had a
system-wide performance impact due to the ExecProcNode changes, I
decided to take a different approach here: I created an async
infrastructure where both the requestor and the requestee have to be
specifically modified to support parallelism, and then modified Append
and ForeignScan to cooperate using the new interface.  Hopefully that
means that anything other than those two nodes will suffer no
performance impact.  Of course, it might have other problems

Some notes:

- EvalPlanQual rechecks are broken.
- EXPLAIN ANALYZE instrumentation is broken.
- ExecReScanAppend is broken, because the async stuff needs some way
of canceling an async request and I didn't invent anything like that
yet.
- The postgres_fdw changes pretend to be async but aren't actually.
It's just a demo of (part of) the interface at this point.
- The postgres_fdw changes also report all pg-fdw paths as
async-capable, but actually the direct-modify ones aren't, so the
regression tests fail.
- Errors in the executor can leak the WaitEventSet.  Probably we need
to modify ResourceOwners to be able to own WaitEventSets.
- There are probably other bugs, too.

Whee!

Note that I've tried to solve the re-entrancy problems by (1) putting
all of the event loop's state inside the EState rather than in local
variables and (2) having the function that is called to report arrival
of a result be thoroughly different than the function that is used to
return a tuple to a synchronous caller.

Comments welcome, if you're feeling brave enough to look at anything
this half-baked.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


async-wip-2016-09-23.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Asynchronous execution on FDW

2015-08-10 Thread Robert Haas
On Mon, Aug 10, 2015 at 3:23 AM, Heikki Linnakangas hlinn...@iki.fi wrote:
 I've marked this as rejected in the commitfest, because others are
 working on a more general solution with parallel workers. That's still
 work-in-progress, and it's not certain if it's going to make it into
 9.6, but if it does it will largely render this obsolete. We can revisit
 this patch later in the release cycle, if the parallel scan patch hasn't
 solved the same use case by then.

I think the really important issue for this patch is the one discussed here:

http://www.postgresql.org/message-id/ca+tgmoaijk1svzw_gkfu+zssxcijkfelqu2aomvuphpsfw4...@mail.gmail.com

You raised an important issue there but never really expressed an
opinion on the points I raised, here or on the other thread.  And
neither did anyone else except the patch author who, perhaps
unsurprisingly, thinks it's OK.  I wish we could get more discussion
about that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Asynchronous execution on FDW

2015-08-10 Thread Stephen Frost
* Robert Haas (robertmh...@gmail.com) wrote:
 On Mon, Aug 10, 2015 at 3:23 AM, Heikki Linnakangas hlinn...@iki.fi wrote:
  I've marked this as rejected in the commitfest, because others are
  working on a more general solution with parallel workers. That's still
  work-in-progress, and it's not certain if it's going to make it into
  9.6, but if it does it will largely render this obsolete. We can revisit
  this patch later in the release cycle, if the parallel scan patch hasn't
  solved the same use case by then.
 
 I think the really important issue for this patch is the one discussed here:
 
 http://www.postgresql.org/message-id/ca+tgmoaijk1svzw_gkfu+zssxcijkfelqu2aomvuphpsfw4...@mail.gmail.com

I agree that it'd be great to figure out the answer to #2, but I'm also
of the opinion that we can either let the user tell us through the use
of the GUCs proposed in the patch or simply not worry about the
potential for time wastage associated with starting them all at once, as
you suggested there.

 You raised an important issue there but never really expressed an
 opinion on the points I raised, here or on the other thread.  And
 neither did anyone else except the patch author who, perhaps
 unsurprisingly, thinks it's OK.  I wish we could get more discussion
 about that.

When I read the proposal, I had the same reaction that it didn't seem
like quite the right place and it further bothered me that it was
specific to FDWs.

Perhaps not surprisingly, as I authored it, but I'm still a fan of my
proposal #1 here:

http://www.postgresql.org/message-id/20131104032604.gb2...@tamriel.snowman.net

More generally, I completely agree with the position (I believe your's,
but I might be misremembering) that we want to have this async
capability independently and in addition to parallel scan.  I don't
believe one obviates the advantages of the other.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Asynchronous execution on FDW

2015-08-10 Thread Heikki Linnakangas
I've marked this as rejected in the commitfest, because others are
working on a more general solution with parallel workers. That's still
work-in-progress, and it's not certain if it's going to make it into
9.6, but if it does it will largely render this obsolete. We can revisit
this patch later in the release cycle, if the parallel scan patch hasn't
solved the same use case by then.

- Heikki



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Asynchronous execution on FDW

2015-07-24 Thread Kouhei Kaigai
Hello Horiguchi-san,

   As for ForeignScan, it is merely an API for FDW and does nothing
   substantial so it would have nothing special to do. As for
   postgres_fdw, current patch restricts one execution per one
   foreign server at once by itself. We would have to provide
   another execution management if we want to have two or more
   simultaneous scans per one foreign server at once.
  
  Yep, your 4th patch defines a new callback to FdwRoutines and
  5th patch implements postgres_fdw specific portion.
  It shall work for distributed / shaded database environment well,
  however, its benefit is around ForeignScan only.
  Once management node kicks underlying SeqScan, ForeignScan or
  others in parallel, it also enables to run local heap scan
  asynchronously.
 
 I suppose SeqScan don't need async kick since its startup cost is
 extremely low as nothing. (fetching first several pages would
 boost seqscans?) On the other hand sort/hash would be a field
 where asynchronous execution is in effect.

Startup cost is not only advantage of asynchronous execution.
If background worker prefetches the records to be read soon, during
other tasks are in progress, its latency to fetch next record is
much faster than usual execution path.
Please assume if next record is on neither shared-buffer nor page
cache of operating system.
First, the upper node calls heap_getnext() to fetch next record,
then it looks up the target block on the shared-buffer, then it
issues read(2) system call, then operating system makes the caller
process slept until this block gets read from the storage.
If asynchronous worker already goes through the above painful code
path and the records to be read are ready on the top of queue, it
will reduce the i/o wait time dramatically.

   Sorry for the focusless discussion but does this answer some of
   your question?
  
  Hmm... Its advantage is still unclear for me. However, it is not
  fair to hijack this thread by my idea.
 
 It would be more advantageous if join/sort pushdown on fdw comes,
 where start-up cost could be extremely high...

Not only FDW. I intend to combine the ParallelAppend with another idea
I previously post, to run tables join in parallel.
In case of partitioned foreign-tables, planner probably needs to consider
(1) FDW scan + local serial join, (2) FDW scan + local parallel join,
or (3) FDW remote join, according to the cost.

* [idea] table partition + hash join:
http://www.postgresql.org/message-id/9a28c8860f777e439aa12e8aea7694f8010f6...@bpxm15gp.gisp.nec.co.jp

Anyway, let's have a further discussion in another thread.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Asynchronous execution on FDW

2015-07-24 Thread Kyotaro HORIGUCHI
Hello,

At Thu, 23 Jul 2015 09:38:39 +, Kouhei Kaigai kai...@ak.jp.nec.com wrote 
in 9a28c8860f777e439aa12e8aea7694f80111b...@bpxm15gp.gisp.nec.co.jp
 I expected workloads like single shot scan on a partitioned large
 fact table on DWH system. Yep, if workload is expected to rescan
 so frequently, its expected cost shall be higher (by the cost to
 launch bgworker) than existing Append, then planner will kick out
 this path.
 
 Regarding of interaction between Limit and ParallelMergeAppend,
 it is probably the best scenario, isn't it? If Limit picks up
 the least 1000rows from a partitioned table consists of 20 child
 tables, ParallelMergeAppend can launch 20 parallel jobs that
 picks up the least 1000rows from the child relations for each.
 Probably, it is same job done in pass_down_bound() of nodeLimit.c.

Yes. I confused a bit. The scenario is one of least problematic
cases.

  As for ForeignScan, it is merely an API for FDW and does nothing
  substantial so it would have nothing special to do. As for
  postgres_fdw, current patch restricts one execution per one
  foreign server at once by itself. We would have to provide
  another execution management if we want to have two or more
  simultaneous scans per one foreign server at once.
 
 Yep, your 4th patch defines a new callback to FdwRoutines and
 5th patch implements postgres_fdw specific portion.
 It shall work for distributed / shaded database environment well,
 however, its benefit is around ForeignScan only.
 Once management node kicks underlying SeqScan, ForeignScan or
 others in parallel, it also enables to run local heap scan
 asynchronously.

I suppose SeqScan don't need async kick since its startup cost is
extremely low as nothing. (fetching first several pages would
boost seqscans?) On the other hand sort/hash would be a field
where asynchronous execution is in effect.

  Sorry for the focusless discussion but does this answer some of
  your question?
 
 Hmm... Its advantage is still unclear for me. However, it is not
 fair to hijack this thread by my idea.

It would be more advantageous if join/sort pushdown on fdw comes,
where start-up cost could be extremely high...

 I'll submit my design proposal about ParallelAppend towards the
 next commit-fest. Please comment on.

Ok, I'll come there.

   Expected waste of CPU or I/O is common problem to be solved, however, it 
   does
   not need to add a special case handling to ForeignScan, I think.
   How about your opinion?
  
  I agree with you that ForeignScan as the wrapper for FDWs don't
  need anything special for the case. I suppose for now that
  avoiding the penalty from abandoning too many speculatively
  executed scans (or other works on bg worker like sorts) would be
  a business of the upper node of FDWs, or somewhere else.
  
  However, I haven't dismissed the possibility that some common
  works related to resource management could be integrated into
  executor (or even into planner), but I see none for now.
 
 I also agree with it is eventually needed, but may not be supported
 in the first version.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Asynchronous execution on FDW

2015-07-23 Thread Kouhei Kaigai
  If we have ParallelAppend node that kicks a background worker process for
  each underlying child node in parallel, does ForeignScan need to do 
  something
  special?
 
 Although I don't see the point of the background worker in your
 story but at least for ParalleMergeAppend, it would frequently
 discontinues to scan by upper Limit so one more state, say setup
 - which mans a worker is allocated but not started- would be
 useful and the driver node might need to manage the number of
 async execution. Or the driven nodes might do so inversely.

I expected workloads like single shot scan on a partitioned large
fact table on DWH system. Yep, if workload is expected to rescan
so frequently, its expected cost shall be higher (by the cost to
launch bgworker) than existing Append, then planner will kick out
this path.

Regarding of interaction between Limit and ParallelMergeAppend,
it is probably the best scenario, isn't it? If Limit picks up
the least 1000rows from a partitioned table consists of 20 child
tables, ParallelMergeAppend can launch 20 parallel jobs that
picks up the least 1000rows from the child relations for each.
Probably, it is same job done in pass_down_bound() of nodeLimit.c.

 As for ForeignScan, it is merely an API for FDW and does nothing
 substantial so it would have nothing special to do. As for
 postgres_fdw, current patch restricts one execution per one
 foreign server at once by itself. We would have to provide
 another execution management if we want to have two or more
 simultaneous scans per one foreign server at once.

Yep, your 4th patch defines a new callback to FdwRoutines and
5th patch implements postgres_fdw specific portion.
It shall work for distributed / shaded database environment well,
however, its benefit is around ForeignScan only.
Once management node kicks underlying SeqScan, ForeignScan or
others in parallel, it also enables to run local heap scan
asynchronously.

 Sorry for the focusless discussion but does this answer some of
 your question?

Hmm... Its advantage is still unclear for me. However, it is not
fair to hijack this thread by my idea.
I'll submit my design proposal about ParallelAppend towards the
next commit-fest. Please comment on.

  Expected waste of CPU or I/O is common problem to be solved, however, it 
  does
  not need to add a special case handling to ForeignScan, I think.
  How about your opinion?
 
 I agree with you that ForeignScan as the wrapper for FDWs don't
 need anything special for the case. I suppose for now that
 avoiding the penalty from abandoning too many speculatively
 executed scans (or other works on bg worker like sorts) would be
 a business of the upper node of FDWs, or somewhere else.
 
 However, I haven't dismissed the possibility that some common
 works related to resource management could be integrated into
 executor (or even into planner), but I see none for now.

I also agree with it is eventually needed, but may not be supported
in the first version.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Asynchronous execution on FDW

2015-07-22 Thread Kouhei Kaigai
 -Original Message-
 From: pgsql-hackers-ow...@postgresql.org
 [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kyotaro HORIGUCHI
 Sent: Wednesday, July 22, 2015 4:10 PM
 To: robertmh...@gmail.com
 Cc: hlinn...@iki.fi; pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] Asynchronous execution on FDW
 
 Hello, thank you for the comment.
 
 At Fri, 17 Jul 2015 14:34:53 -0400, Robert Haas robertmh...@gmail.com wrote
 in ca+tgmoaijk1svzw_gkfu+zssxcijkfelqu2aomvuphpsfw4...@mail.gmail.com
  On Fri, Jul 3, 2015 at 4:41 PM, Heikki Linnakangas hlinn...@iki.fi wrote:
   At a quick glance, I think this has all the same problems as starting the
   execution at ExecInit phase. The correct way to do this is to kick off the
   queries in the first IterateForeignScan() call. You said that ExecProc
   phase does not fit - why not?
 
  What exactly are those problems?
 
  I can think of these:
 
  1. If the scan is parametrized, we probably can't do it for lack of
  knowledge of what they will be.  This seems easy; just don't do it in
  that case.
 
 We can put an early kick to foreign scans only for the first shot
 if we do it outside (before) ExecProc phase.
 
 Nestloop
 - SeqScan
 - Append
- Foreign (Index) Scan
- Foreign (Index) Scan
..
 
 This plan premises precise (even to some extent) estimate for
 remote query but async execution within ExecProc phase would be
 in effect for this case.
 
 
  2. It's possible that we're down inside some subtree of the plan that
  won't actually get executed.  This is trickier.
 
 As for current postgres_fdw, it is done simply abandoning queued
 result then close the cursor.
 
  Consider this:
 
  Append
  - Foreign Scan
  - Foreign Scan
  - Foreign Scan
  repeat 17 more times
 
  If we don't start each foreign scan until the first tuple is fetched,
  we will not get any benefit here, because we won't fetch the first
  tuple from query #2 until we finish reading the results of query #1.
  If the result of the Append node will be needed in its entirety, we
  really, really want to launch of those queries as early as possible.
  OTOH, if there's a Limit node with a small limit on top of the Append
  node, that could be quite wasteful.
 
 It's the nature of speculative execution, but the Limit will be
 pushed down onto every Foreign Scans near future.
 
  We could decide not to care: after all, if our limit is
  satisfied, we can just bang the remote connections shut, and if
  they wasted some CPU, well, tough luck for them.  But it would
  be nice to be smarter.  I'm not sure how, though.
 
 Appropriate fetch size will cap the harm and the case will be
 handled as I mentioned above as for postgres_fdw.

Horiguchi-san,

Let me ask an elemental question.

If we have ParallelAppend node that kicks a background worker process for
each underlying child node in parallel, does ForeignScan need to do something
special?

Expected waste of CPU or I/O is common problem to be solved, however, it does
not need to add a special case handling to ForeignScan, I think.
How about your opinion?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Asynchronous execution on FDW

2015-07-22 Thread Kyotaro HORIGUCHI
Hello, thank you for the comment.

At Fri, 17 Jul 2015 14:34:53 -0400, Robert Haas robertmh...@gmail.com wrote 
in ca+tgmoaijk1svzw_gkfu+zssxcijkfelqu2aomvuphpsfw4...@mail.gmail.com
 On Fri, Jul 3, 2015 at 4:41 PM, Heikki Linnakangas hlinn...@iki.fi wrote:
  At a quick glance, I think this has all the same problems as starting the
  execution at ExecInit phase. The correct way to do this is to kick off the
  queries in the first IterateForeignScan() call. You said that ExecProc
  phase does not fit - why not?
 
 What exactly are those problems?
 
 I can think of these:
 
 1. If the scan is parametrized, we probably can't do it for lack of
 knowledge of what they will be.  This seems easy; just don't do it in
 that case.

We can put an early kick to foreign scans only for the first shot
if we do it outside (before) ExecProc phase.

Nestloop
- SeqScan
- Append
   - Foreign (Index) Scan
   - Foreign (Index) Scan
   ..

This plan premises precise (even to some extent) estimate for
remote query but async execution within ExecProc phase would be
in effect for this case.


 2. It's possible that we're down inside some subtree of the plan that
 won't actually get executed.  This is trickier.

As for current postgres_fdw, it is done simply abandoning queued
result then close the cursor.

 Consider this:
 
 Append
 - Foreign Scan
 - Foreign Scan
 - Foreign Scan
 repeat 17 more times
 
 If we don't start each foreign scan until the first tuple is fetched,
 we will not get any benefit here, because we won't fetch the first
 tuple from query #2 until we finish reading the results of query #1.
 If the result of the Append node will be needed in its entirety, we
 really, really want to launch of those queries as early as possible.
 OTOH, if there's a Limit node with a small limit on top of the Append
 node, that could be quite wasteful.

It's the nature of speculative execution, but the Limit will be
pushed down onto every Foreign Scans near future.

 We could decide not to care: after all, if our limit is
 satisfied, we can just bang the remote connections shut, and if
 they wasted some CPU, well, tough luck for them.  But it would
 be nice to be smarter.  I'm not sure how, though.

Appropriate fetch size will cap the harm and the case will be
handled as I mentioned above as for postgres_fdw.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Asynchronous execution on FDW

2015-07-22 Thread Kyotaro HORIGUCHI
Hello,

 Let me ask an elemental question.
 
 If we have ParallelAppend node that kicks a background worker process for
 each underlying child node in parallel, does ForeignScan need to do something
 special?

Although I don't see the point of the background worker in your
story but at least for ParalleMergeAppend, it would frequently
discontinues to scan by upper Limit so one more state, say setup
- which mans a worker is allocated but not started- would be
useful and the driver node might need to manage the number of
async execution. Or the driven nodes might do so inversely.

As for ForeignScan, it is merely an API for FDW and does nothing
substantial so it would have nothing special to do. As for
postgres_fdw, current patch restricts one execution per one
foreign server at once by itself. We would have to provide
another execution management if we want to have two or more
simultaneous scans per one foreign server at once.

Sorry for the focusless discussion but does this answer some of
your question?

 Expected waste of CPU or I/O is common problem to be solved, however, it does
 not need to add a special case handling to ForeignScan, I think.
 How about your opinion?

I agree with you that ForeignScan as the wrapper for FDWs don't
need anything special for the case. I suppose for now that
avoiding the penalty from abandoning too many speculatively
executed scans (or other works on bg worker like sorts) would be
a business of the upper node of FDWs, or somewhere else.

However, I haven't dismissed the possibility that some common
works related to resource management could be integrated into
executor (or even into planner), but I see none for now.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Asynchronous execution on FDW

2015-07-17 Thread Robert Haas
On Fri, Jul 3, 2015 at 4:41 PM, Heikki Linnakangas hlinn...@iki.fi wrote:
 At a quick glance, I think this has all the same problems as starting the
 execution at ExecInit phase. The correct way to do this is to kick off the
 queries in the first IterateForeignScan() call. You said that ExecProc
 phase does not fit - why not?

What exactly are those problems?

I can think of these:

1. If the scan is parametrized, we probably can't do it for lack of
knowledge of what they will be.  This seems easy; just don't do it in
that case.
2. It's possible that we're down inside some subtree of the plan that
won't actually get executed.  This is trickier.

Consider this:

Append
- Foreign Scan
- Foreign Scan
- Foreign Scan
repeat 17 more times

If we don't start each foreign scan until the first tuple is fetched,
we will not get any benefit here, because we won't fetch the first
tuple from query #2 until we finish reading the results of query #1.
If the result of the Append node will be needed in its entirety, we
really, really want to launch of those queries as early as possible.
OTOH, if there's a Limit node with a small limit on top of the Append
node, that could be quite wasteful.  We could decide not to care:
after all, if our limit is satisfied, we can just bang the remote
connections shut, and if they wasted some CPU, well, tough luck for
them.  But it would be nice to be smarter.  I'm not sure how, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Asynchronous execution on FDW

2015-07-10 Thread Kyotaro HORIGUCHI
Hi,

 Currently there's no means to observe what it is doing from
 outside, so the additional sixth patch is to output debug
 messages about asynchronous execution.

The sixth patch did not contain one message shown in the example.
Attached is the revised version.
Other patches are not changed.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From d1ed9fe6a4e68d42653a552a680a038a0aef5683 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi horiguchi.kyot...@lab.ntt.co.jp
Date: Fri, 10 Jul 2015 15:02:59 +0900
Subject: [PATCH 6/6] Debug message for async execution of postgres_fdw.

---
 contrib/postgres_fdw/postgres_fdw.c | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index dc60bcc..a8a9cc5 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -385,6 +385,25 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	PG_RETURN_POINTER(routine);
 }
 
+static void
+postgresDebugLog(PgFdwScanState *fsstate, char *msg, void* ptr)
+{
+	ForeignTable *table = GetForeignTable(RelationGetRelid(fsstate-rel));
+	ForeignServer *server = GetForeignServer(table-serverid);
+
+	if (fsstate-conn)
+		ereport(LOG,
+(errmsg (pg_fdw: [%s/%s/%p] %s,
+		 get_rel_name(table-relid), server-servername,
+		 fsstate-conn, msg),
+ errhidestmt(true)));
+	else
+		ereport(LOG,
+(errmsg (pg_fdw: [%s/%s/--] %s,
+		 get_rel_name(table-relid), server-servername, msg),
+ errhidestmt(true)));
+}
+
 /*
  * Read boolean server/table options
  * 0 is false, 1 is true, -1 is not specified
@@ -928,7 +947,10 @@ postgresStartForeignScan(ForeignScanState *node)
 	PgFdwScanState *fsstate = (PgFdwScanState *)node-fdw_state;
 
 	if (!fsstate-allow_async)
+	{
+		postgresDebugLog(fsstate, Async start admistratively denied., NULL);
 		return false;
+	}
 
 	/*
 	 * On the current implemnt, scans can run asynchronously if it is the
@@ -943,9 +965,11 @@ postgresStartForeignScan(ForeignScanState *node)
 		 * for details
 		 */
 		fetch_more_data(fsstate, START_ONLY);
+		postgresDebugLog(fsstate, Async exec started., fsstate-conn);
 		return true;
 	}
 
+	postgresDebugLog(fsstate, Async exec denied., NULL);
 	return false;
 }
 
@@ -2162,11 +2186,16 @@ fetch_more_data(PgFdwScanState *fsstate, fetch_mode cmd)
 			 */
 			if (fsstate != PFCgetAsyncScan(conn))
 			{
+postgresDebugLog(fsstate,
+ Changed to sync fetch (different scan),
+ fsstate-conn);
 fetch_more_data(PFCgetAsyncScan(conn), EXIT_ASYNC);
 res = PFCexec(conn, sql);
 			}
 			else
 			{
+postgresDebugLog(fsstate,
+ Async fetch, fsstate-conn);
 /* Get result of running async fetch */
 res = PFCgetResult(conn);
 if (PQntuples(res) == fetch_size)
@@ -2196,6 +2225,7 @@ fetch_more_data(PgFdwScanState *fsstate, fetch_mode cmd)
 			}
 
 			/* Elsewise do synchronous query execution */
+			postgresDebugLog(fsstate, Sync fetch., conn);
 			PFCsetAsyncScan(conn, NULL);
 			res = PFCexec(conn, sql);
 		}
-- 
1.8.3.1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Asynchronous execution on FDW

2015-07-06 Thread Kyotaro HORIGUCHI
Hello, thank you for looking this.

If it is acceptable to reconstruct the executor nodes to have
additional return state PREP_RUN or such (which means it needs
one more call for the first tuple) , I'll modify the whole
executor to handle the state in the next patch to do so.

I haven't take the advice I had so far in this sense. But I came
to think that it is the most reasonable way to solve this.


==
  - It was a problem when to give the first kick for async exec. It
 is not in ExecInit phase, and ExecProc phase does not fit,
 too. An extra phase ExecPreProc or something is too
 invasive. So I tried pre-exec callback.
 
 Any init-node can register callbacks on their turn, then the
 registerd callbacks are called just before ExecProc phase in
 executor. The first patch adds functions and structs to enable
 this.
 
 At a quick glance, I think this has all the same problems as starting
 the execution at ExecInit phase. The correct way to do this is to kick
 off the queries in the first IterateForeignScan() call. You said that
 ExecProc phase does not fit - why not?

Execution nodes are expected to return the first tuple if
available. But asynchronous execution can not return the first
tuple immediately. Simultaneous execution for the first tuple on
every foreign node is crucial than asynchronous fetching for many
cases, especially for the cases like sort/agg pushdown on FDW.

The reason why ExecProc does not fit is that the first loop
without returning tuple looks impact too large portion in
executor.

It is my mistake that it doesn't address the problem about
parameterized paths. Parameterized paths should be executed
within ExecProc loops so this patch would be like following.

- To gain the advantage of kicking execution before the first
  ExecProc loop, non-parameterized paths are started using the
  callback feature this patch provides.

- Parameterized paths need the upper nodes executed before it
  starts execution so they should be start in ExecProc loop, but
  runs asynchronously if possible.

This is rather a makeshift solution for the problem, but
considering current trend of parallelism, it might the time to
make the executor to fit parallel execution.

If it is acceptable to reconstruct the executor nodes to have
additional return state PREP_RUN or such (which means it needs
one more call for the first tuple) , I'll modify the whole
executor to handle the state in the next patch to do so.

I hate my stupidity if you suggested this kind of solution by do
it in ExecProc:(

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Asynchronous execution on FDW

2015-07-03 Thread Heikki Linnakangas

On 07/02/2015 08:48 AM, Kyotaro HORIGUCHI wrote:

- It was a problem when to give the first kick for async exec. It
   is not in ExecInit phase, and ExecProc phase does not fit,
   too. An extra phase ExecPreProc or something is too
   invasive. So I tried pre-exec callback.

   Any init-node can register callbacks on their turn, then the
   registerd callbacks are called just before ExecProc phase in
   executor. The first patch adds functions and structs to enable
   this.


At a quick glance, I think this has all the same problems as starting 
the execution at ExecInit phase. The correct way to do this is to kick 
off the queries in the first IterateForeignScan() call. You said that 
ExecProc phase does not fit - why not?


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Asynchronous execution on FDW

2015-07-02 Thread Kyotaro HORIGUCHI
Ouch! I mistakenly made two CF entries for this patch. Could
someone remove this entry for me?

https://commitfest.postgresql.org/5/290/

The correct entry is /5/291/

==
Hello. This is the new version of FDW async exection feature.

The status of this feature is as follows, as of the last commitfest.

- Async execution is valuable to have.
- But do the first kick in ExecInit phase is wrong.

So the design outline of this version is as following,

- The patch set consists of three parts. The fist is the
  infrastracture in core-side, second is the code to enable
  asynchronous execution of Postgres-FDW. The third part is the
  alternative set of three methods to adapt fetch size, which
  makes asynchronous execution more effective.

- It was a problem when to give the first kick for async exec. It
  is not in ExecInit phase, and ExecProc phase does not fit,
  too. An extra phase ExecPreProc or something is too
  invasive. So I tried pre-exec callback.

  Any init-node can register callbacks on their turn, then the
  registerd callbacks are called just before ExecProc phase in
  executor. The first patch adds functions and structs to enable
  this.

- The second part is not changed from the previous version. Add
  PgFdwConn as a extended PgConn which have some members to
  support asynchronous execution.

  The asynchronous execution is kicked only for the first
  ForeignScan node on the same foreign server. And the state
  lasts until the next scan comes. This behavior is mainly
  controlled in fetch_more_data(). The behavior limits the number
  of simultaneous exection for one foreign server to 1. This
  behavior is decided from the reason that no reasonable method
  to limit multiplicity of execution on *single peer* was found
  so far.

- The third part is three kind of trials of adaptive fetch size
  feature.

   The first one is duration-based adaptation. The patch
   increases the fetch size by every FETCH execution but try to
   keep the duration of every FETCH below 500 ms. But it is not
   promising because it looks very unstable, or the behavior is
   nearly unforeseeable..

   The second one is based on byte-based FETCH feature. This
   patch adds to FETCH command an argument to limit the number of
   bytes (octets) to send. But this might be a over-exposure of
   the internals. The size is counted based on internal
   representation of a tuple and the client is needed to send the
   overhead of its internal tuple representation in bytes. This
   is effective but quite ugly..

   The third is the most simple and straight-forward way, that
   is, adds a foreign table option to specify the fetch_size. The
   effect of this is also in doubt since the size of tuples for
   one foreign table would vary according to the return-columns
   list. But it is foreseeable for users and is a necessary knob
   for those who want to tune it. Foreign server also could have
   the same option as the default for that for foreign tables but
   this patch have not added it.


The attached patches are the following,

- 0001-Add-infrastructure-of-pre-execution-callbacks.patch
  Infrastructure of pre-execution callback

- 0002-Allow-asynchronous-remote-query-of-postgres_fdw.patch
  FDW asynchronous execution feature

- 0003a-Add-experimental-POC-adaptive-fetch-size-feature.patch
  Adaptive fetch size alternative 1: duration based control

- 0003b-POC-Experimental-fetch_by_size-feature.patch
  Adaptive fetch size alternative 2: FETCH by size

- 0003c-Add-foreign-table-option-to-set-fetch-size.patch
  Adaptive fetch size alternative 3: Foreign table option.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Asynchronous execution on FDW

2015-07-02 Thread Michael Paquier
On Thu, Jul 2, 2015 at 3:07 PM, Kyotaro HORIGUCHI
horiguchi.kyot...@lab.ntt.co.jp wrote:
 Ouch! I mistakenly made two CF entries for this patch. Could
 someone remove this entry for me?

 https://commitfest.postgresql.org/5/290/

 The correct entry is /5/291/

Done.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Asynchronous execution on FDW

2015-07-02 Thread Kyotaro HORIGUCHI
Thank you.

At Thu, 2 Jul 2015 16:02:27 +0900, Michael Paquier michael.paqu...@gmail.com 
wrote in CAB7nPqTs0YCwXedt1P=JjxFJeoj9UzLzkLuiX8=jdtpyutn...@mail.gmail.com
 On Thu, Jul 2, 2015 at 3:07 PM, Kyotaro HORIGUCHI
 horiguchi.kyot...@lab.ntt.co.jp wrote:
  Ouch! I mistakenly made two CF entries for this patch. Could
  someone remove this entry for me?
 
  https://commitfest.postgresql.org/5/290/
 
  The correct entry is /5/291/
 
 Done.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Asynchronous execution on FDW

2015-07-01 Thread Kyotaro HORIGUCHI
Hello. This is the new version of FDW async exection feature.

The status of this feature is as follows, as of the last commitfest.

- Async execution is valuable to have.
- But do the first kick in ExecInit phase is wrong.

So the design outline of this version is as following,

- The patch set consists of three parts. The fist is the
  infrastracture in core-side, second is the code to enable
  asynchronous execution of Postgres-FDW. The third part is the
  alternative set of three methods to adapt fetch size, which
  makes asynchronous execution more effective.

- It was a problem when to give the first kick for async exec. It
  is not in ExecInit phase, and ExecProc phase does not fit,
  too. An extra phase ExecPreProc or something is too
  invasive. So I tried pre-exec callback.

  Any init-node can register callbacks on their turn, then the
  registerd callbacks are called just before ExecProc phase in
  executor. The first patch adds functions and structs to enable
  this.

- The second part is not changed from the previous version. Add
  PgFdwConn as a extended PgConn which have some members to
  support asynchronous execution.

  The asynchronous execution is kicked only for the first
  ForeignScan node on the same foreign server. And the state
  lasts until the next scan comes. This behavior is mainly
  controlled in fetch_more_data(). The behavior limits the number
  of simultaneous exection for one foreign server to 1. This
  behavior is decided from the reason that no reasonable method
  to limit multiplicity of execution on *single peer* was found
  so far.

- The third part is three kind of trials of adaptive fetch size
  feature.

   The first one is duration-based adaptation. The patch
   increases the fetch size by every FETCH execution but try to
   keep the duration of every FETCH below 500 ms. But it is not
   promising because it looks very unstable, or the behavior is
   nearly unforeseeable..

   The second one is based on byte-based FETCH feature. This
   patch adds to FETCH command an argument to limit the number of
   bytes (octets) to send. But this might be a over-exposure of
   the internals. The size is counted based on internal
   representation of a tuple and the client is needed to send the
   overhead of its internal tuple representation in bytes. This
   is effective but quite ugly..

   The third is the most simple and straight-forward way, that
   is, adds a foreign table option to specify the fetch_size. The
   effect of this is also in doubt since the size of tuples for
   one foreign table would vary according to the return-columns
   list. But it is foreseeable for users and is a necessary knob
   for those who want to tune it. Foreign server also could have
   the same option as the default for that for foreign tables but
   this patch have not added it.


The attached patches are the following,

- 0001-Add-infrastructure-of-pre-execution-callbacks.patch
  Infrastructure of pre-execution callback

- 0002-Allow-asynchronous-remote-query-of-postgres_fdw.patch
  FDW asynchronous execution feature

- 0003a-Add-experimental-POC-adaptive-fetch-size-feature.patch
  Adaptive fetch size alternative 1: duration based control

- 0003b-POC-Experimental-fetch_by_size-feature.patch
  Adaptive fetch size alternative 2: FETCH by size

- 0003c-Add-foreign-table-option-to-set-fetch-size.patch
  Adaptive fetch size alternative 3: Foreign table option.

regards,
  
From eb621897d1410079c6458bc4d1914d1345eb77bc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi horiguchi.kyot...@lab.ntt.co.jp
Date: Fri, 26 Jun 2015 15:12:16 +0900
Subject: [PATCH 1/3] Add infrastructure of pre-execution callbacks.

Some exec nodes have some work before plan tree execution.
This infrastructure provides such functionality
---
 src/backend/executor/execMain.c  | 32 
 src/backend/executor/execUtils.c |  2 ++
 src/include/nodes/execnodes.h| 22 ++
 3 files changed, 56 insertions(+)

diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index a1561ce..51a86b2 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -764,6 +764,35 @@ ExecCheckXactReadOnly(PlannedStmt *plannedstmt)
 		PreventCommandIfParallelMode(CreateCommandTag((Node *) plannedstmt));
 }
 
+/*
+ * Register callbacks to be called just before plan execution.
+ */
+void
+RegisterPreExecCallback(PreExecCallback callback, EState *es, Node *nd,
+		void *arg)
+{
+	PreExecCallbackItem *item;
+
+	item = (PreExecCallbackItem*)
+		MemoryContextAlloc(es-es_query_cxt, sizeof(PreExecCallbackItem));
+	item-callback = callback;
+	item-node = nd;
+	item-arg = arg;
+
+	/* add the new node at the end of the chain */
+	item-next = es-es_preExecCallbacks;
+	es-es_preExecCallbacks = item;	
+}
+
+/* Execute registered pre-exec callbacks */
+void
+RunPreExecCallbacks(EState *es)
+{
+	PreExecCallbackItem *item;
+
+	for (item = es-es_preExecCallbacks ; item