Re: [HACKERS] asynchronous execution
Hello. Fully-asynchronous executor needs that every node is stateful and suspendable at the time of requesting for the next tuples to underneath nodes. I tried pure push-base executor but failed. After the miserable patch upthread, I finally managed to make executor nodes suspendable using computational jump and got rid of recursive calls of executor. But it runs about x10 slower for simple SeqScan case. (pgbench ran with 9% degradation.) It doesn't seem recoverable by handy improvements. So I gave up that. Then I returned to single-level asynchrony, in other words, the simple case with async-aware nodes just above async-capable nodes. The motive of using the framework in the previous patch was that we had degradation on the sync (or normal) code paths by polluting ExecProcNode with async stuff and as Tom's suggestion the node->ExecProcNode trick can isolate the async code path. The attached PoC patch theoretically has no impact on the normal code paths and just brings gain in async cases. (Additional members in PlanState made degradation seemingly comes from alignment, though.) But I haven't had enough stable result from performance test. Different builds from the same source code gives apparently different results... Anyway I'll show the best one in the several times run here. original(ms) patched(ms)gain(%) A: simple table scan : 9714.70 9656.73 0.6 B: local partitioning: 4119.44 4131.10-0.3 C: single remote table : 9484.86 9141.89 3.7 D: sharding (single con) : 7114.34 6751.21 5.1 E: sharding (multi con) : 7166.56 1827.9374.5 A and B are degradation checks, which are expected to show no degradation. C is the gain only by postgres_fdw's command presending on a remote table. D is the gain of sharding on a connection. The number of partitions/shards is 4. E is the gain using dedicate connection per shard. regards, -- Kyotaro Horiguchi NTT Open Source Software Center >From fc424c16e124934581a184fcadaed1e05f7672c8 Mon Sep 17 00:00:00 2001 From: Kyotaro HoriguchiDate: Mon, 22 May 2017 12:42:58 +0900 Subject: [PATCH 1/3] Allow wait event set to be registered to resource owner WaitEventSet needs to be released using resource owner for a certain case. This change adds WaitEventSet reowner and allow the creator of a WaitEventSet to specify a resource owner. --- src/backend/libpq/pqcomm.c| 2 +- src/backend/storage/ipc/latch.c | 18 ++- src/backend/storage/lmgr/condition_variable.c | 2 +- src/backend/utils/resowner/resowner.c | 68 +++ src/include/storage/latch.h | 4 +- src/include/utils/resowner_private.h | 8 6 files changed, 97 insertions(+), 5 deletions(-) diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c index 754154b..d459f32 100644 --- a/src/backend/libpq/pqcomm.c +++ b/src/backend/libpq/pqcomm.c @@ -220,7 +220,7 @@ pq_init(void) (errmsg("could not set socket to nonblocking mode: %m"))); #endif - FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3); + FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3); AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock, NULL, NULL); AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL); diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index 4eb6e83..e6fc3dd 100644 --- a/src/backend/storage/ipc/latch.c +++ b/src/backend/storage/ipc/latch.c @@ -51,6 +51,7 @@ #include "storage/latch.h" #include "storage/pmsignal.h" #include "storage/shmem.h" +#include "utils/resowner_private.h" /* * Select the fd readiness primitive to use. Normally the "most modern" @@ -77,6 +78,8 @@ struct WaitEventSet int nevents; /* number of registered events */ int nevents_space; /* maximum number of events in this set */ + ResourceOwner resowner; /* Resource owner */ + /* * Array, of nevents_space length, storing the definition of events this * set is waiting for. @@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock, int ret = 0; int rc; WaitEvent event; - WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3); + WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3); if (wakeEvents & WL_TIMEOUT) Assert(timeout >= 0); @@ -517,12 +520,15 @@ ResetLatch(volatile Latch *latch) * WaitEventSetWait(). */ WaitEventSet * -CreateWaitEventSet(MemoryContext context, int nevents) +CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents) { WaitEventSet *set; char *data; Size sz = 0; + if (res) + ResourceOwnerEnlargeWESs(res); + /* * Use MAXALIGN size/alignment to guarantee that later uses of memory are * aligned correctly. E.g. epoll_event might need 8 byte alignment on some @@ -591,6 +597,11 @@
Re: [HACKERS] asynchronous execution
At Thu, 31 Aug 2017 21:52:36 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHIwrote in <20170831.215236.135328985.horiguchi.kyot...@lab.ntt.co.jp> > At Thu, 03 Aug 2017 09:30:57 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI > wrote in > <20170803.093057.261590619.horiguchi.kyot...@lab.ntt.co.jp> > > > Unfortunately, that's probably another gigantic patch (that > > > should probably be written by Andres). > > > > Yeah, but async executor on the current style of executor seems > > furtile work, or sitting until the patch comes is also waste of > > time. So I'm planning to include the following sutff in the next > > PoC patch. Even I'm not sure it can land on the coming > > Andres'patch. > > > > - Tuple passing outside call-stack. (I remember it was in the > > past of the thread around but not found) > > > > This should be included in the Andres' patch. > > > > - Give executor an ability to run from data-source (or driver) > > nodes to the root. > > > > I'm not sure this is included, but I suppose he is aiming this > > kind of thing. > > > > - Rebuid asynchronous execution on the upside-down executor. > > Anyway, I modified ExecProcNode into push-up form and it *seems* > working to some extent. But trigger and cursors are almost broken > and several other regressions fail. Some nodes such like > windowagg are terriblly difficult to change to this push-up form > (using state machine). And of course it is terribly inefficient. > > I'm afraid that all of this turns out to be in vain. But anyway, > and FWIW, I'll show the work to here after some cleansing work on > it. So, this is that. Maybe this is really a bad way to go. Top of the bads is it's terriblly hard to maintain because the behavior of the state machine constructed in this patch is hardly predictable so easily broken. During the 'cleansing work' I had many crash or infinite-loop and they were a bit hard to diagnose.. This will be soon broken by following commits. Anyway and, again FWIW, this is that. I'll leave this for a while (at least the period of this CF) and reconsider on async in different forms. regards, -- Kyotaro Horiguchi NTT Open Source Software Center poc_pushexecutor_20170904_4faa1dc.tar.bz2 Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous execution
At Thu, 03 Aug 2017 09:30:57 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHIwrote in <20170803.093057.261590619.horiguchi.kyot...@lab.ntt.co.jp> > > Unfortunately, that's probably another gigantic patch (that > > should probably be written by Andres). > > Yeah, but async executor on the current style of executor seems > furtile work, or sitting until the patch comes is also waste of > time. So I'm planning to include the following sutff in the next > PoC patch. Even I'm not sure it can land on the coming > Andres'patch. > > - Tuple passing outside call-stack. (I remember it was in the > past of the thread around but not found) > > This should be included in the Andres' patch. > > - Give executor an ability to run from data-source (or driver) > nodes to the root. > > I'm not sure this is included, but I suppose he is aiming this > kind of thing. > > - Rebuid asynchronous execution on the upside-down executor. Anyway, I modified ExecProcNode into push-up form and it *seems* working to some extent. But trigger and cursors are almost broken and several other regressions fail. Some nodes such like windowagg are terriblly difficult to change to this push-up form (using state machine). And of course it is terribly inefficient. I'm afraid that all of this turns out to be in vain. But anyway, and FWIW, I'll show the work to here after some cleansing work on it. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous execution
Thank you for the comment. At Tue, 1 Aug 2017 16:27:41 -0400, Robert Haaswrote in > On Mon, Jul 31, 2017 at 5:42 AM, Kyotaro HORIGUCHI > wrote: > > Another is getting rid of recursive call to run an execution > > tree. > > That happens to be exactly what Andres did for expression evaluation > in commit b8d7f053c5c2bf2a7e8734fe3327f6a8bc711755, and I think > generalizing that to include the plan tree as well as expression trees > is likely to be the long-term way forward here. I read it in the source tree. The patch implements converting expression tree to an intermediate expression then run it on a custom-made interpreter. Guessing from the word "upside down" from Andres, the whole thing will become source-driven. > Unfortunately, that's probably another gigantic patch (that > should probably be written by Andres). Yeah, but async executor on the current style of executor seems furtile work, or sitting until the patch comes is also waste of time. So I'm planning to include the following sutff in the next PoC patch. Even I'm not sure it can land on the coming Andres'patch. - Tuple passing outside call-stack. (I remember it was in the past of the thread around but not found) This should be included in the Andres' patch. - Give executor an ability to run from data-source (or driver) nodes to the root. I'm not sure this is included, but I suppose he is aiming this kind of thing. - Rebuid asynchronous execution on the upside-down executor. regrds, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous execution
On Mon, Jul 31, 2017 at 5:42 AM, Kyotaro HORIGUCHIwrote: > Another is getting rid of recursive call to run an execution > tree. That happens to be exactly what Andres did for expression evaluation in commit b8d7f053c5c2bf2a7e8734fe3327f6a8bc711755, and I think generalizing that to include the plan tree as well as expression trees is likely to be the long-term way forward here. Unfortunately, that's probably another gigantic patch (that should probably be written by Andres). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous execution
At Fri, 28 Jul 2017 17:31:05 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHIwrote in <20170728.173105.238045591.horiguchi.kyot...@lab.ntt.co.jp> > Thank you for the comment. > > At Wed, 26 Jul 2017 17:16:43 -0400, Robert Haas wrote > in > > regression all the same. Every type of intermediate node will have to > > have a code path where it uses ExecAsyncRequest() / > > ExecAyncHogeResponse() rather than ExecProcNode() to get tuples, and > > I understand what Robert concerns and I share the same > opinion. It needs further different framework. > > At Thu, 27 Jul 2017 06:39:51 -0400, Robert Haas wrote > in > > On Wed, Jul 26, 2017 at 5:43 PM, Tom Lane wrote: > > > The scheme he has allows $extra_stuff to be injected into ExecProcNode at > > > no cost when $extra_stuff is not needed, because you simply don't insert > > > the wrapper function when it's not needed. I'm not sure that it will ... > > Yeah, I don't quite see how that would apply in this case -- what we > > need here is not as simple as just conditionally injecting an extra > > bit. > > Thank you for the pointer, Tom. The subject (segfault in HEAD...) > haven't made me think that this kind of discussion was held > there. Anyway it seems very closer to asynchronous execution so > I'll catch up it considering how I can associate with this. I understand the executor change which has just been made at master based on the pointed thread. This seems to have the capability to let exec-node switch to async-aware with no extra cost on non-async processing. So it would be doable to (just) *shrink* the current framework by detaching the async-aware side of the API. But to get the most out of asynchrony, it is required that multiple async-capable nodes distributed under async-unaware nodes run simultaneously. There seems two ways to achieve this. One is propagating required-async-nodes bitmap up to the topmost node and waiting for the all required nodes to become ready. In the long run this requires all nodes to be async-aware and that apparently would have bad effect on performance to async-unaware nodes containing async-capable nodes. Another is getting rid of recursive call to run an execution tree. It is perhaps the same to what mentioned as "data-centric processing" in a previous threads [1], [2], but I'd like to I pay attention on the aspect of "enableing to resume execution tree from arbitrary leaf node". So I'm considering to realize it still in one-tuple-by-one manner instead of collecting all tuples of a leaf node first. Even though I'm not sure it is doable. [1] https://www.postgresql.org/message-id/bf2827dcce55594c8d7a8f7ffd3ab77159a9b...@szxeml521-mbs.china.huawei.com [2] https://www.postgresql.org/message-id/20160629183254.frcm3dgg54ud5...@alap3.anarazel.de regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous execution
Thank you for the comment. At Wed, 26 Jul 2017 17:16:43 -0400, Robert Haaswrote in > But if we do, then I fear we'll just be reintroducing the same > performance regression that we introduced by switching to this > framework from the previous one - or maybe a different one, but a > regression all the same. Every type of intermediate node will have to > have a code path where it uses ExecAsyncRequest() / > ExecAyncHogeResponse() rather than ExecProcNode() to get tuples, and I understand what Robert concerns and I think I share the same opinion. It needs further different framework. At Thu, 27 Jul 2017 06:39:51 -0400, Robert Haas wrote in > On Wed, Jul 26, 2017 at 5:43 PM, Tom Lane wrote: > > I have not been paying any attention to this thread whatsoever, > > but I wonder if you can address your problem by building on top of > > the ExecProcNode replacement that Andres is working on, > > https://www.postgresql.org/message-id/20170726012641.bmhfcp5ajpudi...@alap3.anarazel.de > > > > The scheme he has allows $extra_stuff to be injected into ExecProcNode at > > no cost when $extra_stuff is not needed, because you simply don't insert > > the wrapper function when it's not needed. I'm not sure that it will > > scale well to several different kinds of insertions though, for instance > > if you wanted both instrumentation and async support on the same node. > > But maybe those two couldn't be arms-length from each other anyway, > > in which case it might be fine as-is. > > Yeah, I don't quite see how that would apply in this case -- what we > need here is not as simple as just conditionally injecting an extra > bit. Thank you for the pointer, Tom. The subject (segfault in HEAD...) haven't made me think that this kind of discussion was held there. Anyway it seems very closer to asynchronous execution so I'll catch up it considering how I can associate with this. Regards. -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous execution
On Wed, Jul 26, 2017 at 5:43 PM, Tom Lanewrote: > I have not been paying any attention to this thread whatsoever, > but I wonder if you can address your problem by building on top of > the ExecProcNode replacement that Andres is working on, > https://www.postgresql.org/message-id/20170726012641.bmhfcp5ajpudi...@alap3.anarazel.de > > The scheme he has allows $extra_stuff to be injected into ExecProcNode at > no cost when $extra_stuff is not needed, because you simply don't insert > the wrapper function when it's not needed. I'm not sure that it will > scale well to several different kinds of insertions though, for instance > if you wanted both instrumentation and async support on the same node. > But maybe those two couldn't be arms-length from each other anyway, > in which case it might be fine as-is. Yeah, I don't quite see how that would apply in this case -- what we need here is not as simple as just conditionally injecting an extra bit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous execution
Robert Haaswrites: > Ostensibly, the advantage of this framework over my previous proposal > is that it avoids inserting anything into ExecProcNode(), which is > probably a good thing to avoid given how frequently ExecProcNode() is > called. Unless the parent and the child both know about asynchronous > execution and choose to use it, everything runs exactly as it does > today and so there is no possibility of a complaint about a > performance hit. As far as it goes, that is good. > However, at a deeper level, I fear we haven't really solved the > problem. If an Append is directly on top of a ForeignScan node, then > this will work. But if an Append is indirectly on top of a > ForeignScan node with some other stuff in the middle, then it won't - > unless we make whichever nodes appear between the Append and the > ForeignScan async-capable. I have not been paying any attention to this thread whatsoever, but I wonder if you can address your problem by building on top of the ExecProcNode replacement that Andres is working on, https://www.postgresql.org/message-id/20170726012641.bmhfcp5ajpudi...@alap3.anarazel.de The scheme he has allows $extra_stuff to be injected into ExecProcNode at no cost when $extra_stuff is not needed, because you simply don't insert the wrapper function when it's not needed. I'm not sure that it will scale well to several different kinds of insertions though, for instance if you wanted both instrumentation and async support on the same node. But maybe those two couldn't be arms-length from each other anyway, in which case it might be fine as-is. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous execution
On Tue, Jul 25, 2017 at 5:11 AM, Kyotaro HORIGUCHIwrote: > [ new patches ] I spent some time today refreshing my memory of what's going with this thread today. Ostensibly, the advantage of this framework over my previous proposal is that it avoids inserting anything into ExecProcNode(), which is probably a good thing to avoid given how frequently ExecProcNode() is called. Unless the parent and the child both know about asynchronous execution and choose to use it, everything runs exactly as it does today and so there is no possibility of a complaint about a performance hit. As far as it goes, that is good. However, at a deeper level, I fear we haven't really solved the problem. If an Append is directly on top of a ForeignScan node, then this will work. But if an Append is indirectly on top of a ForeignScan node with some other stuff in the middle, then it won't - unless we make whichever nodes appear between the Append and the ForeignScan async-capable. Indeed, we'd really want all kinds of joins and aggregates to be async-capable so that examples like the one Corey asked about in http://postgr.es/m/CADkLM=fuvVdKvz92XpCRnb4=rj6bLOhSLifQ3RV=sb4q5rj...@mail.gmail.com will work. But if we do, then I fear we'll just be reintroducing the same performance regression that we introduced by switching to this framework from the previous one - or maybe a different one, but a regression all the same. Every type of intermediate node will have to have a code path where it uses ExecAsyncRequest() / ExecAyncHogeResponse() rather than ExecProcNode() to get tuples, and it seems like that will either end up duplicating a lot of code from the regular code path or, alternatively, polluting the regular code path with some of the async code's concerns to avoid duplication, and maybe slowing things down. Maybe that concern is unjustified; I'm not sure. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous execution
Hello, 8bf58c0d9bd33686 badly conflicts with this patch, so I'll rebase this and added a patch to refactor the function that Anotonin pointed. This would be merged into 0002 patch. At Tue, 18 Jul 2017 16:24:52 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHIwrote in <20170718.162452.221576658.horiguchi.kyot...@lab.ntt.co.jp> > I'll put an upper limit to the number of waiters processed at > once. Then add a comment like that. > > > Actually the reason I thought of simplification was that I noticed small > > inefficiency in the way you do the compaction. In particular, I think it's > > not > > always necessary to swap the tail and head entries. Would something like > > this > > make sense? > > I'm not sure, but I suppose that it is rare that all of the first > many elements in the array are not COMPLETE. In most cases the > first element gets a response first. ... > Yeah, but maybe the "head" is still confusing even if reversed > because it is still not a head of something. It might be less > confusing by rewriting it in more verbose-but-straightforwad way. > > > | int npending = 0; > | > | /* Skip over not-completed items at the beginning */ > | while (npending < estate->es_num_pending_async && > | estate->es_pending_async[npending] != ASYNCREQ_COMPLETE) > |npending++; > | > | /* Scan over the rest for not-completed items */ > | for (i = npending + 1 ; i < estate->es_num_pending_async; ++i) > | { > |PendingAsyncRequest *tmp; > |PendingAsyncRequest *curr = estate->es_pending_async[i]; > | > |if (curr->state == ASYNCREQ_COMPLETE) > | continue; > | > | /* Move the not-completed item to the tail of the first chunk */ > |tmp = estate->es_pending_async[i]; > |estate->es_pending_async[nepending] = tmp; > |estate->es_pending_async[i] = tmp; > |++npending; > | } The last patch does something like this (with apparent bugs fixed) regards, -- Kyotaro Horiguchi NTT Open Source Software Center >From 41ad9a7518c066da619363e6cdf8574fa00ee1e5 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Wed, 22 Feb 2017 09:07:49 +0900 Subject: [PATCH 1/5] Allow wait event set to be registered to resource owner WaitEventSet needs to be released using resource owner for a certain case. This change adds WaitEventSet reowner and allow the creator of a WaitEventSet to specify a resource owner. --- src/backend/libpq/pqcomm.c| 2 +- src/backend/storage/ipc/latch.c | 18 ++- src/backend/storage/lmgr/condition_variable.c | 2 +- src/backend/utils/resowner/resowner.c | 68 +++ src/include/storage/latch.h | 4 +- src/include/utils/resowner_private.h | 8 6 files changed, 97 insertions(+), 5 deletions(-) diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c index 4452ea4..ed71e7c 100644 --- a/src/backend/libpq/pqcomm.c +++ b/src/backend/libpq/pqcomm.c @@ -220,7 +220,7 @@ pq_init(void) (errmsg("could not set socket to nonblocking mode: %m"))); #endif - FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3); + FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3); AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock, NULL, NULL); AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL); diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index 07b1364..9543397 100644 --- a/src/backend/storage/ipc/latch.c +++ b/src/backend/storage/ipc/latch.c @@ -51,6 +51,7 @@ #include "storage/latch.h" #include "storage/pmsignal.h" #include "storage/shmem.h" +#include "utils/resowner_private.h" /* * Select the fd readiness primitive to use. Normally the "most modern" @@ -77,6 +78,8 @@ struct WaitEventSet int nevents; /* number of registered events */ int nevents_space; /* maximum number of events in this set */ + ResourceOwner resowner; /* Resource owner */ + /* * Array, of nevents_space length, storing the definition of events this * set is waiting for. @@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock, int ret = 0; int rc; WaitEvent event; - WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3); + WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3); if (wakeEvents & WL_TIMEOUT) Assert(timeout >= 0); @@ -518,12 +521,15 @@ ResetLatch(volatile Latch *latch) * WaitEventSetWait(). */ WaitEventSet * -CreateWaitEventSet(MemoryContext context, int nevents) +CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents) { WaitEventSet *set; char *data; Size sz = 0; + if (res) + ResourceOwnerEnlargeWESs(res); + /* * Use MAXALIGN size/alignment to guarantee that later uses of memory are * aligned correctly. E.g.
Re: [HACKERS] asynchronous execution
Hello, At Tue, 11 Jul 2017 10:28:51 +0200, Antonin Houskawrote in <6448.1499761731@localhost> > Kyotaro HORIGUCHI wrote: > > Effectively it is a waiting-queue followed by a > > completed-list. The point of the compaction is keeping the order > > of waiting or not-yet-completed requests, which is crucial to > > avoid kind-a precedence inversion. We cannot keep the order by > > using bitmapset in such way. > > > The current code waits all waiters at once and processes all > > fired events at once. The order in the waiting-queue is > > inessential in the case. On the other hand I suppoese waiting on > > several-tens to near-hundred remote hosts is in a realistic > > target range. Keeping the order could be crucial if we process a > > part of the queue at once in the case. > > > > Putting siginificance on the deviation of response time of > > remotes, process-all-at-once is effective. In turn we should > > consider the effectiveness of the lifecycle of the larger wait > > event set. > > ok, I missed the fact that the order of es_pending_async entries is > important. I think this is worth adding a comment. I'll put an upper limit to the number of waiters processed at once. Then add a comment like that. > Actually the reason I thought of simplification was that I noticed small > inefficiency in the way you do the compaction. In particular, I think it's not > always necessary to swap the tail and head entries. Would something like this > make sense? I'm not sure, but I suppose that it is rare that all of the first many elements in the array are not COMPLETE. In most cases the first element gets a response first. > > /* If any node completed, compact the array. */ > if (any_node_done) > { ... > for (tidx = 0; tidx < estate->es_num_pending_async; > ++tidx) > { ... > if (tail->state == ASYNCREQ_COMPLETE) > continue; > > /* >* If the array starts with one or more > incomplete requests, >* both head and tail point at the same item, > so there's no >* point in swapping. >*/ > if (tidx > hidx) > { This works to skip the first several elements when all of them are ASYNCREQ_COMPLETE. I think it makes sense as long as it doesn't harm the loop. The optimization is more effective by putting out of the loop like this. | for (tidx = 0; tidx < estate->es_num_pending_async && estate->es_pending_async[tidx] == ASYNCREQ_COMPLETE; ++tidx); | for (; tidx < estate->es_num_pending_async; ++tidx) ... > And besides that, I think it'd be more intuitive if the meaning of "head" and > "tail" was reversed: if the array is iterated from lower to higher positions, > then I'd consider head to be at higher position, not tail. Yeah, but maybe the "head" is still confusing even if reversed because it is still not a head of something. It might be less confusing by rewriting it in more verbose-but-straightforwad way. | int npending = 0; | | /* Skip over not-completed items at the beginning */ | while (npending < estate->es_num_pending_async && | estate->es_pending_async[npending] != ASYNCREQ_COMPLETE) |npending++; | | /* Scan over the rest for not-completed items */ | for (i = npending + 1 ; i < estate->es_num_pending_async; ++i) | { |PendingAsyncRequest *tmp; |PendingAsyncRequest *curr = estate->es_pending_async[i]; | |if (curr->state == ASYNCREQ_COMPLETE) | continue; | |/* Move the not-completed item to the tail of the first chunk */ |tmp = estate->es_pending_async[i]; |estate->es_pending_async[nepending] = tmp; |estate->es_pending_async[i] = tmp; |++npending; | } regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous execution
Kyotaro HORIGUCHIwrote: > > Just one idea that I had while reading the code. > > > > In ExecAsyncEventLoop you iterate estate->es_pending_async, then move the > > complete requests to the end and finaly adjust estate->es_num_pending_async > > so > > that the array no longer contains the complete requests. I think the point > > is > > that then you can add new requests to the end of the array. > > > > I wonder if a set (Bitmapset) of incomplete requests would make the code > > more > > efficient. The set would contain position of each incomplete request in > > estate->es_num_pending_async (I think it's the myindex field of > > PendingAsyncRequest). If ExecAsyncEventLoop used this set to retrieve the > > requests subject to ExecAsyncNotify etc, then the compaction of > > estate->es_pending_async wouldn't be necessary. > > > > ExecAsyncRequest would use the set to look for space for new requests by > > iterating it and trying to find the first gap (which corresponds to > > completed > > request). > > > > And finally, item would be removed from the set at the moment the request > > state is being set to ASYNCREQ_COMPLETE. > > Effectively it is a waiting-queue followed by a > completed-list. The point of the compaction is keeping the order > of waiting or not-yet-completed requests, which is crucial to > avoid kind-a precedence inversion. We cannot keep the order by > using bitmapset in such way. > The current code waits all waiters at once and processes all > fired events at once. The order in the waiting-queue is > inessential in the case. On the other hand I suppoese waiting on > several-tens to near-hundred remote hosts is in a realistic > target range. Keeping the order could be crucial if we process a > part of the queue at once in the case. > > Putting siginificance on the deviation of response time of > remotes, process-all-at-once is effective. In turn we should > consider the effectiveness of the lifecycle of the larger wait > event set. ok, I missed the fact that the order of es_pending_async entries is important. I think this is worth adding a comment. Actually the reason I thought of simplification was that I noticed small inefficiency in the way you do the compaction. In particular, I think it's not always necessary to swap the tail and head entries. Would something like this make sense? /* If any node completed, compact the array. */ if (any_node_done) { int hidx = 0, tidx; /* * Swap all non-yet-completed items to the start of the array. * Keep them in the same order. */ for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx) { PendingAsyncRequest *tail = estate->es_pending_async[tidx]; Assert(tail->state != ASYNCREQ_CALLBACK_PENDING); if (tail->state == ASYNCREQ_COMPLETE) continue; /* * If the array starts with one or more incomplete requests, * both head and tail point at the same item, so there's no * point in swapping. */ if (tidx > hidx) { PendingAsyncRequest *head = estate->es_pending_async[hidx]; /* * Once the tail got ahead, it should only leave * ASYNCREQ_COMPLETE behind. Only those can then be seen * by head. */ Assert(head->state == ASYNCREQ_COMPLETE); estate->es_pending_async[tidx] = head; estate->es_pending_async[hidx] = tail; } ++hidx; } estate->es_num_pending_async = hidx; } And besides that, I think it'd be more intuitive if the meaning of "head" and "tail" was reversed: if the array is iterated from lower to higher positions, then I'd consider head to be at higher position, not tail. -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous execution
Thank you for the thought. This is at PoC level so I'd be grateful for this kind of fundamental comments. At Wed, 28 Jun 2017 20:22:24 +0200, Antonin Houskawrote in <392.1498674144@localhost> > Kyotaro HORIGUCHI wrote: > > > The patch got conflicted. This is a new version just rebased to > > the current master. Furtuer amendment will be taken later. > > Just one idea that I had while reading the code. > > In ExecAsyncEventLoop you iterate estate->es_pending_async, then move the > complete requests to the end and finaly adjust estate->es_num_pending_async so > that the array no longer contains the complete requests. I think the point is > that then you can add new requests to the end of the array. > > I wonder if a set (Bitmapset) of incomplete requests would make the code more > efficient. The set would contain position of each incomplete request in > estate->es_num_pending_async (I think it's the myindex field of > PendingAsyncRequest). If ExecAsyncEventLoop used this set to retrieve the > requests subject to ExecAsyncNotify etc, then the compaction of > estate->es_pending_async wouldn't be necessary. > > ExecAsyncRequest would use the set to look for space for new requests by > iterating it and trying to find the first gap (which corresponds to completed > request). > > And finally, item would be removed from the set at the moment the request > state is being set to ASYNCREQ_COMPLETE. Effectively it is a waiting-queue followed by a completed-list. The point of the compaction is keeping the order of waiting or not-yet-completed requests, which is crucial to avoid kind-a precedence inversion. We cannot keep the order by using bitmapset in such way. The current code waits all waiters at once and processes all fired events at once. The order in the waiting-queue is inessential in the case. On the other hand I suppoese waiting on several-tens to near-hundred remote hosts is in a realistic target range. Keeping the order could be crucial if we process a part of the queue at once in the case. Putting siginificance on the deviation of response time of remotes, process-all-at-once is effective. In turn we should consider the effectiveness of the lifecycle of the larger wait event set. Sorry for the discursive discussion but in short, I have noticed that I have a lot to consider on this:p Thanks! regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous execution
Hi, I've returned. At Thu, 29 Jun 2017 14:08:27 +0900, Amit Langotewrote in <63a5a01c-2967-83e0-8bbf-c981404f5...@lab.ntt.co.jp> > Hi, > > On 2017/06/29 13:45, Kyotaro HORIGUCHI wrote: > > Thank you for looking this. > > > > At Wed, 28 Jun 2017 10:23:54 +0200, Antonin Houska wrote: > >> Can you please explain this part of make_append() ? > >> > >> /* Currently async on partitioned tables is not available */ > >> Assert(nasyncplans == 0 || partitioned_rels == NIL); > >> > >> I don't think the output of Append plan is supposed to be ordered even if > >> the > >> underlying relation is partitioned. Besides ordering, is there any other > >> reason not to use the asynchronous execution? > > When making an Append for a partitioned table, among the arguments passed > to make_append(), 'partitioned_rels' is a list of RT indexes of > partitioned tables in the inheritance tree of which the aforementioned > partitioned table is the root. 'appendplans' is a list of subplans for > scanning the leaf partitions in the tree. Note that the 'appendplans' > list contains no members corresponding to the partitioned tables, because > we don't need to scan them (only leaf relations contain any data). > > The point of having the 'partitioned_rels' list in the resulting Append > plan is so that the executor can identify those relations and take the > appropriate locks on them. Amit, thank you for the detailed explanation. I understand what it is and that just ignoring it is enough, then confirmed that actually works as before. I'll then adresss Antonin's comments tomorrow. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous execution
Hi, On 2017/06/29 13:45, Kyotaro HORIGUCHI wrote: > Thank you for looking this. > > At Wed, 28 Jun 2017 10:23:54 +0200, Antonin Houska wrote: >> Kyotaro HORIGUCHIwrote: >> >>> The patch got conflicted. This is a new version just rebased to >>> the current master. Furtuer amendment will be taken later. >> >> Can you please explain this part of make_append() ? >> >> /* Currently async on partitioned tables is not available */ >> Assert(nasyncplans == 0 || partitioned_rels == NIL); >> >> I don't think the output of Append plan is supposed to be ordered even if the >> underlying relation is partitioned. Besides ordering, is there any other >> reason not to use the asynchronous execution? > > It was just a developmental sentinel that will remind me later to > consider the declarative partitions since I didn't have an idea > of the differences (or the similarity) between appendrels and > partitioned_rels. It is never to say the condition cannot > make. I'll check it out and will support partitioned_rels sooner. > Sorry for having left it as it is. When making an Append for a partitioned table, among the arguments passed to make_append(), 'partitioned_rels' is a list of RT indexes of partitioned tables in the inheritance tree of which the aforementioned partitioned table is the root. 'appendplans' is a list of subplans for scanning the leaf partitions in the tree. Note that the 'appendplans' list contains no members corresponding to the partitioned tables, because we don't need to scan them (only leaf relations contain any data). The point of having the 'partitioned_rels' list in the resulting Append plan is so that the executor can identify those relations and take the appropriate locks on them. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous execution
Thank you for looking this. At Wed, 28 Jun 2017 10:23:54 +0200, Antonin Houskawrote in <4579.1498638234@localhost> > Kyotaro HORIGUCHI wrote: > > > The patch got conflicted. This is a new version just rebased to > > the current master. Furtuer amendment will be taken later. > > Can you please explain this part of make_append() ? > > /* Currently async on partitioned tables is not available */ > Assert(nasyncplans == 0 || partitioned_rels == NIL); > > I don't think the output of Append plan is supposed to be ordered even if the > underlying relation is partitioned. Besides ordering, is there any other > reason not to use the asynchronous execution? It was just a developmental sentinel that will remind me later to consider the declarative partitions since I didn't have an idea of the differences (or the similarity) between appendrels and partitioned_rels. It is never to say the condition cannot make. I'll check it out and will support partitioned_rels sooner. Sorry for having left it as it is. > And even if there was some, the planner should ensure that executor does not > fire the assertion statement above. The script attached shows an example how > to cause the assertion failure. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous execution
Kyotaro HORIGUCHIwrote: > The patch got conflicted. This is a new version just rebased to > the current master. Furtuer amendment will be taken later. Just one idea that I had while reading the code. In ExecAsyncEventLoop you iterate estate->es_pending_async, then move the complete requests to the end and finaly adjust estate->es_num_pending_async so that the array no longer contains the complete requests. I think the point is that then you can add new requests to the end of the array. I wonder if a set (Bitmapset) of incomplete requests would make the code more efficient. The set would contain position of each incomplete request in estate->es_num_pending_async (I think it's the myindex field of PendingAsyncRequest). If ExecAsyncEventLoop used this set to retrieve the requests subject to ExecAsyncNotify etc, then the compaction of estate->es_pending_async wouldn't be necessary. ExecAsyncRequest would use the set to look for space for new requests by iterating it and trying to find the first gap (which corresponds to completed request). And finally, item would be removed from the set at the moment the request state is being set to ASYNCREQ_COMPLETE. -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous execution
Kyotaro HORIGUCHIwrote: > The patch got conflicted. This is a new version just rebased to > the current master. Furtuer amendment will be taken later. Can you please explain this part of make_append() ? /* Currently async on partitioned tables is not available */ Assert(nasyncplans == 0 || partitioned_rels == NIL); I don't think the output of Append plan is supposed to be ordered even if the underlying relation is partitioned. Besides ordering, is there any other reason not to use the asynchronous execution? And even if there was some, the planner should ensure that executor does not fire the assertion statement above. The script attached shows an example how to cause the assertion failure. -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at OPT_HOST="-h localhost" USER=postgres for x in $(psql $OPT_HOST -c "SELECT datname FROM pg_database WHERE datname ~ 'shard.*' " postgres -A -t) do echo dropping $x dropdb $OPT_HOST $x done; createdb $OPT_HOST shard psql $OPT_HOST -U $USER shard -c " CREATE EXTENSION postgres_fdw; CREATE TABLE orders(customer_id int, product_id int) PARTITION BY LIST (customer_id);" createdb $OPT_HOST shard_0 psql $OPT_HOST -U $USER shard_0 -c " CREATE TABLE orders_0 AS SELECT trunc(2 * random())::int AS customer_id, trunc(100 * random())::int AS product_id FROM generate_series(1, 5000); ANALYZE;" psql $OPT_HOST -U $USER shard -c " CREATE SERVER server_0 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'localhost', dbname 'shard_0', fetch_size '1000', use_remote_estimate 'true', fdw_tuple_cost '10'); CREATE USER MAPPING FOR CURRENT_USER SERVER server_0 OPTIONS (user '$USER'); IMPORT FOREIGN SCHEMA public FROM SERVER server_0 INTO public; ALTER TABLE orders ATTACH PARTITION orders_0 FOR VALUES IN (0, 1);" createdb $OPT_HOST shard_1 psql $OPT_HOST -U $USER shard_1 -c " CREATE TABLE orders_1 AS SELECT trunc(2 * random())::int + 2 AS customer_id, trunc(100 * random())::int + 2 AS product_id FROM generate_series(1, 5000); ANALYZE;" psql $OPT_HOST -U $USER shard -c " CREATE SERVER server_1 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'localhost', dbname 'shard_1', fetch_size '1000', use_remote_estimate 'true', fdw_tuple_cost '10'); CREATE USER MAPPING FOR CURRENT_USER SERVER server_1 OPTIONS (user '$USER'); IMPORT FOREIGN SCHEMA public FROM SERVER server_1 INTO public; ALTER TABLE orders ATTACH PARTITION orders_1 FOR VALUES IN (2, 3);" psql $OPT_HOST -U $USER shard -c "SELECT * FROM orders" -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous execution
The patch got conflicted. This is a new version just rebased to the current master. Furtuer amendment will be taken later. > The attached patch is rebased on the current master, but no > substantial changes other than disallowing partitioned tables on > async by assertion. regards, -- Kyotaro Horiguchi NTT Open Source Software Center >From 32d5c143a679bcccee9ff29fe3807dfd8deae458 Mon Sep 17 00:00:00 2001 From: Kyotaro HoriguchiDate: Wed, 22 Feb 2017 09:07:49 +0900 Subject: [PATCH 1/4] Allow wait event set to be registered to resource owner WaitEventSet needs to be released using resource owner for a certain case. This change adds WaitEventSet reowner and allow the creator of a WaitEventSet to specify a resource owner. --- src/backend/libpq/pqcomm.c| 2 +- src/backend/storage/ipc/latch.c | 18 ++- src/backend/storage/lmgr/condition_variable.c | 2 +- src/backend/utils/resowner/resowner.c | 68 +++ src/include/storage/latch.h | 4 +- src/include/utils/resowner_private.h | 8 6 files changed, 97 insertions(+), 5 deletions(-) diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c index 261e9be..c4f336d 100644 --- a/src/backend/libpq/pqcomm.c +++ b/src/backend/libpq/pqcomm.c @@ -201,7 +201,7 @@ pq_init(void) (errmsg("could not set socket to nonblocking mode: %m"))); #endif - FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3); + FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3); AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock, NULL, NULL); AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL); diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index 07b1364..9543397 100644 --- a/src/backend/storage/ipc/latch.c +++ b/src/backend/storage/ipc/latch.c @@ -51,6 +51,7 @@ #include "storage/latch.h" #include "storage/pmsignal.h" #include "storage/shmem.h" +#include "utils/resowner_private.h" /* * Select the fd readiness primitive to use. Normally the "most modern" @@ -77,6 +78,8 @@ struct WaitEventSet int nevents; /* number of registered events */ int nevents_space; /* maximum number of events in this set */ + ResourceOwner resowner; /* Resource owner */ + /* * Array, of nevents_space length, storing the definition of events this * set is waiting for. @@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock, int ret = 0; int rc; WaitEvent event; - WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3); + WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3); if (wakeEvents & WL_TIMEOUT) Assert(timeout >= 0); @@ -518,12 +521,15 @@ ResetLatch(volatile Latch *latch) * WaitEventSetWait(). */ WaitEventSet * -CreateWaitEventSet(MemoryContext context, int nevents) +CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents) { WaitEventSet *set; char *data; Size sz = 0; + if (res) + ResourceOwnerEnlargeWESs(res); + /* * Use MAXALIGN size/alignment to guarantee that later uses of memory are * aligned correctly. E.g. epoll_event might need 8 byte alignment on some @@ -592,6 +598,11 @@ CreateWaitEventSet(MemoryContext context, int nevents) StaticAssertStmt(WSA_INVALID_EVENT == NULL, ""); #endif + /* Register this wait event set if requested */ + set->resowner = res; + if (res) + ResourceOwnerRememberWES(set->resowner, set); + return set; } @@ -633,6 +644,9 @@ FreeWaitEventSet(WaitEventSet *set) } #endif + if (set->resowner != NULL) + ResourceOwnerForgetWES(set->resowner, set); + pfree(set); } diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c index b4b7d28..182f759 100644 --- a/src/backend/storage/lmgr/condition_variable.c +++ b/src/backend/storage/lmgr/condition_variable.c @@ -66,7 +66,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv) /* Create a reusable WaitEventSet. */ if (cv_wait_event_set == NULL) { - cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, 1); + cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 1); AddWaitEventToSet(cv_wait_event_set, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL); } diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c index 4a4a287..f2509c3 100644 --- a/src/backend/utils/resowner/resowner.c +++ b/src/backend/utils/resowner/resowner.c @@ -124,6 +124,7 @@ typedef struct ResourceOwnerData ResourceArray snapshotarr; /* snapshot references */ ResourceArray filearr; /* open temporary files */ ResourceArray dsmarr; /* dynamic shmem segments */ + ResourceArray wesarr; /* wait event sets */ /* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */ int nlocks; /* number of owned locks */ @@ -169,6 +170,7 @@ static
Re: [HACKERS] asynchronous execution
At Mon, 22 May 2017 13:12:14 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHIwrote in <20170522.131214.20936668.horiguchi.kyot...@lab.ntt.co.jp> > > The attached patch is rebased on the current master, but no > > substantial changes other than disallowing partitioned tables on > > async by assertion. > > This is just rebased onto the current master (d761fe2). > I'll recheck further detail after this. Sorry, the patch was missing some files to add. regards, -- Kyotaro Horiguchi NTT Open Source Software Center >From 000f0465a59cdabd02f43e886c76c89c14d987a5 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Mon, 22 May 2017 12:42:58 +0900 Subject: [PATCH 1/4] Allow wait event set to be registered to resource owner WaitEventSet needs to be released using resource owner for a certain case. This change adds WaitEventSet reowner and allow the creator of a WaitEventSet to specify a resource owner. --- src/backend/libpq/pqcomm.c| 2 +- src/backend/storage/ipc/latch.c | 18 ++- src/backend/storage/lmgr/condition_variable.c | 2 +- src/backend/utils/resowner/resowner.c | 68 +++ src/include/storage/latch.h | 4 +- src/include/utils/resowner_private.h | 8 6 files changed, 97 insertions(+), 5 deletions(-) diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c index d1cc38b..1c34114 100644 --- a/src/backend/libpq/pqcomm.c +++ b/src/backend/libpq/pqcomm.c @@ -201,7 +201,7 @@ pq_init(void) (errmsg("could not set socket to nonblocking mode: %m"))); #endif - FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3); + FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3); AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock, NULL, NULL); AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL); diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index 53e6bf2..8c182a2 100644 --- a/src/backend/storage/ipc/latch.c +++ b/src/backend/storage/ipc/latch.c @@ -51,6 +51,7 @@ #include "storage/latch.h" #include "storage/pmsignal.h" #include "storage/shmem.h" +#include "utils/resowner_private.h" /* * Select the fd readiness primitive to use. Normally the "most modern" @@ -77,6 +78,8 @@ struct WaitEventSet int nevents; /* number of registered events */ int nevents_space; /* maximum number of events in this set */ + ResourceOwner resowner; /* Resource owner */ + /* * Array, of nevents_space length, storing the definition of events this * set is waiting for. @@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock, int ret = 0; int rc; WaitEvent event; - WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3); + WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3); if (wakeEvents & WL_TIMEOUT) Assert(timeout >= 0); @@ -518,12 +521,15 @@ ResetLatch(volatile Latch *latch) * WaitEventSetWait(). */ WaitEventSet * -CreateWaitEventSet(MemoryContext context, int nevents) +CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents) { WaitEventSet *set; char *data; Size sz = 0; + if (res) + ResourceOwnerEnlargeWESs(res); + /* * Use MAXALIGN size/alignment to guarantee that later uses of memory are * aligned correctly. E.g. epoll_event might need 8 byte alignment on some @@ -592,6 +598,11 @@ CreateWaitEventSet(MemoryContext context, int nevents) StaticAssertStmt(WSA_INVALID_EVENT == NULL, ""); #endif + /* Register this wait event set if requested */ + set->resowner = res; + if (res) + ResourceOwnerRememberWES(set->resowner, set); + return set; } @@ -633,6 +644,9 @@ FreeWaitEventSet(WaitEventSet *set) } #endif + if (set->resowner != NULL) + ResourceOwnerForgetWES(set->resowner, set); + pfree(set); } diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c index 5afb211..1d9111e 100644 --- a/src/backend/storage/lmgr/condition_variable.c +++ b/src/backend/storage/lmgr/condition_variable.c @@ -66,7 +66,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv) /* Create a reusable WaitEventSet. */ if (cv_wait_event_set == NULL) { - cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, 1); + cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 1); AddWaitEventToSet(cv_wait_event_set, WL_LATCH_SET, PGINVALID_SOCKET, >procLatch, NULL); } diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c index af46d78..a1a1121 100644 --- a/src/backend/utils/resowner/resowner.c +++ b/src/backend/utils/resowner/resowner.c @@ -124,6 +124,7 @@ typedef struct ResourceOwnerData ResourceArray snapshotarr; /* snapshot references */ ResourceArray filearr; /* open temporary files */ ResourceArray dsmarr; /* dynamic
Re: [HACKERS] asynchronous execution
Hello. At Tue, 04 Apr 2017 19:25:39 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHIwrote in <20170404.192539.29699823.horiguchi.kyot...@lab.ntt.co.jp> > The attached patch is rebased on the current master, but no > substantial changes other than disallowing partitioned tables on > async by assertion. This is just rebased onto the current master (d761fe2). I'll recheck further detail after this. regards, -- Kyotaro Horiguchi NTT Open Source Software Center >From 000f0465a59cdabd02f43e886c76c89c14d987a5 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Mon, 22 May 2017 12:42:58 +0900 Subject: [PATCH 1/4] Allow wait event set to be registered to resource owner WaitEventSet needs to be released using resource owner for a certain case. This change adds WaitEventSet reowner and allow the creator of a WaitEventSet to specify a resource owner. --- src/backend/libpq/pqcomm.c| 2 +- src/backend/storage/ipc/latch.c | 18 ++- src/backend/storage/lmgr/condition_variable.c | 2 +- src/backend/utils/resowner/resowner.c | 68 +++ src/include/storage/latch.h | 4 +- src/include/utils/resowner_private.h | 8 6 files changed, 97 insertions(+), 5 deletions(-) diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c index d1cc38b..1c34114 100644 --- a/src/backend/libpq/pqcomm.c +++ b/src/backend/libpq/pqcomm.c @@ -201,7 +201,7 @@ pq_init(void) (errmsg("could not set socket to nonblocking mode: %m"))); #endif - FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3); + FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3); AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock, NULL, NULL); AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL); diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index 53e6bf2..8c182a2 100644 --- a/src/backend/storage/ipc/latch.c +++ b/src/backend/storage/ipc/latch.c @@ -51,6 +51,7 @@ #include "storage/latch.h" #include "storage/pmsignal.h" #include "storage/shmem.h" +#include "utils/resowner_private.h" /* * Select the fd readiness primitive to use. Normally the "most modern" @@ -77,6 +78,8 @@ struct WaitEventSet int nevents; /* number of registered events */ int nevents_space; /* maximum number of events in this set */ + ResourceOwner resowner; /* Resource owner */ + /* * Array, of nevents_space length, storing the definition of events this * set is waiting for. @@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock, int ret = 0; int rc; WaitEvent event; - WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3); + WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3); if (wakeEvents & WL_TIMEOUT) Assert(timeout >= 0); @@ -518,12 +521,15 @@ ResetLatch(volatile Latch *latch) * WaitEventSetWait(). */ WaitEventSet * -CreateWaitEventSet(MemoryContext context, int nevents) +CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents) { WaitEventSet *set; char *data; Size sz = 0; + if (res) + ResourceOwnerEnlargeWESs(res); + /* * Use MAXALIGN size/alignment to guarantee that later uses of memory are * aligned correctly. E.g. epoll_event might need 8 byte alignment on some @@ -592,6 +598,11 @@ CreateWaitEventSet(MemoryContext context, int nevents) StaticAssertStmt(WSA_INVALID_EVENT == NULL, ""); #endif + /* Register this wait event set if requested */ + set->resowner = res; + if (res) + ResourceOwnerRememberWES(set->resowner, set); + return set; } @@ -633,6 +644,9 @@ FreeWaitEventSet(WaitEventSet *set) } #endif + if (set->resowner != NULL) + ResourceOwnerForgetWES(set->resowner, set); + pfree(set); } diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c index 5afb211..1d9111e 100644 --- a/src/backend/storage/lmgr/condition_variable.c +++ b/src/backend/storage/lmgr/condition_variable.c @@ -66,7 +66,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv) /* Create a reusable WaitEventSet. */ if (cv_wait_event_set == NULL) { - cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, 1); + cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 1); AddWaitEventToSet(cv_wait_event_set, WL_LATCH_SET, PGINVALID_SOCKET, >procLatch, NULL); } diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c index af46d78..a1a1121 100644 --- a/src/backend/utils/resowner/resowner.c +++ b/src/backend/utils/resowner/resowner.c @@ -124,6 +124,7 @@ typedef struct ResourceOwnerData ResourceArray snapshotarr; /* snapshot references */ ResourceArray filearr; /* open temporary files */ ResourceArray dsmarr; /* dynamic shmem segments */ + ResourceArray wesarr; /* wait
Re: [HACKERS] asynchronous execution
Hello, At Sun, 2 Apr 2017 12:21:14 -0400, Corey Huinkerwrote in
Re: [HACKERS] asynchronous execution
> > > I'll continue working on this from this point aiming to the next > commit fest. > > This probably will not surprise you given the many commits in the past 2 weeks, but the patches no longer apply to master: $ git apply ~/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch /home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:27: trailing whitespace. FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3); /home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:39: trailing whitespace. #include "utils/resowner_private.h" /home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:47: trailing whitespace. ResourceOwner resowner; /* Resource owner */ /home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:48: trailing whitespace. /home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:57: trailing whitespace. WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3); error: patch failed: src/backend/libpq/pqcomm.c:201 error: src/backend/libpq/pqcomm.c: patch does not apply error: patch failed: src/backend/storage/ipc/latch.c:61 error: src/backend/storage/ipc/latch.c: patch does not apply error: patch failed: src/backend/storage/lmgr/condition_variable.c:66 error: src/backend/storage/lmgr/condition_variable.c: patch does not apply error: patch failed: src/backend/utils/resowner/resowner.c:124 error: src/backend/utils/resowner/resowner.c: patch does not apply error: patch failed: src/include/storage/latch.h:101 error: src/include/storage/latch.h: patch does not apply error: patch failed: src/include/utils/resowner_private.h:18 error: src/include/utils/resowner_private.h: patch does not apply
Re: [HACKERS] asynchronous execution
Hello. This is the final report in this CF period. At Fri, 17 Mar 2017 17:35:05 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHIwrote in <20170317.173505.152063931.horiguchi.kyot...@lab.ntt.co.jp> > Async-capable plan is generated in planner. An Append contains at > least one async-capable child becomes async-aware Append. So the > async feature should be effective also for the UNION ALL case. > > The following will work faster than unpatched version.I > > SELECT sum(a) FROM (SELECT a FROM ft10 UNION ALL SELECT a FROM ft20 UNION ALL > SELECT a FROM ft30 UNION ALL SELECT a FROM ft40) as ft; > > I'll measure the performance for the case next week. I found that the following query works as the same as partitioned table. SELECT sum(a) FROM (SELECT a FROM ft10 UNION ALL SELECT a FROM ft20 UNION ALL SELECT a FROM ft30 UNION ALL SELECT a FROM ft40 UNION ALL *SELECT a FROM ONLY pf0*) as ft; So, the difference comes from the additional async-uncapable child (faster if contains any). In both cases, Append node runs children asynchronously but slightly differently when all async-capable children are busy. I'll continue working on this from this point aiming to the next commit fest. Thank you for valuable feedback. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous execution
At Thu, 16 Mar 2017 17:16:32 -0400, Corey Huinkerwrote in
Re: [HACKERS] asynchronous execution
On Thu, Mar 16, 2017 at 4:17 PM, Tom Lanewrote: > Corey Huinker writes: > > I reworked the test such that all of the foreign tables inherit from the > > same parent table, and if you query that you do get async execution. But > It > > doesn't work when just stringing together those foreign tables with UNION > > ALLs. > > > I don't know how to proceed with this review if that was a goal of the > > patch. > > Whether it was a goal or not, I'd say there is something either broken > or incorrectly implemented if you don't see that. The planner (and > therefore also the executor) generally treats inheritance the same as > simple UNION ALL. If that's not the case here, I'd want to know why. > > regards, tom lane > Updated commitfest entry to "Returned With Feedback".
Re: [HACKERS] asynchronous execution
Corey Huinkerwrites: > I reworked the test such that all of the foreign tables inherit from the > same parent table, and if you query that you do get async execution. But It > doesn't work when just stringing together those foreign tables with UNION > ALLs. > I don't know how to proceed with this review if that was a goal of the > patch. Whether it was a goal or not, I'd say there is something either broken or incorrectly implemented if you don't see that. The planner (and therefore also the executor) generally treats inheritance the same as simple UNION ALL. If that's not the case here, I'd want to know why. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous execution
On Mon, Mar 13, 2017 at 9:28 PM, Amit Langotewrote: > On 2017/03/14 10:08, Corey Huinker wrote: > >> I don't think the plan itself will change as a result of applying this > >> patch. You might however be able to observe some performance > improvement. > > > > I could see no performance improvement, even with 16 separate queries > > combined with UNION ALL. Query performance was always with +/- 10% of a > 9.6 > > instance given the same script. I must be missing something. > > Hmm, maybe I'm missing something too. > > Anyway, here is an older message on this thread from Horiguchi-san where > he shared some of the test cases that this patch improves performance for: > > https://www.postgresql.org/message-id/20161018.103051. > 30820907.horiguchi.kyotaro%40lab.ntt.co.jp > > From that message: > > > I measured performance and had the following result. > > t0 - SELECT sum(a) FROM ; > pl - SELECT sum(a) FROM <4 local children>; > pf0 - SELECT sum(a) FROM <4 foreign children on single connection>; > pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>; > > The result is written as "time (std dev )" > > sync > t0: 3820.33 ( 1.88) > pl: 1608.59 ( 12.06) > pf0: 7928.29 ( 46.58) > pf1: 8023.16 ( 26.43) > > async > t0: 3806.31 ( 4.49)0.4% faster (should be error) > pl: 1629.17 ( 0.29)1.3% slower > pf0: 6447.07 ( 25.19) 18.7% faster > pf1: 1876.80 ( 47.13) 76.6% faster > > > IIUC, pf0 and pf1 is the same test case (all 4 foreign tables target the > same server) measured with different implementations of the patch. > > Thanks, > Amit > I reworked the test such that all of the foreign tables inherit from the same parent table, and if you query that you do get async execution. But It doesn't work when just stringing together those foreign tables with UNION ALLs. I don't know how to proceed with this review if that was a goal of the patch.
Re: [HACKERS] asynchronous execution
On 2017/03/14 10:08, Corey Huinker wrote: >> I don't think the plan itself will change as a result of applying this >> patch. You might however be able to observe some performance improvement. > > I could see no performance improvement, even with 16 separate queries > combined with UNION ALL. Query performance was always with +/- 10% of a 9.6 > instance given the same script. I must be missing something. Hmm, maybe I'm missing something too. Anyway, here is an older message on this thread from Horiguchi-san where he shared some of the test cases that this patch improves performance for: https://www.postgresql.org/message-id/20161018.103051.30820907.horiguchi.kyotaro%40lab.ntt.co.jp >From that message: I measured performance and had the following result. t0 - SELECT sum(a) FROM ; pl - SELECT sum(a) FROM <4 local children>; pf0 - SELECT sum(a) FROM <4 foreign children on single connection>; pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>; The result is written as "time (std dev )" sync t0: 3820.33 ( 1.88) pl: 1608.59 ( 12.06) pf0: 7928.29 ( 46.58) pf1: 8023.16 ( 26.43) async t0: 3806.31 ( 4.49)0.4% faster (should be error) pl: 1629.17 ( 0.29)1.3% slower pf0: 6447.07 ( 25.19) 18.7% faster pf1: 1876.80 ( 47.13) 76.6% faster IIUC, pf0 and pf1 is the same test case (all 4 foreign tables target the same server) measured with different implementations of the patch. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous execution
> > I don't think the plan itself will change as a result of applying this > patch. You might however be able to observe some performance improvement. > > Thanks, > Amit > I could see no performance improvement, even with 16 separate queries combined with UNION ALL. Query performance was always with +/- 10% of a 9.6 instance given the same script. I must be missing something.
Re: [HACKERS] asynchronous execution
On 2017/03/14 6:31, Corey Huinker wrote: > On Mon, Mar 13, 2017 at 1:06 AM, Corey Huinker> wrote: > >> >>> I think it will, because Append itself has been made async-capable by one >>> of the patches and UNION ALL uses Append. But as mentioned above, only >>> the postgres_fdw foreign tables will be able to utilize this for now. >>> >>> >> Ok, I'll re-run my test from a few weeks back and see if anything has >> changed. >> > > > I'm not able to discern any difference in plan between a 9.6 instance and > this patch. > > The basic outline of my test is: > > EXPLAIN ANALYZE > SELECT c1, c2, ..., cN FROM tab1 WHERE date = '1 day ago' > UNION ALL > SELECT c1, c2, ..., cN FROM tab2 WHERE date = '2 days ago' > UNION ALL > SELECT c1, c2, ..., cN FROM tab3 WHERE date = '3 days ago' > UNION ALL > SELECT c1, c2, ..., cN FROM tab4 WHERE date = '4 days ago' > > > I've tried this test where tab1 through tab4 all are the same postgres_fdw > foreign table. > I've tried this test where tab1 through tab4 all are different foreign > tables pointing to the same remote table sharing a the same server > definition. > I've tried this test where tab1 through tab4 all are different foreign > tables pointing each with it's own foreign server definition, all of which > happen to point to the same remote cluster. > > Are there some postgresql.conf settings I should set to get a decent test? I don't think the plan itself will change as a result of applying this patch. You might however be able to observe some performance improvement. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous execution
On Mon, Mar 13, 2017 at 1:06 AM, Corey Huinkerwrote: > >> I think it will, because Append itself has been made async-capable by one >> of the patches and UNION ALL uses Append. But as mentioned above, only >> the postgres_fdw foreign tables will be able to utilize this for now. >> >> > Ok, I'll re-run my test from a few weeks back and see if anything has > changed. > I'm not able to discern any difference in plan between a 9.6 instance and this patch. The basic outline of my test is: EXPLAIN ANALYZE SELECT c1, c2, ..., cN FROM tab1 WHERE date = '1 day ago' UNION ALL SELECT c1, c2, ..., cN FROM tab2 WHERE date = '2 days ago' UNION ALL SELECT c1, c2, ..., cN FROM tab3 WHERE date = '3 days ago' UNION ALL SELECT c1, c2, ..., cN FROM tab4 WHERE date = '4 days ago' I've tried this test where tab1 through tab4 all are the same postgres_fdw foreign table. I've tried this test where tab1 through tab4 all are different foreign tables pointing to the same remote table sharing a the same server definition. I've tried this test where tab1 through tab4 all are different foreign tables pointing each with it's own foreign server definition, all of which happen to point to the same remote cluster. Are there some postgresql.conf settings I should set to get a decent test?
Re: [HACKERS] asynchronous execution
> > > I think it will, because Append itself has been made async-capable by one > of the patches and UNION ALL uses Append. But as mentioned above, only > the postgres_fdw foreign tables will be able to utilize this for now. > > Ok, I'll re-run my test from a few weeks back and see if anything has changed.
Re: [HACKERS] asynchronous execution
On 2017/03/11 8:19, Corey Huinker wrote: > > On Thu, Feb 23, 2017 at 6:59 AM, Kyotaro HORIGUCHI >> > wrote: > > 9e43e87 > > > Patch fails on current master, but correctly applies to 9e43e87. Thanks > for including the commit id. > > Regression tests pass. > > As with my last attempt at reviewing this patch, I'm confused about what > kind of queries can take advantage of this patch. Is it only cases where a > local table has multiple inherited foreign table children? IIUC, Horiguchi-san's patch adds asynchronous capability for ForeignScan's driven by postgres_fdw (after building some relevant infrastructure first). The same might be added to other Scan nodes (and probably other nodes as well) eventually so that more queries will benefit from asynchronous execution. It may just be that ForeignScan's benefit more from asynchronous execution than other Scan types. > Will it work > with queries where two foreign tables are referenced and combined with a > UNION ALL? I think it will, because Append itself has been made async-capable by one of the patches and UNION ALL uses Append. But as mentioned above, only the postgres_fdw foreign tables will be able to utilize this for now. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous execution
On Thu, Feb 23, 2017 at 6:59 AM, Kyotaro HORIGUCHI < horiguchi.kyot...@lab.ntt.co.jp> wrote: > 9e43e87 Patch fails on current master, but correctly applies to 9e43e87. Thanks for including the commit id. Regression tests pass. As with my last attempt at reviewing this patch, I'm confused about what kind of queries can take advantage of this patch. Is it only cases where a local table has multiple inherited foreign table children? Will it work with queries where two foreign tables are referenced and combined with a UNION ALL?
Re: [HACKERS] asynchronous execution
Hello, I totally reorganized the patch set to four pathces on the current master (9e43e87). At Wed, 22 Feb 2017 17:39:45 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHIwrote in <20170222.173945.262776579.horiguchi.kyot...@lab.ntt.co.jp> > Finally, I couldn't see the crash for the (maybe) same case. I > can guess two reasons for this. One is that a situation where > node->as_nasyncpending differs from estate->es_num_pending_async, > but I couldn't find a possibility. Another is a situation in > postgresIterateForeignScan where the "next owner" reaches eof but > another waiter is not. I haven't reproduce the situation but > fixed it for the case. Addition to that I found a bug in > ExecAsyncAppendResponse. It calls bms_add_member inappropriate > way. This found to be wrong. The true problem here was (maybe) that ExecAsyncRequest can complete a tuple immediately. This causes multiple calling to ExecAsyncRequest for the same child at once. (For the case, the processing node is added again to node->as_needrequest before ExecAsyncRequest returns.) Using a copy of node->as_needrequest will fix this but it is uneasy so I changed ExecAsyncRequest not to return a tuple immediately. Instaed, ExecAsyncEventLoop skips waiting if no node to wait. The tuple previously "response"'ed in ExecAsyncRequest is now responsed here. Addition to that, the current policy of preserving of es_wait_event_set doesn't seem to work with the async-capable postgres_fdw. So the current code cleares it at every entering to ExecAppend. This needs more thoughts. I measured the performance of async-execution and it was quite good from the previous version especially for single-connection environment. pf0: 4 foreign tables on single connection non async : (prev) 7928ms -> (this time)7993ms async : (prev) 6447ms -> (this time)3211ms pf1: 4 foreign tables on dedicate connection for every table non async : (prev) 8023ms -> (this time)7953ms async : (prev) 1876ms -> (this time)1841ms Boost rate by async execution is 60% for single connectsion and 77% for dedicate connection environment. > > Mmm, I reproduces it quite easily. A silly bug. > > > > Something bad is happening between freeing ExecutorState memory > > context and resource owner. Perhaps the ExecutorState is freed by > > resowner (as a part of its anscestors) before the memory for the > > WaitEventSet is freed. It was careless of me. I'll reconsider it. > > The cause was that the WaitEventSet was placed in ExecutorState > but registered to TopTransactionResourceOwner. I fixed it. The attached patches are the following. 0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch Allows WaitEventSet to released by resource owner. 0002-Asynchronous-execution-framework.patch Asynchronous execution framework based on Robert's version. All edits on this is merged. 0003-Make-postgres_fdw-async-capable.patch Make postgres_fdw async-capable. 0004-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patch This can be merged to 0002 but I didn't since the usage of using these pragmas is arguable. regards, -- Kyotaro Horiguchi NTT Open Source Software Center >From bcd888a98a7aa5e1bd367c83e06d598121fd2d94 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Wed, 22 Feb 2017 09:07:49 +0900 Subject: [PATCH 1/5] Allow wait event set to be registered to resource owner WaitEventSet needs to be released using resource owner for a certain case. This change adds WaitEventSet reowner and allow the creator of a WaitEventSet to specify a resource owner. --- src/backend/libpq/pqcomm.c| 2 +- src/backend/storage/ipc/latch.c | 18 ++- src/backend/storage/lmgr/condition_variable.c | 2 +- src/backend/utils/resowner/resowner.c | 68 +++ src/include/storage/latch.h | 4 +- src/include/utils/resowner_private.h | 8 6 files changed, 97 insertions(+), 5 deletions(-) diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c index 7939b1f..16a5d7a 100644 --- a/src/backend/libpq/pqcomm.c +++ b/src/backend/libpq/pqcomm.c @@ -201,7 +201,7 @@ pq_init(void) (errmsg("could not set socket to nonblocking mode: %m"))); #endif - FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3); + FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3); AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock, NULL, NULL); AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL); diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index 0079ba5..a204b0c 100644 --- a/src/backend/storage/ipc/latch.c +++ b/src/backend/storage/ipc/latch.c @@ -62,6 +62,7 @@ #include "storage/latch.h" #include "storage/pmsignal.h" #include "storage/shmem.h" +#include "utils/resowner_private.h" /* * Select the fd readiness primitive to use.
Re: [HACKERS] asynchronous execution
Thank you very much for testing this! At Tue, 7 Feb 2017 13:28:42 +0900, Amit Langotewrote in <9058d70b-a6b0-8b3c-091a-fe77ed0df...@lab.ntt.co.jp> > Horiguchi-san, > > On 2017/01/31 12:45, Kyotaro HORIGUCHI wrote: > > I noticed that this patch is conflicting with 665d1fa (Logical > > replication) so I rebased this. Only executor/Makefile > > conflicted. > > With the latest set of patches, I observe a crash due to an Assert failure: > > #3 0x006883ed in ExecAsyncEventWait (estate=0x13c01b8, > timeout=-1) at execAsync.c:345 This means no pending fdw scan didn't let itself go to waiting stage. It leads to a stuck of the whole things. This is caused if no one acutually is waiting for result. I suppose that all of the foreign scans ran on the same connection. Anyway it should be a mistake in state transition. I'll look into it. > I was running a query whose plan looked like: > > explain (costs off) select tableoid::regclass, a, min(b), max(b) from ptab > group by 1,2 order by 1; > QUERY PLAN > -- > Sort >Sort Key: ((ptab.tableoid)::regclass) >-> HashAggregate > Group Key: (ptab.tableoid)::regclass, ptab.a > -> Result >-> Append > -> Foreign Scan on ptab_1 > -> Foreign Scan on ptab_2 > -> Foreign Scan on ptab_3 > -> Foreign Scan on ptab_4 > -> Foreign Scan on ptab_5 > -> Foreign Scan on ptab_6 > -> Foreign Scan on ptab_7 > -> Foreign Scan on ptab_8 > -> Foreign Scan on ptab_9 > -> Foreign Scan on ptab_00010 > > > The snipped part contains Foreign Scans on 90 more foreign partitions (in > fact, I could see the crash even with 10 foreign table partitions for the > same query). Yeah, it seems to me unrelated to how many they are. > There is a crash in one more case, which seems related to how WaitEventSet > objects are manipulated during resource-owner-mediated cleanup of a failed > query, such as after the FDW returned an error like below: > > ERROR: relation "public.ptab_00010" does not exist > CONTEXT: Remote SQL command: SELECT a, b FROM public.ptab_00010 > > The backtrace in this looks like below: > > Program terminated with signal SIGSEGV, Segmentation fault. > #0 0x009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf, > value=20645152) at resowner.c:301 > 301 lastidx = resarr->lastidx; > (gdb) > (gdb) bt > #0 0x009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf, > value=20645152) at resowner.c:301 > #1 0x009c6578 in ResourceOwnerForgetWES > (owner=0x7f7f7f7f7f7f7f7f, events=0x13b0520) at resowner.c:1317 > #2 0x00806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600 > #3 0x009c5338 in ResourceOwnerReleaseInternal (owner=0x12de768, > phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1 '\001') > at resowner.c:566 > #4 0x009c5155 in ResourceOwnerRelease (owner=0x12de768, > phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1 > '\001') at resowner.c:485 > #5 0x00524172 in AbortTransaction () at xact.c:2588 > #6 0x00524854 in AbortCurrentTransaction () at xact.c:3016 > #7 0x00836aa6 in PostgresMain (argc=1, argv=0x12d7b08, > dbname=0x12d7968 "postgres", username=0x12d7948 "amit") at postgres.c:3860 > #8 0x007a49d8 in BackendRun (port=0x12cdf00) at postmaster.c:4310 > #9 0x007a4151 in BackendStartup (port=0x12cdf00) at postmaster.c:3982 > #10 0x007a0885 in ServerLoop () at postmaster.c:1722 > #11 0x0079febf in PostmasterMain (argc=3, argv=0x12aacc0) at > postmaster.c:1330 > #12 0x006e7549 in main (argc=3, argv=0x12aacc0) at main.c:228 > > There is a segfault when accessing the events variable, whose members seem > to be pfreed: > > (gdb) f 2 > #2 0x00806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600 > 600 ResourceOwnerForgetWES(set->resowner, set); > (gdb) p *set > $5 = { > nevents = 2139062143, > nevents_space = 2139062143, > resowner = 0x7f7f7f7f7f7f7f7f, > events = 0x7f7f7f7f7f7f7f7f, > latch = 0x7f7f7f7f7f7f7f7f, > latch_pos = 2139062143, > epoll_fd = 2139062143, > epoll_ret_events = 0x7f7f7f7f7f7f7f7f > } Mmm, I reproduces it quite easily. A silly bug. Something bad is happening between freeing ExecutorState memory context and resource owner. Perhaps the ExecutorState is freed by resowner (as a part of its anscestors) before the memory for the WaitEventSet is freed. It was careless of me. I'll reconsider it. Great thanks for the report. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [HACKERS] asynchronous execution
Horiguchi-san, On 2017/01/31 12:45, Kyotaro HORIGUCHI wrote: > I noticed that this patch is conflicting with 665d1fa (Logical > replication) so I rebased this. Only executor/Makefile > conflicted. With the latest set of patches, I observe a crash due to an Assert failure: #0 0x003969632625 in *__GI_raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x003969633e05 in *__GI_abort () at abort.c:92 #2 0x0098b22c in ExceptionalCondition (conditionName=0xb30e02 "!(added)", errorType=0xb30d77 "FailedAssertion", fileName=0xb30d50 "execAsync.c", lineNumber=345) at assert.c:54 #3 0x006883ed in ExecAsyncEventWait (estate=0x13c01b8, timeout=-1) at execAsync.c:345 #4 0x00687ed5 in ExecAsyncEventLoop (estate=0x13c01b8, requestor=0x13c1640, timeout=-1) at execAsync.c:186 #5 0x006a5170 in ExecAppend (node=0x13c1640) at nodeAppend.c:257 #6 0x00692b9b in ExecProcNode (node=0x13c1640) at execProcnode.c:411 #7 0x006bf4d7 in ExecResult (node=0x13c1170) at nodeResult.c:113 #8 0x00692b5c in ExecProcNode (node=0x13c1170) at execProcnode.c:399 #9 0x006a596b in fetch_input_tuple (aggstate=0x13c06a0) at nodeAgg.c:587 #10 0x006a8530 in agg_fill_hash_table (aggstate=0x13c06a0) at nodeAgg.c:2272 #11 0x006a7e76 in ExecAgg (node=0x13c06a0) at nodeAgg.c:1910 #12 0x00692d69 in ExecProcNode (node=0x13c06a0) at execProcnode.c:514 #13 0x006c1a42 in ExecSort (node=0x13c03d0) at nodeSort.c:103 #14 0x00692d3f in ExecProcNode (node=0x13c03d0) at execProcnode.c:506 #15 0x0068e733 in ExecutePlan (estate=0x13c01b8, planstate=0x13c03d0, use_parallel_mode=0 '\000', operation=CMD_SELECT, sendTuples=1 '\001', numberTuples=0, direction=ForwardScanDirection, dest=0x7fa368ee1da8) at execMain.c:1609 #16 0x0068c751 in standard_ExecutorRun (queryDesc=0x135c568, direction=ForwardScanDirection, count=0) at execMain.c:341 #17 0x0068c5dc in ExecutorRun (queryDesc=0x135c568, I was running a query whose plan looked like: explain (costs off) select tableoid::regclass, a, min(b), max(b) from ptab group by 1,2 order by 1; QUERY PLAN -- Sort Sort Key: ((ptab.tableoid)::regclass) -> HashAggregate Group Key: (ptab.tableoid)::regclass, ptab.a -> Result -> Append -> Foreign Scan on ptab_1 -> Foreign Scan on ptab_2 -> Foreign Scan on ptab_3 -> Foreign Scan on ptab_4 -> Foreign Scan on ptab_5 -> Foreign Scan on ptab_6 -> Foreign Scan on ptab_7 -> Foreign Scan on ptab_8 -> Foreign Scan on ptab_9 -> Foreign Scan on ptab_00010 The snipped part contains Foreign Scans on 90 more foreign partitions (in fact, I could see the crash even with 10 foreign table partitions for the same query). There is a crash in one more case, which seems related to how WaitEventSet objects are manipulated during resource-owner-mediated cleanup of a failed query, such as after the FDW returned an error like below: ERROR: relation "public.ptab_00010" does not exist CONTEXT: Remote SQL command: SELECT a, b FROM public.ptab_00010 The backtrace in this looks like below: Program terminated with signal SIGSEGV, Segmentation fault. #0 0x009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf, value=20645152) at resowner.c:301 301 lastidx = resarr->lastidx; (gdb) (gdb) bt #0 0x009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf, value=20645152) at resowner.c:301 #1 0x009c6578 in ResourceOwnerForgetWES (owner=0x7f7f7f7f7f7f7f7f, events=0x13b0520) at resowner.c:1317 #2 0x00806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600 #3 0x009c5338 in ResourceOwnerReleaseInternal (owner=0x12de768, phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1 '\001') at resowner.c:566 #4 0x009c5155 in ResourceOwnerRelease (owner=0x12de768, phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1 '\001') at resowner.c:485 #5 0x00524172 in AbortTransaction () at xact.c:2588 #6 0x00524854 in AbortCurrentTransaction () at xact.c:3016 #7 0x00836aa6 in PostgresMain (argc=1, argv=0x12d7b08, dbname=0x12d7968 "postgres", username=0x12d7948 "amit") at postgres.c:3860 #8 0x007a49d8 in BackendRun (port=0x12cdf00) at postmaster.c:4310 #9 0x007a4151 in BackendStartup (port=0x12cdf00) at postmaster.c:3982 #10 0x007a0885 in ServerLoop () at postmaster.c:1722 #11 0x0079febf in PostmasterMain (argc=3, argv=0x12aacc0) at postmaster.c:1330 #12 0x006e7549 in main (argc=3, argv=0x12aacc0) at
Re: [HACKERS] asynchronous execution
On Fri, Feb 3, 2017 at 5:04 AM, Antonin Houskawrote: > Kyotaro HORIGUCHI wrote: > > > I noticed that this patch is conflicting with 665d1fa (Logical > > replication) so I rebased this. Only executor/Makefile > > conflicted. > > I was lucky enough to see an infinite loop when using this patch, which I > fixed by this change: > > diff --git a/src/backend/executor/execAsync.c > b/src/backend/executor/execAsync.c > new file mode 100644 > index 588ba18..9b87fbd > *** a/src/backend/executor/execAsync.c > --- b/src/backend/executor/execAsync.c > *** ExecAsyncEventWait(EState *estate, long > *** 364,369 > --- 364,370 > > if ((w->events & WL_LATCH_SET) != 0) > { > + ResetLatch(MyLatch); > process_latch_set = true; > continue; > } Hi, I've been testing this patch because seemed like it would help a use case of mine, but can't tell if it's currently working for cases other than a local parent table that has many child partitions which happen to be foreign tables. Is it? I was hoping to use it for a case like: select x, sum(y) from one_remote_table union all select x, sum(y) from another_remote_table union all select x, sum(y) from a_third_remote_table but while aggregates do appear to be pushed down, it seems that the remote tables are being queried in sequence. Am I doing something wrong?
Re: [HACKERS] asynchronous execution
Kyotaro HORIGUCHIwrote: > I noticed that this patch is conflicting with 665d1fa (Logical > replication) so I rebased this. Only executor/Makefile > conflicted. I was lucky enough to see an infinite loop when using this patch, which I fixed by this change: diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c new file mode 100644 index 588ba18..9b87fbd *** a/src/backend/executor/execAsync.c --- b/src/backend/executor/execAsync.c *** ExecAsyncEventWait(EState *estate, long *** 364,369 --- 364,370 if ((w->events & WL_LATCH_SET) != 0) { + ResetLatch(MyLatch); process_latch_set = true; continue; } Actually _almost_ fixed because at some point one of the following Assert(areq->state == ASYNC_WAITING); statements fired. I think it was the immediately following one, but I can imagine the same to happen in the branch if (process_latch_set) ... I think the wants_process_latch field of PendingAsyncRequest is not useful alone because the process latch can be set for reasons completely unrelated to the asynchronous processing. If the asynchronous node should use latch to signal it's readiness, I think an additional flag is needed in the request which tells ExecAsyncEventWait that the latch was set by the asynchronous node. BTW, do we really need the ASYNC_CALLBACK_PENDING state? I can imagine the async node either to change ASYNC_WAITING directly to ASYNC_COMPLETE, or leave it ASYNC_WAITING if the data is not ready. In addition, the following comments are based only on code review, I didn't verify my understanding experimentally: * Isn't it possible for AppendState.as_asyncresult to contain multiple responses from the same async node? Since the array stores TupleTableSlot instead of the actual tuple (so multiple items of as_asyncresult point to the same slot), I suspect the slot contents might not be defined when the Append node eventually tries to return it to the upper plan. * For the WaitEvent subsystem to work, I think postgres_fdw should keep a separate libpq connection per node, not per user mapping. Currently the connections are cached by user mapping, but it's legal to locate multiple child postgres_fdw nodes of Append plan on the same remote server. I expect that these "co-located" nodes would currently use the same user mapping and therefore the same connection. -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous execution
Thank you. At Wed, 1 Feb 2017 14:11:58 +0900, Michael Paquierwrote in
Re: [HACKERS] asynchronous execution
On Tue, Jan 31, 2017 at 12:45 PM, Kyotaro HORIGUCHIwrote: > I noticed that this patch is conflicting with 665d1fa (Logical > replication) so I rebased this. Only executor/Makefile > conflicted. The patches still apply, moved to CF 2017-03. Be aware of that: $ git diff HEAD~6 --check contrib/postgres_fdw/postgres_fdw.c:388: indent with spaces. + PendingAsyncRequest *areq, contrib/postgres_fdw/postgres_fdw.c:389: indent with spaces. + bool reinit); src/backend/utils/resowner/resowner.c:1332: new blank line at EOF. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous execution
Hello, I cannot respond until next Monday, so I move this to the next CF by myself. At Tue, 15 Nov 2016 20:25:13 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHIwrote in <20161115.202513.268072050.horiguchi.kyot...@lab.ntt.co.jp> > Hello, this is a maintenance post of reased patches. > I added a change of ResourceOwnerData missed in 0005. > > At Mon, 31 Oct 2016 10:39:12 +0900 (JST), Kyotaro HORIGUCHI > wrote in > <20161031.103912.217430542.horiguchi.kyot...@lab.ntt.co.jp> > > This a PoC patch of asynchronous execution feature, based on a > > executor infrastructure Robert proposed. These patches are > > rebased on the current master. > > > > 0001-robert-s-2nd-framework.patch > > > > Roberts executor async infrastructure. Async-driver nodes > > register its async-capable children and sync and data transfer > > are done out of band of ordinary ExecProcNode channel. So async > > execution no longer disturbs async-unaware node and slows them > > down. > > > > 0002-Fix-some-bugs.patch > > > > Some fixes for 0001 to work. This is just to preserve the shape > > of 0001 patch. > > > > 0003-Modify-async-execution-infrastructure.patch > > > > The original infrastructure doesn't work when multiple foreign > > tables is on the same connection. This makes it work. > > > > 0004-Make-postgres_fdw-async-capable.patch > > > > Makes postgres_fdw to work asynchronously. > > > > 0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patch > > > > This addresses a problem pointed by Robers about 0001 patch, > > that WaitEventSet used for async execution can leak by errors. > > > > 0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patch > > > > ExecAppend gets a bit slower by penalties of misprediction of > > branches. This fixes it by using unlikely() macro. > > > > 0007-Add-instrumentation-to-async-execution.patch > > > > As the description above for 0001, async infrastructure conveys > > tuples outside ExecProcNode channel so EXPLAIN ANALYZE requires > > special treat to show sane results. This patch tries that. > > > > > > A result of a performance measurement is in this message. > > > > https://www.postgresql.org/message-id/20161025.182150.230901487.horiguchi.kyot...@lab.ntt.co.jp > > > > > > | t0 - SELECT sum(a) FROM ; > > | pl - SELECT sum(a) FROM <4 local children>; > > | pf0 - SELECT sum(a) FROM <4 foreign children on single connection>; > > | pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>; > > ... > > | async > > | t0: 3885.84 ( 40.20) 0.86% faster (should be error but stable on my > > env..) > > | pl: 1617.20 ( 3.51) 1.26% faster (ditto) > > | pf0: 6680.95 (478.72) 19.5% faster > > | pf1: 1886.87 ( 36.25) 77.1% faster -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous execution
Hi, this is the 7th patch to make instrumentation work. Explain analyze shows the following result by the previous patch set . | Aggregate (cost=820.25..820.26 rows=1 width=8) (actual time=4324.676..4324.676 | rows=1 loops=1) | -> Append (cost=0.00..791.00 rows=11701 width=4) (actual time=0.910..3663.8 |82 rows=400 loops=1) | -> Foreign Scan on ft10 (cost=100.00..197.75 rows=2925 width=4) | (never executed) | -> Foreign Scan on ft20 (cost=100.00..197.75 rows=2925 width=4) | (never executed) | -> Foreign Scan on ft30 (cost=100.00..197.75 rows=2925 width=4) | (never executed) | -> Foreign Scan on ft40 (cost=100.00..197.75 rows=2925 width=4) | (never executed) | -> Seq Scan on pf0 (cost=0.00..0.00 rows=1 width=4) | (actual time=0.004..0.004 rows=0 loops=1) The current instrument stuff assumes that requested tuple always returns a tuple or the end of tuple comes. This async framework has two point of executing underneath nodes. ExecAsyncRequest and ExecAsyncEventLoop. So I'm not sure if this is appropriate but anyway it seems to show sane numbers. | Aggregate (cost=820.25..820.26 rows=1 width=8) (actual time=4571.205..4571.206 | rows=1 loops=1) | -> Append (cost=0.00..791.00 rows=11701 width=4) (actual time=1.362..3893.1 |14 rows=400 loops=1) | -> Foreign Scan on ft10 (cost=100.00..197.75 rows=2925 width=4) |(actual time=1.056..770.863 rows=100 loops=1) | -> Foreign Scan on ft20 (cost=100.00..197.75 rows=2925 width=4) |(actual time=0.461..767.840 rows=100 loops=1) | -> Foreign Scan on ft30 (cost=100.00..197.75 rows=2925 width=4) |(actual time=0.474..782.547 rows=100 loops=1) | -> Foreign Scan on ft40 (cost=100.00..197.75 rows=2925 width=4) |(actual time=0.156..765.920 rows=100 loops=1) | -> Seq Scan on pf0 (cost=0.00..0.00 rows=1 width=4) (never executed) regards, -- Kyotaro Horiguchi NTT Open Source Software Center >From 35c60a46f49aab72d492c798ff7eb8fc0e672250 Mon Sep 17 00:00:00 2001 From: Kyotaro HoriguchiDate: Tue, 25 Oct 2016 19:04:04 +0900 Subject: [PATCH 7/7] Add instrumentation to async execution Make explain analyze give sane result when async execution has taken place. --- src/backend/executor/execAsync.c | 19 +++ src/backend/executor/instrument.c | 2 +- 2 files changed, 20 insertions(+), 1 deletion(-) diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c index 40e3f67..588ba18 100644 --- a/src/backend/executor/execAsync.c +++ b/src/backend/executor/execAsync.c @@ -46,6 +46,9 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index, PendingAsyncRequest *areq = NULL; int nasync = estate->es_num_pending_async; + if (requestee->instrument) + InstrStartNode(requestee->instrument); + /* * If the number of pending asynchronous nodes exceeds the number of * available slots in the es_pending_async array, expand the array. @@ -121,11 +124,17 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index, if (areq->state == ASYNC_COMPLETE) { Assert(areq->result == NULL || IsA(areq->result, TupleTableSlot)); + ExecAsyncResponse(estate, areq); + if (areq->requestee->instrument) + InstrStopNode(requestee->instrument, + TupIsNull((TupleTableSlot*)areq->result) ? 0.0 : 1.0); return; } + if (areq->requestee->instrument) + InstrStopNode(requestee->instrument, 0); /* No result available now, make this node pending */ estate->es_num_pending_async++; } @@ -193,6 +202,9 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout) { PendingAsyncRequest *areq = estate->es_pending_async[i]; + if (areq->requestee->instrument) +InstrStartNode(areq->requestee->instrument); + /* Skip it if not pending. */ if (areq->state == ASYNC_CALLBACK_PENDING) { @@ -211,7 +223,14 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout) if (requestor == areq->requestor) requestor_done = true; ExecAsyncResponse(estate, areq); + +if (areq->requestee->instrument) + InstrStopNode(areq->requestee->instrument, + TupIsNull((TupleTableSlot*)areq->result) ? + 0.0 : 1.0); } + else if (areq->requestee->instrument) +InstrStopNode(areq->requestee->instrument, 0); } /* If any node completed, compact the array. */ diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c index 2614bf4..6a22a15 100644 --- a/src/backend/executor/instrument.c +++ b/src/backend/executor/instrument.c @@ -102,7 +102,7 @@ InstrStopNode(Instrumentation *instr, double nTuples) , >bufusage_start); /* Is this the first tuple
Re: [HACKERS] asynchronous execution
This is the rebased version on the current master(-0004), and added resowner stuff (0005) and unlikely(0006). At Tue, 18 Oct 2016 10:30:51 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHIwrote in <20161018.103051.30820907.horiguchi.kyot...@lab.ntt.co.jp> > > > - Errors in the executor can leak the WaitEventSet. Probably we need > > > to modify ResourceOwners to be able to own WaitEventSets. > > WaitEventSet itself is not leaked but epoll-fd should be closed > at failure. This seems doable with TRY-CATCHing in > ExecAsyncEventLoop. (not yet) Haha, that's a silly talk. The wait event can continue to live when timeout and any error can happen on the way after the that. I added an entry for wait event set to resource owner and hang ones created in ExecAsyncEventWait to TopTransactionResourceOwner. Currently WaitLatchOrSocket doesn't do so not to change the current behavior. WaitEventSet doesn't have usable identifier for resowner.c so currently I use the address(pointer value) for the purpose. The patch 0005 does that. > I measured performance and had the following result. > > t0 - SELECT sum(a) FROM ; > pl - SELECT sum(a) FROM <4 local children>; > pf0 - SELECT sum(a) FROM <4 foreign children on single connection>; > pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>; > > The result is written as "time (std dev )" > > sync > t0: 3820.33 ( 1.88) > pl: 1608.59 ( 12.06) > pf0: 7928.29 ( 46.58) > pf1: 8023.16 ( 26.43) > > async > t0: 3806.31 ( 4.49)0.4% faster (should be error) > pl: 1629.17 ( 0.29)1.3% slower > pf0: 6447.07 ( 25.19) 18.7% faster > pf1: 1876.80 ( 47.13) 76.6% faster > > t0 is not affected since the ExecProcNode stuff has gone. > > pl is getting a bit slower. (almost the same to simple seqscan of > the previous patch) This should be a misprediction penalty. Using likely macro for ExecAppend, and it seems to have shaken off the degradation. sync t0: 3919.49 ( 5.95) pl: 1637.95 ( 0.75) pf0: 8304.20 ( 43.94) pf1: 8222.09 ( 28.20) async t0: 3885.84 ( 40.20) 0.86% faster (should be error but stable on my env..) pl: 1617.20 ( 3.51) 1.26% faster (ditto) pf0: 6680.95 (478.72) 19.5% faster pf1: 1886.87 ( 36.25) 77.1% faster regards, -- Kyotaro Horiguchi NTT Open Source Software Center >From 25eba7e506228ab087e8b743efb039286a8251c4 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Wed, 12 Oct 2016 12:46:10 +0900 Subject: [PATCH 1/6] robert's 2nd framework --- contrib/postgres_fdw/postgres_fdw.c | 49 src/backend/executor/Makefile | 4 +- src/backend/executor/README | 43 +++ src/backend/executor/execAmi.c | 5 + src/backend/executor/execAsync.c| 462 src/backend/executor/nodeAppend.c | 162 ++- src/backend/executor/nodeForeignscan.c | 49 src/backend/nodes/copyfuncs.c | 1 + src/backend/nodes/outfuncs.c| 1 + src/backend/nodes/readfuncs.c | 1 + src/backend/optimizer/plan/createplan.c | 45 +++- src/include/executor/execAsync.h| 29 ++ src/include/executor/nodeAppend.h | 3 + src/include/executor/nodeForeignscan.h | 7 + src/include/foreign/fdwapi.h| 15 ++ src/include/nodes/execnodes.h | 57 +++- src/include/nodes/plannodes.h | 1 + 17 files changed, 909 insertions(+), 25 deletions(-) create mode 100644 src/backend/executor/execAsync.c create mode 100644 src/include/executor/execAsync.h diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c index 906d6e6..c480945 100644 --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -19,6 +19,7 @@ #include "commands/defrem.h" #include "commands/explain.h" #include "commands/vacuum.h" +#include "executor/execAsync.h" #include "foreign/fdwapi.h" #include "funcapi.h" #include "miscadmin.h" @@ -349,6 +350,14 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root, UpperRelationKind stage, RelOptInfo *input_rel, RelOptInfo *output_rel); +static bool postgresIsForeignPathAsyncCapable(ForeignPath *path); +static void postgresForeignAsyncRequest(EState *estate, + PendingAsyncRequest *areq); +static void postgresForeignAsyncConfigureWait(EState *estate, + PendingAsyncRequest *areq, + bool reinit); +static void postgresForeignAsyncNotify(EState *estate, + PendingAsyncRequest *areq); /* * Helper functions @@ -468,6 +477,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS) /* Support functions for upper relation push-down */ routine->GetForeignUpperPaths = postgresGetForeignUpperPaths; + /* Support functions for async execution */ + routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable; + routine->ForeignAsyncRequest = postgresForeignAsyncRequest; +
Re: [HACKERS] asynchronous execution
Hello, this works but ExecAppend gets a bit degradation. At Mon, 03 Oct 2016 19:46:32 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHIwrote in <20161003.194632.204401048.horiguchi.kyot...@lab.ntt.co.jp> > > Some notes: > > > > - EvalPlanQual rechecks are broken. This is fixed by adding (restoring) async-cancelation. > > - EXPLAIN ANALYZE instrumentation is broken. EXPLAIN ANALYE seems working but async-specific information is not available yet. > > - ExecReScanAppend is broken, because the async stuff needs some way > > of canceling an async request and I didn't invent anything like that > > yet. Fixed as EvalPlanQual. > > - The postgres_fdw changes pretend to be async but aren't actually. > > It's just a demo of (part of) the interface at this point. Applied my previous patch with some modification. > > - The postgres_fdw changes also report all pg-fdw paths as > > async-capable, but actually the direct-modify ones aren't, so the > > regression tests fail. All actions other than scan does vacate_connection() to use a connection. > > - Errors in the executor can leak the WaitEventSet. Probably we need > > to modify ResourceOwners to be able to own WaitEventSets. WaitEventSet itself is not leaked but epoll-fd should be closed at failure. This seems doable with TRY-CATCHing in ExecAsyncEventLoop. (not yet) > > - There are probably other bugs, too. > > > > Whee! > > > > Note that I've tried to solve the re-entrancy problems by (1) putting > > all of the event loop's state inside the EState rather than in local > > variables and (2) having the function that is called to report arrival > > of a result be thoroughly different than the function that is used to > > return a tuple to a synchronous caller. > > > > Comments welcome, if you're feeling brave enough to look at anything > > this half-baked. This doesn't cause reentry since this no longer bubbles up tupples through async-unaware nodes. This framework passes tuples through private channels for requestor and requestees. Anyway, I amended this and made postgres_fdw async and then finally all regtests passed with minor modifications. The attached patches are the following. 0001-robert-s-2nd-framework.patch The patch Robert shown upthread 0002-Fix-some-bugs.patch A small patch to fix complation errors of 0001. 0003-Modify-async-execution-infrastructure.patch Several modifications on the infrastructure. The details are shown after the measurement below. 0004-Make-postgres_fdw-async-capable.patch True-async postgres-fdw. gentblr.sql, testrun.sh, calc.pl Performance test script suite. gentblr.sql - creates test tables. testrun.sh - does single test run and calc.pl - drives testrunc.sh and summirize its results. I measured performance and had the following result. t0 - SELECT sum(a) FROM ; pl - SELECT sum(a) FROM <4 local children>; pf0 - SELECT sum(a) FROM <4 foreign children on single connection>; pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>; The result is written as "time (std dev )" sync t0: 3820.33 ( 1.88) pl: 1608.59 ( 12.06) pf0: 7928.29 ( 46.58) pf1: 8023.16 ( 26.43) async t0: 3806.31 ( 4.49)0.4% faster (should be error) pl: 1629.17 ( 0.29)1.3% slower pf0: 6447.07 ( 25.19) 18.7% faster pf1: 1876.80 ( 47.13) 76.6% faster t0 is not affected since the ExecProcNode stuff has gone. pl is getting a bit slower. (almost the same to simple seqscan of the previous patch) This should be a misprediction penalty. pf0, pf1 are faster as expected. The below is a summary of modifications made by 0002 and 0003 patch. execAsync.c, execnodes.h: - Added include "pgstat.h" to use WAIT_EVENT_ASYNC_WAIT. - Changed the interface of ExecAsyncRequest to return if a tuple is immediately available or not. - Made ExecAsyncConfigureWait to return if it registered at least one waitevent or not. This is used to know the caller (ExecAsyncEventWait) has a event to wait (for safety). If two or more postgres_fdw nodes are sharing one connection, only one of them can be waited at once. It is a responsibility to the FDW drivers to ensure at least one wait event to be added but on failure WaitEventSetWait silently waits forever. - There were separate areq->callback_pending and areq->request_complete but they are altering together so they are replaced with one state variable areq->state. New enum AsyncRequestState for areq->state in execnodes.h. nodeAppend.c: - Return a tuple immediately if ExecAsyncRequest says that a tuple is available. - Reduced nest level of for(;;). nodeForeignscan.[ch], fdwapi.h, execProcnode.c:: - Calling postgresIterateForeignScan can yield tuples in wrong shape. Call ExecForeignScan instead. - Changed the interface of AsyncConfigureWait as execAsync.c. - Added ShutdownForeignScan interface. createplan.c, ruleutils.c, plannodes.h: -
Re: [HACKERS] asynchronous execution
On Tue, Oct 4, 2016 at 7:53 AM, Amit Khandekarwrote: > Also, parent pointers are not required in the new design. Thinking of > parent pointers, now it seems the event won't get bubbled up the tree > with the new design. But still, , I think it's possible to switch over > to the other asynchronous tree when some node in the current subtree > is waiting. But I am not sure, will think more on that. The bubbling-up still happens, because each node that made an async request gets a callback with the final response - and if it is itself the recipient of an async request, it can use that callback to respond to that request in turn. This version doesn't bubble up through non-async-aware nodes, but that might be a good thing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous execution
On 4 October 2016 at 02:30, Robert Haaswrote: > On Wed, Sep 28, 2016 at 12:30 AM, Amit Khandekar > wrote: >> On 24 September 2016 at 06:39, Robert Haas wrote: >>> Since Kyotaro Horiguchi found that my previous design had a >>> system-wide performance impact due to the ExecProcNode changes, I >>> decided to take a different approach here: I created an async >>> infrastructure where both the requestor and the requestee have to be >>> specifically modified to support parallelism, and then modified Append >>> and ForeignScan to cooperate using the new interface. Hopefully that >>> means that anything other than those two nodes will suffer no >>> performance impact. Of course, it might have other problems >> >> I see that the reason why you re-designed the asynchronous execution >> implementation is because the earlier implementation showed >> performance degradation in local sequential and local parallel scans. >> But I checked that the ExecProcNode() changes were not that >> significant as to cause the degradation. > > I think we need some testing to prove that one way or the other. If > you can do some - say on a plan with multiple nested loop joins with > inner index-scans, which will call ExecProcNode() a lot - that would > be great. I don't think we can just rely on "it doesn't seem like it > should be slower" Agreed. I will come up with some tests. > , though - ExecProcNode() is too important a function > for us to guess at what the performance will be. Also, parent pointers are not required in the new design. Thinking of parent pointers, now it seems the event won't get bubbled up the tree with the new design. But still, , I think it's possible to switch over to the other asynchronous tree when some node in the current subtree is waiting. But I am not sure, will think more on that. > > The thing I'm really worried about with either implementation is what > happens when we start to add asynchronous capability to multiple > nodes. For example, if you imagine a plan like this: > > Append > -> Hash Join > -> Foreign Scan > -> Hash > -> Seq Scan > -> Hash Join > -> Foreign Scan > -> Hash > -> Seq Scan > > In order for this to run asynchronously, you need not only Append and > Foreign Scan to be async-capable, but also Hash Join. That's true in > either approach. Things are slightly better with the original > approach, but the basic problem is there in both cases. So it seems > we need an approach that will make adding async capability to a node > really cheap, which seems like it might be a problem. Yes, we might have to deal with this. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous execution
On Wed, Sep 28, 2016 at 12:30 AM, Amit Khandekarwrote: > On 24 September 2016 at 06:39, Robert Haas wrote: >> Since Kyotaro Horiguchi found that my previous design had a >> system-wide performance impact due to the ExecProcNode changes, I >> decided to take a different approach here: I created an async >> infrastructure where both the requestor and the requestee have to be >> specifically modified to support parallelism, and then modified Append >> and ForeignScan to cooperate using the new interface. Hopefully that >> means that anything other than those two nodes will suffer no >> performance impact. Of course, it might have other problems > > I see that the reason why you re-designed the asynchronous execution > implementation is because the earlier implementation showed > performance degradation in local sequential and local parallel scans. > But I checked that the ExecProcNode() changes were not that > significant as to cause the degradation. I think we need some testing to prove that one way or the other. If you can do some - say on a plan with multiple nested loop joins with inner index-scans, which will call ExecProcNode() a lot - that would be great. I don't think we can just rely on "it doesn't seem like it should be slower", though - ExecProcNode() is too important a function for us to guess at what the performance will be. The thing I'm really worried about with either implementation is what happens when we start to add asynchronous capability to multiple nodes. For example, if you imagine a plan like this: Append -> Hash Join -> Foreign Scan -> Hash -> Seq Scan -> Hash Join -> Foreign Scan -> Hash -> Seq Scan In order for this to run asynchronously, you need not only Append and Foreign Scan to be async-capable, but also Hash Join. That's true in either approach. Things are slightly better with the original approach, but the basic problem is there in both cases. So it seems we need an approach that will make adding async capability to a node really cheap, which seems like it might be a problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous execution
Thank you for the thought. At Fri, 23 Sep 2016 21:09:03 -0400, Robert Haaswrote in > [ Adjusting subject line to reflect the actual topic of discussion better. ] > > On Fri, Sep 23, 2016 at 9:29 AM, Robert Haas wrote: > > On Fri, Sep 23, 2016 at 8:45 AM, Amit Khandekar > > wrote: > >> For e.g., in the above plan which you specified, suppose : > >> 1. Hash Join has called ExecProcNode() for the child foreign scan b, and so > >> is > >> waiting in ExecAsyncWaitForNode(foreign_scan_on_b). > >> 2. The event wait list already has foreign scan on a that is on a different > >> subtree. > >> 3. This foreign scan a happens to be ready, so in > >> ExecAsyncWaitForNode (), ExecDispatchNode(foreign_scan_a) is called, > >> which returns with result_ready. > >> 4. Since it returns result_ready, it's parent node is now inserted in the > >> callbacks array, and so it's parent (Append) is executed. > >> 5. But, this Append planstate is already in the middle of executing Hash > >> join, and is waiting for HashJoin. > > > > Ah, yeah, something like that could happen. I've spent much of this > > week working on a new design for this feature which I think will avoid > > this problem. It doesn't work yet - in fact I can't even really test > > it yet. But I'll post what I've got by the end of the day today so > > that anyone who is interested can look at it and critique. > > Well, I promised to post this, so here it is. It's not really working > all that well at this point, and it's definitely not doing anything > that interesting, but you can see the outline of what I have in mind. > Since Kyotaro Horiguchi found that my previous design had a > system-wide performance impact due to the ExecProcNode changes, I > decided to take a different approach here: I created an async > infrastructure where both the requestor and the requestee have to be > specifically modified to support parallelism, and then modified Append > and ForeignScan to cooperate using the new interface. Hopefully that > means that anything other than those two nodes will suffer no > performance impact. Of course, it might have other problems The previous framework didn't need to distinguish async-capable and uncapable nodes from the parent node's view. The things in ExecProcNode was required for the reason. Instead, this new one removes the ExecProcNode stuff by distinguishing the two kinds of node in async-aware parents, that is, Append. This no longer involves async-unaware nodes into the tuple bubbling-up mechanism so the reentrant problem doesn't seem to occur. On the other hand, for example, the following plan, regrardless of its practicality, (there should be more good example..) (Async-unaware node) - NestLoop - Append - n * ForegnScan - Append - n * ForegnScan If the NestLoop, Append are async-aware, all of the ForeignScans can run asynchronously with the previous framework. The topmost NestLoop will be awakened after that firing of any ForenScans makes a tuple bubbles up to the NestLoop. This is because the not-need-to-distinguish-aware-or-not nature provided by the ExecProcNode stuff. On the other hand, with the new one, in order to do the same thing, ExecAppend have in turn to behave differently whether the parent is async or not. To do this will be bothersome but not with confidence. I examine this further intensively, especially for performance degeneration and obstacles to complete this. > Some notes: > > - EvalPlanQual rechecks are broken. > - EXPLAIN ANALYZE instrumentation is broken. > - ExecReScanAppend is broken, because the async stuff needs some way > of canceling an async request and I didn't invent anything like that > yet. > - The postgres_fdw changes pretend to be async but aren't actually. > It's just a demo of (part of) the interface at this point. > - The postgres_fdw changes also report all pg-fdw paths as > async-capable, but actually the direct-modify ones aren't, so the > regression tests fail. > - Errors in the executor can leak the WaitEventSet. Probably we need > to modify ResourceOwners to be able to own WaitEventSets. > - There are probably other bugs, too. > > Whee! > > Note that I've tried to solve the re-entrancy problems by (1) putting > all of the event loop's state inside the EState rather than in local > variables and (2) having the function that is called to report arrival > of a result be thoroughly different than the function that is used to > return a tuple to a synchronous caller. > > Comments welcome, if you're feeling brave enough to look at anything > this half-baked. -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous execution
Hello, thank you for the comment. At Wed, 28 Sep 2016 10:00:08 +0530, Amit Khandekarwrote in
Re: [HACKERS] asynchronous execution
Sorry for delayed response, I'll have enough time from now and address this. At Fri, 23 Sep 2016 21:09:03 -0400, Robert Haaswrote in > Well, I promised to post this, so here it is. It's not really working > all that well at this point, and it's definitely not doing anything > that interesting, but you can see the outline of what I have in mind. > Since Kyotaro Horiguchi found that my previous design had a > system-wide performance impact due to the ExecProcNode changes, I > decided to take a different approach here: I created an async > infrastructure where both the requestor and the requestee have to be > specifically modified to support parallelism, and then modified Append > and ForeignScan to cooperate using the new interface. Hopefully that > means that anything other than those two nodes will suffer no > performance impact. Of course, it might have other problems > > Some notes: > > - EvalPlanQual rechecks are broken. > - EXPLAIN ANALYZE instrumentation is broken. > - ExecReScanAppend is broken, because the async stuff needs some way > of canceling an async request and I didn't invent anything like that > yet. > - The postgres_fdw changes pretend to be async but aren't actually. > It's just a demo of (part of) the interface at this point. > - The postgres_fdw changes also report all pg-fdw paths as > async-capable, but actually the direct-modify ones aren't, so the > regression tests fail. > - Errors in the executor can leak the WaitEventSet. Probably we need > to modify ResourceOwners to be able to own WaitEventSets. > - There are probably other bugs, too. > > Whee! > > Note that I've tried to solve the re-entrancy problems by (1) putting > all of the event loop's state inside the EState rather than in local > variables and (2) having the function that is called to report arrival > of a result be thoroughly different than the function that is used to > return a tuple to a synchronous caller. > > Comments welcome, if you're feeling brave enough to look at anything > this half-baked. -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous execution
On 24 September 2016 at 06:39, Robert Haaswrote: > Since Kyotaro Horiguchi found that my previous design had a > system-wide performance impact due to the ExecProcNode changes, I > decided to take a different approach here: I created an async > infrastructure where both the requestor and the requestee have to be > specifically modified to support parallelism, and then modified Append > and ForeignScan to cooperate using the new interface. Hopefully that > means that anything other than those two nodes will suffer no > performance impact. Of course, it might have other problems I see that the reason why you re-designed the asynchronous execution implementation is because the earlier implementation showed performance degradation in local sequential and local parallel scans. But I checked that the ExecProcNode() changes were not that significant as to cause the degradation. It will not call ExecAsyncWaitForNode() unless that node supports asynchronism. Do you feel there is anywhere else in the implementation that is really causing this degrade ? That previous implementation has some issues, but they seemed solvable. We could resolve the plan state recursion issue by explicitly making sure the same plan state does not get called again while it is already executing. Thanks -Amit Khandekar -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous execution
[ Adjusting subject line to reflect the actual topic of discussion better. ] On Fri, Sep 23, 2016 at 9:29 AM, Robert Haaswrote: > On Fri, Sep 23, 2016 at 8:45 AM, Amit Khandekar > wrote: >> For e.g., in the above plan which you specified, suppose : >> 1. Hash Join has called ExecProcNode() for the child foreign scan b, and so >> is >> waiting in ExecAsyncWaitForNode(foreign_scan_on_b). >> 2. The event wait list already has foreign scan on a that is on a different >> subtree. >> 3. This foreign scan a happens to be ready, so in >> ExecAsyncWaitForNode (), ExecDispatchNode(foreign_scan_a) is called, >> which returns with result_ready. >> 4. Since it returns result_ready, it's parent node is now inserted in the >> callbacks array, and so it's parent (Append) is executed. >> 5. But, this Append planstate is already in the middle of executing Hash >> join, and is waiting for HashJoin. > > Ah, yeah, something like that could happen. I've spent much of this > week working on a new design for this feature which I think will avoid > this problem. It doesn't work yet - in fact I can't even really test > it yet. But I'll post what I've got by the end of the day today so > that anyone who is interested can look at it and critique. Well, I promised to post this, so here it is. It's not really working all that well at this point, and it's definitely not doing anything that interesting, but you can see the outline of what I have in mind. Since Kyotaro Horiguchi found that my previous design had a system-wide performance impact due to the ExecProcNode changes, I decided to take a different approach here: I created an async infrastructure where both the requestor and the requestee have to be specifically modified to support parallelism, and then modified Append and ForeignScan to cooperate using the new interface. Hopefully that means that anything other than those two nodes will suffer no performance impact. Of course, it might have other problems Some notes: - EvalPlanQual rechecks are broken. - EXPLAIN ANALYZE instrumentation is broken. - ExecReScanAppend is broken, because the async stuff needs some way of canceling an async request and I didn't invent anything like that yet. - The postgres_fdw changes pretend to be async but aren't actually. It's just a demo of (part of) the interface at this point. - The postgres_fdw changes also report all pg-fdw paths as async-capable, but actually the direct-modify ones aren't, so the regression tests fail. - Errors in the executor can leak the WaitEventSet. Probably we need to modify ResourceOwners to be able to own WaitEventSets. - There are probably other bugs, too. Whee! Note that I've tried to solve the re-entrancy problems by (1) putting all of the event loop's state inside the EState rather than in local variables and (2) having the function that is called to report arrival of a result be thoroughly different than the function that is used to return a tuple to a synchronous caller. Comments welcome, if you're feeling brave enough to look at anything this half-baked. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company async-wip-2016-09-23.patch Description: binary/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Asynchronous execution on FDW
On Mon, Aug 10, 2015 at 3:23 AM, Heikki Linnakangas hlinn...@iki.fi wrote: I've marked this as rejected in the commitfest, because others are working on a more general solution with parallel workers. That's still work-in-progress, and it's not certain if it's going to make it into 9.6, but if it does it will largely render this obsolete. We can revisit this patch later in the release cycle, if the parallel scan patch hasn't solved the same use case by then. I think the really important issue for this patch is the one discussed here: http://www.postgresql.org/message-id/ca+tgmoaijk1svzw_gkfu+zssxcijkfelqu2aomvuphpsfw4...@mail.gmail.com You raised an important issue there but never really expressed an opinion on the points I raised, here or on the other thread. And neither did anyone else except the patch author who, perhaps unsurprisingly, thinks it's OK. I wish we could get more discussion about that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Asynchronous execution on FDW
* Robert Haas (robertmh...@gmail.com) wrote: On Mon, Aug 10, 2015 at 3:23 AM, Heikki Linnakangas hlinn...@iki.fi wrote: I've marked this as rejected in the commitfest, because others are working on a more general solution with parallel workers. That's still work-in-progress, and it's not certain if it's going to make it into 9.6, but if it does it will largely render this obsolete. We can revisit this patch later in the release cycle, if the parallel scan patch hasn't solved the same use case by then. I think the really important issue for this patch is the one discussed here: http://www.postgresql.org/message-id/ca+tgmoaijk1svzw_gkfu+zssxcijkfelqu2aomvuphpsfw4...@mail.gmail.com I agree that it'd be great to figure out the answer to #2, but I'm also of the opinion that we can either let the user tell us through the use of the GUCs proposed in the patch or simply not worry about the potential for time wastage associated with starting them all at once, as you suggested there. You raised an important issue there but never really expressed an opinion on the points I raised, here or on the other thread. And neither did anyone else except the patch author who, perhaps unsurprisingly, thinks it's OK. I wish we could get more discussion about that. When I read the proposal, I had the same reaction that it didn't seem like quite the right place and it further bothered me that it was specific to FDWs. Perhaps not surprisingly, as I authored it, but I'm still a fan of my proposal #1 here: http://www.postgresql.org/message-id/20131104032604.gb2...@tamriel.snowman.net More generally, I completely agree with the position (I believe your's, but I might be misremembering) that we want to have this async capability independently and in addition to parallel scan. I don't believe one obviates the advantages of the other. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] Asynchronous execution on FDW
I've marked this as rejected in the commitfest, because others are working on a more general solution with parallel workers. That's still work-in-progress, and it's not certain if it's going to make it into 9.6, but if it does it will largely render this obsolete. We can revisit this patch later in the release cycle, if the parallel scan patch hasn't solved the same use case by then. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Asynchronous execution on FDW
Hello Horiguchi-san, As for ForeignScan, it is merely an API for FDW and does nothing substantial so it would have nothing special to do. As for postgres_fdw, current patch restricts one execution per one foreign server at once by itself. We would have to provide another execution management if we want to have two or more simultaneous scans per one foreign server at once. Yep, your 4th patch defines a new callback to FdwRoutines and 5th patch implements postgres_fdw specific portion. It shall work for distributed / shaded database environment well, however, its benefit is around ForeignScan only. Once management node kicks underlying SeqScan, ForeignScan or others in parallel, it also enables to run local heap scan asynchronously. I suppose SeqScan don't need async kick since its startup cost is extremely low as nothing. (fetching first several pages would boost seqscans?) On the other hand sort/hash would be a field where asynchronous execution is in effect. Startup cost is not only advantage of asynchronous execution. If background worker prefetches the records to be read soon, during other tasks are in progress, its latency to fetch next record is much faster than usual execution path. Please assume if next record is on neither shared-buffer nor page cache of operating system. First, the upper node calls heap_getnext() to fetch next record, then it looks up the target block on the shared-buffer, then it issues read(2) system call, then operating system makes the caller process slept until this block gets read from the storage. If asynchronous worker already goes through the above painful code path and the records to be read are ready on the top of queue, it will reduce the i/o wait time dramatically. Sorry for the focusless discussion but does this answer some of your question? Hmm... Its advantage is still unclear for me. However, it is not fair to hijack this thread by my idea. It would be more advantageous if join/sort pushdown on fdw comes, where start-up cost could be extremely high... Not only FDW. I intend to combine the ParallelAppend with another idea I previously post, to run tables join in parallel. In case of partitioned foreign-tables, planner probably needs to consider (1) FDW scan + local serial join, (2) FDW scan + local parallel join, or (3) FDW remote join, according to the cost. * [idea] table partition + hash join: http://www.postgresql.org/message-id/9a28c8860f777e439aa12e8aea7694f8010f6...@bpxm15gp.gisp.nec.co.jp Anyway, let's have a further discussion in another thread. Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Asynchronous execution on FDW
Hello, At Thu, 23 Jul 2015 09:38:39 +, Kouhei Kaigai kai...@ak.jp.nec.com wrote in 9a28c8860f777e439aa12e8aea7694f80111b...@bpxm15gp.gisp.nec.co.jp I expected workloads like single shot scan on a partitioned large fact table on DWH system. Yep, if workload is expected to rescan so frequently, its expected cost shall be higher (by the cost to launch bgworker) than existing Append, then planner will kick out this path. Regarding of interaction between Limit and ParallelMergeAppend, it is probably the best scenario, isn't it? If Limit picks up the least 1000rows from a partitioned table consists of 20 child tables, ParallelMergeAppend can launch 20 parallel jobs that picks up the least 1000rows from the child relations for each. Probably, it is same job done in pass_down_bound() of nodeLimit.c. Yes. I confused a bit. The scenario is one of least problematic cases. As for ForeignScan, it is merely an API for FDW and does nothing substantial so it would have nothing special to do. As for postgres_fdw, current patch restricts one execution per one foreign server at once by itself. We would have to provide another execution management if we want to have two or more simultaneous scans per one foreign server at once. Yep, your 4th patch defines a new callback to FdwRoutines and 5th patch implements postgres_fdw specific portion. It shall work for distributed / shaded database environment well, however, its benefit is around ForeignScan only. Once management node kicks underlying SeqScan, ForeignScan or others in parallel, it also enables to run local heap scan asynchronously. I suppose SeqScan don't need async kick since its startup cost is extremely low as nothing. (fetching first several pages would boost seqscans?) On the other hand sort/hash would be a field where asynchronous execution is in effect. Sorry for the focusless discussion but does this answer some of your question? Hmm... Its advantage is still unclear for me. However, it is not fair to hijack this thread by my idea. It would be more advantageous if join/sort pushdown on fdw comes, where start-up cost could be extremely high... I'll submit my design proposal about ParallelAppend towards the next commit-fest. Please comment on. Ok, I'll come there. Expected waste of CPU or I/O is common problem to be solved, however, it does not need to add a special case handling to ForeignScan, I think. How about your opinion? I agree with you that ForeignScan as the wrapper for FDWs don't need anything special for the case. I suppose for now that avoiding the penalty from abandoning too many speculatively executed scans (or other works on bg worker like sorts) would be a business of the upper node of FDWs, or somewhere else. However, I haven't dismissed the possibility that some common works related to resource management could be integrated into executor (or even into planner), but I see none for now. I also agree with it is eventually needed, but may not be supported in the first version. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Asynchronous execution on FDW
If we have ParallelAppend node that kicks a background worker process for each underlying child node in parallel, does ForeignScan need to do something special? Although I don't see the point of the background worker in your story but at least for ParalleMergeAppend, it would frequently discontinues to scan by upper Limit so one more state, say setup - which mans a worker is allocated but not started- would be useful and the driver node might need to manage the number of async execution. Or the driven nodes might do so inversely. I expected workloads like single shot scan on a partitioned large fact table on DWH system. Yep, if workload is expected to rescan so frequently, its expected cost shall be higher (by the cost to launch bgworker) than existing Append, then planner will kick out this path. Regarding of interaction between Limit and ParallelMergeAppend, it is probably the best scenario, isn't it? If Limit picks up the least 1000rows from a partitioned table consists of 20 child tables, ParallelMergeAppend can launch 20 parallel jobs that picks up the least 1000rows from the child relations for each. Probably, it is same job done in pass_down_bound() of nodeLimit.c. As for ForeignScan, it is merely an API for FDW and does nothing substantial so it would have nothing special to do. As for postgres_fdw, current patch restricts one execution per one foreign server at once by itself. We would have to provide another execution management if we want to have two or more simultaneous scans per one foreign server at once. Yep, your 4th patch defines a new callback to FdwRoutines and 5th patch implements postgres_fdw specific portion. It shall work for distributed / shaded database environment well, however, its benefit is around ForeignScan only. Once management node kicks underlying SeqScan, ForeignScan or others in parallel, it also enables to run local heap scan asynchronously. Sorry for the focusless discussion but does this answer some of your question? Hmm... Its advantage is still unclear for me. However, it is not fair to hijack this thread by my idea. I'll submit my design proposal about ParallelAppend towards the next commit-fest. Please comment on. Expected waste of CPU or I/O is common problem to be solved, however, it does not need to add a special case handling to ForeignScan, I think. How about your opinion? I agree with you that ForeignScan as the wrapper for FDWs don't need anything special for the case. I suppose for now that avoiding the penalty from abandoning too many speculatively executed scans (or other works on bg worker like sorts) would be a business of the upper node of FDWs, or somewhere else. However, I haven't dismissed the possibility that some common works related to resource management could be integrated into executor (or even into planner), but I see none for now. I also agree with it is eventually needed, but may not be supported in the first version. Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Asynchronous execution on FDW
-Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kyotaro HORIGUCHI Sent: Wednesday, July 22, 2015 4:10 PM To: robertmh...@gmail.com Cc: hlinn...@iki.fi; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Asynchronous execution on FDW Hello, thank you for the comment. At Fri, 17 Jul 2015 14:34:53 -0400, Robert Haas robertmh...@gmail.com wrote in ca+tgmoaijk1svzw_gkfu+zssxcijkfelqu2aomvuphpsfw4...@mail.gmail.com On Fri, Jul 3, 2015 at 4:41 PM, Heikki Linnakangas hlinn...@iki.fi wrote: At a quick glance, I think this has all the same problems as starting the execution at ExecInit phase. The correct way to do this is to kick off the queries in the first IterateForeignScan() call. You said that ExecProc phase does not fit - why not? What exactly are those problems? I can think of these: 1. If the scan is parametrized, we probably can't do it for lack of knowledge of what they will be. This seems easy; just don't do it in that case. We can put an early kick to foreign scans only for the first shot if we do it outside (before) ExecProc phase. Nestloop - SeqScan - Append - Foreign (Index) Scan - Foreign (Index) Scan .. This plan premises precise (even to some extent) estimate for remote query but async execution within ExecProc phase would be in effect for this case. 2. It's possible that we're down inside some subtree of the plan that won't actually get executed. This is trickier. As for current postgres_fdw, it is done simply abandoning queued result then close the cursor. Consider this: Append - Foreign Scan - Foreign Scan - Foreign Scan repeat 17 more times If we don't start each foreign scan until the first tuple is fetched, we will not get any benefit here, because we won't fetch the first tuple from query #2 until we finish reading the results of query #1. If the result of the Append node will be needed in its entirety, we really, really want to launch of those queries as early as possible. OTOH, if there's a Limit node with a small limit on top of the Append node, that could be quite wasteful. It's the nature of speculative execution, but the Limit will be pushed down onto every Foreign Scans near future. We could decide not to care: after all, if our limit is satisfied, we can just bang the remote connections shut, and if they wasted some CPU, well, tough luck for them. But it would be nice to be smarter. I'm not sure how, though. Appropriate fetch size will cap the harm and the case will be handled as I mentioned above as for postgres_fdw. Horiguchi-san, Let me ask an elemental question. If we have ParallelAppend node that kicks a background worker process for each underlying child node in parallel, does ForeignScan need to do something special? Expected waste of CPU or I/O is common problem to be solved, however, it does not need to add a special case handling to ForeignScan, I think. How about your opinion? Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Asynchronous execution on FDW
Hello, thank you for the comment. At Fri, 17 Jul 2015 14:34:53 -0400, Robert Haas robertmh...@gmail.com wrote in ca+tgmoaijk1svzw_gkfu+zssxcijkfelqu2aomvuphpsfw4...@mail.gmail.com On Fri, Jul 3, 2015 at 4:41 PM, Heikki Linnakangas hlinn...@iki.fi wrote: At a quick glance, I think this has all the same problems as starting the execution at ExecInit phase. The correct way to do this is to kick off the queries in the first IterateForeignScan() call. You said that ExecProc phase does not fit - why not? What exactly are those problems? I can think of these: 1. If the scan is parametrized, we probably can't do it for lack of knowledge of what they will be. This seems easy; just don't do it in that case. We can put an early kick to foreign scans only for the first shot if we do it outside (before) ExecProc phase. Nestloop - SeqScan - Append - Foreign (Index) Scan - Foreign (Index) Scan .. This plan premises precise (even to some extent) estimate for remote query but async execution within ExecProc phase would be in effect for this case. 2. It's possible that we're down inside some subtree of the plan that won't actually get executed. This is trickier. As for current postgres_fdw, it is done simply abandoning queued result then close the cursor. Consider this: Append - Foreign Scan - Foreign Scan - Foreign Scan repeat 17 more times If we don't start each foreign scan until the first tuple is fetched, we will not get any benefit here, because we won't fetch the first tuple from query #2 until we finish reading the results of query #1. If the result of the Append node will be needed in its entirety, we really, really want to launch of those queries as early as possible. OTOH, if there's a Limit node with a small limit on top of the Append node, that could be quite wasteful. It's the nature of speculative execution, but the Limit will be pushed down onto every Foreign Scans near future. We could decide not to care: after all, if our limit is satisfied, we can just bang the remote connections shut, and if they wasted some CPU, well, tough luck for them. But it would be nice to be smarter. I'm not sure how, though. Appropriate fetch size will cap the harm and the case will be handled as I mentioned above as for postgres_fdw. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Asynchronous execution on FDW
Hello, Let me ask an elemental question. If we have ParallelAppend node that kicks a background worker process for each underlying child node in parallel, does ForeignScan need to do something special? Although I don't see the point of the background worker in your story but at least for ParalleMergeAppend, it would frequently discontinues to scan by upper Limit so one more state, say setup - which mans a worker is allocated but not started- would be useful and the driver node might need to manage the number of async execution. Or the driven nodes might do so inversely. As for ForeignScan, it is merely an API for FDW and does nothing substantial so it would have nothing special to do. As for postgres_fdw, current patch restricts one execution per one foreign server at once by itself. We would have to provide another execution management if we want to have two or more simultaneous scans per one foreign server at once. Sorry for the focusless discussion but does this answer some of your question? Expected waste of CPU or I/O is common problem to be solved, however, it does not need to add a special case handling to ForeignScan, I think. How about your opinion? I agree with you that ForeignScan as the wrapper for FDWs don't need anything special for the case. I suppose for now that avoiding the penalty from abandoning too many speculatively executed scans (or other works on bg worker like sorts) would be a business of the upper node of FDWs, or somewhere else. However, I haven't dismissed the possibility that some common works related to resource management could be integrated into executor (or even into planner), but I see none for now. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Asynchronous execution on FDW
On Fri, Jul 3, 2015 at 4:41 PM, Heikki Linnakangas hlinn...@iki.fi wrote: At a quick glance, I think this has all the same problems as starting the execution at ExecInit phase. The correct way to do this is to kick off the queries in the first IterateForeignScan() call. You said that ExecProc phase does not fit - why not? What exactly are those problems? I can think of these: 1. If the scan is parametrized, we probably can't do it for lack of knowledge of what they will be. This seems easy; just don't do it in that case. 2. It's possible that we're down inside some subtree of the plan that won't actually get executed. This is trickier. Consider this: Append - Foreign Scan - Foreign Scan - Foreign Scan repeat 17 more times If we don't start each foreign scan until the first tuple is fetched, we will not get any benefit here, because we won't fetch the first tuple from query #2 until we finish reading the results of query #1. If the result of the Append node will be needed in its entirety, we really, really want to launch of those queries as early as possible. OTOH, if there's a Limit node with a small limit on top of the Append node, that could be quite wasteful. We could decide not to care: after all, if our limit is satisfied, we can just bang the remote connections shut, and if they wasted some CPU, well, tough luck for them. But it would be nice to be smarter. I'm not sure how, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Asynchronous execution on FDW
Hi, Currently there's no means to observe what it is doing from outside, so the additional sixth patch is to output debug messages about asynchronous execution. The sixth patch did not contain one message shown in the example. Attached is the revised version. Other patches are not changed. -- Kyotaro Horiguchi NTT Open Source Software Center From d1ed9fe6a4e68d42653a552a680a038a0aef5683 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi horiguchi.kyot...@lab.ntt.co.jp Date: Fri, 10 Jul 2015 15:02:59 +0900 Subject: [PATCH 6/6] Debug message for async execution of postgres_fdw. --- contrib/postgres_fdw/postgres_fdw.c | 30 ++ 1 file changed, 30 insertions(+) diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c index dc60bcc..a8a9cc5 100644 --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -385,6 +385,25 @@ postgres_fdw_handler(PG_FUNCTION_ARGS) PG_RETURN_POINTER(routine); } +static void +postgresDebugLog(PgFdwScanState *fsstate, char *msg, void* ptr) +{ + ForeignTable *table = GetForeignTable(RelationGetRelid(fsstate-rel)); + ForeignServer *server = GetForeignServer(table-serverid); + + if (fsstate-conn) + ereport(LOG, +(errmsg (pg_fdw: [%s/%s/%p] %s, + get_rel_name(table-relid), server-servername, + fsstate-conn, msg), + errhidestmt(true))); + else + ereport(LOG, +(errmsg (pg_fdw: [%s/%s/--] %s, + get_rel_name(table-relid), server-servername, msg), + errhidestmt(true))); +} + /* * Read boolean server/table options * 0 is false, 1 is true, -1 is not specified @@ -928,7 +947,10 @@ postgresStartForeignScan(ForeignScanState *node) PgFdwScanState *fsstate = (PgFdwScanState *)node-fdw_state; if (!fsstate-allow_async) + { + postgresDebugLog(fsstate, Async start admistratively denied., NULL); return false; + } /* * On the current implemnt, scans can run asynchronously if it is the @@ -943,9 +965,11 @@ postgresStartForeignScan(ForeignScanState *node) * for details */ fetch_more_data(fsstate, START_ONLY); + postgresDebugLog(fsstate, Async exec started., fsstate-conn); return true; } + postgresDebugLog(fsstate, Async exec denied., NULL); return false; } @@ -2162,11 +2186,16 @@ fetch_more_data(PgFdwScanState *fsstate, fetch_mode cmd) */ if (fsstate != PFCgetAsyncScan(conn)) { +postgresDebugLog(fsstate, + Changed to sync fetch (different scan), + fsstate-conn); fetch_more_data(PFCgetAsyncScan(conn), EXIT_ASYNC); res = PFCexec(conn, sql); } else { +postgresDebugLog(fsstate, + Async fetch, fsstate-conn); /* Get result of running async fetch */ res = PFCgetResult(conn); if (PQntuples(res) == fetch_size) @@ -2196,6 +2225,7 @@ fetch_more_data(PgFdwScanState *fsstate, fetch_mode cmd) } /* Elsewise do synchronous query execution */ + postgresDebugLog(fsstate, Sync fetch., conn); PFCsetAsyncScan(conn, NULL); res = PFCexec(conn, sql); } -- 1.8.3.1 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Asynchronous execution on FDW
Hello, thank you for looking this. If it is acceptable to reconstruct the executor nodes to have additional return state PREP_RUN or such (which means it needs one more call for the first tuple) , I'll modify the whole executor to handle the state in the next patch to do so. I haven't take the advice I had so far in this sense. But I came to think that it is the most reasonable way to solve this. == - It was a problem when to give the first kick for async exec. It is not in ExecInit phase, and ExecProc phase does not fit, too. An extra phase ExecPreProc or something is too invasive. So I tried pre-exec callback. Any init-node can register callbacks on their turn, then the registerd callbacks are called just before ExecProc phase in executor. The first patch adds functions and structs to enable this. At a quick glance, I think this has all the same problems as starting the execution at ExecInit phase. The correct way to do this is to kick off the queries in the first IterateForeignScan() call. You said that ExecProc phase does not fit - why not? Execution nodes are expected to return the first tuple if available. But asynchronous execution can not return the first tuple immediately. Simultaneous execution for the first tuple on every foreign node is crucial than asynchronous fetching for many cases, especially for the cases like sort/agg pushdown on FDW. The reason why ExecProc does not fit is that the first loop without returning tuple looks impact too large portion in executor. It is my mistake that it doesn't address the problem about parameterized paths. Parameterized paths should be executed within ExecProc loops so this patch would be like following. - To gain the advantage of kicking execution before the first ExecProc loop, non-parameterized paths are started using the callback feature this patch provides. - Parameterized paths need the upper nodes executed before it starts execution so they should be start in ExecProc loop, but runs asynchronously if possible. This is rather a makeshift solution for the problem, but considering current trend of parallelism, it might the time to make the executor to fit parallel execution. If it is acceptable to reconstruct the executor nodes to have additional return state PREP_RUN or such (which means it needs one more call for the first tuple) , I'll modify the whole executor to handle the state in the next patch to do so. I hate my stupidity if you suggested this kind of solution by do it in ExecProc:( regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Asynchronous execution on FDW
On 07/02/2015 08:48 AM, Kyotaro HORIGUCHI wrote: - It was a problem when to give the first kick for async exec. It is not in ExecInit phase, and ExecProc phase does not fit, too. An extra phase ExecPreProc or something is too invasive. So I tried pre-exec callback. Any init-node can register callbacks on their turn, then the registerd callbacks are called just before ExecProc phase in executor. The first patch adds functions and structs to enable this. At a quick glance, I think this has all the same problems as starting the execution at ExecInit phase. The correct way to do this is to kick off the queries in the first IterateForeignScan() call. You said that ExecProc phase does not fit - why not? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Asynchronous execution on FDW
Ouch! I mistakenly made two CF entries for this patch. Could someone remove this entry for me? https://commitfest.postgresql.org/5/290/ The correct entry is /5/291/ == Hello. This is the new version of FDW async exection feature. The status of this feature is as follows, as of the last commitfest. - Async execution is valuable to have. - But do the first kick in ExecInit phase is wrong. So the design outline of this version is as following, - The patch set consists of three parts. The fist is the infrastracture in core-side, second is the code to enable asynchronous execution of Postgres-FDW. The third part is the alternative set of three methods to adapt fetch size, which makes asynchronous execution more effective. - It was a problem when to give the first kick for async exec. It is not in ExecInit phase, and ExecProc phase does not fit, too. An extra phase ExecPreProc or something is too invasive. So I tried pre-exec callback. Any init-node can register callbacks on their turn, then the registerd callbacks are called just before ExecProc phase in executor. The first patch adds functions and structs to enable this. - The second part is not changed from the previous version. Add PgFdwConn as a extended PgConn which have some members to support asynchronous execution. The asynchronous execution is kicked only for the first ForeignScan node on the same foreign server. And the state lasts until the next scan comes. This behavior is mainly controlled in fetch_more_data(). The behavior limits the number of simultaneous exection for one foreign server to 1. This behavior is decided from the reason that no reasonable method to limit multiplicity of execution on *single peer* was found so far. - The third part is three kind of trials of adaptive fetch size feature. The first one is duration-based adaptation. The patch increases the fetch size by every FETCH execution but try to keep the duration of every FETCH below 500 ms. But it is not promising because it looks very unstable, or the behavior is nearly unforeseeable.. The second one is based on byte-based FETCH feature. This patch adds to FETCH command an argument to limit the number of bytes (octets) to send. But this might be a over-exposure of the internals. The size is counted based on internal representation of a tuple and the client is needed to send the overhead of its internal tuple representation in bytes. This is effective but quite ugly.. The third is the most simple and straight-forward way, that is, adds a foreign table option to specify the fetch_size. The effect of this is also in doubt since the size of tuples for one foreign table would vary according to the return-columns list. But it is foreseeable for users and is a necessary knob for those who want to tune it. Foreign server also could have the same option as the default for that for foreign tables but this patch have not added it. The attached patches are the following, - 0001-Add-infrastructure-of-pre-execution-callbacks.patch Infrastructure of pre-execution callback - 0002-Allow-asynchronous-remote-query-of-postgres_fdw.patch FDW asynchronous execution feature - 0003a-Add-experimental-POC-adaptive-fetch-size-feature.patch Adaptive fetch size alternative 1: duration based control - 0003b-POC-Experimental-fetch_by_size-feature.patch Adaptive fetch size alternative 2: FETCH by size - 0003c-Add-foreign-table-option-to-set-fetch-size.patch Adaptive fetch size alternative 3: Foreign table option. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Asynchronous execution on FDW
On Thu, Jul 2, 2015 at 3:07 PM, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: Ouch! I mistakenly made two CF entries for this patch. Could someone remove this entry for me? https://commitfest.postgresql.org/5/290/ The correct entry is /5/291/ Done. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Asynchronous execution on FDW
Thank you. At Thu, 2 Jul 2015 16:02:27 +0900, Michael Paquier michael.paqu...@gmail.com wrote in CAB7nPqTs0YCwXedt1P=JjxFJeoj9UzLzkLuiX8=jdtpyutn...@mail.gmail.com On Thu, Jul 2, 2015 at 3:07 PM, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: Ouch! I mistakenly made two CF entries for this patch. Could someone remove this entry for me? https://commitfest.postgresql.org/5/290/ The correct entry is /5/291/ Done. -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Asynchronous execution on FDW
Hello. This is the new version of FDW async exection feature. The status of this feature is as follows, as of the last commitfest. - Async execution is valuable to have. - But do the first kick in ExecInit phase is wrong. So the design outline of this version is as following, - The patch set consists of three parts. The fist is the infrastracture in core-side, second is the code to enable asynchronous execution of Postgres-FDW. The third part is the alternative set of three methods to adapt fetch size, which makes asynchronous execution more effective. - It was a problem when to give the first kick for async exec. It is not in ExecInit phase, and ExecProc phase does not fit, too. An extra phase ExecPreProc or something is too invasive. So I tried pre-exec callback. Any init-node can register callbacks on their turn, then the registerd callbacks are called just before ExecProc phase in executor. The first patch adds functions and structs to enable this. - The second part is not changed from the previous version. Add PgFdwConn as a extended PgConn which have some members to support asynchronous execution. The asynchronous execution is kicked only for the first ForeignScan node on the same foreign server. And the state lasts until the next scan comes. This behavior is mainly controlled in fetch_more_data(). The behavior limits the number of simultaneous exection for one foreign server to 1. This behavior is decided from the reason that no reasonable method to limit multiplicity of execution on *single peer* was found so far. - The third part is three kind of trials of adaptive fetch size feature. The first one is duration-based adaptation. The patch increases the fetch size by every FETCH execution but try to keep the duration of every FETCH below 500 ms. But it is not promising because it looks very unstable, or the behavior is nearly unforeseeable.. The second one is based on byte-based FETCH feature. This patch adds to FETCH command an argument to limit the number of bytes (octets) to send. But this might be a over-exposure of the internals. The size is counted based on internal representation of a tuple and the client is needed to send the overhead of its internal tuple representation in bytes. This is effective but quite ugly.. The third is the most simple and straight-forward way, that is, adds a foreign table option to specify the fetch_size. The effect of this is also in doubt since the size of tuples for one foreign table would vary according to the return-columns list. But it is foreseeable for users and is a necessary knob for those who want to tune it. Foreign server also could have the same option as the default for that for foreign tables but this patch have not added it. The attached patches are the following, - 0001-Add-infrastructure-of-pre-execution-callbacks.patch Infrastructure of pre-execution callback - 0002-Allow-asynchronous-remote-query-of-postgres_fdw.patch FDW asynchronous execution feature - 0003a-Add-experimental-POC-adaptive-fetch-size-feature.patch Adaptive fetch size alternative 1: duration based control - 0003b-POC-Experimental-fetch_by_size-feature.patch Adaptive fetch size alternative 2: FETCH by size - 0003c-Add-foreign-table-option-to-set-fetch-size.patch Adaptive fetch size alternative 3: Foreign table option. regards, From eb621897d1410079c6458bc4d1914d1345eb77bc Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi horiguchi.kyot...@lab.ntt.co.jp Date: Fri, 26 Jun 2015 15:12:16 +0900 Subject: [PATCH 1/3] Add infrastructure of pre-execution callbacks. Some exec nodes have some work before plan tree execution. This infrastructure provides such functionality --- src/backend/executor/execMain.c | 32 src/backend/executor/execUtils.c | 2 ++ src/include/nodes/execnodes.h| 22 ++ 3 files changed, 56 insertions(+) diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c index a1561ce..51a86b2 100644 --- a/src/backend/executor/execMain.c +++ b/src/backend/executor/execMain.c @@ -764,6 +764,35 @@ ExecCheckXactReadOnly(PlannedStmt *plannedstmt) PreventCommandIfParallelMode(CreateCommandTag((Node *) plannedstmt)); } +/* + * Register callbacks to be called just before plan execution. + */ +void +RegisterPreExecCallback(PreExecCallback callback, EState *es, Node *nd, + void *arg) +{ + PreExecCallbackItem *item; + + item = (PreExecCallbackItem*) + MemoryContextAlloc(es-es_query_cxt, sizeof(PreExecCallbackItem)); + item-callback = callback; + item-node = nd; + item-arg = arg; + + /* add the new node at the end of the chain */ + item-next = es-es_preExecCallbacks; + es-es_preExecCallbacks = item; +} + +/* Execute registered pre-exec callbacks */ +void +RunPreExecCallbacks(EState *es) +{ + PreExecCallbackItem *item; + + for (item = es-es_preExecCallbacks ; item