On Mon, May 9, 2016 at 9:38 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote: > Is the parallel aware Append node sufficient to run multiple nodes > asynchronously? (Sorry, I couldn't have enough time to code the feature > even though we had discussion before.)
It's tempting to think that parallel query and asynchronous query are the same thing, but I think that they are actually quite different. Parallel query involves using multiple processes to service a query. Asynchronous query involves using each individual process as efficiently as possible by not having it block any more than necessary. You can want these things together or separately. For example, consider this query plan: Append -> ForeignScan -> ForeignScan Here, you do not want parallel query; the queries must both be launched by the user backend, not some worker process, else you will not get the right transaction semantics. The parallel-aware Append node we talked about before does not help here. On the other hand, consider this: Append -> Seq Scan Filter: lots_of_cpu() -> Seq Scan Filter: lots_of_cpu() Here, asynchronous query is of no help, but parallel query is great. Now consider this hypothetical plan: Gather -> Hash Join -> Parallel Bitmap Heap Scan -> Bitmap Index Scan -> Parallel Hash -> Parallel Seq Scan Let's assume that the bitmap *heap* scan can be performed in parallel but the bitmap *index* scan can't. That's pretty reasonable for a first cut, actually, because the index accesses are probably touching much less data, and we're doing little CPU work for each index page - any delay here is likely to be I/O. So in that world what you want is for the first worker to begin performing the bitmap index scan and building a shared TIDBitmap for that the workers can use to scan the heap. The other workers, meanwhile, could begin building the shared hash table (which is what I intend to denote by saying that it's a *Parallel* Hash). If the process building the bitmap finishes before the hash table is built, it can help build the hash table as well. Once both operations are done, each process can scan a subset of the rows from the bitmap heap scan and do the hash table probes for just those rows. To make all of this work, you need both *parallel* query, so that you have workers, and also *asynchronous* query, so that workers which see that the bitmap index scan is in progress don't get stuck waiting for it but can look around for other work. > In the above example, scan on foreign-table takes longer lead time than > local scan. If Append can map every child nodes on individual workers, > local scan worker begins to return tuples at first, then, mixed tuples > shall be returned eventually. This is getting at an issue I'm somewhat worried about, which is scheduling. In my prototype, we only check for events if there are no nodes ready for the CPU now, but sometimes that might be a loser. One probably needs to check for events periodically even when there are still nodes waiting for the CPU, but "how often?" seems like an unanswerable question. > However, the process internal asynchronous execution may be also beneficial > in case when cost of shm_mq is not ignorable (e.g, no scan qualifiers > are given to worker process). I think it allows to implement pre-fetching > very naturally. Yes. >> My proposal for how to do this is to make ExecProcNode function as a >> backward-compatibility wrapper. For asynchronous execution, a node >> might return a not-ready-yet indication, but if that node is called >> via ExecProcNode, it means the caller isn't prepared to receive such >> an indication, so ExecProcNode will just wait for the node to become >> ready and then return the tuple. >> > Backward compatibility is good. In addition, child node may want to > know the context when it is called. It may want to switch the behavior > according to the caller's expectation. For example, it may be beneficial > if SeqScan makes more aggressive prefetching on asynchronous execution. Maybe, but I'm a bit doubtful. I'm not seeing a lot of advantage in that sort of thing, and it will make the code a lot more complicated. > Also, can we consider which data format will be returned from the child > node during the planning stage? It affects to the cost of inter-node > data exchange. If a pair of parent-node and child-node supports its > special data format (like as existing HashJoin and Hash doing), it shall > be a discount factor of cost estimation. I'm not sure. The costing aspects of this need a lot more thought than I have given them so far. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers