On Mon, May 9, 2016 at 9:38 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> Is the parallel aware Append node sufficient to run multiple nodes
> asynchronously? (Sorry, I couldn't have enough time to code the feature
> even though we had discussion before.)

It's tempting to think that parallel query and asynchronous query are
the same thing, but I think that they are actually quite different.
Parallel query involves using multiple processes to service a query.
Asynchronous query involves using each individual process as
efficiently as possible by not having it block any more than
necessary.  You can want these things together or separately.  For
example, consider this query plan:

Append
-> ForeignScan
-> ForeignScan

Here, you do not want parallel query; the queries must both be
launched by the user backend, not some worker process, else you will
not get the right transaction semantics.  The parallel-aware Append
node we talked about before does not help here.  On the other hand,
consider this:

Append
  -> Seq Scan
       Filter: lots_of_cpu()
  -> Seq Scan
       Filter: lots_of_cpu()

Here, asynchronous query is of no help, but parallel query is great.
Now consider this hypothetical plan:

Gather
-> Hash Join
  -> Parallel Bitmap Heap Scan
    -> Bitmap Index Scan
  -> Parallel Hash
    -> Parallel Seq Scan

Let's assume that the bitmap *heap* scan can be performed in parallel
but the bitmap *index* scan can't.  That's pretty reasonable for a
first cut, actually, because the index accesses are probably touching
much less data, and we're doing little CPU work for each index page -
any delay here is likely to be I/O.

So in that world what you want is for the first worker to begin
performing the bitmap index scan and building a shared TIDBitmap for
that the workers can use to scan the heap.  The other workers,
meanwhile, could begin building the shared hash table (which is what I
intend to denote by saying that it's a *Parallel* Hash).  If the
process building the bitmap finishes before the hash table is built,
it can help build the hash table as well.  Once both operations are
done, each process can scan a subset of the rows from the bitmap heap
scan and do the hash table probes for just those rows.  To make all of
this work, you need both *parallel* query, so that you have workers,
and also *asynchronous* query, so that workers which see that the
bitmap index scan is in progress don't get stuck waiting for it but
can look around for other work.

> In the above example, scan on foreign-table takes longer lead time than
> local scan. If Append can map every child nodes on individual workers,
> local scan worker begins to return tuples at first, then, mixed tuples
> shall be returned eventually.

This is getting at an issue I'm somewhat worried about, which is
scheduling.  In my prototype, we only check for events if there are no
nodes ready for the CPU now, but sometimes that might be a loser.  One
probably needs to check for events periodically even when there are
still nodes waiting for the CPU, but "how often?" seems like an
unanswerable question.

> However, the process internal asynchronous execution may be also beneficial
> in case when cost of shm_mq is not ignorable (e.g, no scan qualifiers
> are given to worker process). I think it allows to implement pre-fetching
> very naturally.

Yes.

>> My proposal for how to do this is to make ExecProcNode function as a
>> backward-compatibility wrapper.  For asynchronous execution, a node
>> might return a not-ready-yet indication, but if that node is called
>> via ExecProcNode, it means the caller isn't prepared to receive such
>> an indication, so ExecProcNode will just wait for the node to become
>> ready and then return the tuple.
>>
> Backward compatibility is good. In addition, child node may want to
> know the context when it is called. It may want to switch the behavior
> according to the caller's expectation. For example, it may be beneficial
> if SeqScan makes more aggressive prefetching on asynchronous execution.

Maybe, but I'm a bit doubtful.  I'm not seeing a lot of advantage in
that sort of thing, and it will make the code a lot more complicated.

> Also, can we consider which data format will be returned from the child
> node during the planning stage? It affects to the cost of inter-node
> data exchange. If a pair of parent-node and child-node supports its
> special data format (like as existing HashJoin and Hash doing), it shall
> be a discount factor of cost estimation.

I'm not sure.  The costing aspects of this need a lot more thought
than I have given them so far.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to