Re: [HACKERS] parallelizing subplan execution

2010-02-21 Thread Dimitri Fontaine
Robert Haas robertmh...@gmail.com writes:
 Probably.  For one thing, you can't use fork(), because it won't work
 on Windows.
[...]
 query.  IOW, we're going to need, well, a connection pool in core.
 *ducks, runs for cover*

Well, in fact, you're slowly getting to the interesting^W crazy part of
it.

Now that you have a connection pool in core and a way to share the same
snapshot in more than one backend, won't you like for any HotStandby
slave to be able to share this snapshot too? And run the subplan there?

And while at it, you'd obviously (ahem) want the slave to run the pooler
too and have the possibility to ask its master if it still have a given
snapshot available. So that any transaction (session?) that turns out
not to be read-only can get transparently run on the master instead. So
the snapshot too old error get some more reasons to be.

Oh, of course, the next step after that is to have a single cluster be
both a slave and a master, so that we can talk about distributing the
data. Multi-nodes multi-TB (make it PB) is the future, didn't they say?

We now have nodes with only some of the data (that could be only some
partitions) and a way to give them subplans over the network, and a way
for them to run a write query on other hosts without telling the client
connection. Sounds fun, he?

Regards,
-- 
dim

And I don't do drugs, not even caffeine. :)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallelizing subplan execution (was: explain and PARAM_EXEC)

2010-02-21 Thread Greg Stark
On Sun, Feb 21, 2010 at 3:25 AM, Robert Haas robertmh...@gmail.com wrote:
 What kinds of things would be
 sensible to hand off in this way?  Well, you'd want to find nodes that
 are not likely to be repeatedly re-executed with different parameters,
 like subplans or inner-indexscans, because otherwise you'll get
 pipeline stalls handing the new parameters back and forth.  And you
 want to find nodes that are expensive for the same reason.

I think the case you want to handle is when you could execute a node
asynchronously. That is, if the rest of the plan can proceed without
the results until they are are ready.

The case that Oracle handled first and best was UNION ALL where each
child can be run in separate processes.



-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallelizing subplan execution

2010-02-20 Thread Mark Kirkwood

Robert Haas wrote:



It seems to me that you need to start by thinking about what kinds of
queries could be usefully parallelized.  What I think you're proposing
here, modulo large amounts of hand-waving, is that we should basically
find a branch of the query tree, cut it off, and make that branch the
responsibility of a subprocess.  What kinds of things would be
sensible to hand off in this way?  Well, you'd want to find nodes that
are not likely to be repeatedly re-executed with different parameters,
like subplans or inner-indexscans, because otherwise you'll get
pipeline stalls handing the new parameters back and forth.  And you
want to find nodes that are expensive for the same reason.  So maybe
this would work for something like a merge join on top of two sorts -
one backend could perform each sort, and then whichever one was the
child would stream the tuples to the parent for the final merge.  Of
course, this assumes the I/O subsystem can keep up, which is not a
given - if both tables are fed by the same, single spindle, it might
be worse than if you just did the sorts consecutively.

This approach might also benefit queries that are very CPU-intensive,
on a multi-core system with spare cycles.  Suppose you have a big tall
stack of hash joins, each with a small inner rel.  The child process
does about half the joins and then pipelines the results into the
parent, which does the other half and returns the results.

But there's at least one other totally different way of thinking about
this problem, which is that you might want two processes to cooperate
in executing the SAME query node - imagine, for example, a big
sequential scan with an expensive but highly selective filter
condition, or an enormous sort.  You have all the same problems of
figuring out when it's actually going to help, of course, but the
details will likely be quite different.

I'm not really sure which one of these would be more useful in
practice - or maybe there are even other strategies.  What does
$COMPETITOR do?

I'm also ignoring the difficulties of getting hold of a second backend
in the right state - same database, same snapshot, etc.  It seems to
me unlikely that there are a substantial number of real-world
applications for which this will not work very well if we have to
actually start a new backend every time we want to parallelize a
query.  IOW, we're going to need, well, a connection pool in core.
*ducks, runs for cover*

  


One thing that might work quite well is slicing up by partition 
(properly implemented partitioning would go along with this nicely too...)


regards

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers