Re: parallelizing subplan execution (was: [HACKERS] explain and PARAM_EXEC)

2010-06-30 Thread Mark Wong
On Sat, Jun 26, 2010 at 6:01 PM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Jun 25, 2010 at 10:47 PM, Mark Wong mark...@gmail.com wrote:
 http://pages.cs.wisc.edu/~dewitt/includes/publications.html

 Some of these papers aren't the type of parallelism we're talking
 about here, but the ones that I think are relevant talk mostly about
 parallelizing hash based joins.  I think we might be lacking an
 operator or two though in order to do some of these things.

 This part (from the first paper linked on that page) is not terribly
 encouraging.

 Current database query optimizers do not consider all possible plans
 when optimizing a relational query. While cost models for relational
 queries running on a single processor are now well-understood
 [SELI79], they still depend on cost estimators that are a guess at
 best. Some dynamically select from among several plans at run time
 depending on, for example, the amount of physical memory actually
 available and the cardinalities of the intermediate results [GRAE89].
 To date, no query optimizers consider all the parallel algorithms for
 each operator and all the query tree organizations. More work is
 needed in this area.

 The section (from that same paper) on parallelizing hash joins and
 merge-join-over-sort is interesting, and I can definitely imagine
 those techniques being a win for us.  But I'm not too sure how we'd
 know when to apply them - that is, what algorithm would the query
 optimizer use?  I'm sure we could come up with something, but I'd get
 a warmer, fuzzier feeling if we could implement the fruits of someone
 else's research rather than rolling our own.

I found another starting point for more papers here:

http://infolab.stanford.edu/joker/joqrs.html

The links on this page don't work anymore but many of these are easily
found by searching for the title.  I've only gone through some
abstracts so far, but it seems to me that they discuss some query
optimization techniques for parallel systems.

Regards,
Mark

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: parallelizing subplan execution (was: [HACKERS] explain and PARAM_EXEC)

2010-06-27 Thread Simon Riggs
On Sat, 2010-06-26 at 21:01 -0400, Robert Haas wrote:

 The section (from that same paper) on parallelizing hash joins and
 merge-join-over-sort is interesting, and I can definitely imagine
 those techniques being a win for us.  But I'm not too sure how we'd
 know when to apply them - that is, what algorithm would the query
 optimizer use?  I'm sure we could come up with something, but I'd get
 a warmer, fuzzier feeling if we could implement the fruits of someone
 else's research rather than rolling our own.

You've just touched on why parallel query is hard. There is a big bucket
of executor code to write and then lots of very subtle thinking,
heuristics and usability parameters to make parallel query sensibly
optimised. You need both to make it actually work in practice (without
hints).

Parallel sub-plans is not a good case to start with because it presumes
only certain kinds of plans are in place. It wouldn't be usable for the
majority of plans.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: parallelizing subplan execution (was: [HACKERS] explain and PARAM_EXEC)

2010-06-26 Thread Robert Haas
On Fri, Jun 25, 2010 at 10:47 PM, Mark Wong mark...@gmail.com wrote:
 http://pages.cs.wisc.edu/~dewitt/includes/publications.html

 Some of these papers aren't the type of parallelism we're talking
 about here, but the ones that I think are relevant talk mostly about
 parallelizing hash based joins.  I think we might be lacking an
 operator or two though in order to do some of these things.

This part (from the first paper linked on that page) is not terribly
encouraging.

Current database query optimizers do not consider all possible plans
when optimizing a relational query. While cost models for relational
queries running on a single processor are now well-understood
[SELI79], they still depend on cost estimators that are a guess at
best. Some dynamically select from among several plans at run time
depending on, for example, the amount of physical memory actually
available and the cardinalities of the intermediate results [GRAE89].
To date, no query optimizers consider all the parallel algorithms for
each operator and all the query tree organizations. More work is
needed in this area.

The section (from that same paper) on parallelizing hash joins and
merge-join-over-sort is interesting, and I can definitely imagine
those techniques being a win for us.  But I'm not too sure how we'd
know when to apply them - that is, what algorithm would the query
optimizer use?  I'm sure we could come up with something, but I'd get
a warmer, fuzzier feeling if we could implement the fruits of someone
else's research rather than rolling our own.

 I'm also ignoring the difficulties of getting hold of a second backend
 in the right state - same database, same snapshot, etc.  It seems to
 me unlikely that there are a substantial number of real-world
 applications for which this will not work very well if we have to
 actually start a new backend every time we want to parallelize a
 query.  IOW, we're going to need, well, a connection pool in core.
 *ducks, runs for cover*

 Do we think it's worth proofing that we can execute a plan in
 parallel?  Something simple, if not the best case, say a nested loop
 join between two tables?  Just as a starting point before worrying too
 much about what is the best thing to parallelize, or how the degree of
 parallelism will be controller?

Well, we can certainly DO it, I guess.  It's just a question of
whether we can make it fairly automatic and capable of delivering good
results in the real world.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: parallelizing subplan execution (was: [HACKERS] explain and PARAM_EXEC)

2010-06-25 Thread Mark Wong
Hi all,

Sorry for jumping in over 4 months later...

On Sat, Feb 20, 2010 at 8:25 PM, Robert Haas robertmh...@gmail.com wrote:
 On Sat, Feb 20, 2010 at 8:31 AM, Dimitri Fontaine
 dfonta...@hi-media.com wrote:
 This is really a topic for another thread, but at 100,000 feet it
 seems to me that the hardest question is - how will you decide which
 operations to parallelize in the first place?  Actually making it
 happen is really hard, too, of course, but even to get that that point
 you have to have some model for what types of operations it makes
 sense to parallelize and how you're going to decide when it's a win.

 My naive thoughts would be to add some cost parameters. The fact to
 fork() another backend first, then model for each supported subplan (we
 will want to add more, or maybe have a special rendez-vous-materialise
 node) some idea of the data exchange cost.

 Now the planner would as usual try to find the less costly plan, and
 will be able to compare plans with and without distributing the work.

 Overly naive ?

 Probably.  For one thing, you can't use fork(), because it won't work
 on Windows.

 It seems to me that you need to start by thinking about what kinds of
 queries could be usefully parallelized.  What I think you're proposing
 here, modulo large amounts of hand-waving, is that we should basically
 find a branch of the query tree, cut it off, and make that branch the
 responsibility of a subprocess.  What kinds of things would be
 sensible to hand off in this way?  Well, you'd want to find nodes that
 are not likely to be repeatedly re-executed with different parameters,
 like subplans or inner-indexscans, because otherwise you'll get
 pipeline stalls handing the new parameters back and forth.  And you
 want to find nodes that are expensive for the same reason.  So maybe
 this would work for something like a merge join on top of two sorts -
 one backend could perform each sort, and then whichever one was the
 child would stream the tuples to the parent for the final merge.  Of
 course, this assumes the I/O subsystem can keep up, which is not a
 given - if both tables are fed by the same, single spindle, it might
 be worse than if you just did the sorts consecutively.

 This approach might also benefit queries that are very CPU-intensive,
 on a multi-core system with spare cycles.  Suppose you have a big tall
 stack of hash joins, each with a small inner rel.  The child process
 does about half the joins and then pipelines the results into the
 parent, which does the other half and returns the results.

 But there's at least one other totally different way of thinking about
 this problem, which is that you might want two processes to cooperate
 in executing the SAME query node - imagine, for example, a big
 sequential scan with an expensive but highly selective filter
 condition, or an enormous sort.  You have all the same problems of
 figuring out when it's actually going to help, of course, but the
 details will likely be quite different.

 I'm not really sure which one of these would be more useful in
 practice - or maybe there are even other strategies.  What does
 $COMPETITOR do?

I feel that the answer is it depends.  To partially answer what others
are doing, I'll present some papers from someone we might recognize as
a starting point. :)

http://pages.cs.wisc.edu/~dewitt/includes/publications.html

Some of these papers aren't the type of parallelism we're talking
about here, but the ones that I think are relevant talk mostly about
parallelizing hash based joins.  I think we might be lacking an
operator or two though in order to do some of these things.

 I'm also ignoring the difficulties of getting hold of a second backend
 in the right state - same database, same snapshot, etc.  It seems to
 me unlikely that there are a substantial number of real-world
 applications for which this will not work very well if we have to
 actually start a new backend every time we want to parallelize a
 query.  IOW, we're going to need, well, a connection pool in core.
 *ducks, runs for cover*

Do we think it's worth proofing that we can execute a plan in
parallel?  Something simple, if not the best case, say a nested loop
join between two tables?  Just as a starting point before worrying too
much about what is the best thing to parallelize, or how the degree of
parallelism will be controller?

Regards,
Mark

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: parallelizing subplan execution (was: [HACKERS] explain and PARAM_EXEC)

2010-02-20 Thread Robert Haas
On Sat, Feb 20, 2010 at 8:31 AM, Dimitri Fontaine
dfonta...@hi-media.com wrote:
 This is really a topic for another thread, but at 100,000 feet it
 seems to me that the hardest question is - how will you decide which
 operations to parallelize in the first place?  Actually making it
 happen is really hard, too, of course, but even to get that that point
 you have to have some model for what types of operations it makes
 sense to parallelize and how you're going to decide when it's a win.

 My naive thoughts would be to add some cost parameters. The fact to
 fork() another backend first, then model for each supported subplan (we
 will want to add more, or maybe have a special rendez-vous-materialise
 node) some idea of the data exchange cost.

 Now the planner would as usual try to find the less costly plan, and
 will be able to compare plans with and without distributing the work.

 Overly naive ?

Probably.  For one thing, you can't use fork(), because it won't work
on Windows.

It seems to me that you need to start by thinking about what kinds of
queries could be usefully parallelized.  What I think you're proposing
here, modulo large amounts of hand-waving, is that we should basically
find a branch of the query tree, cut it off, and make that branch the
responsibility of a subprocess.  What kinds of things would be
sensible to hand off in this way?  Well, you'd want to find nodes that
are not likely to be repeatedly re-executed with different parameters,
like subplans or inner-indexscans, because otherwise you'll get
pipeline stalls handing the new parameters back and forth.  And you
want to find nodes that are expensive for the same reason.  So maybe
this would work for something like a merge join on top of two sorts -
one backend could perform each sort, and then whichever one was the
child would stream the tuples to the parent for the final merge.  Of
course, this assumes the I/O subsystem can keep up, which is not a
given - if both tables are fed by the same, single spindle, it might
be worse than if you just did the sorts consecutively.

This approach might also benefit queries that are very CPU-intensive,
on a multi-core system with spare cycles.  Suppose you have a big tall
stack of hash joins, each with a small inner rel.  The child process
does about half the joins and then pipelines the results into the
parent, which does the other half and returns the results.

But there's at least one other totally different way of thinking about
this problem, which is that you might want two processes to cooperate
in executing the SAME query node - imagine, for example, a big
sequential scan with an expensive but highly selective filter
condition, or an enormous sort.  You have all the same problems of
figuring out when it's actually going to help, of course, but the
details will likely be quite different.

I'm not really sure which one of these would be more useful in
practice - or maybe there are even other strategies.  What does
$COMPETITOR do?

I'm also ignoring the difficulties of getting hold of a second backend
in the right state - same database, same snapshot, etc.  It seems to
me unlikely that there are a substantial number of real-world
applications for which this will not work very well if we have to
actually start a new backend every time we want to parallelize a
query.  IOW, we're going to need, well, a connection pool in core.
*ducks, runs for cover*

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers