Hi all, Sorry for jumping in over 4 months later...
On Sat, Feb 20, 2010 at 8:25 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Sat, Feb 20, 2010 at 8:31 AM, Dimitri Fontaine > <dfonta...@hi-media.com> wrote: >>> This is really a topic for another thread, but at 100,000 feet it >>> seems to me that the hardest question is - how will you decide which >>> operations to parallelize in the first place? Actually making it >>> happen is really hard, too, of course, but even to get that that point >>> you have to have some model for what types of operations it makes >>> sense to parallelize and how you're going to decide when it's a win. >> >> My naive thoughts would be to add some cost parameters. The fact to >> fork() another backend first, then model for each supported subplan (we >> will want to add more, or maybe have a special rendez-vous-materialise >> node) some idea of the data exchange cost. >> >> Now the planner would as usual try to find the less costly plan, and >> will be able to compare plans with and without distributing the work. >> >> Overly naive ? > > Probably. For one thing, you can't use fork(), because it won't work > on Windows. > > It seems to me that you need to start by thinking about what kinds of > queries could be usefully parallelized. What I think you're proposing > here, modulo large amounts of hand-waving, is that we should basically > find a branch of the query tree, cut it off, and make that branch the > responsibility of a subprocess. What kinds of things would be > sensible to hand off in this way? Well, you'd want to find nodes that > are not likely to be repeatedly re-executed with different parameters, > like subplans or inner-indexscans, because otherwise you'll get > pipeline stalls handing the new parameters back and forth. And you > want to find nodes that are expensive for the same reason. So maybe > this would work for something like a merge join on top of two sorts - > one backend could perform each sort, and then whichever one was the > child would stream the tuples to the parent for the final merge. Of > course, this assumes the I/O subsystem can keep up, which is not a > given - if both tables are fed by the same, single spindle, it might > be worse than if you just did the sorts consecutively. > > This approach might also benefit queries that are very CPU-intensive, > on a multi-core system with spare cycles. Suppose you have a big tall > stack of hash joins, each with a small inner rel. The child process > does about half the joins and then pipelines the results into the > parent, which does the other half and returns the results. > > But there's at least one other totally different way of thinking about > this problem, which is that you might want two processes to cooperate > in executing the SAME query node - imagine, for example, a big > sequential scan with an expensive but highly selective filter > condition, or an enormous sort. You have all the same problems of > figuring out when it's actually going to help, of course, but the > details will likely be quite different. > > I'm not really sure which one of these would be more useful in > practice - or maybe there are even other strategies. What does > $COMPETITOR do? I feel that the answer is it depends. To partially answer what others are doing, I'll present some papers from someone we might recognize as a starting point. :) http://pages.cs.wisc.edu/~dewitt/includes/publications.html Some of these papers aren't the type of parallelism we're talking about here, but the ones that I think are relevant talk mostly about parallelizing hash based joins. I think we might be lacking an operator or two though in order to do some of these things. > I'm also ignoring the difficulties of getting hold of a second backend > in the right state - same database, same snapshot, etc. It seems to > me unlikely that there are a substantial number of real-world > applications for which this will not work very well if we have to > actually start a new backend every time we want to parallelize a > query. IOW, we're going to need, well, a connection pool in core. > *ducks, runs for cover* Do we think it's worth proofing that we can execute a plan in parallel? Something simple, if not the best case, say a nested loop join between two tables? Just as a starting point before worrying too much about what is the best thing to parallelize, or how the degree of parallelism will be controller? Regards, Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers