Re: [HACKERS] parallel joins, and better parallel explain

Paul Ramsey Mon, 14 Dec 2015 10:28:45 -0800

On Wed, Dec 2, 2015 at 1:55 PM, Robert Haas <[email protected]> wrote:


> Oops.  The new version I've attached should fix this.

I've been trying to see if parallel join has any effect on PostGIS
spatial join queries, which are commonly CPU bound. (My tests [1] on
simple parallel scan were very positive, though quite limited in that
they only parallelized such a small part of the work).

Like Amit, I found the current patches are missing a change to
src/include/nodes/relation.h, but just adding in "Relids
extra_lateral_rels" to JoinPathExtraData allowed a warning-free build.

The assumptions on parallel code in generally are that setting up
parallel workers is very costly compared to the work to be done, so to
get PostGIS code to parallelize I've been reduced to shoving
parallel_setup_cost very low (1) and parallel_tuple_cost as well.
Otherwise I just end up with ordinary plans. I did redefine all the
relevant functions as "parallel safe" and upped their declared costs
significantly.

I set up a 8000 record spatial table, with a spatial index, and did a
self-join on it.

explain analyze
select a.gid, b.gid from vada a join vada b
on st_intersects(a.geom, b.geom)
where a.gid != b.gid;

With no parallelism, I got this:

set max_parallel_degree = 0;

                                                          QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.15..227332.48 rows=1822243 width=8) (actual
time=0.377..5528.461 rows=52074 loops=1)
   ->  Seq Scan on vada a  (cost=0.00..1209.92 rows=8792 width=1189)
(actual time=0.027..5.004 rows=8792 loops=1)
   ->  Index Scan using vada_gix on vada b  (cost=0.15..25.71 rows=1
width=1189) (actual time=0.351..0.622 rows=6 loops=8792)
         Index Cond: (a.geom && geom)
         Filter: ((a.gid <> gid) AND _st_intersects(a.geom, geom))
         Rows Removed by Filter: 3
 Planning time: 3.976 ms
 Execution time: 5533.573 ms


With parallelism, I got this:

                                                           QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.15..226930.05 rows=1822243 width=8) (actual
time=0.840..5462.029 rows=52074 loops=1)
   ->  Gather  (cost=0.00..807.49 rows=8792 width=1189) (actual
time=0.335..39.326 rows=8792 loops=1)
         Number of Workers: 1
         ->  Parallel Seq Scan on vada a  (cost=0.00..806.61 rows=5861
width=1189) (actual time=0.015..10.167 rows=4396 loops=2)
   ->  Index Scan using vada_gix on vada b  (cost=0.15..25.71 rows=1
width=1189) (actual time=0.353..0.609 rows=6 loops=8792)
         Index Cond: (a.geom && geom)
         Filter: ((a.gid <> gid) AND _st_intersects(a.geom, geom))
         Rows Removed by Filter: 3
 Planning time: 4.019 ms
 Execution time: 5467.126 ms

Given that it's a CPU-bound process, I was hoping for something closer
to the results of the sequence tests, about 50% time reduction, based
on the two cores in my test machine.

In general either the parallel planner is way too conservative (it
seems), or we need to radically increase the costs of our PostGIS
functions (right now, most "costly" functions are cost 100, but I had
to push costs up into the 100000 range to get parallelism to kick in
sometimes). Some guidelines on cost setting would be useful, something
that says, "this function run against this kind of data is cost level
1, compare the time your function takes on 'standard' data to the
baseline function to get a cost ratio to use in the function
definition"

ATB,

P.



[1]  https://gist.github.com/pramsey/84e7a39d83cccae692f8


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallel joins, and better parallel explain

Reply via email to