Kim Bisgaard wrote:
Hi,

I'm having problems with the query optimizer and FULL OUTER JOIN on
PostgreSQL 7.4. I cannot get it to use my indexes with full outer joins.
I might be naive, but I think that it should be possible?

I have two BIG tables (virtually identical) with 3 NOT NULL columns
Station_id, TimeObs, Temp_XXXX, with indexes on (Station_id, TimeObs)
and valid ANALYSE (set statistics=100). I want to join the two tables
with a FULL OUTER JOIN.

When I specify the query as:

select temp_max_60min,temp_dry_at_2m
from station s natural join
temp_dry_at_2m a full outer join temp_max_60min b using (station_id, timeobs)
where s.wmo_id=6065
and timeobs='2004-1-1 0:0:0'
and '2004-1-1 0:0:0' between s.startdate and s.enddate;

I get the correct results, BUT LOUSY performance, and the following explain:

 Nested Loop Left Join  (cost=5.84..163484.08 rows=1349 width=12) (actual 
time=66146.815..119005.381 rows=1 loops=1)
   Filter: (COALESCE("outer".timeobs, "inner".timeobs) = '2004-01-01 
00:00:00'::timestamp without time zone)
   ->  Hash Join  (cost=5.84..155420.24 rows=1349 width=16) (actual 
time=8644.449..110836.038 rows=109826 loops=1)

Well, the estimate here is quite a bit off. It thinks you will be getting 1349 (which is probably why it picked a nested loop plan), but then it is getting 109826 rows. I'm guessing it is misunderstanding the selectivity of the timeobs column.

         Hash Cond: ("outer".station_id = "inner".station_id)
         ->  Seq Scan on temp_dry_at_2m a  (cost=0.00..120615.94 rows=6956994 
width=16) (actual time=0.024..104548.515 rows=6956994 loops=1)
         ->  Hash  (cost=5.84..5.84 rows=1 width=4) (actual time=0.114..0.114 
rows=0 loops=1)
               ->  Index Scan using wmo_idx on station  (cost=0.00..5.84 rows=1 
width=4) (actual time=0.105..0.108 rows=1 loops=1)
                     Index Cond: ((wmo_id = 6065) AND ('2004-01-01 
00:00:00'::timestamp without time zone >= startdate) AND ('2004-01-01 
00:00:00'::timestamp without time zone <= enddate))
   ->  Index Scan using temp_max_60min_idx on temp_max_60min b  
(cost=0.00..5.96 rows=1 width=20) (actual time=0.071..0.071 rows=0 loops=109826)
         Index Cond: (("outer".station_id = b.station_id) AND ("outer".timeobs 
= b.timeobs))
 Total runtime: 119005.499 ms
(11 rows)

I think the bigger problem is that a full outer join says grab all rows, even if they are null.

What about this query:
SELECT temp_max_60min,temp_dry_at_2m
  FROM (station s LEFT JOIN temp_dry_at_2m a USING (station_id, timeobs)
       LEFT JOIN temp_max_60min b USING (station_id, timeobs)
where s.wmo_id=6065
and timeobs='2004-1-1 0:0:0'
and '2004-1-1 0:0:0' between s.startdate and s.enddate;

After that, you should probably have a multi-column index on
(station_id, timeobs), which lets postgres use just that index for the
lookup, rather than using an index and then a filter. (Looking at your
next query you might already have that index).


If I change the query to (and thus negates the full outer join):

This is the same query, I think you messed up your copy and paste.


select temp_max_60min,temp_dry_at_2m from station s natural join temp_dry_at_2m a full outer join temp_max_60min b using (station_id, timeobs) where s.wmo_id=6065 and _a.timeobs='2004-1-1 0:0:0' and b._timeobs='2004-1-1 0:0:0' and '2004-1-1 0:0:0' between s.startdate and s.enddate;


I get wrong results (In the case where one of the records is missing in one of the tables), BUT GOOD performance, and this query plan:

 Nested Loop  (cost=0.00..17.83 rows=1 width=12) (actual time=79.221..79.236 
rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..11.82 rows=1 width=24) (actual 
time=65.517..65.526 rows=1 loops=1)
         ->  Index Scan using wmo_idx on station  (cost=0.00..5.83 rows=1 
width=4) (actual time=0.022..0.026 rows=1 loops=1)
               Index Cond: ((wmo_id = 6065) AND ('2004-01-01 00:00:00'::timestamp 
without time zone >= startdate) AND ('2004-01-01 00:00:00'::timestamp without time 
zone <= enddate))
         ->  Index Scan using temp_max_60min_idx on temp_max_60min b  
(cost=0.00..5.97 rows=1 width=20) (actual time=65.483..65.486 rows=1 loops=1)
               Index Cond: (("outer".station_id = b.station_id) AND (b.timeobs 
= '2004-01-01 00:00:00'::timestamp without time zone))
   ->  Index Scan using temp_dry_at_2m_idx on temp_dry_at_2m a  
(cost=0.00..6.00 rows=1 width=16) (actual time=13.694..13.698 rows=1 loops=1)
         Index Cond: (("outer".station_id = a.station_id) AND (a.timeobs = 
'2004-01-01 00:00:00'::timestamp without time zone))
 Total runtime: 79.340 ms
(9 rows)


If further info like EXPLAIN VERBOSE is useful please say so and I will provide it.

Thanks in advance!
Kim Bisgaard.

I still feel like you will have a problem with an outer join in this circumstance, because it will have to scan all of both tables.

I think what you are wanting is "give me everything where station_id =
X, and there is a row in either a or b".
I think my LEFT JOIN example does that, but I also think there would be
a subselect form which would work, which might do better. Something like:

SELECT temp_max_60min,temp_dry_at_2m
        FROM (SELECT station_id, timeobs FROM station s
               WHERE s.wmo_id=6065
                 AND timeobs = '2004-1-1 0:0:0'
                 AND '2004-1-1 0:0:0' BETWEEN s.startdate AND s.enddate
              ) AS s
        JOIN (SELECT temp_max_60min, temp_dry_at_2m
                FROM temp_dry_at_2m a
                FULL OUTER JOIN temp_max_60min b
                USING (station_id, timeobs)
                WHERE station_id = s.station_id
                  AND timeobs = '2004-1-1 0:0:0'
             )
;

If I did this correctly, you should have a very restrictive scan done on
station, which only returns a few rows based on timeobs & station_id.
But it might be better to turn that final FULL OUTER JOIN into 2 LEFT
JOINs like I did the first time:

SELECT temp_max_60min,temp_dry_at_2m
        FROM (SELECT station_id, timeobs FROM station s
               WHERE s.wmo_id=6065
                 AND timeobs = '2004-1-1 0:0:0'
                 AND '2004-1-1 0:0:0' BETWEEN s.startdate AND s.enddate
              ) AS s
        LEFT JOIN temp_dry_at_2m a USING (station_id, timeobs)
        LEFT JOIN temp_max_60min b USING (station_id, timeobs)
;

I would hope postgres could do this from just my earlier plan. And I
hope I understand what you want, such that 2 LEFT JOINS work better than
your FULL OUTER JOIN. If you only want rows where one of both temp_dry
or temp_max exist, you probably could just add the line:

WHERE (temp_max_60_min IS NOT NULL OR temp_dry_at_2m IS NOT NULL)


John =:->

Attachment: signature.asc
Description: OpenPGP digital signature



Reply via email to