Re: [GENERAL] Slow join over three tables

Tim van der Linden Wed, 27 Apr 2016 06:23:04 -0700

On Wed, 27 Apr 2016 13:48:06 +0200
Alban Hertroys <[email protected]> wrote:


Hi Alban

Thanks for chiming in!

> Since you're not using age and gender in this (particular) query until the 
> rows are combined into a result set already, it doesn't make a whole lot of 
> sense to add them to the index. Moreover, since your'e ordering by created, 
> I'd at least put that directly after id:
> 
>       create index on reports (id, created);

The ORDER statement did not attribute much to the total query time, but it 
makes sense to index this.
 
> In this case, you're using the values in adverse to filter relevant rid's for 
> the FK join, so you might be better off with the inverse of above index:
>       create index on report_adverses (adverse, rid);
>       create index on report_drugs (drug, rid);

Hold the phone ... you just cracked the code.

I added these two proposed indexes (reversing their order). After purging the 
OS disk cache, the query runs in (*drum roll*) 120 ms. You managed to speed 
this up ~40 times.

The query plan:

 Sort  (cost=12108.99..12109.83 rows=337 width=41) (actual 
time=119.517..119.531 rows=448 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 60kB
   ->  Nested Loop  (cost=10940.25..12094.84 rows=337 width=41) (actual 
time=26.473..119.392 rows=448 loops=1)
         Join Filter: (d.rid = r.id)
         ->  Merge Join  (cost=10939.69..11414.84 rows=337 width=29) (actual 
time=25.624..33.650 rows=448 loops=1)
               Merge Cond: (d.rid = a.rid)
               ->  Sort  (cost=1417.78..1454.02 rows=14496 width=8) (actual 
time=4.208..4.976 rows=14074 loops=1)
                     Sort Key: d.rid
                     Sort Method: quicksort  Memory: 1050kB
                     ->  Index Only Scan using report_drugs_drug_rid_idx on 
report_drugs d  (cost=0.44..415.86 rows=14496 width=8) (actual 
time=0.648..2.236 rows=14200 loops=1)
                           Index Cond: (drug = ANY 
('{359,360,361,362,363}'::integer[]))
                           Heap Fetches: 0
               ->  Sort  (cost=9521.91..9721.56 rows=79860 width=21) (actual 
time=21.410..24.174 rows=76974 loops=1)
                     Sort Key: a.rid
                     Sort Method: quicksort  Memory: 6682kB
                     ->  Index Only Scan using report_adverses_adverse_rid_idx 
on report_adverses a  (cost=0.56..3019.24 rows=79860 width=21) (actual 
time=0.916..10.689 rows=76972 loops=1)
                           Index Cond: (adverse = ANY ('{"back pain - 
nonspecific","nonspecific back pain","back pain"}'::text[]))
                           Heap Fetches: 0
         ->  Index Only Scan using reports_id_age_gender_created_idx on reports 
r  (cost=0.56..2.01 rows=1 width=20) (actual time=0.190..0.191 rows=1 loops=448)
               Index Cond: (id = a.rid)
               Heap Fetches: 0
 Planning time: 15.466 ms
 Execution time: 119.871 ms

This is amazing, and the slow down was purely a fact of missing indexes on the 
sweet spot.
You can now clearly see that the above two indexes are used 
(report_drugs_drug_rid_idx) and (report_adverses_adverse_rid_idx) and that the 
nested loop is much faster.

> Do these tables have a primary key and can that be used here?

Only the "reports" table has a PK, the other two don't (only a foreign key to 
"reports").

> I hope I'm not being superfluous here, but don't forget to vacuum analyze 
> after creating those indexes. Eventually autovacuum will do that for you, but 
> you could otherwise be running tests to verify the impact of adding those 
> indexes before autovacuum gets around to it.

I'm becoming the king of manual VACUUM-ing the past few days ... no need to 
point that out ;)

> Finally, perhaps it's more efficient to weed out all unrelated drugs and 
> adverses before relating them to reports, something like:
> 
> SELECT r.id, r.age, r.gender, r.created, x.adverse, x.drug
> FROM reports r
> JOIN (
>       SELECT d.rid, a.adverse, d.drug
>       FROM report_drugs d
>       JOIN report_adverses a ON a.rid = d.rid 
>       WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific 
> back pain', 'back pain']) 
>       AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363])
> ) x ON x.rid = r.id
> ORDER BY r.created;

With the above indexes created, disk cache flushed, this query ran at the exact 
same speed, the plan:

 Sort  (cost=12108.14..12108.99 rows=337 width=41) (actual 
time=119.102..119.111 rows=448 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 60kB
   ->  Nested Loop  (cost=10940.25..12094.00 rows=337 width=41) (actual 
time=26.797..118.969 rows=448 loops=1)
         ->  Merge Join  (cost=10939.69..11414.84 rows=337 width=29) (actual 
time=25.899..33.203 rows=448 loops=1)
               Merge Cond: (d.rid = a.rid)
               ->  Sort  (cost=1417.78..1454.02 rows=14496 width=8) (actual 
time=4.319..4.981 rows=14074 loops=1)
                     Sort Key: d.rid
                     Sort Method: quicksort  Memory: 1050kB
                     ->  Index Only Scan using report_drugs_drug_rid_idx on 
report_drugs d  (cost=0.44..415.86 rows=14496 width=8) (actual 
time=0.748..2.369 rows=14200 loops=1)
                           Index Cond: (drug = ANY 
('{359,360,361,362,363}'::integer[]))
                           Heap Fetches: 0
               ->  Sort  (cost=9521.91..9721.56 rows=79860 width=21) (actual 
time=21.573..23.940 rows=76974 loops=1)
                     Sort Key: a.rid
                     Sort Method: quicksort  Memory: 6682kB
                     ->  Index Only Scan using report_adverses_adverse_rid_idx 
on report_adverses a  (cost=0.56..3019.24 rows=79860 width=21) (actual 
time=0.922..10.637 rows=76972 loops=1)
                           Index Cond: (adverse = ANY ('{"back pain - 
nonspecific","nonspecific back pain","back pain"}'::text[]))
                           Heap Fetches: 0
         ->  Index Only Scan using reports_id_age_gender_created_idx on reports 
r  (cost=0.56..2.01 rows=1 width=20) (actual time=0.190..0.191 rows=1 loops=448)
               Index Cond: (id = a.rid)
               Heap Fetches: 0
 Planning time: 15.238 ms
 Execution time: 119.431 ms

So your hunch was right, not much difference there.

But man ... this query is now flying ... Hat's off to you, sir!

> Alban Hertroys

Cheers,
Tim


-- 
Sent via pgsql-general mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Slow join over three tables

Reply via email to