Re: [GENERAL] Slow join over three tables

Alban Hertroys Thu, 28 Apr 2016 07:40:27 -0700

On 28 April 2016 at 08:36, Tim van der Linden <t...@shisaa.jp> wrote:
> On Wed, 27 Apr 2016 13:48:06 +0200
> Alban Hertroys <haram...@gmail.com> wrote:
>
>> In this case, you're using the values in adverse to filter relevant rid's 
>> for the FK join, so you might be better off with the inverse of above index:
>>       create index on report_adverses (adverse, rid);
>>       create index on report_drugs (drug, rid);
>
> Hmmmm, like I reported yesterday this achieved a huge performance boost.
>
> However, two (new) things I like to touch on while further experimenting with 
> this query:
>
> 1. Altering Drug IDs or Adverse names effects the executing time negatively.
>
> In this example altering the drug IDs I search for makes the query 6 times 
> slower again:
>
> SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
> FROM reports r
> JOIN report_drugs d ON d.rid = r.id
> JOIN report_adverses a ON a.rid = r.id
> WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back 
> pain', 'back pain'])
> AND d.drug = ANY (ARRAY[9557, 17848, 17880, 18223]) ORDER BY r.created;
>
> Different drug ID ([9557, 17848, 17880, 18223]), but the rest is the same. 
> Query plan:
>
>  Sort  (cost=31409.71..31417.48 rows=3107 width=41) (actual 
> time=662.707..662.819 rows=4076 loops=1)
>    Sort Key: r.created
>    Sort Method: quicksort  Memory: 415kB
>    ->  Nested Loop  (cost=25693.17..31229.48 rows=3107 width=41) (actual 
> time=71.748..661.743 rows=4076 loops=1)
>          ->  Merge Join  (cost=25692.61..26795.53 rows=3107 width=29) (actual 
> time=70.841..97.094 rows=4076 loops=1)
>                Merge Cond: (d.rid = a.rid)
>                ->  Sort  (cost=16332.08..16671.61 rows=135810 width=8) 
> (actual time=48.946..58.623 rows=135370 loops=1)
>                      Sort Key: d.rid
>                      Sort Method: quicksort  Memory: 12498kB
>                      ->  Index Scan using report_drugs_drug_idx on 
> report_drugs d  (cost=0.44..4753.44 rows=135810 width=8) (actual 
> time=0.681..28.441 rows=135532 loops=1)
>                            Index Cond: (drug = ANY 
> ('{9557,17848,17880,18223}'::integer[]))



You're doing ~9 times as many index lookups. A slowdown of 6x of this
part of the query seems rather reasonable.

>  Planning time: 16.438 ms
>  Execution time: 663.583 ms
>
> A total of 660 ms instead of the previous 120 ms. The amount of rows returned 
> are 4076, instead of the 448 rows of the previous query. Could this be the 
> cause of the slow down? Or is it the fact that the integers in the ANY clause 
> are further apart from one another and thus make for longer index searches?

Technically probably both, but the index node distance (assuming a
btree index) should be insignificant.

The sorting of found candidate rows on rid probably contributes the
most. I forgot how quicksort scales to the number of items to sort,
probably something like 2log(n) or something like that.

That said, you didn't happen to perform these tests (assuming >1)
during a busier period of the database server? I ask because later on
you seem to be getting fast results again after some more index
changes and I don't see cause for the difference.

> The same happens with changing the adverses, some adverses come back in ~120 
> ms, others can take up to one second.

Likely the same issue

> 2. Adding extra columns ignores indexes
>
> This one was kind of expected. Adding extra columns from both "report_drugs" 
> and "report_adverses" slows the lot down again. Extra columns added:
>
> - a.recovery
> - a.severity
> - d.reason
> - d.effectiveness
> - d.duration

I have to admit that I'm a bit behind with the current state of the
art of PostgreSQL, but last time I checked, the database needed to
look at each row in the result-set for transaction visibility
information. I recall that there was (at some point) much discussion
whether that visibility information could be added to indexes and that
there were strong arguments against doing so. Hence, I doubt that
those new index-only scans skip that step.

Unless I'm wrong there, adding non-queried fields to the index is only
going to affect your performance adversely. Relevant fields for
indices are those used in joins, those regularly used in conditions
(where-clauses) and those that are regularly sorted over.

Other fields are available from the candidate rows of the result set.
Putting them in the index just results in storing redundant
information.

That's how it always has been and I'm pretty sure that's still the case.

> As I would expect, adding these columns would make the previous multi-column 
> indexes useless. And they indeed were not used anymore.

What adding those fields to the index did was drive up the cost of
using that index. It requires more memory to cache, for example, and
this will be evicted from caches sooner than a smaller index.

I suspect the query planner also noticed that and chose for the
smaller index as that indexes usage costs are now lower than those of
your (unnecessarily) large index.

> Still fast enough ... but I was wondering why the planner would not use the 
> new index and instead fall back on the "report_drugs_drug_idx" single column 
> index.

I'm actually a bit suspicious of those numbers; those are different
drug id's than those from the first explain too.

-- 
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Slow join over three tables

Reply via email to