Re: [GENERAL] Slow join over three tables

2016-04-28 Thread Tim van der Linden
On Thu, 28 Apr 2016 16:38:53 +0200
Alban Hertroys  wrote:

Hi Alban

First off ... thanks you for your time on this so far and giving me the 
educational smacks on the head :) I appreciate this a lot. 

> You're doing ~9 times as many index lookups. A slowdown of 6x of this
> part of the query seems rather reasonable.

Again, very true. But why 9 times as much?
 
> >  Planning time: 16.438 ms
> >  Execution time: 663.583 ms
> >
> > A total of 660 ms instead of the previous 120 ms. The amount of rows 
> > returned are 4076, instead of the 448 rows of the previous query. Could 
> > this be the cause of the slow down? Or is it the fact that the integers in 
> > the ANY clause are further apart from one another and thus make for longer 
> > index searches?
> 
> Technically probably both, but the index node distance (assuming a
> btree index) should be insignificant.

It is indeed a b-tree index.

> The sorting of found candidate rows on rid probably contributes the
> most. I forgot how quicksort scales to the number of items to sort,
> probably something like 2log(n) or something like that.

Looked this up and you are right, a quicksort (in C) is of nlog(n) complexity 
... and if I'm right this is "rather" linear. Less sorting is thus the message 
:)

> That said, you didn't happen to perform these tests (assuming >1)
> during a busier period of the database server? I ask because later on
> you seem to be getting fast results again after some more index
> changes and I don't see cause for the difference.

No, these test are on a local development machine with nothing else going on 
but the database cluster.

> > 2. Adding extra columns ignores indexes
> > ...
> > - a.recovery
> > - a.severity
> > - d.reason
> > - d.effectiveness
> > - d.duration
> 
> I have to admit that I'm a bit behind with the current state of the
> art of PostgreSQL, but last time I checked, the database needed to
> look at each row in the result-set for transaction visibility
> information. I recall that there was (at some point) much discussion
> whether that visibility information could be added to indexes and that
> there were strong arguments against doing so. Hence, I doubt that
> those new index-only scans skip that step.

I see.

> Unless I'm wrong there, adding non-queried fields to the index is only
> going to affect your performance adversely. Relevant fields for
> indices are those used in joins, those regularly used in conditions
> (where-clauses) and those that are regularly sorted over.
> 
> Other fields are available from the candidate rows of the result set.
> Putting them in the index just results in storing redundant
> information.

Storing redundant information and making for bloated indexes which the planner 
might choose to skip if I understand correctly. Good to know.

> > ...
> > Still fast enough ... but I was wondering why the planner would not use the 
> > new index and instead fall back on the "report_drugs_drug_idx" single 
> > column index.
> 
> I'm actually a bit suspicious of those numbers; those are different
> drug id's than those from the first explain too.

You are completely right, this was the wrong plan ... the correct plan is:

 Sort  (cost=31757.71..31765.48 rows=3107 width=76) (actual 
time=722.348..722.461 rows=4076 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 495kB
   ->  Nested Loop  (cost=26041.17..31577.48 rows=3107 width=76) (actual 
time=139.407..721.090 rows=4076 loops=1)
 ->  Merge Join  (cost=26040.61..27143.53 rows=3107 width=64) (actual 
time=139.396..170.446 rows=4076 loops=1)
   Merge Cond: (d.rid = a.rid)
   ->  Sort  (cost=16332.08..16671.61 rows=135810 width=35) (actual 
time=108.866..119.143 rows=135370 loops=1)
 Sort Key: d.rid
 Sort Method: quicksort  Memory: 13992kB
 ->  Index Scan using report_drugs_drug_idx on report_drugs 
d  (cost=0.44..4753.44 rows=135810 width=35) (actual time=0.038..53.758 
rows=135532 loops=1)
   Index Cond: (drug = ANY 
('{9557,17848,17880,18223}'::integer[]))
   ->  Sort  (cost=9708.53..9904.94 rows=78565 width=29) (actual 
time=30.517..34.876 rows=77163 loops=1)
 Sort Key: a.rid
 Sort Method: quicksort  Memory: 6702kB
 ->  Index Only Scan using 
report_adverses_adverse_rid_severity_recovery_idx on report_adverses a  
(cost=0.56..3320.57 rows=78565 width=29) (actual time=1.005..16.135 rows=76972 
loops=1)
   Index Cond: (adverse = ANY ('{"back pain - 
nonspecific","nonspecific back pain","back pain"}'::text[]))
   Heap Fetches: 0
 ->  Index Only Scan using reports_id_age_gender_created_idx on reports 
r  (cost=0.56..1.42 rows=1 width=20) (actual time=0.134..0.135 rows=1 
loops=4076)
   Index Cond: (id = d.rid)
   Heap Fetches: 0
 Planning time: 

Re: [GENERAL] Slow join over three tables

2016-04-28 Thread Tim van der Linden
On Wed, 27 Apr 2016 13:48:06 +0200
Alban Hertroys  wrote:

> In this case, you're using the values in adverse to filter relevant rid's for 
> the FK join, so you might be better off with the inverse of above index:
>   create index on report_adverses (adverse, rid);
>   create index on report_drugs (drug, rid);

H, like I reported yesterday this achieved a huge performance boost.

However, two (new) things I like to touch on while further experimenting with 
this query:

1. Altering Drug IDs or Adverse names effects the executing time negatively.

In this example altering the drug IDs I search for makes the query 6 times 
slower again:

SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
FROM reports r
JOIN report_drugs d ON d.rid = r.id
JOIN report_adverses a ON a.rid = r.id 
WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back 
pain', 'back pain']) 
AND d.drug = ANY (ARRAY[9557, 17848, 17880, 18223]) ORDER BY r.created;

Different drug ID ([9557, 17848, 17880, 18223]), but the rest is the same. 
Query plan:

 Sort  (cost=31409.71..31417.48 rows=3107 width=41) (actual 
time=662.707..662.819 rows=4076 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 415kB
   ->  Nested Loop  (cost=25693.17..31229.48 rows=3107 width=41) (actual 
time=71.748..661.743 rows=4076 loops=1)
 ->  Merge Join  (cost=25692.61..26795.53 rows=3107 width=29) (actual 
time=70.841..97.094 rows=4076 loops=1)
   Merge Cond: (d.rid = a.rid)
   ->  Sort  (cost=16332.08..16671.61 rows=135810 width=8) (actual 
time=48.946..58.623 rows=135370 loops=1)
 Sort Key: d.rid
 Sort Method: quicksort  Memory: 12498kB
 ->  Index Scan using report_drugs_drug_idx on report_drugs 
d  (cost=0.44..4753.44 rows=135810 width=8) (actual time=0.681..28.441 
rows=135532 loops=1)
   Index Cond: (drug = ANY 
('{9557,17848,17880,18223}'::integer[]))
   ->  Sort  (cost=9360.53..9556.94 rows=78565 width=21) (actual 
time=21.880..25.969 rows=77163 loops=1)
 Sort Key: a.rid
 Sort Method: quicksort  Memory: 6682kB
 ->  Index Only Scan using report_adverses_adverse_rid_idx 
on report_adverses a  (cost=0.56..2972.57 rows=78565 width=21) (actual 
time=0.983..10.744 rows=76972 loops=1)
   Index Cond: (adverse = ANY ('{"back pain - 
nonspecific","nonspecific back pain","back pain"}'::text[]))
   Heap Fetches: 0
 ->  Index Only Scan using reports_id_age_gender_created_idx on reports 
r  (cost=0.56..1.42 rows=1 width=20) (actual time=0.137..0.138 rows=1 
loops=4076)
   Index Cond: (id = d.rid)
   Heap Fetches: 0
 Planning time: 16.438 ms
 Execution time: 663.583 ms

A total of 660 ms instead of the previous 120 ms. The amount of rows returned 
are 4076, instead of the 448 rows of the previous query. Could this be the 
cause of the slow down? Or is it the fact that the integers in the ANY clause 
are further apart from one another and thus make for longer index searches?

The same happens with changing the adverses, some adverses come back in ~120 
ms, others can take up to one second.

2. Adding extra columns ignores indexes

This one was kind of expected. Adding extra columns from both "report_drugs" 
and "report_adverses" slows the lot down again. Extra columns added:

- a.recovery
- a.severity
- d.reason
- d.effectiveness
- d.duration

As I would expect, adding these columns would make the previous multi-column 
indexes useless. And they indeed were not used anymore.

So my initial thought was to create new indexes on the columns I now query:

- CREATE INDEX ON report_adverses(adverse, rid, severity, recovery);
- CREATE INDEX ON report_drugs(drug, rid, reason, effectiveness, duration);

After running the query again, the new index on "report_adverses" got picked 
up, but the index on "report_drugs" did not:

 Sort  (cost=12365.79..12366.61 rows=329 width=76) (actual 
time=129.106..129.120 rows=448 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 66kB
   ->  Nested Loop  (cost=11212.93..12352.04 rows=329 width=76) (actual 
time=31.558..128.951 rows=448 loops=1)
 Join Filter: (d.rid = r.id)
 ->  Merge Join  (cost=11212.38..11680.44 rows=329 width=64) (actual 
time=30.705..39.527 rows=448 loops=1)
   Merge Cond: (d.rid = a.rid)
   ->  Sort  (cost=1503.85..1539.82 rows=14390 width=35) (actual 
time=6.977..7.993 rows=14074 loops=1)
 Sort Key: d.rid
 Sort Method: quicksort  Memory: 1202kB
 ->  Index Scan using report_drugs_drug_idx on report_drugs 
d  (cost=0.44..510.02 rows=14390 width=35) (actual time=0.567..4.638 rows=14200 
loops=1)
   Index Cond: (drug = ANY 
('{359,360,361,362,363}'::integer[]))
  

Re: [GENERAL] Slow join over three tables

2016-04-27 Thread Tim van der Linden
On Wed, 27 Apr 2016 13:48:06 +0200
Alban Hertroys  wrote:

Hi Alban

Thanks for chiming in!

> Since you're not using age and gender in this (particular) query until the 
> rows are combined into a result set already, it doesn't make a whole lot of 
> sense to add them to the index. Moreover, since your'e ordering by created, 
> I'd at least put that directly after id:
> 
>   create index on reports (id, created);

The ORDER statement did not attribute much to the total query time, but it 
makes sense to index this.
 
> In this case, you're using the values in adverse to filter relevant rid's for 
> the FK join, so you might be better off with the inverse of above index:
>   create index on report_adverses (adverse, rid);
>   create index on report_drugs (drug, rid);

Hold the phone ... you just cracked the code.

I added these two proposed indexes (reversing their order). After purging the 
OS disk cache, the query runs in (*drum roll*) 120 ms. You managed to speed 
this up ~40 times.

The query plan:

 Sort  (cost=12108.99..12109.83 rows=337 width=41) (actual 
time=119.517..119.531 rows=448 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 60kB
   ->  Nested Loop  (cost=10940.25..12094.84 rows=337 width=41) (actual 
time=26.473..119.392 rows=448 loops=1)
 Join Filter: (d.rid = r.id)
 ->  Merge Join  (cost=10939.69..11414.84 rows=337 width=29) (actual 
time=25.624..33.650 rows=448 loops=1)
   Merge Cond: (d.rid = a.rid)
   ->  Sort  (cost=1417.78..1454.02 rows=14496 width=8) (actual 
time=4.208..4.976 rows=14074 loops=1)
 Sort Key: d.rid
 Sort Method: quicksort  Memory: 1050kB
 ->  Index Only Scan using report_drugs_drug_rid_idx on 
report_drugs d  (cost=0.44..415.86 rows=14496 width=8) (actual 
time=0.648..2.236 rows=14200 loops=1)
   Index Cond: (drug = ANY 
('{359,360,361,362,363}'::integer[]))
   Heap Fetches: 0
   ->  Sort  (cost=9521.91..9721.56 rows=79860 width=21) (actual 
time=21.410..24.174 rows=76974 loops=1)
 Sort Key: a.rid
 Sort Method: quicksort  Memory: 6682kB
 ->  Index Only Scan using report_adverses_adverse_rid_idx 
on report_adverses a  (cost=0.56..3019.24 rows=79860 width=21) (actual 
time=0.916..10.689 rows=76972 loops=1)
   Index Cond: (adverse = ANY ('{"back pain - 
nonspecific","nonspecific back pain","back pain"}'::text[]))
   Heap Fetches: 0
 ->  Index Only Scan using reports_id_age_gender_created_idx on reports 
r  (cost=0.56..2.01 rows=1 width=20) (actual time=0.190..0.191 rows=1 loops=448)
   Index Cond: (id = a.rid)
   Heap Fetches: 0
 Planning time: 15.466 ms
 Execution time: 119.871 ms

This is amazing, and the slow down was purely a fact of missing indexes on the 
sweet spot.
You can now clearly see that the above two indexes are used 
(report_drugs_drug_rid_idx) and (report_adverses_adverse_rid_idx) and that the 
nested loop is much faster.

> Do these tables have a primary key and can that be used here?

Only the "reports" table has a PK, the other two don't (only a foreign key to 
"reports").

> I hope I'm not being superfluous here, but don't forget to vacuum analyze 
> after creating those indexes. Eventually autovacuum will do that for you, but 
> you could otherwise be running tests to verify the impact of adding those 
> indexes before autovacuum gets around to it.

I'm becoming the king of manual VACUUM-ing the past few days ... no need to 
point that out ;)

> Finally, perhaps it's more efficient to weed out all unrelated drugs and 
> adverses before relating them to reports, something like:
> 
> SELECT r.id, r.age, r.gender, r.created, x.adverse, x.drug
> FROM reports r
> JOIN (
>   SELECT d.rid, a.adverse, d.drug
>   FROM report_drugs d
>   JOIN report_adverses a ON a.rid = d.rid 
>   WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific 
> back pain', 'back pain']) 
>   AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363])
> ) x ON x.rid = r.id
> ORDER BY r.created;

With the above indexes created, disk cache flushed, this query ran at the exact 
same speed, the plan:

 Sort  (cost=12108.14..12108.99 rows=337 width=41) (actual 
time=119.102..119.111 rows=448 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 60kB
   ->  Nested Loop  (cost=10940.25..12094.00 rows=337 width=41) (actual 
time=26.797..118.969 rows=448 loops=1)
 ->  Merge Join  (cost=10939.69..11414.84 rows=337 width=29) (actual 
time=25.899..33.203 rows=448 loops=1)
   Merge Cond: (d.rid = a.rid)
   ->  Sort  (cost=1417.78..1454.02 rows=14496 width=8) (actual 
time=4.319..4.981 rows=14074 loops=1)
 Sort Key: d.rid
 Sort Method: 

Re: [GENERAL] Slow join over three tables

2016-04-27 Thread Tim van der Linden
On Wed, 27 Apr 2016 22:40:43 +1200
David Rowley  wrote:

Hi David

> > ...
> >  Planning time: 15.968 ms
> >  Execution time: 4313.755 ms
> >
> > Both the (rid, adverse) and the (id, age, gender, created) indexes are now 
> > used.
> >
> 
> Seems the (rid, adverse) is not being used. report_adverses_rid_idx
> is your (rid) index.

Ah ... good catch, I blame it on the late evening over here ;)

> > Yet ... this is "only" 1 second faster, still 4 seconds to complete this 
> > query (the join(s)). Is this truly the most that I could get out of it ... 
> > or could further tweaking to PostgreSQL's configuration be of any help here?
> 
> EXPLAIN ANALYZE also has quite a bit of timing overhead, so it might
> not be taking quite as long as you think.
> 
> How long does it take with EXPLAIN (ANALYZE, TIMING OFF) ... ?

Hmmm, this takes roughly the same amount of time: 

 Planning time: 39.889 ms
 Execution time: 4532.642 ms

> Or perhaps just run the query, as there's only 448 rows anyway.

Also, did not speed up much ... query time pulled from the log statement: 
4305.686 ms.

> -- 
>  David Rowley   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services

Cheers,
Tim


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Slow join over three tables

2016-04-27 Thread Tim van der Linden
On Wed, 27 Apr 2016 09:14:27 +0300
Victor Yegorov  wrote:

Hi Victor

> > ...
> 
> Can you post output of `EXPLAIN (analyze, buffers)`, please?
> It'd be good to check how many buffers are hit/read during Index Scans.

Happy to, here it is:

 Sort  (cost=107727.85..107728.71 rows=344 width=41) (actual 
time=5357.453..5357.482 rows=448 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 60kB
   Buffers: shared hit=75462 read=37183
   ->  Nested Loop  (cost=1.31..107713.36 rows=344 width=41) (actual 
time=53.866..5357.211 rows=448 loops=1)
 Join Filter: (d.rid = a.rid)
 Buffers: shared hit=75459 read=37183
 ->  Nested Loop  (cost=0.87..96362.32 rows=14319 width=28) (actual 
time=0.569..2214.900 rows=14200 loops=1)
   Buffers: shared hit=39949 read=17072
   ->  Index Scan using report_drugs_drug_idx on report_drugs d  
(cost=0.44..507.78 rows=14319 width=8) (actual time=0.567..4.406 rows=14200 
loops=1)
 Index Cond: (drug = ANY 
('{359,360,361,362,363}'::integer[]))
 Buffers: shared hit=14 read=131
   ->  Index Scan using reports_id_key on reports r  
(cost=0.43..6.68 rows=1 width=20) (actual time=0.155..0.155 rows=1 loops=14200)
 Index Cond: (id = d.rid)
 Buffers: shared hit=39935 read=16941
 ->  Index Scan using report_adverses_rid_idx on report_adverses a  
(cost=0.44..0.78 rows=1 width=21) (actual time=0.221..0.221 rows=0 loops=14200)
   Index Cond: (rid = r.id)
   Filter: (adverse = ANY ('{"back pain - nonspecific","nonspecific 
back pain","back pain"}'::text[]))
   Rows Removed by Filter: 5
   Buffers: shared hit=35510 read=20111
 Planning time: 13.962 ms
 Execution time: 5357.520 ms

> -- 
> Victor Y. Yegorov

Cheers,
Tim


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Slow join over three tables

2016-04-27 Thread Tim van der Linden
On Wed, 27 Apr 2016 14:09:04 +1200
David Rowley  wrote:

Hi David

Thanks for your time on this. I tried your proposals with the results below.

> > ...
> > Under 5 ms. The same goes for querying the "adverse" column in the 
> > "report_adverses" table: under 20 ms.
> 
> I'm not sure why you're comparing this to the join plan above. They're
> very different, I can only imagine that it' because you've not quite
> understood what the EXPLAIN output is saying.

Well, it is more due to me poorly wording the question I guess. I understand 
that both plans are totally different and both queries are fairly incomparable 
as two very different things are happening.

My point here was to show that the "drug" column seems to be indexed correctly 
and, even though it is a very large table, the query executes very fast (all 
these execution times are given after flushing the OS's disk cache).
 
> >  ->  Nested Loop  (cost=0.87..94657.59 rows=14005 width=72) (actual 
> > time=0.822..2038.952 rows=14199 loops=1)
> >  ->  Index Scan using report_drugs_drug_idx on report_drugs d  
> > (cost=0.44..500.28 rows=14005 width=31) (actual time=0.669..3.900 
> > rows=14199 loops=1)
> >Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
> >  ->  Index Scan using reports_id_key on reports r  (cost=0.43..6.71 
> > rows=1 width=41) (actual time=0.143..0.143 rows=1 loops=14199)
> >Index Cond: (id = d.rid)
> 
> This is a parameterised nested loop. This means that the inner side of
> the loop (reports), is parameterised by the outerside of the loop, you
> can see the d.rid in the Index Cond. This means that the
> reports_id_key index is looked up 14199 times. You can see from the
> plan that the nested loop here took 2038.952 milliseconds to complete,
> so about 0.144 ms per lookup, not too bad, right?

That is indeed not bad at all, thanks for pointing this out. This nested loop 
is thus responsible for little over 40% of the execution time.

> The alternative to a parameterised nested loop plan is that a Hash
> Join plan is generated. If you imagine what would happen here, likely
> the matching report_drugs records would be hashed, then the outer side
> of the join would then perform a SeqScan over the entire 6 million
> reports records... probably not better. You can test this yourself
> with; SET enable_nestloop = 0; you might need to SET enable_mergejoin
> = 0; too. I imagine it'll be slower.

Doing this with the current (unaltered) indexes almost doubles the time and it 
seems to behave exactly as you mentioned:

 Sort  (cost=372968.28..372969.07 rows=317 width=41) (actual 
time=9308.174..9308.187 rows=448 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 60kB
   ->  Hash Join  (cost=130941.07..372955.11 rows=317 width=41) (actual 
time=8131.509..9307.983 rows=448 loops=1)
 Hash Cond: (r.id = d.rid)
 ->  Seq Scan on reports r  (cost=0.00..105465.00 rows=5749300 
width=20) (actual time=0.237..849.564 rows=5749300 loops=1)
 ->  Hash  (cost=130937.10..130937.10 rows=317 width=29) (actual 
time=8130.175..8130.175 rows=448 loops=1)
   Buckets: 1024  Batches: 1  Memory Usage: 25kB
   ->  Hash Join  (cost=122150.13..130937.10 rows=317 width=29) 
(actual time=8118.130..8130.090 rows=448 loops=1)
 Hash Cond: (d.rid = a.rid)
 ->  Index Scan using report_drugs_drug_idx on report_drugs 
d  (cost=0.44..496.92 rows=13870 width=8) (actual time=0.830..3.503 rows=14200 
loops=1)
   Index Cond: (drug = ANY 
('{359,360,361,362,363}'::integer[]))
 ->  Hash  (cost=121170.69..121170.69 rows=78320 width=21) 
(actual time=8116.544..8116.544 rows=76972 loops=1)
   Buckets: 8192  Batches: 1  Memory Usage: 3609kB
   ->  Bitmap Heap Scan on report_adverses a  
(cost=2024.68..121170.69 rows=78320 width=21) (actual time=19.303..8093.257 
rows=76972 loops=1)
 Recheck Cond: (adverse = ANY ('{"back pain - 
nonspecific","nonspecific back pain","back pain"}'::text[]))
 Heap Blocks: exact=56959
 ->  Bitmap Index Scan on 
report_adverses_adverse_idx  (cost=0.00..2005.10 rows=78320 width=0) (actual 
time=11.756..11.756 rows=76972 loops=1)
   Index Cond: (adverse = ANY ('{"back pain 
- nonspecific","nonspecific back pain","back pain"}'::text[]))
 Planning time: 11.348 ms
 Execution time: 9308.694 ms

> Likely the only way to speed this up would be to create indexes;
> 
> create index on reports (id, age, gender, created);
> ...
> create index on report_adverses (rid, adverse);
> 
> this might speedup the 2nd join a bit.

Done. This does help a bit. The query runs about ~1 second faster. The plan:

 Sort  (cost=66065.73..66066.59 rows=344 width=41) (actual 

Re: [GENERAL] Slow join over three tables

2016-04-27 Thread Tim van der Linden
On Wed, 27 Apr 2016 01:45:55 +
Sameer Kumar  wrote:

Hi Sameer

Thanks for taking the time to look into this!

> > ... 
> Quite clearly the nested loop joins are the most costly operations here.

Indeed.

> > ... 
> I suppose. It might help if the filters are performed before the join. I am
> not an expert on optimizer but I guess it might help if you change the join
> order and add duplicate conditions for reports-
> 
> SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
> FROM report_drugs d
> JOIN report_adverses a ON a.rid = d.rid
> JOIN reports r ON d.rid = r.id
> WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back
> pain', 'back pain'])
> AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363]) ORDER BY r.created;

Looks like a nice approach, but it did no effect to the query time. The plan 
for this approach:

 Sort  (cost=104928.07..104928.86 rows=317 width=41) (actual 
time=5435.210..5435.236 rows=448 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 60kB
   ->  Nested Loop  (cost=1.31..104914.90 rows=317 width=41) (actual 
time=57.230..5434.930 rows=448 loops=1)
 Join Filter: (d.rid = a.rid)
 ->  Nested Loop  (cost=0.87..93919.79 rows=13870 width=28) (actual 
time=0.569..2240.955 rows=14200 loops=1)
   ->  Index Scan using report_drugs_drug_idx on report_drugs d  
(cost=0.44..496.92 rows=13870 width=8) (actual time=0.565..4.678 rows=14200 
loops=1)
 Index Cond: (drug = ANY 
('{359,360,361,362,363}'::integer[]))
   ->  Index Scan using reports_id_key on reports r  
(cost=0.43..6.73 rows=1 width=20) (actual time=0.157..0.157 rows=1 loops=14200)
 Index Cond: (id = d.rid)
 ->  Index Scan using report_adverses_rid_idx on report_adverses a  
(cost=0.44..0.78 rows=1 width=21) (actual time=0.224..0.225 rows=0 loops=14200)
   Index Cond: (rid = r.id)
   Filter: (adverse = ANY ('{"back pain - nonspecific","nonspecific 
back
 pain","back pain"}'::text[]))
   Rows Removed by Filter: 5
 Planning time: 18.512 ms
 Execution time: 5435.293 ms


> OR since you are using INNER JOIN, (As far as I understand the concept of
> joins) it won't hurt the result set if the where clause is pushed into the
> INNER JOIN criteria-

Correct. I have tried those as well, but the planner seems to take the exact 
same path and as a result the query time is unchanged.
 
> > ...
> 
> --
> Best Regards
> Sameer Kumar | DB Solution Architect

Cheers,
Tim


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Slow join over three tables

2016-04-26 Thread Tim van der Linden
Hi all

I have asked this question in a somewhat different form on the DBA 
Stackexchange site, but without much luck 
(https://dba.stackexchange.com/questions/136423/postgresql-slow-join-on-three-large-tables).
 So I apologize for possible double posting, but I hope this might get a better 
response on the mailing list.

I'm joining three fairly large tables together, and it is slow. The tables are:

- "reports": 6 million rows
- "report_drugs": 20 million rows
- "report_adverses": 20 million rows

The table "reports" holds main report data and has a primary key column "id". 
The other two tables have a foreign key to that table with "rid". It are those 
columns that I use to join them together.

All tables have indexes on the "id"/"rid" columns and on the "drug"/"adverse" 
columns.

The query:

SELECT r.id, r.age, r.gender, r.created, a.adverse, d.drug
FROM reports r
JOIN report_drugs d ON d.rid = r.id
JOIN report_adverses a ON a.rid = r.id 
WHERE a.adverse = ANY (ARRAY['back pain - nonspecific', 'nonspecific back 
pain', 'back pain']) 
AND d.drug = ANY (ARRAY[359, 360, 361, 362, 363]) ORDER BY r.created;

The plan:

Sort  (cost=105773.63..105774.46 rows=333 width=76) (actual 
time=5143.162..5143.185 rows=448 loops=1)
   Sort Key: r.created
   Sort Method: quicksort  Memory: 60kB
   ->  Nested Loop  (cost=1.31..105759.68 rows=333 width=76) (actual 
time=54.784..5142.872 rows=448 loops=1)
 Join Filter: (d.rid = a.rid)
 ->  Nested Loop  (cost=0.87..94657.59 rows=14005 width=72) (actual 
time=0.822..2038.952 rows=14199 loops=1)
 ->  Index Scan using report_drugs_drug_idx on report_drugs d  
(cost=0.44..500.28 rows=14005 width=31) (actual time=0.669..3.900 rows=14199 
loops=1)
   Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
 ->  Index Scan using reports_id_key on reports r  (cost=0.43..6.71 
rows=1 width=41) (actual time=0.143..0.143 rows=1 loops=14199)
   Index Cond: (id = d.rid)
 ->  Index Scan using report_adverses_rid_idx on report_adverses a  
(cost=0.44..0.78 rows=1 width=12) (actual time=0.218..0.218 rows=0 loops=14199)
   Index Cond: (rid = r.id)
   Filter: (adverse = ANY ('{"back pain - nonspecific","nonspecific 
back pain","back pain"}'::text[]))
   Rows Removed by Filter: 5
Planning time: 13.994 ms
Execution time: 5143.235 ms

This takes well over 5 seconds, which to me, feels much too slow.
If I query each table directly with the same conditions, thus:

SELECT reason
FROM report_drugs 
WHERE drug = ANY (ARRAY[359, 360, 361, 362, 363]);

I get:

Index Scan using report_drugs_drug_idx on report_drugs  (cost=0.44..500.28 
rows=14005 width=27) (actual time=0.621..4.510 rows=14199 loops=1)
  Index Cond: (drug = ANY ('{359,360,361,362,363}'::integer[]))
Planning time: 6.939 ms
Execution time: 4.759 ms

Under 5 ms. The same goes for querying the "adverse" column in the 
"report_adverses" table: under 20 ms.

This indicates to me that indeed the join itself causes a major performance 
bottleneck.

I'm running the cluster from an SSD drive, as a traditional HDD could not even 
manage the query in under 5 minutes. The system has a total memory of 24 GB, 
runs on Debian and uses an 4Ghz 8 core i7-4790 processor.

Some important postgresql.conf readouts:

- shared_buffers = 4GB
- work_mem = 64MB
- maintenance_work_mem = 1GB
- checkpoint_segments = 50
- checkpoint_completion_target = 0.9
- autovacuum = on

Is there something I am missing here? Any help on getting this join faster is 
much appreciated.

Cheers,
Tim


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Multiple word synonyms (maybe?)

2015-10-22 Thread Tim van der Linden
On Wed, 21 Oct 2015 13:40:38 + (UTC)
Kevin Grittner  wrote:

> > Damn, I completely overlooked that one, and it indeed does seem
> > to come very close to what I need in this use case.
> 
> I have to admit that the name of that dictionary type threw me off
> a bit at first.

Indeed :)
 
> > ...
> 
> It has been a while, but my recollection is that I did something
> more like this:
> 
> heart attack : heartattack
> acute mi : heartattack
> mi : heartattack
> myocardial infarction : heartattack​
> 
> If my memory is to be trusted, both the original words (whichever
> are actually in the document) and the "invented" synonym
> ("heartattack") will be in the tsvector/tsquery; this results in
> all *matching* but the identical wording being considered a *closer
> match*.

Hmm, a very helpful insight and it indeed makes sense to convert each phrase 
into a "single word" mash-up so it can be lexemized.

> As with most things, I encourage you to play around with it a bit
> to see what gives the best results for you.

Yes indeed and will do!

Thank you very much for your help. If I get this up and running it might offer 
a nice opportunity to write a small post about this to expand on my PostgreSQL 
series...

> --
> Kevin Grittner
> EDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

Cheers,
Tim


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Multiple word synonyms (maybe?)

2015-10-20 Thread Tim van der Linden
Hi All

I have a question regarding PostgreSQL's full text capabilities and 
(presumably) the synonym dictionary.

I'm currently implementing FTS on a medical themed setup which uses domain 
specific jargon to denote a bunch of stuff. A specific request I wish to 
implement here are the jargon synonyms that are heavily used.

Of course, I can simply go ahead and create my own synonym dictionary with a 
jargon specific synonym file to feed it. However, most of the synonyms are 
comprised out of more then a single word. 

The term "heart attack" for example has the following "synonyms":

- Acute MI
- MI
- Myocardial infarction

As far as I understand it, the tokenizer within PostgreSQL FTS engine splits 
words on spaces to generate tokens which are then proposed to each dictionary. 
I think it is therefor impossible to have "multi-word synonyms" in this sense 
as multiple words cannot reach the dictionary. The term "heart attack" would be 
presented as the tokens "heart" and "attack".

>From a technical standpoint I understand FTS is about looking at individual 
>words and lexemizing them ... yet from a natural language lookup perspective 
>you still wish to tie "Heart attack" to "Acute MI" so when a client search on 
>one, the other will turn up as well.

Should I write my own tokenizer to catch all these words and present them as a 
single token? Or is this completely outside the realm of FTS (or FTS within 
Postgresql)?

Cheers,
Tim


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Multiple word synonyms (maybe?)

2015-10-20 Thread Tim van der Linden
On Tue, 20 Oct 2015 21:57:59 +1100
rob stone  wrote:
> 
> Looking at this from an entirely different perspective, why are you not
> using ICD codes to identify patient events?
> It is a one to many relationship between patient and their events
> identified by the relevant ICD code and date.
> Given that MI has several applicable ICD codes you can use a select
> along the lines of:-
> WHERE icd_code IN (  . . . )
> 
> 
> I know it doesn't answer your question!

It does indeed not answer my direct question, but it does offer an interesting 
perspecitive to be used on one of the next phases of the medical application.

Thanks for the heads-up!

> Cheers,
> Rob

Cheers,
Tim


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Multiple word synonyms (maybe?)

2015-10-20 Thread Tim van der Linden
On Tue, 20 Oct 2015 12:02:46 +0100

> ​Does the Thesaurus dictionary not do what you want?​
> ​
> http://www.postgresql.org/docs/current/static/textsearch-dictionaries.html#TEXTSEARCH-THESAURUS

Damn, I completely overlooked that one, and it indeed does seem to come very 
close to what I need in this use case. Thanks for jolting my memory (also 
@Kevin) :)

If I am not mistaken, this would be a valid thesaurus file:

acute mi : heart attack
mi : heart attack
myocardial infarction : heart attack​

Multiple words on both ends, separated by a colon and each line being 
functional (a unique phrase linked to its more generic replacement)?

> ​Geoff​​

Cheers,
Tim


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Full text: Ispell dictionary

2014-05-09 Thread Tim van der Linden
Hi Oleg

 btw, take a look on contrib/dict_xsyn, it's  more powerful than
 synonym dictionary.

Sorry for the late reply...and thank you for the tip.

I will check out xsyn soon. I am about to finish the third and final chapter of 
my full text series, but I could maybe write an appendix chapter which 
mentions xsyn...or just update my posts.

Cheers,
Tim

 On Sat, May 3, 2014 at 2:26 AM, Tim van der Linden t...@shisaa.jp wrote:
  Hi Oleg
 
  Haha, understood!
 
  Thanks for helping me on this one.
 
  Cheers
  Tim
 
 
  On May 3, 2014 7:24:08 AM GMT+09:00, Oleg Bartunov obartu...@gmail.com
  wrote:
 
  Tim,
 
  you did answer yourself - don't use ispell :)
 
  On Sat, May 3, 2014 at 1:45 AM, Tim van der Linden t...@shisaa.jp wrote:
 
   On Fri, 2 May 2014 21:12:56 +0400
   Oleg Bartunov obartu...@gmail.com wrote:
 
   Hi Oleg
 
   Thanks for the response!
 
   Yes, it's normal for ispell dictionary, think about morphological
  dictionary.
 
 
   Hmm, I see, that makes sense. I thought the morphological aspect of the
  Ispell only dealt with splitting up compound words, but it also deals with
  deriving the word to a more stem like form, correct?
 
   As a last question on this, is there a way to disable this dictionary to
  emit multiple lexemes?
 
 
  The reason I am asking is because in my (fairly new) understanding of
  PostgreSQL's full text it is always best to have as few lexemes as 
  possible
  saved in the vector. This to get smaller indexes and faster matching
  afterwards. Also, if you run a tsquery afterwards to, you can still employ
  the power of these multiple lexemes to find a match.
 
   Or...probably answering my own question...if I do not desire this
  behavior I should maybe not use Ispell and simply use another dictionary 
  :)
 
   Thanks again.
 
   Cheers,
   Tim
 
   On Fri, May 2, 2014 at 11:54 AM, Tim van der Linden t...@shisaa.jp
  wrote:
 
   Good morning/afternoon all
 
   I am currently writing a few articles about PostgreSQL's full text
  capabilities and have a question about the Ispell dictionary which I
  cannot seem to find an answer to. It is probably a very simple issue, so
  forgive my ignorance.
 
   In one article I am explaining about dictionaries and I have setup a
  sample configuration which maps most token categories to only use a 
  Ispell
  dictionary (timusan_ispell) which has a default configuration:
 
   CREATE TEXT SEARCH DICTIONARY timusan_ispell (
   TEMPLATE = ispell,
   DictFile = en_us,
   AffFile = en_us,
   StopWords = english
   );
 
   When I run a simple query like SELECT
  to_tsvector('timusan-ispell','smiling') I get back the following 
  tsvector:
 
   'smile':1 'smiling':1
 
   As you can see I get two lexemes with the same pointer.
   The question here is: why does this happen?
 
   Is it normal behavior for the Ispell dictionary to emit multiple
  lexemes for a single token? And if so, is this efficient? I
  mean, why could it not simply save one lexeme 'smile' which (same as
  the snowball dictionary) would match 'smiling' as well if later matched 
  with
  the accompanying tsquery?
 
   Thanks!
 
   Cheers,
   Tim
 
 
   --
   Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
   To make changes to your subscription:
   http://www.postgresql.org/mailpref/pgsql-general
 
 
 
   --
   Tim van der Linden t...@shisaa.jp


-- 
Tim van der Linden t...@shisaa.jp


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Normalization in text search ranking

2014-05-03 Thread Tim van der Linden
Hi all

Another question regarding full text, this time about ranking.
The ts_ranking() and ts_ranking_cd() accept a normalization integer/bit mask.

In the documentation the different integers are somewhat laid out and it is 
said that some take into account the document length (1 and 2) while others 
take into account the number of unique words (8 and 16).

To illustrate my following questions, take this tsvector:

'ate':9 'cat':3 'fat':2,11

Now, I was wondering how document length and unique words are calculated (from 
a high level perspective). 

I am correct in saying that, when counting the document length, the number of 
total pointers is summed up, meaning that in the above tsvector we have 4 words 
(resulting in an integer of 4 to use to divide the float).

And when counting unique words, the calculation for the above tsvector would be 
3, only counting the actual lexemes in there and not the amount of pointers?

Also, final question, if you use integer 8 or 16 to influence the ranking float 
calculated, you would actual punish documents who are more unique? Meaning 
that this is just another way of giving shorter documents precedence over 
longer ones?

Thanks again!

Cheers,
Tim


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Full text: Ispell dictionary

2014-05-02 Thread Tim van der Linden
Good morning/afternoon all

I am currently writing a few articles about PostgreSQL's full text capabilities 
and have a question about the Ispell dictionary which I cannot seem to find an 
answer to. It is probably a very simple issue, so forgive my ignorance.

In one article I am explaining about dictionaries and I have setup a sample 
configuration which maps most token categories to only use a Ispell dictionary 
(timusan_ispell) which has a default configuration: 

CREATE TEXT SEARCH DICTIONARY timusan_ispell (
TEMPLATE = ispell,
DictFile = en_us,
AffFile = en_us,
StopWords = english
);

When I run a simple query like SELECT to_tsvector('timusan-ispell','smiling') 
I get back the following tsvector:

'smile':1 'smiling':1

As you can see I get two lexemes with the same pointer.
The question here is: why does this happen? 

Is it normal behavior for the Ispell dictionary to emit multiple lexemes for a 
single token? And if so, is this efficient? I mean, why could it not simply 
save one lexeme 'smile' which (same as the snowball dictionary) would match 
'smiling' as well if later matched with the accompanying tsquery?

Thanks!

Cheers,
Tim


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Full text: Ispell dictionary

2014-05-02 Thread Tim van der Linden
On Fri, 2 May 2014 21:12:56 +0400
Oleg Bartunov obartu...@gmail.com wrote:

Hi Oleg

Thanks for the response!

 Yes, it's normal for ispell dictionary, think about morphological dictionary.

Hmm, I see, that makes sense. I thought the morphological aspect of the Ispell 
only dealt with splitting up compound words, but it also deals with deriving 
the word to a more stem like form, correct?

As a last question on this, is there a way to disable this dictionary to emit 
multiple lexemes? 

The reason I am asking is because in my (fairly new) understanding of 
PostgreSQL's full text it is always best to have as few lexemes as possible 
saved in the vector. This to get smaller indexes and faster matching 
afterwards. Also, if you run a tsquery afterwards to, you can still employ the 
power of these multiple lexemes to find a match.

Or...probably answering my own question...if I do not desire this behavior I 
should maybe not use Ispell and simply use another dictionary :)

Thanks again.

Cheers,
Tim

 On Fri, May 2, 2014 at 11:54 AM, Tim van der Linden t...@shisaa.jp wrote:
  Good morning/afternoon all
 
  I am currently writing a few articles about PostgreSQL's full text 
  capabilities and have a question about the Ispell dictionary which I cannot 
  seem to find an answer to. It is probably a very simple issue, so forgive 
  my ignorance.
 
  In one article I am explaining about dictionaries and I have setup a sample 
  configuration which maps most token categories to only use a Ispell 
  dictionary (timusan_ispell) which has a default configuration:
 
  CREATE TEXT SEARCH DICTIONARY timusan_ispell (
  TEMPLATE = ispell,
  DictFile = en_us,
  AffFile = en_us,
  StopWords = english
  );
 
  When I run a simple query like SELECT 
  to_tsvector('timusan-ispell','smiling') I get back the following tsvector:
 
  'smile':1 'smiling':1
 
  As you can see I get two lexemes with the same pointer.
  The question here is: why does this happen?
 
  Is it normal behavior for the Ispell dictionary to emit multiple lexemes 
  for a single token? And if so, is this efficient? I mean, why could it not 
  simply save one lexeme 'smile' which (same as the snowball dictionary) 
  would match 'smiling' as well if later matched with the accompanying 
  tsquery?
 
  Thanks!
 
  Cheers,
  Tim
 
 
  --
  Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
  To make changes to your subscription:
  http://www.postgresql.org/mailpref/pgsql-general


-- 
Tim van der Linden t...@shisaa.jp


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Full text: Ispell dictionary

2014-05-02 Thread Tim van der Linden
Hi Oleg

Haha, understood!

Thanks for helping me on this one.

Cheers
Tim

On May 3, 2014 7:24:08 AM GMT+09:00, Oleg Bartunov obartu...@gmail.com wrote:
Tim,

you did answer yourself - don't use ispell :)

On Sat, May 3, 2014 at 1:45 AM, Tim van der Linden t...@shisaa.jp
wrote:
 On Fri, 2 May 2014 21:12:56 +0400
 Oleg Bartunov obartu...@gmail.com wrote:

 Hi Oleg

 Thanks for the response!

 Yes, it's normal for ispell dictionary, think about morphological
dictionary.

 Hmm, I see, that makes sense. I thought the morphological aspect of
the Ispell only dealt with splitting up compound words, but it also
deals with deriving the word to a more stem like form, correct?

 As a last question on this, is there a way to disable this dictionary
to emit multiple lexemes?

 The reason I am asking is because in my (fairly new) understanding of
PostgreSQL's full text it is always best to have as few lexemes as
possible saved in the vector. This to get smaller indexes and faster
matching afterwards. Also, if you run a tsquery afterwards to, you can
still employ the power of these multiple lexemes to find a match.

 Or...probably answering my own question...if I do not desire this
behavior I should maybe not use Ispell and simply use another
dictionary :)

 Thanks again.

 Cheers,
 Tim

 On Fri, May 2, 2014 at 11:54 AM, Tim van der Linden t...@shisaa.jp
wrote:
  Good morning/afternoon all
 
  I am currently writing a few articles about PostgreSQL's full text
capabilities and have a question about the Ispell dictionary which I
cannot seem to find an answer to. It is probably a very simple issue,
so forgive my ignorance.
 
  In one article I am explaining about dictionaries and I have setup
a sample configuration which maps most token categories to only use a
Ispell dictionary (timusan_ispell) which has a default configuration:
 
  CREATE TEXT SEARCH DICTIONARY timusan_ispell (
  TEMPLATE = ispell,
  DictFile = en_us,
  AffFile = en_us,
  StopWords = english
  );
 
  When I run a simple query like SELECT
to_tsvector('timusan-ispell','smiling') I get back the following
tsvector:
 
  'smile':1 'smiling':1
 
  As you can see I get two lexemes with the same pointer.
  The question here is: why does this happen?
 
  Is it normal behavior for the Ispell dictionary to emit multiple
lexemes for a single token? And if so, is this efficient? I mean, why
could it not simply save one lexeme 'smile' which (same as the snowball
dictionary) would match 'smiling' as well if later matched with the
accompanying tsquery?
 
  Thanks!
 
  Cheers,
  Tim
 
 
  --
  Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
  To make changes to your subscription:
  http://www.postgresql.org/mailpref/pgsql-general


 --
 Tim van der Linden t...@shisaa.jp