Re: [PERFORM] too complex query plan for not exists query and multicolumn indexes

2010-03-22 Thread Matthew Wakeling

On Fri, 19 Mar 2010, Stephen Frost wrote:
...it has to go to an external on-disk sort (see later on, and how to 
fix that).


This was covered on this list a few months ago, in 
http://archives.postgresql.org/pgsql-performance/2009-08/msg00184.php and 
http://archives.postgresql.org/pgsql-performance/2009-08/msg00189.php


There seemed to be some consensus that allowing a materialise in front of 
an index scan might have been a good change. Was there any movement on 
this front?



Limit  (cost=66681.50..66681.50 rows=1 width=139) (actual
time=7413.489..7413.489 rows=1 loops=1)
  -  Merge Anti Join  (cost=40520.17..66681.50 rows=367793 width=139)
(actual time=3705.078..7344.256 rows=101 loops=1)
Merge Cond: ((f1.user_id = f2.ref_id) AND (f1.ref_id =
f2.user_id))
-  Index Scan using user_ref on friends f1
(cost=0.00..26097.86 rows=2818347 width=139) (actual
time=0.093..1222.592 rows=1917360 loops=1)
-  Materialize  (cost=40520.17..40555.40 rows=2818347 width=8)
(actual time=3704.977..5043.347 rows=1990148 loops=1)
  -  Sort  (cost=40520.17..40527.21 rows=2818347 width=8)
(actual time=3704.970..4710.703 rows=1990148 loops=1)
Sort Key: f2.ref_id, f2.user_id
Sort Method:  external merge  Disk: 49576kB
-  Seq Scan on friends f2  (cost=0.00..18143.18
rows=2818347 width=8) (actual time=0.015..508.797 rows=2818347 loops=1)
Total runtime: 7422.516 ms



If you had an index on ref_id,user_id (as well as the one on
user_id,ref_id), it'd probably be able to do in-order index traversals
on both and be really fast...  But then updates would be more expensive,
of course, since it'd have more indexes to maintain.


That isn't necessarily so, until the issue referred to in the above linked 
messages is resolved. It depends.


Matthew

--
I've run DOOM more in the last few days than I have the last few
months.  I just love debugging ;-)  -- Linus Torvalds

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] too complex query plan for not exists query and multicolumn indexes

2010-03-22 Thread Tom Lane
Matthew Wakeling matt...@flymine.org writes:
 On Fri, 19 Mar 2010, Stephen Frost wrote:
 ...it has to go to an external on-disk sort (see later on, and how to 
 fix that).

 This was covered on this list a few months ago, in 
 http://archives.postgresql.org/pgsql-performance/2009-08/msg00184.php and 
 http://archives.postgresql.org/pgsql-performance/2009-08/msg00189.php

 There seemed to be some consensus that allowing a materialise in front of 
 an index scan might have been a good change. Was there any movement on 
 this front?

Yes, 9.0 will consider plans like

 Merge Join  (cost=0.00..14328.70 rows=100 width=488)
   Merge Cond: (a.four = b.hundred)
   -  Index Scan using fouri on tenk1 a  (cost=0.00..1635.62 rows=1 
width=244)
   -  Materialize  (cost=0.00..1727.16 rows=1 width=244)
 -  Index Scan using tenk1_hundred on tenk1 b  (cost=0.00..1702.16 rows
=1 width=244)

Some experimentation shows that it won't insert the materialize unless
quite a bit of re-fetching is predicted (ie neither side of the join is
unique).  We might need to tweak the cost parameters once we get some
field experience with it.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] too complex query plan for not exists query and multicolumn indexes

2010-03-22 Thread Justin Graf
 Message from Corin wakath...@gmail.com at 03-19-2010 01:26:35 PM
--

***snip 
The intention of the query is to find rows with no partner row. The
offset and limit are just to ignore the time needed to send the result
to the client.
---
I don't understand the point of OFFSET,  limit will accomplish the same
thing,  PG will still execute the query the only difference is PG will skip
the step to count through the first million rows before returning a record.

---

SELECT * FROM friends AS f1 WHERE NOT EXISTS (SELECT 1 FROM friends AS
f2 WHERE f1.user_id=f2.ref_id AND f1.ref_id=f2.user_id) OFFSET 100
LIMIT 1

Mysql uses this query plan:
1 PRIMARY f1 index  NULL user_ref 8  NULL
2818860 Using where; Using index
2 DEPENDENT SUBQUERY f2 ref user_ref user_ref 8
f1.ref_id,f1.user_id 1 Using index
Time: 9.8s

---
if that's a query explain in Mysql its worthless. The above  has no
information, does not tell us how long each step is taking, let alone what
it was thinking it would take to make the query work .
--

Postgre uses this query plan:
Limit  (cost=66681.50..66681.50 rows=1 width=139) (actual
time=7413.489..7413.489 rows=1 loops=1)
  -  Merge Anti Join  (cost=40520.17..66681.50 rows=367793 width=139)
(actual time=3705.078..7344.256 rows=101 loops=1)
*Merge Cond: ((f1.user_id = f2.ref_id) AND (f1.ref_id =
f2.user_id))*
-  Index Scan using user_ref on friends f1
(cost=0.00..26097.86 rows=2818347 width=139) (actual
time=0.093..1222.592 rows=1917360 loops=1)
-  Materialize  (cost=40520.17..40555.40 rows=2818347 width=8)
(actual time=3704.977..5043.347 rows=1990148 loops=1)
  -  Sort  (cost=40520.17..40527.21 rows=2818347 width=8)
(actual time=3704.970..4710.703 rows=1990148 loops=1)
Sort Key: f2.ref_id, f2.user_id
Sort Method:  external merge  Disk: 49576kB
-  Seq Scan on friends f2  (cost=0.00..18143.18
rows=2818347 width=8) (actual time=0.015..508.797 rows=2818347 loops=1)
Total runtime: 7422.516 ms

---
We can see each step PG takes and make inform decisions what part of the
query is slow .  We can See the Sorting the rows takes most of the time
---

It's already faster, which is great, but I wonder why the query plan is
that complex.


Its not complex it showing you all the steps which Mysql is not showing you


I read in the pqsql docs that using a multicolumn key is almost never
needed and only a waste of cpu/space. So I dropped the multicolumn key
and added to separate keys instead:


Where is that at???  I don't recall reading that.  PG will only use indexes
that match exactly where/join conditions.



CREATE INDEX ref1 ON friends USING btree (ref_id);
CREATE INDEX user1 ON friends USING btree (user_id);

New query plan:
Limit  (cost=70345.04..70345.04 rows=1 width=139) (actual
time=43541.709..43541.709 rows=1 loops=1)
  -  Merge Anti Join  (cost=40520.27..70345.04 rows=367793 width=139)
(actual time=3356.694..43467.818 rows=101 loops=1)
   * Merge Cond: (f1.user_id = f2.ref_id)
Join Filter: (f1.ref_id = f2.user_id)
---
*take note the merge has changed.  it now joins on f1.user_id=f2.ref_id then
filters the results down by using the AND condition. Put the index back *
---
*-  Index Scan using user1 on friends f1  (cost=0.00..26059.79
rows=2818347 width=139) (actual time=0.031..1246.668 rows=1917365 loops=1)
-  Materialize  (cost=40520.17..40555.40 rows=2818347 width=8)
(actual time=3356.615..14941.405* rows=130503729* loops=1)
---
take note look at what happened here. this because the of Join is not
limited as it was before.
did you run this query against Mysql with the same kind of indexes???
-
  -  Sort  (cost=40520.17..40527.21 rows=2818347 width=8)
(actual time=3356.611..4127.435 rows=1990160 loops=1)
Sort Key: f2.ref_id
Sort Method:  external merge  Disk: 49560kB
-  Seq Scan on friends f2  (cost=0.00..18143.18
rows=2818347 width=8) (actual time=0.012..496.174 rows=2818347 loops=1)
Total runtime: 43550.187 ms

I also wonder why it makes a difference when adding a LIMIT clause to
the subselect in an EXISTS subselect. Shouldn't pgsql always stop after
finding the a row? In mysql is makes no difference in speed, pgsql even
get's slower when adding a LIMIT to the EXISTS subselect (I hoped it
would get faster?!).



Limits occur last after doing all the major work  is done



SELECT * FROM friends AS f1 WHERE NOT EXISTS (SELECT 1 FROM friends AS
f2 WHERE f1.user_id=f2.ref_id AND f1.ref_id=f2.user_id LIMIT 1) OFFSET
100 LIMIT 1

Limit  (cost=6389166.19..6389172.58 rows=1 width=139) (actual
time=54540.356..54540.356 rows=1 loops=1)
  -  Seq Scan on friends f1  (cost=0.00..9003446.87 rows=1409174
width=139) (actual time=0.511..54460.006 rows=101 loops=1)
Filter: 

Re: [PERFORM] too complex query plan for not exists query and multicolumn indexes

2010-03-19 Thread Kevin Grittner
Corin wakath...@gmail.com wrote:
 
 It's already faster, which is great, but I wonder why the query
 plan is that complex.
 
Because that's the plan, out of all the ways the planner knows to
get the requested result set, which was estimated to cost the least.
If it isn't actually the fastest, that might suggest that you
should adjust your costing model.  Could you tell us more about the
machine?  Especially useful would be the amount of RAM, what else is
running on the machine, and what the disk system looks like.  The
default configuration is almost never optimal for serious production
-- it's designed to behave reasonably if someone installs on their
desktop PC to try it out.
 
 I read in the pqsql docs that using a multicolumn key is almost
 never needed and only a waste of cpu/space.
 
Where in the docs did you see that?
 
 As in my previous tests, this is only a testing environment: so
 all data is in memory, no disk activity involved at all, no swap
 etc.
 
Ah, that suggests possible configuration changes.  You can try these
out in the session to see the impact, and modify postgresql.conf if
they work out.
 
seq_page_cost = 0.01
random_page_cost = 0.01
effective_cache_size = about 3/4 of your machine's RAM
 
Also, make sure that you run VACUUM ANALYZE against the table after
initially populating it and before your benchmarks; otherwise you
might inadvertently include transient or one-time maintenance costs
to some benchmarks, or distort behavior by not yet having the
statistics present for sane optimizer choices.
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance