Re: [PERFORM] too complex query plan for not exists query and multicolumn indexes
On Fri, 19 Mar 2010, Stephen Frost wrote: ...it has to go to an external on-disk sort (see later on, and how to fix that). This was covered on this list a few months ago, in http://archives.postgresql.org/pgsql-performance/2009-08/msg00184.php and http://archives.postgresql.org/pgsql-performance/2009-08/msg00189.php There seemed to be some consensus that allowing a materialise in front of an index scan might have been a good change. Was there any movement on this front? Limit (cost=66681.50..66681.50 rows=1 width=139) (actual time=7413.489..7413.489 rows=1 loops=1) - Merge Anti Join (cost=40520.17..66681.50 rows=367793 width=139) (actual time=3705.078..7344.256 rows=101 loops=1) Merge Cond: ((f1.user_id = f2.ref_id) AND (f1.ref_id = f2.user_id)) - Index Scan using user_ref on friends f1 (cost=0.00..26097.86 rows=2818347 width=139) (actual time=0.093..1222.592 rows=1917360 loops=1) - Materialize (cost=40520.17..40555.40 rows=2818347 width=8) (actual time=3704.977..5043.347 rows=1990148 loops=1) - Sort (cost=40520.17..40527.21 rows=2818347 width=8) (actual time=3704.970..4710.703 rows=1990148 loops=1) Sort Key: f2.ref_id, f2.user_id Sort Method: external merge Disk: 49576kB - Seq Scan on friends f2 (cost=0.00..18143.18 rows=2818347 width=8) (actual time=0.015..508.797 rows=2818347 loops=1) Total runtime: 7422.516 ms If you had an index on ref_id,user_id (as well as the one on user_id,ref_id), it'd probably be able to do in-order index traversals on both and be really fast... But then updates would be more expensive, of course, since it'd have more indexes to maintain. That isn't necessarily so, until the issue referred to in the above linked messages is resolved. It depends. Matthew -- I've run DOOM more in the last few days than I have the last few months. I just love debugging ;-) -- Linus Torvalds -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] too complex query plan for not exists query and multicolumn indexes
Matthew Wakeling matt...@flymine.org writes: On Fri, 19 Mar 2010, Stephen Frost wrote: ...it has to go to an external on-disk sort (see later on, and how to fix that). This was covered on this list a few months ago, in http://archives.postgresql.org/pgsql-performance/2009-08/msg00184.php and http://archives.postgresql.org/pgsql-performance/2009-08/msg00189.php There seemed to be some consensus that allowing a materialise in front of an index scan might have been a good change. Was there any movement on this front? Yes, 9.0 will consider plans like Merge Join (cost=0.00..14328.70 rows=100 width=488) Merge Cond: (a.four = b.hundred) - Index Scan using fouri on tenk1 a (cost=0.00..1635.62 rows=1 width=244) - Materialize (cost=0.00..1727.16 rows=1 width=244) - Index Scan using tenk1_hundred on tenk1 b (cost=0.00..1702.16 rows =1 width=244) Some experimentation shows that it won't insert the materialize unless quite a bit of re-fetching is predicted (ie neither side of the join is unique). We might need to tweak the cost parameters once we get some field experience with it. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] too complex query plan for not exists query and multicolumn indexes
Message from Corin wakath...@gmail.com at 03-19-2010 01:26:35 PM -- ***snip The intention of the query is to find rows with no partner row. The offset and limit are just to ignore the time needed to send the result to the client. --- I don't understand the point of OFFSET, limit will accomplish the same thing, PG will still execute the query the only difference is PG will skip the step to count through the first million rows before returning a record. --- SELECT * FROM friends AS f1 WHERE NOT EXISTS (SELECT 1 FROM friends AS f2 WHERE f1.user_id=f2.ref_id AND f1.ref_id=f2.user_id) OFFSET 100 LIMIT 1 Mysql uses this query plan: 1 PRIMARY f1 index NULL user_ref 8 NULL 2818860 Using where; Using index 2 DEPENDENT SUBQUERY f2 ref user_ref user_ref 8 f1.ref_id,f1.user_id 1 Using index Time: 9.8s --- if that's a query explain in Mysql its worthless. The above has no information, does not tell us how long each step is taking, let alone what it was thinking it would take to make the query work . -- Postgre uses this query plan: Limit (cost=66681.50..66681.50 rows=1 width=139) (actual time=7413.489..7413.489 rows=1 loops=1) - Merge Anti Join (cost=40520.17..66681.50 rows=367793 width=139) (actual time=3705.078..7344.256 rows=101 loops=1) *Merge Cond: ((f1.user_id = f2.ref_id) AND (f1.ref_id = f2.user_id))* - Index Scan using user_ref on friends f1 (cost=0.00..26097.86 rows=2818347 width=139) (actual time=0.093..1222.592 rows=1917360 loops=1) - Materialize (cost=40520.17..40555.40 rows=2818347 width=8) (actual time=3704.977..5043.347 rows=1990148 loops=1) - Sort (cost=40520.17..40527.21 rows=2818347 width=8) (actual time=3704.970..4710.703 rows=1990148 loops=1) Sort Key: f2.ref_id, f2.user_id Sort Method: external merge Disk: 49576kB - Seq Scan on friends f2 (cost=0.00..18143.18 rows=2818347 width=8) (actual time=0.015..508.797 rows=2818347 loops=1) Total runtime: 7422.516 ms --- We can see each step PG takes and make inform decisions what part of the query is slow . We can See the Sorting the rows takes most of the time --- It's already faster, which is great, but I wonder why the query plan is that complex. Its not complex it showing you all the steps which Mysql is not showing you I read in the pqsql docs that using a multicolumn key is almost never needed and only a waste of cpu/space. So I dropped the multicolumn key and added to separate keys instead: Where is that at??? I don't recall reading that. PG will only use indexes that match exactly where/join conditions. CREATE INDEX ref1 ON friends USING btree (ref_id); CREATE INDEX user1 ON friends USING btree (user_id); New query plan: Limit (cost=70345.04..70345.04 rows=1 width=139) (actual time=43541.709..43541.709 rows=1 loops=1) - Merge Anti Join (cost=40520.27..70345.04 rows=367793 width=139) (actual time=3356.694..43467.818 rows=101 loops=1) * Merge Cond: (f1.user_id = f2.ref_id) Join Filter: (f1.ref_id = f2.user_id) --- *take note the merge has changed. it now joins on f1.user_id=f2.ref_id then filters the results down by using the AND condition. Put the index back * --- *- Index Scan using user1 on friends f1 (cost=0.00..26059.79 rows=2818347 width=139) (actual time=0.031..1246.668 rows=1917365 loops=1) - Materialize (cost=40520.17..40555.40 rows=2818347 width=8) (actual time=3356.615..14941.405* rows=130503729* loops=1) --- take note look at what happened here. this because the of Join is not limited as it was before. did you run this query against Mysql with the same kind of indexes??? - - Sort (cost=40520.17..40527.21 rows=2818347 width=8) (actual time=3356.611..4127.435 rows=1990160 loops=1) Sort Key: f2.ref_id Sort Method: external merge Disk: 49560kB - Seq Scan on friends f2 (cost=0.00..18143.18 rows=2818347 width=8) (actual time=0.012..496.174 rows=2818347 loops=1) Total runtime: 43550.187 ms I also wonder why it makes a difference when adding a LIMIT clause to the subselect in an EXISTS subselect. Shouldn't pgsql always stop after finding the a row? In mysql is makes no difference in speed, pgsql even get's slower when adding a LIMIT to the EXISTS subselect (I hoped it would get faster?!). Limits occur last after doing all the major work is done SELECT * FROM friends AS f1 WHERE NOT EXISTS (SELECT 1 FROM friends AS f2 WHERE f1.user_id=f2.ref_id AND f1.ref_id=f2.user_id LIMIT 1) OFFSET 100 LIMIT 1 Limit (cost=6389166.19..6389172.58 rows=1 width=139) (actual time=54540.356..54540.356 rows=1 loops=1) - Seq Scan on friends f1 (cost=0.00..9003446.87 rows=1409174 width=139) (actual time=0.511..54460.006 rows=101 loops=1) Filter:
Re: [PERFORM] too complex query plan for not exists query and multicolumn indexes
Corin wakath...@gmail.com wrote: It's already faster, which is great, but I wonder why the query plan is that complex. Because that's the plan, out of all the ways the planner knows to get the requested result set, which was estimated to cost the least. If it isn't actually the fastest, that might suggest that you should adjust your costing model. Could you tell us more about the machine? Especially useful would be the amount of RAM, what else is running on the machine, and what the disk system looks like. The default configuration is almost never optimal for serious production -- it's designed to behave reasonably if someone installs on their desktop PC to try it out. I read in the pqsql docs that using a multicolumn key is almost never needed and only a waste of cpu/space. Where in the docs did you see that? As in my previous tests, this is only a testing environment: so all data is in memory, no disk activity involved at all, no swap etc. Ah, that suggests possible configuration changes. You can try these out in the session to see the impact, and modify postgresql.conf if they work out. seq_page_cost = 0.01 random_page_cost = 0.01 effective_cache_size = about 3/4 of your machine's RAM Also, make sure that you run VACUUM ANALYZE against the table after initially populating it and before your benchmarks; otherwise you might inadvertently include transient or one-time maintenance costs to some benchmarks, or distort behavior by not yet having the statistics present for sane optimizer choices. -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance