Re: [PERFORM] bad plan and LIMIT
I looked into the distribution of the filenames, in particular I ran a query to see how for into the table the 1st filename would be found. photoshelter=# select count(*) from ps_image where lower(file_name) < 'a-400-001.jpg'; count - 8915832 As you can see the first row is almost 9 million rows into the table. (a-400-001.jpg is the first filename returned by the query) which implies the distribution is heavily non-uniform. (For uniform distribution the first row should have been within the first 500 rows, give or take) I tried the query you suggest below but it did not work well, but using it as inspiration the following query does work: photoshelter=# explain analyze select * from ( SELECT ID, lower(file_name) as lfn FROM ps_image WHERE id IN (SELECT image_id FROM ps_gallery_image WHERE gallery_id='G7ejKGoWS_cY') offset 0 ) ss ORDER BY lfn ASC limit 1; QUERY PLAN Limit (cost=158946.43..158946.43 rows=1 width=52) (actual time=1539.615..1539.615 rows=1 loops=1) -> Sort (cost=158946.43..159044.80 rows=39350 width=52) (actual time=1539.613..1539.613 rows=1 loops=1) Sort Key: (lower((ps_image.file_name)::text)) Sort Method: top-N heapsort Memory: 17kB -> Limit (cost=43197.34..158356.18 rows=39350 width=36) (actual time=74.530..1499.328 rows=50237 loops=1) -> Nested Loop (cost=43197.34..158356.18 rows=39350 width=36) (actual time=74.529..1475.378 rows=50237 loops=1) -> HashAggregate (cost=43197.34..43590.84 rows=39350 width=17) (actual time=74.468..110.638 rows=50237 loops=1) -> Index Scan using gi_gallery_id on ps_gallery_image (cost=0.00..43072.80 rows=49816 width=17) (actual time=0.049..46.926 rows=50237 loops=1) Index Cond: ((gallery_id)::text = 'G7ejKGoWS_cY'::text) -> Index Scan using ps_image_pkey on ps_image (cost=0.00..2.90 rows=1 width=36) (actual time=0.025..0.025 rows=1 loops=50237) Index Cond: ((ps_image.id)::text = (ps_gallery_image.image_id)::text) Total runtime: 1540.032 ms (12 rows) Interestingly to me, while the 'offest 0' did not work as an optimization fence in the query you provided, it works as one in the query above. I had tried removing it from the above query, and the plan reverted back to the bad form. The non-uniform distribution leads me to another question, would it be possible to use partial indexes or some other technique to help the planner. Or would the fact that the relevant information, gallery ids and filenames, are split across two tables foil any attempt? In any case, I'd like to thank everyone for their input. The query above will be a big help. be well, James On May 1, 2009, at 10:57 AM, Tom Lane wrote: James Nelson writes: Hi, I'm hoping you guys can help with improving this query I'm having a problem with. The main problem is that the query plan changes depending on the value of the LIMIT clause, with small values using a poor plan and running very slowly. The two times are roughly 5 minutes for the bad plan and 1.5 secs for the good plan. photoshelter=# explain analyze SELECT ID FROM ps_image WHERE id IN (SELECT image_id FROM ps_gallery_image WHERE gallery_id='G7ejKGoWS_cY') ORDER BY LOWER(FILE_NAME) ASC limit 1; The problem here is an overoptimistic assessment of how long it will take to find a match to gallery_id='G7ejKGoWS_cY' while searching in file_name order. You might be able to fix that by increasing the statistics target for gallery_id. However, if the issue is not so much how many occurrences of 'G7ejKGoWS_cY' there are as that they're all associated with high values of file_name, that won't help. In that case I think it would work to restructure the query along the lines of select * from ( SELECT ID FROM ps_image WHERE id IN (SELECT image_id FROM ps_gallery_image WHERE gallery_id='G7ejKGoWS_cY') ORDER BY LOWER(FILE_NAME) ASC offset 0 ) ss limit 1; The OFFSET should act as an optimization fence to prevent the LIMIT from being used in the planning of the subquery. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] bad plan and LIMIT
The 'in' form and 'join' form produce identical plans for both limit and non-limit versions of the query, which I actually think reflects well on the query planner. I also tried a form of the query with the subselect in the from clause to try and force the order the tables were evaluated but the query planner saw through that one too. Basically this query: SELECT ps_image.id FROM (SELECT image_id FROM ps_gallery_image WHERE gallery_id='G7ejKGoWS_cY') as ids INNER JOIN ps_image on ps_image.id = ids.image_id ORDER BY LOWER(FILE_NAME) ASC limit 1; produces the same plan as the 'in' or the 'join' form when the limit clause is present. James On May 1, 2009, at 4:32 AM, Grzegorz Jaśkiewicz wrote: use join instead of where in(); -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] bad plan and LIMIT
I had tried using exists but both the forms of the query (with limit and without) performed much worse. James On May 1, 2009, at 4:22 AM, Adam Ruth wrote: You could try changing the IN to an EXISTS, that may alter how the optimizer weighs the limit. SELECT ID FROM ps_image WHERE EXISTS (SELECT null FROM ps_gallery_image WHERE gallery_id ='G7ejKGoWS_cY' and image_id = ps_image.id) ORDER BY LOWER(FILE_NAME) ASC On 30/04/2009, at 3:51 AM, James Nelson wrote: Hi, I'm hoping you guys can help with improving this query I'm having a problem with. The main problem is that the query plan changes depending on the value of the LIMIT clause, with small values using a poor plan and running very slowly. The two times are roughly 5 minutes for the bad plan and 1.5 secs for the good plan. I have read a little about how the query planner takes into account the limit clause, and I can see the effect this has on the costs shown by explain. The problem is that the estimated cost ends up being wildly inaccurate. I'm not sure if this a problem with the planner or if it is something I am doing wrong on my end. the query (without the limit clause): SELECT ID FROM ps_image WHERE id IN (SELECT image_id FROM ps_gallery_image WHERE gallery_id='G7ejKGoWS_cY') ORDER BY LOWER(FILE_NAME) ASC The ps_image table has about 24 million rows, ps_gallery_image has about 14 million. The query above produces roughly 50 thousand rows. When looking at the explain with the limit, I can see the interpolation that the planner does for the limit node (arriving at a final cost of 458.32 for this example) but not sure why it is inaccurate compared to the actual times. Thanks in advance for taking a look at this, let me know if there is additional information I should provide. Some information about the tables and the explains follow below. James Nelson [ja...@db2 ~] psql --version psql (PostgreSQL) 8.3.5 contains support for command-line editing photoshelter=# \d ps_image Table "public.ps_image" Column | Type | Modifiers ---+-- +--- id| character varying(16)| not null user_id | character varying(16)| album_id | character varying(16)| not null parent_id | character varying(16)| file_name | character varying(200) | file_size | bigint | 20 rows snipped Indexes: "ps_image_pkey" PRIMARY KEY, btree (id) "i_file_name_l" btree (lower(file_name::text)) indexes, fk constraints and triggers snipped photoshelter=# \d ps_gallery_image Table "public.ps_gallery_image" Column | Type | Modifiers ---+--+ gallery_id| character varying(16)| not null image_id | character varying(16)| not null display_order | integer | not null default 0 caption | character varying(2000) | ctime | timestamp with time zone | not null default now() mtime | timestamp with time zone | not null default now() id| character varying(16)| not null Indexes: "ps_gallery_image_pkey" PRIMARY KEY, btree (id) "gi_gallery_id" btree (gallery_id) "gi_image_id" btree (image_id) Foreign-key constraints: "ps_gallery_image_gallery_id_fkey" FOREIGN KEY (gallery_id) REFERENCES ps_gallery(id) ON DELETE CASCADE "ps_gallery_image_image_id_fkey" FOREIGN KEY (image_id) REFERENCES ps_image(id) ON DELETE CASCADE Triggers: ps_image_gi_sync AFTER INSERT OR DELETE OR UPDATE ON ps_gallery_image FOR EACH ROW EXECUTE PROCEDURE ps_image_sync() = = = = = = = = = = = = = = = = = explain analyze for bad plan photoshelter=# explain analyze SELECT ID FROM ps_image WHERE id IN (SELECT image_id FROM ps_gallery_image WHERE gallery_id='G7ejKGoWS_cY') ORDER BY LOWER(FILE_NAME) ASC limit 1; QUERY PLAN - Limit (cost=0.00..458.32 rows=1 width=36) (actual time=709831.847..709831.847 rows=1 loops=1) -> Nested Loop IN Join (cost=0.00..17700128.78 rows=38620 width=36) (actual time=709831.845..709831.845 rows=1 loops=1) -> Index Scan using i_file_name_l on ps_image (cost=0.00..1023863.22 rows=24460418 width=36) (actual time=0.063..271167.293 rows=8876340 loops=1) -> Index Scan using gi_image_id on ps_gallery_image (cost=0.00..0.85 rows=1 width=17) (actual time=0.048..0.048 rows=0 loops=8876340) Index Cond: ((ps_gallery_image.image_id)::text = (ps_image.id)::text) Filte
Re: [PERFORM] bad plan and LIMIT
James Nelson writes: > Hi, I'm hoping you guys can help with improving this query I'm having > a problem with. The main problem is that the query plan changes > depending on the value of the LIMIT clause, with small values using a > poor plan and running very slowly. The two times are roughly 5 minutes > for the bad plan and 1.5 secs for the good plan. > photoshelter=# explain analyze SELECT ID FROM ps_image WHERE id IN > (SELECT image_id FROM ps_gallery_image WHERE > gallery_id='G7ejKGoWS_cY') ORDER BY LOWER(FILE_NAME) ASC limit 1; The problem here is an overoptimistic assessment of how long it will take to find a match to gallery_id='G7ejKGoWS_cY' while searching in file_name order. You might be able to fix that by increasing the statistics target for gallery_id. However, if the issue is not so much how many occurrences of 'G7ejKGoWS_cY' there are as that they're all associated with high values of file_name, that won't help. In that case I think it would work to restructure the query along the lines of select * from ( SELECT ID FROM ps_image WHERE id IN (SELECT image_id FROM ps_gallery_image WHERE gallery_id='G7ejKGoWS_cY') ORDER BY LOWER(FILE_NAME) ASC offset 0 ) ss limit 1; The OFFSET should act as an optimization fence to prevent the LIMIT from being used in the planning of the subquery. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] bad plan and LIMIT
EXISTS won't help much either, postgresql is not too fast, when it comes to that sort of approach. join is always going to be fast, it is about time you learn joins and use them ;) -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] bad plan and LIMIT
use join instead of where in(); -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] bad plan and LIMIT
You could try changing the IN to an EXISTS, that may alter how the optimizer weighs the limit. SELECT ID FROM ps_image WHERE EXISTS (SELECT null FROM ps_gallery_image WHERE gallery_id ='G7ejKGoWS_cY' and image_id = ps_image.id) ORDER BY LOWER(FILE_NAME) ASC On 30/04/2009, at 3:51 AM, James Nelson wrote: Hi, I'm hoping you guys can help with improving this query I'm having a problem with. The main problem is that the query plan changes depending on the value of the LIMIT clause, with small values using a poor plan and running very slowly. The two times are roughly 5 minutes for the bad plan and 1.5 secs for the good plan. I have read a little about how the query planner takes into account the limit clause, and I can see the effect this has on the costs shown by explain. The problem is that the estimated cost ends up being wildly inaccurate. I'm not sure if this a problem with the planner or if it is something I am doing wrong on my end. the query (without the limit clause): SELECT ID FROM ps_image WHERE id IN (SELECT image_id FROM ps_gallery_image WHERE gallery_id='G7ejKGoWS_cY') ORDER BY LOWER(FILE_NAME) ASC The ps_image table has about 24 million rows, ps_gallery_image has about 14 million. The query above produces roughly 50 thousand rows. When looking at the explain with the limit, I can see the interpolation that the planner does for the limit node (arriving at a final cost of 458.32 for this example) but not sure why it is inaccurate compared to the actual times. Thanks in advance for taking a look at this, let me know if there is additional information I should provide. Some information about the tables and the explains follow below. James Nelson [ja...@db2 ~] psql --version psql (PostgreSQL) 8.3.5 contains support for command-line editing photoshelter=# \d ps_image Table "public.ps_image" Column | Type | Modifiers ---+-- +--- id| character varying(16)| not null user_id | character varying(16)| album_id | character varying(16)| not null parent_id | character varying(16)| file_name | character varying(200) | file_size | bigint | 20 rows snipped Indexes: "ps_image_pkey" PRIMARY KEY, btree (id) "i_file_name_l" btree (lower(file_name::text)) indexes, fk constraints and triggers snipped photoshelter=# \d ps_gallery_image Table "public.ps_gallery_image" Column | Type | Modifiers ---+--+ gallery_id| character varying(16)| not null image_id | character varying(16)| not null display_order | integer | not null default 0 caption | character varying(2000) | ctime | timestamp with time zone | not null default now() mtime | timestamp with time zone | not null default now() id| character varying(16)| not null Indexes: "ps_gallery_image_pkey" PRIMARY KEY, btree (id) "gi_gallery_id" btree (gallery_id) "gi_image_id" btree (image_id) Foreign-key constraints: "ps_gallery_image_gallery_id_fkey" FOREIGN KEY (gallery_id) REFERENCES ps_gallery(id) ON DELETE CASCADE "ps_gallery_image_image_id_fkey" FOREIGN KEY (image_id) REFERENCES ps_image(id) ON DELETE CASCADE Triggers: ps_image_gi_sync AFTER INSERT OR DELETE OR UPDATE ON ps_gallery_image FOR EACH ROW EXECUTE PROCEDURE ps_image_sync() = = = = = = = = = = = = = = = == explain analyze for bad plan photoshelter=# explain analyze SELECT ID FROM ps_image WHERE id IN (SELECT image_id FROM ps_gallery_image WHERE gallery_id='G7ejKGoWS_cY') ORDER BY LOWER(FILE_NAME) ASC limit 1; QUERY PLAN - Limit (cost=0.00..458.32 rows=1 width=36) (actual time=709831.847..709831.847 rows=1 loops=1) -> Nested Loop IN Join (cost=0.00..17700128.78 rows=38620 width=36) (actual time=709831.845..709831.845 rows=1 loops=1) -> Index Scan using i_file_name_l on ps_image (cost=0.00..1023863.22 rows=24460418 width=36) (actual time=0.063..271167.293 rows=8876340 loops=1) -> Index Scan using gi_image_id on ps_gallery_image (cost=0.00..0.85 rows=1 width=17) (actual time=0.048..0.048 rows=0 loops=8876340) Index Cond: ((ps_gallery_image.image_id)::text = (ps_image.id)::text) Filter: ((ps_gallery_image.gallery_id)::text = 'G7ejKGoWS_cY'::text) Total runtime: 709831.932 ms = = = = = = = = = = = = = = = =
[PERFORM] bad plan and LIMIT
Hi, I'm hoping you guys can help with improving this query I'm having a problem with. The main problem is that the query plan changes depending on the value of the LIMIT clause, with small values using a poor plan and running very slowly. The two times are roughly 5 minutes for the bad plan and 1.5 secs for the good plan. I have read a little about how the query planner takes into account the limit clause, and I can see the effect this has on the costs shown by explain. The problem is that the estimated cost ends up being wildly inaccurate. I'm not sure if this a problem with the planner or if it is something I am doing wrong on my end. the query (without the limit clause): SELECT ID FROM ps_image WHERE id IN (SELECT image_id FROM ps_gallery_image WHERE gallery_id='G7ejKGoWS_cY') ORDER BY LOWER(FILE_NAME) ASC The ps_image table has about 24 million rows, ps_gallery_image has about 14 million. The query above produces roughly 50 thousand rows. When looking at the explain with the limit, I can see the interpolation that the planner does for the limit node (arriving at a final cost of 458.32 for this example) but not sure why it is inaccurate compared to the actual times. Thanks in advance for taking a look at this, let me know if there is additional information I should provide. Some information about the tables and the explains follow below. James Nelson [ja...@db2 ~] psql --version psql (PostgreSQL) 8.3.5 contains support for command-line editing photoshelter=# \d ps_image Table "public.ps_image" Column | Type | Modifiers ---+-- +--- id| character varying(16)| not null user_id | character varying(16)| album_id | character varying(16)| not null parent_id | character varying(16)| file_name | character varying(200) | file_size | bigint | 20 rows snipped Indexes: "ps_image_pkey" PRIMARY KEY, btree (id) "i_file_name_l" btree (lower(file_name::text)) indexes, fk constraints and triggers snipped photoshelter=# \d ps_gallery_image Table "public.ps_gallery_image" Column | Type | Modifiers ---+--+ gallery_id| character varying(16)| not null image_id | character varying(16)| not null display_order | integer | not null default 0 caption | character varying(2000) | ctime | timestamp with time zone | not null default now() mtime | timestamp with time zone | not null default now() id| character varying(16)| not null Indexes: "ps_gallery_image_pkey" PRIMARY KEY, btree (id) "gi_gallery_id" btree (gallery_id) "gi_image_id" btree (image_id) Foreign-key constraints: "ps_gallery_image_gallery_id_fkey" FOREIGN KEY (gallery_id) REFERENCES ps_gallery(id) ON DELETE CASCADE "ps_gallery_image_image_id_fkey" FOREIGN KEY (image_id) REFERENCES ps_image(id) ON DELETE CASCADE Triggers: ps_image_gi_sync AFTER INSERT OR DELETE OR UPDATE ON ps_gallery_image FOR EACH ROW EXECUTE PROCEDURE ps_image_sync() = = = = = = = = = = = = = explain analyze for bad plan photoshelter=# explain analyze SELECT ID FROM ps_image WHERE id IN (SELECT image_id FROM ps_gallery_image WHERE gallery_id='G7ejKGoWS_cY') ORDER BY LOWER(FILE_NAME) ASC limit 1; QUERY PLAN - Limit (cost=0.00..458.32 rows=1 width=36) (actual time=709831.847..709831.847 rows=1 loops=1) -> Nested Loop IN Join (cost=0.00..17700128.78 rows=38620 width=36) (actual time=709831.845..709831.845 rows=1 loops=1) -> Index Scan using i_file_name_l on ps_image (cost=0.00..1023863.22 rows=24460418 width=36) (actual time=0.063..271167.293 rows=8876340 loops=1) -> Index Scan using gi_image_id on ps_gallery_image (cost=0.00..0.85 rows=1 width=17) (actual time=0.048..0.048 rows=0 loops=8876340) Index Cond: ((ps_gallery_image.image_id)::text = (ps_image.id)::text) Filter: ((ps_gallery_image.gallery_id)::text = 'G7ejKGoWS_cY'::text) Total runtime: 709831.932 ms = = = = = = = = = = = = = explain analyze for good plan photoshelter=# explain analyze SELECT ID FROM ps_image WHERE id IN (SELECT image_id FROM ps_gallery_image WHERE gallery_id='G7ejKGoWS_cY') ORDER BY LOWER(FILE_NAME) ASC limit 600;