I looked into the distribution of the filenames, in particular I ran a query to see how for into the table the 1st filename would be found.

photoshelter=# select count(*) from ps_image where lower(file_name) < 'a-400-001.jpg';
  count
---------
 8915832


As you can see the first row is almost 9 million rows into the table. (a-400-001.jpg is the first filename returned by the query) which implies the distribution is heavily non-uniform. (For uniform distribution the first row should have been within the first 500 rows, give or take)

I tried the query you suggest below but it did not work well, but using it as inspiration the following query does work:

photoshelter=# explain analyze select * from (
 SELECT ID, lower(file_name) as lfn FROM ps_image WHERE id IN
   (SELECT image_id FROM ps_gallery_image WHERE
    gallery_id='G00007ejKGoWS_cY')
 offset 0
 ) ss
ORDER BY lfn ASC
limit 1;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=158946.43..158946.43 rows=1 width=52) (actual time=1539.615..1539.615 rows=1 loops=1) -> Sort (cost=158946.43..159044.80 rows=39350 width=52) (actual time=1539.613..1539.613 rows=1 loops=1)
         Sort Key: (lower((ps_image.file_name)::text))
         Sort Method:  top-N heapsort  Memory: 17kB
-> Limit (cost=43197.34..158356.18 rows=39350 width=36) (actual time=74.530..1499.328 rows=50237 loops=1) -> Nested Loop (cost=43197.34..158356.18 rows=39350 width=36) (actual time=74.529..1475.378 rows=50237 loops=1) -> HashAggregate (cost=43197.34..43590.84 rows=39350 width=17) (actual time=74.468..110.638 rows=50237 loops=1) -> Index Scan using gi_gallery_id on ps_gallery_image (cost=0.00..43072.80 rows=49816 width=17) (actual time=0.049..46.926 rows=50237 loops=1) Index Cond: ((gallery_id)::text = 'G00007ejKGoWS_cY'::text) -> Index Scan using ps_image_pkey on ps_image (cost=0.00..2.90 rows=1 width=36) (actual time=0.025..0.025 rows=1 loops=50237) Index Cond: ((ps_image.id)::text = (ps_gallery_image.image_id)::text)
 Total runtime: 1540.032 ms
(12 rows)

Interestingly to me, while the 'offest 0' did not work as an optimization fence in the query you provided, it works as one in the query above. I had tried removing it from the above query, and the plan reverted back to the bad form.

The non-uniform distribution leads me to another question, would it be possible to use partial indexes or some other technique to help the planner. Or would the fact that the relevant information, gallery ids and filenames, are split across two tables foil any attempt?

In any case, I'd like to thank everyone for their input. The query above will be a big help.

be well,

  James


On May 1, 2009, at 10:57 AM, Tom Lane wrote:

James Nelson <ja...@photoshelter.com> writes:
Hi, I'm hoping you guys can help with improving this query I'm having
a problem with. The main problem is that the query plan changes
depending on the value of the LIMIT clause, with small values using a
poor plan and running very slowly. The two times are roughly 5 minutes
for the bad plan and 1.5 secs for the good plan.

photoshelter=# explain  analyze SELECT ID FROM ps_image WHERE id IN
(SELECT image_id FROM ps_gallery_image WHERE
gallery_id='G00007ejKGoWS_cY') ORDER BY LOWER(FILE_NAME) ASC limit 1;

The problem here is an overoptimistic assessment of how long it will
take to find a match to gallery_id='G00007ejKGoWS_cY' while searching
in file_name order.  You might be able to fix that by increasing the
statistics target for gallery_id.  However, if the issue is not so
much how many occurrences of 'G00007ejKGoWS_cY' there are as that
they're all associated with high values of file_name, that won't
help.  In that case I think it would work to restructure the query
along the lines of

select * from (
 SELECT ID FROM ps_image WHERE id IN
   (SELECT image_id FROM ps_gallery_image WHERE
    gallery_id='G00007ejKGoWS_cY') ORDER BY LOWER(FILE_NAME) ASC
 offset 0
 ) ss
limit 1;

The OFFSET should act as an optimization fence to prevent the LIMIT
from being used in the planning of the subquery.

                        regards, tom lane


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Reply via email to