Re: [PERFORM] bad plan and LIMIT

2009-05-01 Thread James Nelson


I looked into the distribution of the filenames, in particular I ran a  
query to see how for into the table the 1st filename would be found.


photoshelter=# select count(*) from ps_image where lower(file_name) <  
'a-400-001.jpg';

  count
-
 8915832


As you can see the first row is almost 9 million rows into the table.  
(a-400-001.jpg is the first filename returned by the query) which  
implies the distribution is heavily non-uniform. (For uniform  
distribution the first row should have been within the first 500 rows,  
give or take)


I tried the query you suggest below but it did not work well, but  
using it as inspiration the following query does work:


photoshelter=# explain analyze select * from (
 SELECT ID, lower(file_name) as lfn FROM ps_image WHERE id IN
   (SELECT image_id FROM ps_gallery_image WHERE
gallery_id='G7ejKGoWS_cY')
 offset 0
 ) ss
ORDER BY lfn ASC
limit 1;
 QUERY 
 PLAN


 Limit  (cost=158946.43..158946.43 rows=1 width=52) (actual  
time=1539.615..1539.615 rows=1 loops=1)
   ->  Sort  (cost=158946.43..159044.80 rows=39350 width=52) (actual  
time=1539.613..1539.613 rows=1 loops=1)

 Sort Key: (lower((ps_image.file_name)::text))
 Sort Method:  top-N heapsort  Memory: 17kB
 ->  Limit  (cost=43197.34..158356.18 rows=39350 width=36)  
(actual time=74.530..1499.328 rows=50237 loops=1)
   ->  Nested Loop  (cost=43197.34..158356.18 rows=39350  
width=36) (actual time=74.529..1475.378 rows=50237 loops=1)
 ->  HashAggregate  (cost=43197.34..43590.84  
rows=39350 width=17) (actual time=74.468..110.638 rows=50237 loops=1)
   ->  Index Scan using gi_gallery_id on  
ps_gallery_image  (cost=0.00..43072.80 rows=49816 width=17) (actual  
time=0.049..46.926 rows=50237 loops=1)
 Index Cond: ((gallery_id)::text =  
'G7ejKGoWS_cY'::text)
 ->  Index Scan using ps_image_pkey on ps_image   
(cost=0.00..2.90 rows=1 width=36) (actual time=0.025..0.025 rows=1  
loops=50237)
   Index Cond: ((ps_image.id)::text =  
(ps_gallery_image.image_id)::text)

 Total runtime: 1540.032 ms
(12 rows)

Interestingly to me, while the  'offest 0'  did not work as an  
optimization fence in the query you provided, it works as one in the  
query above. I had tried removing it from the above query, and the  
plan reverted back to the bad form.


The non-uniform distribution leads me to another question, would it be  
possible to use partial indexes or some other technique to help the  
planner. Or would the fact that the relevant information, gallery ids  
and filenames, are split across two tables foil any attempt?


In any case, I'd like to thank everyone for their input. The query  
above will be a big help.


be well,

  James


On May 1, 2009, at 10:57 AM, Tom Lane wrote:


James Nelson  writes:

Hi, I'm hoping you guys can help with improving this query I'm having
a problem with. The main problem is that the query plan changes
depending on the value of the LIMIT clause, with small values using a
poor plan and running very slowly. The two times are roughly 5  
minutes

for the bad plan and 1.5 secs for the good plan.



photoshelter=# explain  analyze SELECT ID FROM ps_image WHERE id IN
(SELECT image_id FROM ps_gallery_image WHERE
gallery_id='G7ejKGoWS_cY') ORDER BY LOWER(FILE_NAME) ASC limit 1;


The problem here is an overoptimistic assessment of how long it will
take to find a match to gallery_id='G7ejKGoWS_cY' while searching
in file_name order.  You might be able to fix that by increasing the
statistics target for gallery_id.  However, if the issue is not so
much how many occurrences of 'G7ejKGoWS_cY' there are as that
they're all associated with high values of file_name, that won't
help.  In that case I think it would work to restructure the query
along the lines of

select * from (
 SELECT ID FROM ps_image WHERE id IN
   (SELECT image_id FROM ps_gallery_image WHERE
gallery_id='G7ejKGoWS_cY') ORDER BY LOWER(FILE_NAME) ASC
 offset 0
 ) ss
limit 1;

The OFFSET should act as an optimization fence to prevent the LIMIT
from being used in the planning of the subquery.

regards, tom lane



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] bad plan and LIMIT

2009-05-01 Thread James Nelson


The 'in' form and 'join' form produce identical plans for both limit  
and non-limit versions of the query, which I actually think reflects  
well on the query planner. I also tried a form of the query with the  
subselect in the from clause to try and force the order the tables  
were evaluated but the query planner saw through that one too.  
Basically this query:


SELECT ps_image.id FROM
	(SELECT image_id FROM ps_gallery_image WHERE  
gallery_id='G7ejKGoWS_cY') as ids
INNER JOIN ps_image on ps_image.id = ids.image_id ORDER BY  
LOWER(FILE_NAME) ASC limit 1;


produces the same plan as the 'in' or the 'join' form when the limit  
clause is present.


 James



On May 1, 2009, at 4:32 AM, Grzegorz Jaśkiewicz wrote:


use join instead of where in();



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] bad plan and LIMIT

2009-05-01 Thread James Nelson


I had tried using exists but both the forms of the query (with limit  
and without) performed much worse.


   James

On May 1, 2009, at 4:22 AM, Adam Ruth wrote:

You could try changing the IN to an EXISTS, that may alter how the  
optimizer weighs the limit.



SELECT ID FROM ps_image WHERE EXISTS (SELECT null FROM  
ps_gallery_image WHERE gallery_id ='G7ejKGoWS_cY' and image_id =  
ps_image.id) ORDER BY LOWER(FILE_NAME) ASC


On 30/04/2009, at 3:51 AM, James Nelson wrote:



Hi, I'm hoping you guys can help with improving this query I'm  
having a problem with. The main problem is that the query plan  
changes depending on the value of the LIMIT clause, with small  
values using a poor plan and running very slowly. The two times are  
roughly 5 minutes for the bad plan and 1.5 secs for the good plan.


I have read a little about how the query planner takes into account  
the limit clause, and I can see the effect this has on the costs  
shown by explain. The problem is that the estimated cost ends up  
being wildly inaccurate. I'm not sure if this a problem with the  
planner or if it is something I am doing wrong on my end.


the query (without the limit clause):

SELECT ID FROM ps_image WHERE id IN (SELECT image_id FROM  
ps_gallery_image WHERE gallery_id='G7ejKGoWS_cY') ORDER BY  
LOWER(FILE_NAME) ASC


The ps_image table has about 24 million rows, ps_gallery_image has  
about 14 million. The query above produces roughly 50 thousand rows.


When looking at the explain with the limit, I can see the  
interpolation that the planner does for the limit node (arriving at  
a final cost of 458.32 for this example) but not sure why it is  
inaccurate compared to the actual times.


Thanks in advance for taking a look at this, let me know if there  
is additional information I should provide.


Some information about the tables  and the explains follow below.

James Nelson

[ja...@db2 ~] psql --version
psql (PostgreSQL) 8.3.5
contains support for command-line editing

photoshelter=# \d ps_image
Table "public.ps_image"
 Column |   Type   | Modifiers
---+-- 
+---

id| character varying(16)| not null
user_id   | character varying(16)|
album_id  | character varying(16)| not null
parent_id | character varying(16)|
file_name | character varying(200)   |
file_size | bigint   |
 20 rows snipped 
Indexes:
 "ps_image_pkey" PRIMARY KEY, btree (id)
 "i_file_name_l" btree (lower(file_name::text))
 indexes, fk constraints and triggers snipped 

photoshelter=# \d ps_gallery_image
   Table "public.ps_gallery_image"
 Column |   Type   |   Modifiers
---+--+
gallery_id| character varying(16)| not null
image_id  | character varying(16)| not null
display_order | integer  | not null default 0
caption   | character varying(2000)  |
ctime | timestamp with time zone | not null default now()
mtime | timestamp with time zone | not null default now()
id| character varying(16)| not null
Indexes:
 "ps_gallery_image_pkey" PRIMARY KEY, btree (id)
 "gi_gallery_id" btree (gallery_id)
 "gi_image_id" btree (image_id)
Foreign-key constraints:
 "ps_gallery_image_gallery_id_fkey" FOREIGN KEY (gallery_id)  
REFERENCES ps_gallery(id) ON DELETE CASCADE
 "ps_gallery_image_image_id_fkey" FOREIGN KEY (image_id) REFERENCES  
ps_image(id) ON DELETE CASCADE

Triggers:
 ps_image_gi_sync AFTER INSERT OR DELETE OR UPDATE ON  
ps_gallery_image FOR EACH ROW EXECUTE PROCEDURE ps_image_sync()


= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
=

explain analyze for bad plan

photoshelter=# explain  analyze SELECT ID FROM ps_image WHERE id IN  
(SELECT image_id FROM ps_gallery_image WHERE  
gallery_id='G7ejKGoWS_cY') ORDER BY LOWER(FILE_NAME) ASC limit 1;
  QUERY 
 PLAN

-
Limit  (cost=0.00..458.32 rows=1 width=36) (actual  
time=709831.847..709831.847 rows=1 loops=1)
->  Nested Loop IN Join  (cost=0.00..17700128.78 rows=38620  
width=36) (actual time=709831.845..709831.845 rows=1 loops=1)
  ->  Index Scan using i_file_name_l on ps_image   
(cost=0.00..1023863.22 rows=24460418 width=36) (actual  
time=0.063..271167.293 rows=8876340 loops=1)
  ->  Index Scan using gi_image_id on ps_gallery_image   
(cost=0.00..0.85 rows=1 width=17) (actual time=0.048..0.048 rows=0  
loops=8876340)
Index Cond: ((ps_gallery_image.image_id)::text =  
(ps_image.id)::text)
Filte

Re: [PERFORM] bad plan and LIMIT

2009-05-01 Thread Tom Lane
James Nelson  writes:
> Hi, I'm hoping you guys can help with improving this query I'm having  
> a problem with. The main problem is that the query plan changes  
> depending on the value of the LIMIT clause, with small values using a  
> poor plan and running very slowly. The two times are roughly 5 minutes  
> for the bad plan and 1.5 secs for the good plan.

> photoshelter=# explain  analyze SELECT ID FROM ps_image WHERE id IN  
> (SELECT image_id FROM ps_gallery_image WHERE  
> gallery_id='G7ejKGoWS_cY') ORDER BY LOWER(FILE_NAME) ASC limit 1;

The problem here is an overoptimistic assessment of how long it will
take to find a match to gallery_id='G7ejKGoWS_cY' while searching
in file_name order.  You might be able to fix that by increasing the
statistics target for gallery_id.  However, if the issue is not so
much how many occurrences of 'G7ejKGoWS_cY' there are as that
they're all associated with high values of file_name, that won't
help.  In that case I think it would work to restructure the query
along the lines of

select * from (
  SELECT ID FROM ps_image WHERE id IN  
(SELECT image_id FROM ps_gallery_image WHERE  
 gallery_id='G7ejKGoWS_cY') ORDER BY LOWER(FILE_NAME) ASC
  offset 0
  ) ss
limit 1;

The OFFSET should act as an optimization fence to prevent the LIMIT
from being used in the planning of the subquery.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] bad plan and LIMIT

2009-05-01 Thread Grzegorz Jaśkiewicz
EXISTS won't help much either, postgresql is not too fast, when it
comes to that sort of approach.
join is always going to be fast, it is about time you learn joins and
use them ;)

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] bad plan and LIMIT

2009-05-01 Thread Grzegorz Jaśkiewicz
use join instead of where in();

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] bad plan and LIMIT

2009-05-01 Thread Adam Ruth
You could try changing the IN to an EXISTS, that may alter how the  
optimizer weighs the limit.


SELECT ID FROM ps_image WHERE EXISTS (SELECT null FROM  
ps_gallery_image WHERE gallery_id ='G7ejKGoWS_cY' and image_id =  
ps_image.id) ORDER BY LOWER(FILE_NAME) ASC


On 30/04/2009, at 3:51 AM, James Nelson wrote:



Hi, I'm hoping you guys can help with improving this query I'm  
having a problem with. The main problem is that the query plan  
changes depending on the value of the LIMIT clause, with small  
values using a poor plan and running very slowly. The two times are  
roughly 5 minutes for the bad plan and 1.5 secs for the good plan.


I have read a little about how the query planner takes into account  
the limit clause, and I can see the effect this has on the costs  
shown by explain. The problem is that the estimated cost ends up  
being wildly inaccurate. I'm not sure if this a problem with the  
planner or if it is something I am doing wrong on my end.


the query (without the limit clause):

SELECT ID FROM ps_image WHERE id IN (SELECT image_id FROM  
ps_gallery_image WHERE gallery_id='G7ejKGoWS_cY') ORDER BY  
LOWER(FILE_NAME) ASC


The ps_image table has about 24 million rows, ps_gallery_image has  
about 14 million. The query above produces roughly 50 thousand rows.


When looking at the explain with the limit, I can see the  
interpolation that the planner does for the limit node (arriving at  
a final cost of 458.32 for this example) but not sure why it is  
inaccurate compared to the actual times.


Thanks in advance for taking a look at this, let me know if there is  
additional information I should provide.


Some information about the tables  and the explains follow below.

James Nelson

[ja...@db2 ~] psql --version
psql (PostgreSQL) 8.3.5
contains support for command-line editing

photoshelter=# \d ps_image
 Table "public.ps_image"
  Column |   Type   | Modifiers
---+-- 
+---

id| character varying(16)| not null
user_id   | character varying(16)|
album_id  | character varying(16)| not null
parent_id | character varying(16)|
file_name | character varying(200)   |
file_size | bigint   |
 20 rows snipped 
Indexes:
  "ps_image_pkey" PRIMARY KEY, btree (id)
  "i_file_name_l" btree (lower(file_name::text))
 indexes, fk constraints and triggers snipped 

photoshelter=# \d ps_gallery_image
Table "public.ps_gallery_image"
  Column |   Type   |   Modifiers
---+--+
gallery_id| character varying(16)| not null
image_id  | character varying(16)| not null
display_order | integer  | not null default 0
caption   | character varying(2000)  |
ctime | timestamp with time zone | not null default now()
mtime | timestamp with time zone | not null default now()
id| character varying(16)| not null
Indexes:
  "ps_gallery_image_pkey" PRIMARY KEY, btree (id)
  "gi_gallery_id" btree (gallery_id)
  "gi_image_id" btree (image_id)
Foreign-key constraints:
  "ps_gallery_image_gallery_id_fkey" FOREIGN KEY (gallery_id)  
REFERENCES ps_gallery(id) ON DELETE CASCADE
  "ps_gallery_image_image_id_fkey" FOREIGN KEY (image_id) REFERENCES  
ps_image(id) ON DELETE CASCADE

Triggers:
  ps_image_gi_sync AFTER INSERT OR DELETE OR UPDATE ON  
ps_gallery_image FOR EACH ROW EXECUTE PROCEDURE ps_image_sync()


= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
==

explain analyze for bad plan

photoshelter=# explain  analyze SELECT ID FROM ps_image WHERE id IN  
(SELECT image_id FROM ps_gallery_image WHERE  
gallery_id='G7ejKGoWS_cY') ORDER BY LOWER(FILE_NAME) ASC limit 1;
   QUERY 
 PLAN

-
Limit  (cost=0.00..458.32 rows=1 width=36) (actual  
time=709831.847..709831.847 rows=1 loops=1)
 ->  Nested Loop IN Join  (cost=0.00..17700128.78 rows=38620  
width=36) (actual time=709831.845..709831.845 rows=1 loops=1)
   ->  Index Scan using i_file_name_l on ps_image   
(cost=0.00..1023863.22 rows=24460418 width=36) (actual  
time=0.063..271167.293 rows=8876340 loops=1)
   ->  Index Scan using gi_image_id on ps_gallery_image   
(cost=0.00..0.85 rows=1 width=17) (actual time=0.048..0.048 rows=0  
loops=8876340)
 Index Cond: ((ps_gallery_image.image_id)::text =  
(ps_image.id)::text)
 Filter: ((ps_gallery_image.gallery_id)::text =  
'G7ejKGoWS_cY'::text)

Total runtime: 709831.932 ms

= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
=

[PERFORM] bad plan and LIMIT

2009-05-01 Thread James Nelson


Hi, I'm hoping you guys can help with improving this query I'm having  
a problem with. The main problem is that the query plan changes  
depending on the value of the LIMIT clause, with small values using a  
poor plan and running very slowly. The two times are roughly 5 minutes  
for the bad plan and 1.5 secs for the good plan.


I have read a little about how the query planner takes into account  
the limit clause, and I can see the effect this has on the costs shown  
by explain. The problem is that the estimated cost ends up being  
wildly inaccurate. I'm not sure if this a problem with the planner or  
if it is something I am doing wrong on my end.


the query (without the limit clause):

SELECT ID FROM ps_image WHERE id IN (SELECT image_id FROM  
ps_gallery_image WHERE gallery_id='G7ejKGoWS_cY') ORDER BY  
LOWER(FILE_NAME) ASC


The ps_image table has about 24 million rows, ps_gallery_image has  
about 14 million. The query above produces roughly 50 thousand rows.


When looking at the explain with the limit, I can see the  
interpolation that the planner does for the limit node (arriving at a  
final cost of 458.32 for this example) but not sure why it is  
inaccurate compared to the actual times.


Thanks in advance for taking a look at this, let me know if there is  
additional information I should provide.


Some information about the tables  and the explains follow below.

James Nelson

[ja...@db2 ~] psql --version
psql (PostgreSQL) 8.3.5
contains support for command-line editing

photoshelter=# \d ps_image
  Table "public.ps_image"
   Column |   Type   | Modifiers
---+-- 
+---

id| character varying(16)| not null
user_id   | character varying(16)|
album_id  | character varying(16)| not null
parent_id | character varying(16)|
file_name | character varying(200)   |
file_size | bigint   |
 20 rows snipped 
Indexes:
   "ps_image_pkey" PRIMARY KEY, btree (id)
   "i_file_name_l" btree (lower(file_name::text))
 indexes, fk constraints and triggers snipped 

photoshelter=# \d ps_gallery_image
 Table "public.ps_gallery_image"
   Column |   Type   |   Modifiers
---+--+
gallery_id| character varying(16)| not null
image_id  | character varying(16)| not null
display_order | integer  | not null default 0
caption   | character varying(2000)  |
ctime | timestamp with time zone | not null default now()
mtime | timestamp with time zone | not null default now()
id| character varying(16)| not null
Indexes:
   "ps_gallery_image_pkey" PRIMARY KEY, btree (id)
   "gi_gallery_id" btree (gallery_id)
   "gi_image_id" btree (image_id)
Foreign-key constraints:
   "ps_gallery_image_gallery_id_fkey" FOREIGN KEY (gallery_id)  
REFERENCES ps_gallery(id) ON DELETE CASCADE
   "ps_gallery_image_image_id_fkey" FOREIGN KEY (image_id) REFERENCES  
ps_image(id) ON DELETE CASCADE

Triggers:
   ps_image_gi_sync AFTER INSERT OR DELETE OR UPDATE ON  
ps_gallery_image FOR EACH ROW EXECUTE PROCEDURE ps_image_sync()


= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 


explain analyze for bad plan

photoshelter=# explain  analyze SELECT ID FROM ps_image WHERE id IN  
(SELECT image_id FROM ps_gallery_image WHERE  
gallery_id='G7ejKGoWS_cY') ORDER BY LOWER(FILE_NAME) ASC limit 1;
QUERY 
 PLAN

-
Limit  (cost=0.00..458.32 rows=1 width=36) (actual  
time=709831.847..709831.847 rows=1 loops=1)
  ->  Nested Loop IN Join  (cost=0.00..17700128.78 rows=38620  
width=36) (actual time=709831.845..709831.845 rows=1 loops=1)
->  Index Scan using i_file_name_l on ps_image   
(cost=0.00..1023863.22 rows=24460418 width=36) (actual  
time=0.063..271167.293 rows=8876340 loops=1)
->  Index Scan using gi_image_id on ps_gallery_image   
(cost=0.00..0.85 rows=1 width=17) (actual time=0.048..0.048 rows=0  
loops=8876340)
  Index Cond: ((ps_gallery_image.image_id)::text =  
(ps_image.id)::text)
  Filter: ((ps_gallery_image.gallery_id)::text =  
'G7ejKGoWS_cY'::text)

Total runtime: 709831.932 ms

= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 


explain analyze for good plan

photoshelter=# explain  analyze SELECT ID FROM ps_image WHERE id IN  
(SELECT image_id FROM ps_gallery_image WHERE  
gallery_id='G7ejKGoWS_cY') ORDER BY LOWER(FILE_NAME) ASC limit 600;