Re: [GENERAL] regression between 8.4.8 and 8.4.2?

2011-08-31 Thread Peter Eisentraut
I don't have an answer for you, but this report looks suspiciously
similar to the one I posted the other day at
http://archives.postgresql.org/pgsql-hackers/2011-08/msg01224.php,
which, now that I think about it, also manifested itself after the
upgrade to 8.4.8.

On tis, 2011-08-30 at 15:24 -0700, Ben Chobot wrote:
 We recently took a copy of our production data (running on 8.4.2),
 scrubbed many data fields, and then loaded it onto a qa server
 (running 8.4.8). We're seeing some odd planner performance that I
 think might be a bug, though I'm hoping it's just idiocy on my part.
 I've analyzed things and looked into pg_stats and it seems as if the
 relevant columns have about the same statistics. 
 
 
 I've managed to simplify the query, but if I make it any simpler, then
 the two servers end up with the same good plan. The query is down to:
 
 
 SELECT machines.quota_purchased 
 FROM machines
 WHERE NOT deleted AND machines.user_id IN (
 SELECT id FROM users WHERE user_group_id IN (
  select 607547 offset 0
 ) OFFSET 0
   );
 
 
 
 
 (Those offset 0 are in there to protect us from planner regressions
 we saw when moving to 8.4. When we move to 9, they can hopefully go
 away.)
 
 
 On the production server, this returns a fairly accurate plan:
 
 
 
 QUERY PLAN
  
 --
  Nested Loop  (cost=843.59..1447.90 rows=243 width=8) (actual
 time=0.044..0.045 rows=1 loops=1)
-  HashAggregate  (cost=843.59..845.59 rows=200 width=4) (actual
 time=0.027..0.027 rows=1 loops=1)
  -  Limit  (cost=0.02..823.90 rows=1575 width=4) (actual
 time=0.024..0.025 rows=1 loops=1)
-  Nested Loop  (cost=0.02..823.90 rows=1575 width=4)
 (actual time=0.023..0.024 rows=1 loops=1)
  -  HashAggregate  (cost=0.02..0.03 rows=1
 width=4) (actual time=0.005..0.005 rows=1 loops=1)
-  Limit  (cost=0.00..0.01 rows=1 width=0)
 (actual time=0.001..0.002 rows=1 loops=1)
  -  Result  (cost=0.00..0.01 rows=1
 width=0) (actual time=0.000..0.000 rows=1 loops=1)
  -  Index Scan using users_user_groups_idx on
 users  (cost=0.00..804.18 rows=1575 width=8) (actual time=0.017..0.018
 rows=1 loops=1)
Index Cond: (users.user_group_id =
 (607547))
-  Index Scan using machines_sid_un on machines  (cost=0.00..3.00
 rows=1 width=12) (actual time=0.015..0.015 rows=1 loops=1)
  Index Cond: (machines.user_id = users.id)
  Total runtime: 0.121 ms
 (12 rows)
 
 
 
 
 On the QA server, things are not so accurate. It doesn't hurt the
 timing of this simplified query much, but when put into the actual
 query, the row estimation being off by 6 orders of magnitude really
 throws the planning in the wrong direction. The plan on the QA server
 is:
 
 
 
 QUERY PLAN
   
 ---
  Nested Loop  (cost=1887.16..3671.20 rows=1192462 width=8) (actual
 time=0.049..0.051 rows=1 loops=1)
-  HashAggregate  (cost=1887.16..1889.16 rows=200 width=4) (actual
 time=0.032..0.033 rows=1 loops=1)
  -  Limit  (cost=0.02..1868.20 rows=1517 width=4) (actual
 time=0.027..0.029 rows=1 loops=1)
-  Nested Loop  (cost=0.02..1868.20 rows=1517 width=4)
 (actual time=0.027..0.028 rows=1 loops=1)
  -  HashAggregate  (cost=0.02..0.03 rows=1
 width=4) (actual time=0.008..0.008 rows=1 loops=1)
-  Limit  (cost=0.00..0.01 rows=1 width=0)
 (actual time=0.001..0.001 rows=1 loops=1)
  -  Result  (cost=0.00..0.01 rows=1
 width=0) (actual time=0.001..0.001 rows=1 loops=1)
  -  Index Scan using users_user_groups_idx on
 users  (cost=0.00..1849.20 rows=1517 width=8) (actual
 time=0.015..0.016 rows=1 loops=1)
Index Cond: (users.user_group_id =
 (607547))
-  Index Scan using machines_sid_un on machines  (cost=0.00..8.90
 rows=1 width=12) (actual time=0.013..0.013 rows=1 loops=1)
  Index Cond: (machines.user_id = users.id)
  Total runtime: 0.148 ms
 (12 rows)
 
 
 
 
 
 
 The problem here (I think) seems to be that the QA server believes
 that running a nested loop over 200 users.id values and joining that
 against machines.user_id will result in 1M rows. The production
 servers sees this more accurately as the nearly 1:1 relationship that
 it is.
 
 
 The reason I wonder if this might be a bug is because if I change the
 obtuse clause WHERE user_group_id IN (select 607547 offset 0) to
 simply where user_group_id in (607547) then the plan collapses to
 the same plan on both servers:
 
 
 explain analyze SELECT 

Re: [GENERAL] regression between 8.4.8 and 8.4.2?

2011-08-31 Thread Peter Eisentraut
On ons, 2011-08-31 at 10:42 +0300, Peter Eisentraut wrote:
 I don't have an answer for you, but this report looks suspiciously
 similar to the one I posted the other day at
 http://archives.postgresql.org/pgsql-hackers/2011-08/msg01224.php,
 which, now that I think about it, also manifested itself after the
 upgrade to 8.4.8.

See this thread:
http://archives.postgresql.org/pgsql-performance/2011-08/msg00248.php

It looks like there are a number of users affected by this.

 
 On tis, 2011-08-30 at 15:24 -0700, Ben Chobot wrote:
  We recently took a copy of our production data (running on 8.4.2),
  scrubbed many data fields, and then loaded it onto a qa server
  (running 8.4.8). We're seeing some odd planner performance that I
  think might be a bug, though I'm hoping it's just idiocy on my part.
  I've analyzed things and looked into pg_stats and it seems as if the
  relevant columns have about the same statistics. 
  
  
  I've managed to simplify the query, but if I make it any simpler, then
  the two servers end up with the same good plan. The query is down to:
  
  
  SELECT machines.quota_purchased 
  FROM machines
  WHERE NOT deleted AND machines.user_id IN (
  SELECT id FROM users WHERE user_group_id IN (
   select 607547 offset 0
  ) OFFSET 0
);
  
  
  
  
  (Those offset 0 are in there to protect us from planner regressions
  we saw when moving to 8.4. When we move to 9, they can hopefully go
  away.)
  
  
  On the production server, this returns a fairly accurate plan:
  
  
  
  QUERY PLAN
   
  --
   Nested Loop  (cost=843.59..1447.90 rows=243 width=8) (actual
  time=0.044..0.045 rows=1 loops=1)
 -  HashAggregate  (cost=843.59..845.59 rows=200 width=4) (actual
  time=0.027..0.027 rows=1 loops=1)
   -  Limit  (cost=0.02..823.90 rows=1575 width=4) (actual
  time=0.024..0.025 rows=1 loops=1)
 -  Nested Loop  (cost=0.02..823.90 rows=1575 width=4)
  (actual time=0.023..0.024 rows=1 loops=1)
   -  HashAggregate  (cost=0.02..0.03 rows=1
  width=4) (actual time=0.005..0.005 rows=1 loops=1)
 -  Limit  (cost=0.00..0.01 rows=1 width=0)
  (actual time=0.001..0.002 rows=1 loops=1)
   -  Result  (cost=0.00..0.01 rows=1
  width=0) (actual time=0.000..0.000 rows=1 loops=1)
   -  Index Scan using users_user_groups_idx on
  users  (cost=0.00..804.18 rows=1575 width=8) (actual time=0.017..0.018
  rows=1 loops=1)
 Index Cond: (users.user_group_id =
  (607547))
 -  Index Scan using machines_sid_un on machines  (cost=0.00..3.00
  rows=1 width=12) (actual time=0.015..0.015 rows=1 loops=1)
   Index Cond: (machines.user_id = users.id)
   Total runtime: 0.121 ms
  (12 rows)
  
  
  
  
  On the QA server, things are not so accurate. It doesn't hurt the
  timing of this simplified query much, but when put into the actual
  query, the row estimation being off by 6 orders of magnitude really
  throws the planning in the wrong direction. The plan on the QA server
  is:
  
  
  
  QUERY PLAN

  ---
   Nested Loop  (cost=1887.16..3671.20 rows=1192462 width=8) (actual
  time=0.049..0.051 rows=1 loops=1)
 -  HashAggregate  (cost=1887.16..1889.16 rows=200 width=4) (actual
  time=0.032..0.033 rows=1 loops=1)
   -  Limit  (cost=0.02..1868.20 rows=1517 width=4) (actual
  time=0.027..0.029 rows=1 loops=1)
 -  Nested Loop  (cost=0.02..1868.20 rows=1517 width=4)
  (actual time=0.027..0.028 rows=1 loops=1)
   -  HashAggregate  (cost=0.02..0.03 rows=1
  width=4) (actual time=0.008..0.008 rows=1 loops=1)
 -  Limit  (cost=0.00..0.01 rows=1 width=0)
  (actual time=0.001..0.001 rows=1 loops=1)
   -  Result  (cost=0.00..0.01 rows=1
  width=0) (actual time=0.001..0.001 rows=1 loops=1)
   -  Index Scan using users_user_groups_idx on
  users  (cost=0.00..1849.20 rows=1517 width=8) (actual
  time=0.015..0.016 rows=1 loops=1)
 Index Cond: (users.user_group_id =
  (607547))
 -  Index Scan using machines_sid_un on machines  (cost=0.00..8.90
  rows=1 width=12) (actual time=0.013..0.013 rows=1 loops=1)
   Index Cond: (machines.user_id = users.id)
   Total runtime: 0.148 ms
  (12 rows)
  
  
  
  
  
  
  The problem here (I think) seems to be that the QA server believes
  that running a nested loop over 200 users.id values and joining that
  against machines.user_id will result in 1M rows. The production
  servers sees this more 

Re: [GENERAL] regression between 8.4.8 and 8.4.2?

2011-08-31 Thread Tom Lane
Peter Eisentraut pete...@gmx.net writes:
 I don't have an answer for you, but this report looks suspiciously
 similar to the one I posted the other day at
 http://archives.postgresql.org/pgsql-hackers/2011-08/msg01224.php,
 which, now that I think about it, also manifested itself after the
 upgrade to 8.4.8.

I think there is more than one thing going on here.  I've identified a
logic error in this 8.4 change:
http://git.postgresql.org/gitweb/?p=postgresql.gita=commitdiffh=7f3eba30
which is that it is relying on vardata[12]-rel-rows to provide the
number of rows coming in to the semijoin, but that's only accurate for a
single level of join.  With two nested semijoins you get a pretty wacko
answer --- unless you prevent them from being folded by inserting OFFSET
0.  So that's definitely a bug, but it doesn't explain Ben's complaint
because he's griping about a case where he did have OFFSET 0.  (I wonder
though if this is the 8.4 planner regression that he put in the OFFSET
for originally.  Sure would be nice if people reported such things
instead of hacking around them and imagining that they'll get fixed
magically.)  It also doesn't explain Mark Kirkwood's complaint, since
he's showing test cases that involve only one join.

The only relevant-looking change I can find between 8.4.7 and 8.4.8 is
http://git.postgresql.org/gitweb/?p=postgresql.gita=commitdiffh=0ae8b300388c2a3eaf90e6e6f13d6be1f4d4ac2d
which again should not have caused the amount of excitement we're seeing
on this thread, since it should represent no worse than a reversion to
8.3 behavior.  Possibly what we're after is some earlier 8.4.x patch.

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] regression between 8.4.8 and 8.4.2?

2011-08-31 Thread Ben Chobot
On Aug 31, 2011, at 10:47 AM, Tom Lane wrote:

 Peter Eisentraut pete...@gmx.net writes:
 I don't have an answer for you, but this report looks suspiciously
 similar to the one I posted the other day at
 http://archives.postgresql.org/pgsql-hackers/2011-08/msg01224.php,
 which, now that I think about it, also manifested itself after the
 upgrade to 8.4.8.
 
 I think there is more than one thing going on here.  I've identified a
 logic error in this 8.4 change:
 http://git.postgresql.org/gitweb/?p=postgresql.gita=commitdiffh=7f3eba30
 which is that it is relying on vardata[12]-rel-rows to provide the
 number of rows coming in to the semijoin, but that's only accurate for a
 single level of join.  With two nested semijoins you get a pretty wacko
 answer --- unless you prevent them from being folded by inserting OFFSET
 0.  So that's definitely a bug, but it doesn't explain Ben's complaint
 because he's griping about a case where he did have OFFSET 0.  (I wonder
 though if this is the 8.4 planner regression that he put in the OFFSET
 for originally.  Sure would be nice if people reported such things
 instead of hacking around them and imagining that they'll get fixed
 magically.)  It also doesn't explain Mark Kirkwood's complaint, since
 he's showing test cases that involve only one join.
 
 The only relevant-looking change I can find between 8.4.7 and 8.4.8 is
 http://git.postgresql.org/gitweb/?p=postgresql.gita=commitdiffh=0ae8b300388c2a3eaf90e6e6f13d6be1f4d4ac2d
 which again should not have caused the amount of excitement we're seeing
 on this thread, since it should represent no worse than a reversion to
 8.3 behavior.  Possibly what we're after is some earlier 8.4.x patch.

Tom, if there's anything else we can provide that might you out, let me know. 
We're currently about to install an earlier 8.4 version to see if the problem 
goes away. Is there a particular version you'd be interested to know about?
-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] regression between 8.4.8 and 8.4.2?

2011-08-31 Thread Tom Lane
Ben Chobot be...@silentmedia.com writes:
 Tom, if there's anything else we can provide that might you out, let me know.

If you could extract a self-contained test case for the bad estimation,
that would be useful.

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] regression between 8.4.8 and 8.4.2?

2011-08-31 Thread Ben Chobot
On Aug 31, 2011, at 11:10 AM, Tom Lane wrote:

 Ben Chobot be...@silentmedia.com writes:
 Tom, if there's anything else we can provide that might you out, let me know.
 
 If you could extract a self-contained test case for the bad estimation,
 that would be useful.

OK, we'll pull something together. In the meantime, we can confirm that 
reverting from 8.4.8 to 8.4.3 fixes things.

Re: [GENERAL] regression between 8.4.8 and 8.4.2?

2011-08-31 Thread Ben Chobot
On Aug 31, 2011, at 11:53 AM, Ben Chobot wrote:

 On Aug 31, 2011, at 11:10 AM, Tom Lane wrote:
 
 Ben Chobot be...@silentmedia.com writes:
 Tom, if there's anything else we can provide that might you out, let me 
 know.
 
 If you could extract a self-contained test case for the bad estimation,
 that would be useful.
 
 OK, we'll pull something together. In the meantime, we can confirm that 
 reverting from 8.4.8 to 8.4.3 fixes things.

 and FWIW 9.0.3 does the right thing as well.

[GENERAL] regression between 8.4.8 and 8.4.2?

2011-08-30 Thread Ben Chobot
We recently took a copy of our production data (running on 8.4.2), scrubbed 
many data fields, and then loaded it onto a qa server (running 8.4.8). We're 
seeing some odd planner performance that I think might be a bug, though I'm 
hoping it's just idiocy on my part. I've analyzed things and looked into 
pg_stats and it seems as if the relevant columns have about the same 
statistics. 

I've managed to simplify the query, but if I make it any simpler, then the two 
servers end up with the same good plan. The query is down to:

SELECT machines.quota_purchased 
FROM machines
WHERE NOT deleted AND machines.user_id IN (
SELECT id FROM users WHERE user_group_id IN (
 select 607547 offset 0
) OFFSET 0
  );


(Those offset 0 are in there to protect us from planner regressions we saw 
when moving to 8.4. When we move to 9, they can hopefully go away.)

On the production server, this returns a fairly accurate plan:

QUERY 
PLAN
--
 Nested Loop  (cost=843.59..1447.90 rows=243 width=8) (actual time=0.044..0.045 
rows=1 loops=1)
   -  HashAggregate  (cost=843.59..845.59 rows=200 width=4) (actual 
time=0.027..0.027 rows=1 loops=1)
 -  Limit  (cost=0.02..823.90 rows=1575 width=4) (actual 
time=0.024..0.025 rows=1 loops=1)
   -  Nested Loop  (cost=0.02..823.90 rows=1575 width=4) (actual 
time=0.023..0.024 rows=1 loops=1)
 -  HashAggregate  (cost=0.02..0.03 rows=1 width=4) 
(actual time=0.005..0.005 rows=1 loops=1)
   -  Limit  (cost=0.00..0.01 rows=1 width=0) (actual 
time=0.001..0.002 rows=1 loops=1)
 -  Result  (cost=0.00..0.01 rows=1 width=0) 
(actual time=0.000..0.000 rows=1 loops=1)
 -  Index Scan using users_user_groups_idx on users  
(cost=0.00..804.18 rows=1575 width=8) (actual time=0.017..0.018 rows=1 loops=1)
   Index Cond: (users.user_group_id = (607547))
   -  Index Scan using machines_sid_un on machines  (cost=0.00..3.00 rows=1 
width=12) (actual time=0.015..0.015 rows=1 loops=1)
 Index Cond: (machines.user_id = users.id)
 Total runtime: 0.121 ms
(12 rows)


On the QA server, things are not so accurate. It doesn't hurt the timing of 
this simplified query much, but when put into the actual query, the row 
estimation being off by 6 orders of magnitude really throws the planning in the 
wrong direction. The plan on the QA server is:

QUERY 
PLAN 
---
 Nested Loop  (cost=1887.16..3671.20 rows=1192462 width=8) (actual 
time=0.049..0.051 rows=1 loops=1)
   -  HashAggregate  (cost=1887.16..1889.16 rows=200 width=4) (actual 
time=0.032..0.033 rows=1 loops=1)
 -  Limit  (cost=0.02..1868.20 rows=1517 width=4) (actual 
time=0.027..0.029 rows=1 loops=1)
   -  Nested Loop  (cost=0.02..1868.20 rows=1517 width=4) (actual 
time=0.027..0.028 rows=1 loops=1)
 -  HashAggregate  (cost=0.02..0.03 rows=1 width=4) 
(actual time=0.008..0.008 rows=1 loops=1)
   -  Limit  (cost=0.00..0.01 rows=1 width=0) (actual 
time=0.001..0.001 rows=1 loops=1)
 -  Result  (cost=0.00..0.01 rows=1 width=0) 
(actual time=0.001..0.001 rows=1 loops=1)
 -  Index Scan using users_user_groups_idx on users  
(cost=0.00..1849.20 rows=1517 width=8) (actual time=0.015..0.016 rows=1 loops=1)
   Index Cond: (users.user_group_id = (607547))
   -  Index Scan using machines_sid_un on machines  (cost=0.00..8.90 rows=1 
width=12) (actual time=0.013..0.013 rows=1 loops=1)
 Index Cond: (machines.user_id = users.id)
 Total runtime: 0.148 ms
(12 rows)



The problem here (I think) seems to be that the QA server believes that running 
a nested loop over 200 users.id values and joining that against 
machines.user_id will result in 1M rows. The production servers sees this more 
accurately as the nearly 1:1 relationship that it is.

The reason I wonder if this might be a bug is because if I change the obtuse 
clause WHERE user_group_id IN (select 607547 offset 0) to simply where 
user_group_id in (607547) then the plan collapses to the same plan on both 
servers:

explain analyze SELECT machines.quota_purchased
FROM machines
WHERE NOT deleted AND machines.user_id IN (
SELECT id FROM users WHERE user_group_id IN (