Re: [SQL] Curious problem of using BETWEEN with start and end being the same versus EQUALS '='

2013-01-15 Thread Tom Lane
Venky Kandaswamy ve...@adchemy.com writes:
On 9.1, I am running into a curious issue.

It's not very curious at all, or at least people on pgsql-performance
(the right list for this sort of question) would have figured it out
quickly.  You're getting a crummy plan because of a crummy row estimate.
When you do this:

  WHERE a.date_id = 20120228

you get this:

 -  Index Scan using alps_agg_date_id on bi2003.alps_agg 
 a  (cost=0.00..17870.00 rows=26292 width=1350) (actual time=0.047..142.383 
 rows=36132 loops=1)
   Output: a.date_id, a.page_group, a.page, 
 a.int_alloc_type, a.componentset, a.adc_visit, upper((a.adc_visit)::text)
   Index Cond: (a.date_id = 20120228)
   Filter: ((a.page)::text = 'ddi_671'::text)

26K estimated rows versus 36K actual isn't the greatest estimate in the
world, but it's plenty good enough.  But when you do this:

  WHERE a.date_id BETWEEN 20120228 AND 20120228

you get this:

 -  Index Scan using alps_agg_date_id on bi2003.alps_agg a  
 (cost=0.00..10.12 rows=1 width=1350)
   Output: a.date_id, a.adc_visit, a.page_group, a.page, 
 a.int_alloc_type, a.componentset, a.variation_tagset, a.page_instance
   Index Cond: ((a.date_id = 20120228) AND (a.date_id = 
 20120228))
   Filter: ((a.page)::text = 'ddi_671'::text)

so the bogus estimate of only one row causes the planner to pick an
entirely different plan, which would probably be a great choice if there
were indeed only one such row, but with 36000 of them it's horrid.

The reason the row estimate is so crummy is that a zero-width interval
is an edge case for range estimates.  We've seen this before, although
usually it's not quite this bad.

There's been some talk of making the estimate for x = a AND x = b
always be at least as much as the estimate for x = a, but this would
increase the cost of making the estimate by quite a bit, and make things
actually worse in some cases (in particular, if a  b then a nil
estimate is indeed the right thing).

You might look into whether queries formed like date_id = 20120228 AND
date_id  20120229 give you more robust estimates at the edge cases.

BTW, I notice in your EXPLAIN results that the same range restriction
has been propagated to b.date_id:

 -  Index Scan using event_agg_date_id on bi2003.event_agg b  
 (cost=0.00..10.27 rows=1 width=1694)
   Output: b.date_id, b.vcset, b.eventcountset, b.eventvalueset
   Index Cond: ((b.date_id = 20120228) AND (b.date_id = 
 20120228))

I'd expect that to happen automatically for a simple equality
constraint, but not for a range constraint.  Did you do that manually
and not tell us about it?

regards, tom lane


-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] Curious problem of using BETWEEN with start and end being the same versus EQUALS '='

2013-01-15 Thread Venky Kandaswamy
Thanks for the quick and detailed response, Tom. 

Yes, I did add a redundant where clause with a restriction on b.date_id on the 
range queries. This appears to speed things up since it does an index scan on 
the b table before the merge join. 

We will get more intelligent on query generation (our system generates queries 
on the fly) to work around this problem. 


Venky Kandaswamy

Principal Engineer, Adchemy Inc.

925-200-7124


From: Tom Lane [t...@sss.pgh.pa.us]
Sent: Tuesday, January 15, 2013 2:30 PM
To: Venky Kandaswamy
Cc: pgsql-gene...@postgresql.org; pgsql-sql@postgresql.org
Subject: Re: [SQL] Curious problem of using BETWEEN with start and end being 
the same versus EQUALS '='

Venky Kandaswamy ve...@adchemy.com writes:
On 9.1, I am running into a curious issue.

It's not very curious at all, or at least people on pgsql-performance
(the right list for this sort of question) would have figured it out
quickly.  You're getting a crummy plan because of a crummy row estimate.
When you do this:

  WHERE a.date_id = 20120228

you get this:

 -  Index Scan using alps_agg_date_id on bi2003.alps_agg 
 a  (cost=0.00..17870.00 rows=26292 width=1350) (actual time=0.047..142.383 
 rows=36132 loops=1)
   Output: a.date_id, a.page_group, a.page, 
 a.int_alloc_type, a.componentset, a.adc_visit, upper((a.adc_visit)::text)
   Index Cond: (a.date_id = 20120228)
   Filter: ((a.page)::text = 'ddi_671'::text)

26K estimated rows versus 36K actual isn't the greatest estimate in the
world, but it's plenty good enough.  But when you do this:

  WHERE a.date_id BETWEEN 20120228 AND 20120228

you get this:

 -  Index Scan using alps_agg_date_id on bi2003.alps_agg a  
 (cost=0.00..10.12 rows=1 width=1350)
   Output: a.date_id, a.adc_visit, a.page_group, a.page, 
 a.int_alloc_type, a.componentset, a.variation_tagset, a.page_instance
   Index Cond: ((a.date_id = 20120228) AND (a.date_id = 
 20120228))
   Filter: ((a.page)::text = 'ddi_671'::text)

so the bogus estimate of only one row causes the planner to pick an
entirely different plan, which would probably be a great choice if there
were indeed only one such row, but with 36000 of them it's horrid.

The reason the row estimate is so crummy is that a zero-width interval
is an edge case for range estimates.  We've seen this before, although
usually it's not quite this bad.

There's been some talk of making the estimate for x = a AND x = b
always be at least as much as the estimate for x = a, but this would
increase the cost of making the estimate by quite a bit, and make things
actually worse in some cases (in particular, if a  b then a nil
estimate is indeed the right thing).

You might look into whether queries formed like date_id = 20120228 AND
date_id  20120229 give you more robust estimates at the edge cases.

BTW, I notice in your EXPLAIN results that the same range restriction
has been propagated to b.date_id:

 -  Index Scan using event_agg_date_id on bi2003.event_agg b  
 (cost=0.00..10.27 rows=1 width=1694)
   Output: b.date_id, b.vcset, b.eventcountset, b.eventvalueset
   Index Cond: ((b.date_id = 20120228) AND (b.date_id = 
 20120228))

I'd expect that to happen automatically for a simple equality
constraint, but not for a range constraint.  Did you do that manually
and not tell us about it?

regards, tom lane




-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql