Thanks for the quick and detailed response, Tom.
Yes, I did add a redundant where clause with a restriction on b.date_id on the
range queries. This appears to speed things up since it does an index scan on
the b table before the merge join.
We will get more intelligent on query generation (our system generates queries
on the fly) to work around this problem.
Venky Kandaswamy
Principal Engineer, Adchemy Inc.
925-200-7124
From: Tom Lane [t...@sss.pgh.pa.us]
Sent: Tuesday, January 15, 2013 2:30 PM
To: Venky Kandaswamy
Cc: pgsql-gene...@postgresql.org; pgsql-sql@postgresql.org
Subject: Re: [SQL] Curious problem of using BETWEEN with start and end being
the same versus EQUALS '='
Venky Kandaswamy ve...@adchemy.com writes:
On 9.1, I am running into a curious issue.
It's not very curious at all, or at least people on pgsql-performance
(the right list for this sort of question) would have figured it out
quickly. You're getting a crummy plan because of a crummy row estimate.
When you do this:
WHERE a.date_id = 20120228
you get this:
- Index Scan using alps_agg_date_id on bi2003.alps_agg
a (cost=0.00..17870.00 rows=26292 width=1350) (actual time=0.047..142.383
rows=36132 loops=1)
Output: a.date_id, a.page_group, a.page,
a.int_alloc_type, a.componentset, a.adc_visit, upper((a.adc_visit)::text)
Index Cond: (a.date_id = 20120228)
Filter: ((a.page)::text = 'ddi_671'::text)
26K estimated rows versus 36K actual isn't the greatest estimate in the
world, but it's plenty good enough. But when you do this:
WHERE a.date_id BETWEEN 20120228 AND 20120228
you get this:
- Index Scan using alps_agg_date_id on bi2003.alps_agg a
(cost=0.00..10.12 rows=1 width=1350)
Output: a.date_id, a.adc_visit, a.page_group, a.page,
a.int_alloc_type, a.componentset, a.variation_tagset, a.page_instance
Index Cond: ((a.date_id = 20120228) AND (a.date_id =
20120228))
Filter: ((a.page)::text = 'ddi_671'::text)
so the bogus estimate of only one row causes the planner to pick an
entirely different plan, which would probably be a great choice if there
were indeed only one such row, but with 36000 of them it's horrid.
The reason the row estimate is so crummy is that a zero-width interval
is an edge case for range estimates. We've seen this before, although
usually it's not quite this bad.
There's been some talk of making the estimate for x = a AND x = b
always be at least as much as the estimate for x = a, but this would
increase the cost of making the estimate by quite a bit, and make things
actually worse in some cases (in particular, if a b then a nil
estimate is indeed the right thing).
You might look into whether queries formed like date_id = 20120228 AND
date_id 20120229 give you more robust estimates at the edge cases.
BTW, I notice in your EXPLAIN results that the same range restriction
has been propagated to b.date_id:
- Index Scan using event_agg_date_id on bi2003.event_agg b
(cost=0.00..10.27 rows=1 width=1694)
Output: b.date_id, b.vcset, b.eventcountset, b.eventvalueset
Index Cond: ((b.date_id = 20120228) AND (b.date_id =
20120228))
I'd expect that to happen automatically for a simple equality
constraint, but not for a range constraint. Did you do that manually
and not tell us about it?
regards, tom lane
--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql