On Tue, Jun 13, 2006 at 06:04:42PM -0400, Tom Lane wrote:
> "Jim C. Nasby" <[EMAIL PROTECTED]> writes:
> > Also, I'm not sure that the behavior is entirely changed, either. On a
> > 8.1.4 database I'm still seeing a difference between now() - interval
> > and a hard-coded date.
>
> It'd depend on the context, possibly, but it's easy to show that the
> current planner does fold "now() - interval_constant" when making
> estimates. Simple example:
Turns out the difference is between feeding a date vs a timestamp into the
query... I would have thought that since date is a date that the WHERE clause
would be casted to a date if it was a timestamptz, but I guess not...
stats=# explain select * from email_contrib where project_id=8 and date >=
now()-'15 days'::interval;
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Index Scan using email_contrib__project_date on email_contrib
(cost=0.01..45405.83 rows=14225 width=24)
Index Cond: ((project_id = 8) AND (date >= (now() - '15 days'::interval)))
(2 rows)
stats=# explain select * from email_contrib where project_id=8 AND date >=
'2006-05-29 22:09:56.814897+00'::date;
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Index Scan using email_contrib__project_date on email_contrib
(cost=0.00..48951.74 rows=15336 width=24)
Index Cond: ((project_id = 8) AND (date >= '2006-05-29'::date))
(2 rows)
stats=# explain select * from email_contrib where project_id=8 AND date >=
'2006-05-29 22:09:56.814897+00'::timestamp;
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Index Scan using email_contrib__project_date on email_contrib
(cost=0.00..45472.76 rows=14246 width=24)
Index Cond: ((project_id = 8) AND (date >= '2006-05-29
22:09:56.814897'::timestamp without time zone))
(2 rows)
Actual row count is 109071; reason for the vast difference is querying on two
columns.
I know comming up with general-purpose multicolumn stats is extremely
difficult, but can't we at least add histograms for multi-column indexes?? In
this case that would most likely make the estimate dead-on, because there's an
index on project_id, date.
Details below for the morbidly curious/bored...
stats=# \d email_contrib
Table "public.email_contrib"
Column | Type | Modifiers
------------+---------+-----------
project_id | integer | not null
id | integer | not null
date | date | not null
team_id | integer |
work_units | bigint | not null
Indexes:
"email_contrib_pkey" PRIMARY KEY, btree (project_id, id, date), tablespace
"raid10"
"email_contrib__pk24" btree (id, date) WHERE project_id = 24, tablespace
"raid10"
"email_contrib__pk25" btree (id, date) WHERE project_id = 25, tablespace
"raid10"
"email_contrib__pk8" btree (id, date) WHERE project_id = 8, tablespace
"raid10"
"email_contrib__project_date" btree (project_id, date), tablespace "raid10"
"email_contrib__project_id" btree (project_id), tablespace "raid10"
"email_contrib__team_id" btree (team_id), tablespace "raid10"
Foreign-key constraints:
"fk_email_contrib__id" FOREIGN KEY (id) REFERENCES stats_participant(id) ON
UPDATE CASCADE
"fk_email_contrib__team_id" FOREIGN KEY (team_id) REFERENCES
stats_team(team) ON UPDATE CASCADE
Tablespace: "raid10"
stats=# explain analyze select * from email_contrib where project_id=8 and date
>= now()-'15 days'::interval;
QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using email_contrib__project_date on email_contrib
(cost=0.01..45475.95 rows=14247 width=24) (actual time=0.294..264.345
rows=109071 loops=1)
Index Cond: ((project_id = 8) AND (date >= (now() - '15 days'::interval)))
Total runtime: 412.167 ms
(3 rows)
stats=# select now()-'15 days'::interval;
?column?
-------------------------------
2006-05-29 22:09:56.814897+00
(1 row)
stats=# explain analyze select * from email_contrib where project_id=8 and date
>= '2006-05-29 22:09:56.814897+00';
QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using email_contrib__project_date on email_contrib
(cost=0.00..48951.74 rows=15336 width=24) (actual time=0.124..229.800
rows=116828 loops=1)
Index Cond: ((project_id = 8) AND (date >= '2006-05-29'::date))
Total runtime: 391.240 ms
(3 rows)
stats=# explain select * from email_contrib where project_id=8 and date >=
'2006-05-29 22:09:56.814897+00'::date;
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Index Scan using email_contrib__project_date on email_contrib
(cost=0.00..48951.74 rows=15336 width=24)
Index Cond: ((project_id = 8) AND (date >= '2006-05-29'::date))
(2 rows)
So casting to date doesn't change anything, but dropping project_id from the
where clause certainly does...
stats=# explain analyze select * from email_contrib where date >= now()-'15
days'::interval;
QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on email_contrib (cost=847355.98..1256538.96 rows=152552
width=24) (actual time=74886.028..75267.633 rows=148894 loops=1)
Recheck Cond: (date >= (now() - '15 days'::interval))
-> Bitmap Index Scan on email_contrib__project_date (cost=0.00..847355.98
rows=152552 width=0) (actual time=74885.690..74885.690 rows=148894 loops=1)
Index Cond: (date >= (now() - '15 days'::interval))
Total runtime: 75472.490 ms
(5 rows)
That estimate is dead-on. So it appears it's yet another case of cross-column
stats. :( But there's still a difference between now()-interval and something
hard-coded:
stats=# explain analyze select * from email_contrib where date >= '2006-05-29
22:09:56.814897+00'::date;
QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on email_contrib (cost=847355.98..1278756.22 rows=164256
width=24) (actual time=19356.752..19623.450 rows=159348 loops=1)
Recheck Cond: (date >= '2006-05-29'::date)
-> Bitmap Index Scan on email_contrib__project_date (cost=0.00..847355.98
rows=164256 width=0) (actual time=19356.391..19356.391 rows=159348 loops=1)
Index Cond: (date >= '2006-05-29'::date)
Total runtime: 19841.614 ms
(5 rows)
stats=# explain analyze select * from email_contrib where date >= (now()-'15
days'::interval)::date;
QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on email_contrib (cost=847355.98..1279988.15 rows=164256
width=24) (actual time=19099.417..19372.167 rows=159348 loops=1)
Recheck Cond: (date >= ((now() - '15 days'::interval))::date)
-> Bitmap Index Scan on email_contrib__project_date (cost=0.00..847355.98
rows=164256 width=0) (actual time=19099.057..19099.057 rows=159348 loops=1)
Index Cond: (date >= ((now() - '15 days'::interval))::date)
Total runtime: 19589.785 ms
Aha! It's the casting to date that changes things.
The stats target is 100...
stats=# select attname, n_distinct from pg_stats where
tablename='email_contrib';
attname | n_distinct
------------+------------
project_id | 6
team_id | 4104
work_units | 6795
date | 3034
id | 35301
(5 rows)
The n_distinct for project_id and date both look about right.
stats=# select * from pg_stats where tablename='email_contrib' and
attname='project_id';
-[ RECORD 1 ]-----+------------------------------------------------------------
schemaname | public
tablename | email_contrib
attname | project_id
null_frac | 0
avg_width | 4
n_distinct | 6
most_common_vals | {205,5,8,25,24,3}
most_common_freqs | {0.4273,0.419833,0.0933667,0.0514667,0.00506667,0.00296667}
histogram_bounds |
correlation | 0.605662
stats=# select relpages,reltuples from pg_class where relname='email_contrib';
relpages | reltuples
----------+-------------
996524 | 1.35509e+08
If we look at how many rows would match project_id 8 and any 15 dates...
stats=# SELECT 1.35509e+08 * 0.0933667 / 3034 * 15;
?column?
------------------------
62551.2268472313777195
We come up with something much closer to reality (116828 rows). I guess the
problem is in the histogram for date; where the last 3 values are:
2005-11-02,2006-03-05,2006-06-11
--
Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED]
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster