Here is a query on a partitioned schema that produces a very bad query plan.
The tables are fully vacuumed, analyzed with stats target 40, and no bloat
(created with pure inserts, no updates or deletes).
I already know of at least three bugs with the query planner and partitions
listed at the end of this message.
Below are explain analyze examples on just one partition. Expanding the date
range to be multiple partitions only makes this worse.
First, the query written to go against the direct partitions, which is fast and
does the right plan:
(although, wouldn't it use less memory if it hashed the small table, and
scanned the larger one against that rather than vice-versa? Or am I reading
the explain wrong?)
rr=> explain analyze select pl.id, sl.terms from p_pp_logs_026_2008_12_17 pl,
p_s_logs_026_2008_12_17 sl where sl.session = pl.session and sl.terms <>
'null' and sl.terms is not null;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=31470.89..37199.26 rows=38597 width=27) (actual
time=743.782..772.705 rows=29907 loops=1)
Hash Cond: ((pl.session)::text = (sl.session)::text)
-> Seq Scan on pp_logs_026_2008_12_17 pl (cost=0.00..357.49 rows=8449
width=46) (actual time=0.007..3.064 rows=8449 loops=1)
-> Hash (cost=24822.35..24822.35 rows=531883 width=57) (actual
time=743.597..743.597 rows=531453 loops=1)
-> Seq Scan on s_logs_026_2008_12_17 sl (cost=0.00..24822.35
rows=531883 width=57) (actual time=0.011..392.242 rows=532067 loops=1)
Filter: ((terms IS NOT NULL) AND ((terms)::text <> 'null'::text))
Total runtime: 779.431 ms
In the above, the estimated and actual rows are about right, only a bit off
after the join.
We are partitioned by s_id and date, the general form of the query used for
scanning more than one combination of these values fails.
In the below, which restricts the data to the same tables, the planner is
extremely wrong (plus the parent tables which have one irrelevant row of dummy
data so that analyze works on them):
explain analyze select pl.id, sl.terms from pp_logs pl, s_logs sl where
pl.s_id = 26 and sl.s_id = 26 and sl.session = pl.session and sl.terms <>
'null' and sl.terms is not null and pl.date = '2008-12-17' and sl.date =
'2008-12-17';
QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=79036.43..416160.16 rows=22472099 width=27) (actual
time=7089.961..7839.665 rows=29907 loops=1)
Merge Cond: ((pl.session)::text = (sl.session)::text)
-> Sort (cost=968.98..990.10 rows=8450 width=46) (actual
time=193.778..196.690 rows=8449 loops=1)
Sort Key: pl.session
Sort Method: quicksort Memory: 1043kB
-> Append (cost=0.00..417.84 rows=8450 width=46) (actual
time=36.824..133.157 rows=8449 loops=1)
-> Seq Scan on pp_logs pl (cost=0.00..18.10 rows=1 width=21)
(actual time=0.056..0.056 rows=0 loops=1)
Filter: ((s_id = 26) AND (date = '2008-12-17'::date))
-> Seq Scan on pp_logs_026_2008_12_17 pl (cost=0.00..399.74
rows=8449 width=46) (actual time=36.766..129.317 rows=8449 loops=1)
Filter: ((s_id = 26) AND (date = '2008-12-17'::date))
-> Sort (cost=78067.45..79397.16 rows=531884 width=57) (actual
time=6895.648..7094.701 rows=552412 loops=1)
Sort Key: sl.session
Sort Method: quicksort Memory: 92276kB
-> Append (cost=0.00..27483.28 rows=531884 width=57) (actual
time=33.759..1146.936 rows=532067 loops=1)
-> Seq Scan on s_logs sl (cost=0.00..1.19 rows=1 width=22)
(actual time=0.036..0.036 rows=0 loops=1)
Filter: ((terms IS NOT NULL) AND ((terms)::text <>
'null'::text) AND (s_id = 26) AND (date = '2008-12-17'::date))
-> Seq Scan on s_logs_026_2008_12_17 sl (cost=0.00..27482.09
rows=531883 width=57) (actual time=33.721..903.607 rows=532067 loops=1)
Filter: ((terms IS NOT NULL) AND ((terms)::text <>
'null'::text) AND (s_id = 26) AND (date = '2008-12-17'::date))
Total runtime: 7861.723 ms
Some partitions have about 10x the data as the above, resulting in ~30x longer
times to do the sort, but only 10x to hash. The sort also uses far, far more
memory than the hash should if it were to hash the small one and hash-join the
large table against that.
As you can see in the above example, this inner join, which the planner knows
has two arms of row count ~530K and ~8.5K, is predicted to have 22M output
rows. What?
Did it loose its ability to even roughly know n_distinct? Can it use
n_distinct for each table and combine them for a (weighted) worst case estimate
and arrive at something less bizarre?
If I scan across more tables, the actual output grows roughly linearly, but the
estimate looks like its A*ln(A*B) where A is the size of the larger arm and B
is the size of the smaller one. Having an inner join result in more rows than
one of the arms is difficult with n_distinct numbers like these as evidenced by
the direct to table query estimate.
Perhaps the parent table (with one row) is getting in the way?
Any suggestions on how to get a good plan when accessing this via the parent
table to scan multiple dates?
Known (to me) partition bugs related to the query planner / stats:
* Row width estimates for scans across multiple tables are wrong (just max(each
table width) rather than a weighted average - this could have been fixed in
8.3.4+ without me noticing though).
* Column selectivity not combined rationally for aggregate statistics on
multiple table scans in an arm. Example, two partitions may have an int column
that has only 30 distinct values over 2M rows in each partition. The planner
will decide that the selectivity is 10% rather than the worst case
(30+30)/(total rows) or best case (30/(total rows)). There are many
variations on this theme not just the above type of error.
* An empty partition (usually the parent table) will propagate default
statistics through the errors above leading to even more strange query plans.
Empty table default statistics are generally poor, but worse in the cases above
since they override good stats. Furthermore, ANALYZE won't write stats for an
empty table so unless you put dummy rows in a table, you are guaranteed to get
bad plans.
I have put a single dummy row in each parent table with minimum column width on
text columns and analyzed it to help out on the above. All partitions are
fully vacuumed with no bloat and fully analyzed. After a day, they become read
only.
The only time I wish I have hints is when I'm working on partitioned tables,
most other times minor query modifications plus good statistics get a good plan
(but perhaps a less natural query syntax). The planner is not adequate on
partitioned tables.