Hi,
I have a table "table1" with ~100k rows, the table having "flag1"
column. The value of "flag1" is NULL in 85k+ rows, and it's TRUE in
7k+ rows, and FALSE in 6k rows. I use EXPLAIN to get apprx. number of
rows for simple SELECT queries. But in case of "...WHERE NOT flag1"
the optimizer is completely wrong:
-- it's OK here, the estimation is fine
test=# EXPLAIN ANALYZE SELECT * FROM table1 WHERE flag1;
QUERY PLAN
-
Seq Scan on table1 (cost=0.00..9962.84 rows=7875 width=532) (actual
time=0.107..134.729 rows=7652 loops=1)
Filter: flag1
Total runtime: 139.460 ms
(3 rows)
-- here optimizer thinks that we have 90k+ rows with "flag1 = FALSE",
while the real number of rows is 6k+
test=# EXPLAIN ANALYZE SELECT * FROM table1 WHERE NOT flag1;
QUERY PLAN
--
Seq Scan on table1 (cost=0.00..9962.84 rows=91809 width=532) (actual
time=0.087..110.596 rows=6243 loops=1)
Filter: (NOT flag1)
Total runtime: 114.414 ms
(3 rows)
I've checked statistics available and have found that Postgres
actually knows how many FALSE values are present (approximately) in
the table:
test=# SELECT null_frac, n_distinct, most_common_vals,
most_common_freqs FROM pg_stats
WHERE tablename='table1' AND attname='flag1';
null_frac | n_distinct | most_common_vals | most_common_freqs
---++--+---
0.864667 | 2 | {t,f}| {0.079,0.056}
(1 row)
So, I've started to think that this is a shortcoming of the optimizer
code, which makes Postgres count both FALSEs and NULLs when estimating
"var <> const" expressions.
1) backend/utils/adt/selfuncs.c, in neqsel() we have:
...
result = DatumGetFloat8(DirectFunctionCall4(eqsel, ...
...
result = 1.0 - result;
PG_RETURN_FLOAT8(result);
...
-- so, there is a wrong assumption that for "var <> const" expressions
we may just use estimation for "var = const" and subtract it from 1.
In fact, NULLs are ignored here. According to ternary logic, in this
case we must subtract the number of NULLs also. This will improve row
estimation for "var <> const" queries (but not in case when we deal
with boolean datatype, look at (2)!). If there are no objections, I'll
send the patch, which is straightforward.
2). In case of "WHERE flag1 = FALSE" or "WHERE flag1 <> TRUE" the
planner rewrites the query to "WHERE NOT flag1" and then uses the
logic defined in backend/optimizer/path/clausesel.c, where, again, we
see the wrong approach which ignores NULLs:
...
else if (not_clause(clause))
{
/* inverse of the selectivity of the underlying clause */
s1 = 1.0 - clause_selectivity(root,
(Node *)
get_notclausearg((Expr *) clause),
varRelid,
jointype);
...
I have no idea how to improve this. AFAIKS, at this level we have no
knowledge about the data we're dealing with (am I right?) -- so, I'm
afraid that for booleans there is no way to improve the optimizer. If
my thoughts described in (1) are correct and we improve the estimation
for "<>", we will have a situation where using booleans might decrease
the performance due to wrong rows count estimation.
I'll appreciate any help and ideas that will allow to improve the situation.
P. S. I use current HEAD version of Postgres; before running queries
the statistic was updated with ANALYZE
--
Best regards,
Nikolay
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers