Hi All.

The NULL element always suprises me in unpleasant ways..  my brain simply
cant really understand the logic, so please let me know if this is
one of the cases where I just should spend way more efforts into fixing that instead.

I have a table with a "null_frac" of 0.5 and i have tested that a where
clause that evaluates to null isnt included in the result:

testdb=# select id from testtable where  null @@ to_tsquery('testterm80');
 id
----
(0 rows)

Then I'd expect to have the null_fraq taken into account when computing the
estimates for the query:

testdb=# explain select id from testtable where fts @@ to_tsquery('testterm80');
                          QUERY PLAN
---------------------------------------------------------------
 Seq Scan on testtable  (cost=0.00..1985.03 rows=1966 width=4)
   Filter: (fts @@ to_tsquery('testterm80'::text))
(2 rows)

Whereas it actually does it if I explicitly add the "fts is not null" clause to the query.

testdb=# explain select id from testtable where fts @@ to_tsquery('testterm80') and fts is not null;
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Bitmap Heap Scan on testtable  (cost=130.34..1735.19 rows=983 width=4)
Recheck Cond: ((fts @@ to_tsquery('testterm80'::text)) AND (fts IS NOT NULL)) -> Bitmap Index Scan on testtable_fts_idx (cost=0.00..130.09 rows=983 width=0) Index Cond: ((fts @@ to_tsquery('testterm80'::text)) AND (fts IS NOT NULL))
(4 rows)

When something evaluates to "null" isn't included in the result, shouldn't the query-planner
then take the null_frac into account when computing the estimate?

Trying to do the same thing using integers and the < operator seem to take the null_frac into
account.

Below snippet allows to reproduce the dataset.


create table testtable (id serial primary key, document text, fts tsvector);
create index on testtable using gist(fts);
CREATE OR REPLACE FUNCTION public.filltable(rows integer)
 RETURNS boolean
 LANGUAGE plpgsql
AS $function$
    DECLARE
        count integer;
    BEGIN
        count := 0;
        LOOP
            EXIT WHEN count = rows;
            count := count +1;
insert into testtable(document,fts) select document,to_tsvector('english',document) from (select string_agg(concat,' ') as document from (select concat('testterm' || generate_series(1,floor(random()*100)::integer))) as foo) as bar;
        END LOOP;
        RETURN TRUE;
    END;
$function$

select filltable(10000);
testdb=# update testtable set fts = null where id % 2 = 0;
UPDATE 5001
testdb=# ANALYZE verbose testtable;
INFO:  analyzing "public.testtable"
INFO: "testtable": scanned 1835 of 1835 pages, containing 10002 live rows and 5001 dead rows; 10002 rows in sample, 10002 estimated total rows
ANALYZE
testdb=# select null_frac from pg_stats where attname = 'fts';
 null_frac
-----------
       0.5
(1 row)

... trying with integers:

testdb=# ALTER  TABLE testtable add column testint integer;
ALTER TABLE
testdb=# update testtable set testint = floor(random()*100);
UPDATE 10002
testdb=# ANALYZE verbose testtable;
INFO:  analyzing "public.testtable"
INFO: "testtable": scanned 2186 of 2186 pages, containing 10002 live rows and 10002 dead rows; 10002 rows in sample, 10002 estimated total rows
ANALYZE
testdb=# update testtable set testint = null where id %2 = 0;
UPDATE 5001
testdb=# ANALYZE verbose testtable;
INFO:  analyzing "public.testtable"
INFO: "testtable": scanned 2282 of 2282 pages, containing 10002 live rows and 13335 dead rows; 10002 rows in sample, 10002 estimated total rows
analyzze ANALYZE
testdb=# explain select id from testtable where testint = 50;
                         QUERY PLAN
-------------------------------------------------------------
 Seq Scan on testtable  (cost=0.00..2407.03 rows=64 width=4)
   Filter: (testint = 50)
(2 rows)

testdb=# explain select id from testtable where testint = 1;
                         QUERY PLAN
-------------------------------------------------------------
 Seq Scan on testtable  (cost=0.00..2407.03 rows=48 width=4)
   Filter: (testint = 1)
(2 rows)

testdb=# explain select id from testtable where testint < 50;
                          QUERY PLAN
---------------------------------------------------------------
 Seq Scan on testtable  (cost=0.00..2407.03 rows=2470 width=4)
   Filter: (testint < 50)
(2 rows)


(found on 8.4 and reproduced on git-head)

Attached patch tries to align the behaviour

Thanks.

--
Jesper
>From 4b4be27864f683a9b3464b86d6cfa567f8ab6bd2 Mon Sep 17 00:00:00 2001
From: Jesper Krogh <j...@novozymes.com>
Date: Thu, 17 Feb 2011 22:21:52 +0100
Subject: [PATCH] Take null_frac into account in calculating selectivity for @@ operator. This makes behaviour align with what > operator does for integers.

---
 src/backend/tsearch/ts_selfuncs.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/tsearch/ts_selfuncs.c b/src/backend/tsearch/ts_selfuncs.c
index 8ce9fb4..a4f3219 100644
--- a/src/backend/tsearch/ts_selfuncs.c
+++ b/src/backend/tsearch/ts_selfuncs.c
@@ -180,14 +180,14 @@ tsquerysel(VariableStatData *vardata, Datum constval)
 			 * There is a most-common-elements slot for the tsvector Var, so
 			 * use that.
 			 */
-			selec = mcelem_tsquery_selec(query, values, nvalues,
+			selec = (1.0 - stats->stanullfrac ) * mcelem_tsquery_selec(query, values, nvalues,
 										 numbers, nnumbers);
 			free_attstatsslot(TEXTOID, values, nvalues, numbers, nnumbers);
 		}
 		else
 		{
 			/* No most-common-elements info, so do without */
-			selec = tsquery_opr_selec_no_stats(query);
+			selec = (1.0 - stats->stanullfrac ) * tsquery_opr_selec_no_stats(query);
 		}
 	}
 	else
-- 
1.7.0.4

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to