Hi.

I am working on getting full-text-search to work and have
come across something I think look a bit strange.

The document base is arount 350.000 documents and
I have set the statistics target on the tsvector column
to 1000 since the 100 seems way of.

# ANALYZE verbose reference (document_tsvector);
INFO:  analyzing "reference"
INFO: "reference": scanned 14486 of 14486 pages, containing 350174 live rows and 6027 dead rows; 300000 rows in sample, 350174 estimated total rows
ANALYZE

Ok, so analyze allmost examined all rows. Looking into "most_common_freqs" I find # select count(unnest) from (select unnest(most_common_freqs) from pg_stats where attname = 'document_tsvector') as foo;
 count
-------
  2810
(1 row)


But the distribution is very "flat" at the end, the last 128 values are excactly
1.00189e-05
which means that any term sitting outside the array would get an estimate of
1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows

So far I have no idea if this is bad or good, so a couple of sample runs of stuff that
is sitting outside the "most_common_vals" array:

# explain analyze select id from efam.reference where document_tsvector @@ to_tsquery('searchterm') order by id limit 2000;
                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=35.99..35.99 rows=2 width=4) (actual time=20.717..28.135 rows=1612 loops=1) -> Sort (cost=35.99..35.99 rows=2 width=4) (actual time=20.709..23.190 rows=1612 loops=1)
         Sort Key: id
         Sort Method:  quicksort  Memory: 124kB
-> Bitmap Heap Scan on reference (cost=28.02..35.98 rows=2 width=4) (actual time=3.522..17.238 rows=1612 loops=1) Recheck Cond: (document_tsvector @@ to_tsquery('searchterm'::text)) -> Bitmap Index Scan on reference_fts_idx (cost=0.00..28.02 rows=2 width=0) (actual time=3.378..3.378 rows=1613 loops=1) Index Cond: (document_tsvector @@ to_tsquery('searchterm'::text))
 Total runtime: 30.743 ms
(9 rows)

Ok, the query-planner estimates that there are 2 rows .. excactly as predicted, works as expected but
in fact there are 1612 rows that matches.

So, analyze has sampled 6 of 7 rows in the table and this term exists in 1612/350174 rows ~ freq: 0.0046 which is way higher than the lower bound of 1.00189e-05 .. or it should have been sitting around the center of the 2810
values of the histogram collected.

So the "most_common_vals" seems to contain a lot of values that should never have been kept in favor
of other values that are more common.

In practice, just cranking the statistics estimate up high enough seems to solve the problem, but doesn't
there seem to be something wrong in how the statistics are collected?

# select version();
                                                          version
---------------------------------------------------------------------------------------------------------------------------
PostgreSQL 9.0beta1 on x86_64-unknown-linux-gnu, compiled by GCC gcc-4.2.real (GCC) 4.2.4 (Ubuntu 4.2.4-1ubuntu4), 64-bit


Jesper
--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to