[HACKERS] tsvector pg_stats seems quite a bit off.

Jesper Krogh Wed, 19 May 2010 13:03:46 -0700

Hi.

I am working on getting full-text-search to work and have
come across something I think look a bit strange.


The document base is arount 350.000 documents and
I have set the statistics target on the tsvector column
to 1000 since the 100 seems way of.

# ANALYZE verbose reference (document_tsvector);
INFO:  analyzing "reference"

INFO: "reference": scanned 14486 of 14486 pages, containing 350174 liverows and 6027 dead rows; 300000 rows in sample, 350174 estimated total rows

ANALYZE

Ok, so analyze allmost examined all rows. Looking into"most_common_freqs" I find# select count(unnest) from (select unnest(most_common_freqs) frompg_stats where attname = 'document_tsvector') as foo;

 count
-------
  2810
(1 row)

But the distribution is very "flat" at the end, the last 128 values areexcactly

1.00189e-05
which means that any term sitting outside the array would get an estimate of
1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows

So far I have no idea if this is bad or good, so a couple of sample runsof stuff that

is sitting outside the "most_common_vals" array:

# explain analyze select id from efam.reference where document_tsvector@@ to_tsquery('searchterm') order by id limit 2000;

                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------

Limit (cost=35.99..35.99 rows=2 width=4) (actual time=20.717..28.135rows=1612 loops=1)-> Sort (cost=35.99..35.99 rows=2 width=4) (actualtime=20.709..23.190 rows=1612 loops=1)

         Sort Key: id
         Sort Method:  quicksort  Memory: 124kB

-> Bitmap Heap Scan on reference (cost=28.02..35.98 rows=2width=4) (actual time=3.522..17.238 rows=1612 loops=1)Recheck Cond: (document_tsvector @@to_tsquery('searchterm'::text))-> Bitmap Index Scan on reference_fts_idx(cost=0.00..28.02 rows=2 width=0) (actual time=3.378..3.378 rows=1613loops=1)Index Cond: (document_tsvector @@to_tsquery('searchterm'::text))

 Total runtime: 30.743 ms
(9 rows)

Ok, the query-planner estimates that there are 2 rows .. excactly aspredicted, works as expected but

in fact there are 1612 rows that matches.

So, analyze has sampled 6 of 7 rows in the table and this term exists in1612/350174 rows ~ freq: 0.0046 whichis way higher than the lower bound of 1.00189e-05 .. or it should havebeen sitting around the center of the 2810

values of the histogram collected.

So the "most_common_vals" seems to contain a lot of values that shouldnever have been kept in favor

of other values that are more common.

In practice, just cranking the statistics estimate up high enough seemsto solve the problem, but doesn't

there seem to be something wrong in how the statistics are collected?

# select version();
                                                          version
---------------------------------------------------------------------------------------------------------------------------

PostgreSQL 9.0beta1 on x86_64-unknown-linux-gnu, compiled by GCCgcc-4.2.real (GCC) 4.2.4 (Ubuntu 4.2.4-1ubuntu4), 64-bit



Jesper
--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] tsvector pg_stats seems quite a bit off.

Reply via email to