Re: [HACKERS] tsvector pg_stats seems quite a bit off.

Jesper Krogh Sun, 30 May 2010 00:09:09 -0700

On 2010-05-29 15:56, Jan Urbański wrote:

On 29/05/10 12:34, Jesper Krogh wrote:

On 2010-05-28 23:47, Jan Urbański wrote:

On 28/05/10 22:22, Tom Lane wrote:
Now I tried to substitute some numbers there, and so assuming the
English language has ~1e6 words H(W) is around 6.5. Let's assume the
statistics target to be 100.


I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
probably be stopwords, so we will never see them in the input.

I think you should skip the assumption about stop-words, users may
use something where they are needed in the index or have a language
than the typical.  (and they dont seem to influcence the math that much).

Turns out it has nearly linear influence on the bucket width and the
frequency necessary to survive the final pruning. I put some data in a
spreadsheet, results below.

How about setting it to "some default" in the first analyze round, but
setting it to the count of MCE's with a frequency of 1 in the subsequent
analyze rounds?

Isn't it the same "type" of logic that is used for collecting statistics
for "array-types", say integer-arrays and text arrays?

AFAIK statistics for everything other than tsvectors are built based on
the values of whole rows. ts_typanalyze is the only typanalyze function
that takes the trouble of looping over the actual contents of each cell,
all the others just compare whole arrays (which means that for a text[]
field you will probably a quite useless MCV entry).


In another area, I was thinking about modelling a complete tree structure

where I would like to extract complete sub-btranches as int[] of thenode-idsin the set and then indexing them using gin. That seems like a "reallybad idea"

based on the above information.

Wouldn't it make sense to treat array types like the tsvectors?

The results are attached in a text (CSV) file, to preserve formatting.
Based on them I'd like to propose top_stopwords and error_factor to be 100.


I know it is not percieved the correct way to do things, but I would
really like to keep the "stop words" in the dataset and have
something that is robust to that.

There are 2 issues for that wish, one is that the application
becomes more general. I really cannot stop my users from searching

for stop-words and they would expect the "full set" and not the "emptyset" as

we get now.

The list of stop words is by no means an finite and would very
much depend on the input data set.

I would try to add the stop-words to the dictionary, so they still work, but
doesn't occupy that much space in the actual index. That seems to

solve the same task but with fewer issues for the users and a moregeneralized

code around it.

I can "fairly easy" try out patches or do other kind of testing.

I'll try to come up with a patch for you to try and fiddle with these
values before Monday.


Excellent.

testdb=# explain select id from testdb.reference where document_tsvector@@ plainto_tsquery('where') order by id limit 50;NOTICE: text-search query contains only stop words or doesn't containlexemes, ignored

                                         QUERY PLAN
---------------------------------------------------------------------------------------------
 Limit  (cost=41.02..41.03 rows=1 width=4)
   ->  Sort  (cost=41.02..41.03 rows=1 width=4)
         Sort Key: id

-> Bitmap Heap Scan on reference (cost=34.50..41.01 rows=1width=4)Recheck Cond: (document_tsvector @@plainto_tsquery('where'::text))-> Bitmap Index Scan on reference_fts_idx(cost=0.00..34.50 rows=1 width=0)Index Cond: (document_tsvector @@plainto_tsquery('where'::text))

(7 rows)

testdb=# select id from testdb.reference where document_tsvector @@plainto_tsquery('where') order by id limit 50;NOTICE: text-search query contains only stop words or doesn't containlexemes, ignoredNOTICE: text-search query contains only stop words or doesn't containlexemes, ignored

 id
----
(0 rows)

testdb=#

I would indeed have expected the first 50 rows ordered by id.. trivialto extract.


--
Jesper

--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tsvector pg_stats seems quite a bit off.

Reply via email to