Re: [HACKERS] Google Summer of Code 2008

Jan Urbański Tue, 04 Mar 2008 10:48:25 -0800

Oleg Bartunov wrote:

Jan,


the problem is known and well requested. From your promotion it's not
clear what's an idea ?

I guess the first approach could be to populate some more columns in pg_statistics for tables with tsvectors. I see there are some statistics already being gathered (pg_stat's histogram_bounds are populated for tsvector columns), so maybe one could use that? Even remembering a few of the most frequently appearing lexemes could in my opinion help. I plotted distinct lexemes against the number documents containing them (basically the output of stat()) in one of our databases and came out with this:

http://www.fiok.pl/~jurbanski/kaired-depesze.png

The three high values are really stopwords, and partially because of that I wrote my first FTS patch, but this shows that if we'd remember the ~40 most frequent lexemes, we could give much better estimates for popular queries (and I think are the ones that hurt performance most are those which underestimate the row count).

As for a more general solution I'd have to read deeper into the tsearch code to understand how the tsvector type and @@ operator work and give it a bit more thought. I'm planning to do that in the next three weeks (read: before the student applications period starts). Maybe some kind of heuristic could be implemented? Possibly someone could load some information specific to her language, which would tell the planner how common (more or less) a given word is?

Another attempt at it would be: return lower estimates for tsqueries consisting of more parts - 'X'::tsquery is usually far less selective than 'X & Y & Z & V'::tsquery.

I searched through the list archives, but couldn't find any other attempts at this problem - were there any?


Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

signature.asc
Description: OpenPGP digital signature

Re: [HACKERS] Google Summer of Code 2008

Reply via email to