=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <[EMAIL PROTECTED]> writes: > Still, there's a decision to be made: after how many lexemes should the > pruning occur?
The way I think it ought to work is that the number of lexemes stored in the final pg_statistic entry is statistics_target times a constant (perhaps 100). I don't like having it vary depending on tsvector width --- why for example should a column having a few wide tsvectors get a bigger stats entry than one with many narrow ones? (Not to mention the issue of having to estimate the average or max width before you can start the counting run.) But in any case, given a target number of lexemes to accumulate, I'd suggest pruning with that number as the bucket width (pruning distance). Or perhaps use some multiple of the target number, but the number itself seems about right. The LC paper says that the bucket width w is equal to ceil(1/e) where e is the maximum frequency estimation error, and that the maximum number of table entries needed is log(eN)/e after N lexemes have been scanned. For the values of e and N we are going to be dealing with, this is likely to work out to a few times 1/e, in other words the table size is a few times w. (They prove it's at most 7w given reasonable assumptions about data distribution, regardless of how big N gets; though I think our values for N aren't large enough for that to matter.) The existing compute_minimal_stats code uses a table size of twice the target number of values, so setting w to maybe a half or a third of the target number would reproduce the current space usage. I don't see a problem with letting it get a little bigger though, especially since we can expect that the lexemes aren't very long. (compute_minimal_stats can't assume that for arbitrary data types...) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers