=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <[EMAIL PROTECTED]> writes: > Once again: the whole algorithm is a ripoff from analyze.c, with the > dllist instead of an array because you don't know how big tracking list > will you need and with a hashtable, because the tracking list will > probably be large and looking up things linearly wouldn't be fun.
Hmmm ... I had forgotten how compute_minimal_stats() works, and looking at it now I'm just itching to rewrite it. You have to realize that that code was written with the assumptions that (1) it was only going to be used for a few weird second-class-citizen data types, and (2) the stats target would typically be around 10 or so. It really wasn't designed to be industrial-strength code. (It was still a big improvement over what we'd had before :-(.) So I'm not very comfortable with taking it as the design base for something that does need to be industrial-strength. Your point about not wanting the most-recently-added lexeme to drop off first is a good one, but the approach is awfully fragile. Consider what happens if (or when) all the lexemes in the list reach count 2 or more. All other lexemes seen later will fight over the last list slot, and have no chance of getting higher in the list; so the algorithm will completely fail to adapt to any changes in the input statistics that occur after that point is reached. I'm thinking that the idea of periodically cutting off a bunch of low-scoring items, instead of trying to do it "exactly", would actually be better because it'd still have a chance of tracking new data even when the counts get larger. I don't recommend doing two passes over the data because it's fairly likely that tsvectors would be large enough to be toasted, which'd make fetching them expensive. One idea is to start out with some reasonable estimate of the max lexemes per tsvector (say a couple hundred) and realloc the list bigger in sizable jumps (say 2X) when the estimate is seen to be exceeded. Another possibility is to have a prescan that only determines the width of the physically largest tsvector (you can get the width without detoasting, so this would be quite cheap), and then estimate the number of lexemes corresponding to that using some fudge-factor guess about lexeme sizes, and then stick with the resulting list size regardless of what the actual lexeme counts are. I kinda like the latter since its behavior is order-independent. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers