Tom Lane wrote:
target would typically be around 10 or so.  It really wasn't designed to
be industrial-strength code.  (It was still a big improvement over what
we'd had before :-(.)  So I'm not very comfortable with taking it as the
design base for something that does need to be industrial-strength.

I did some googling and found a paper describing a method of determining most frequent elements from a stream called Lossy Counting:
http://www.vldb.org/conf/2002/S10P03.pdf
The following is an adaptation of an algorithm described in section 4.2 of that paper. I'll summarize it and propose an implementation using existing PG data structures.

The data structure (called D) is a set of entries in the form (e, f, d), where e is the lexeme, f is an integer representing the estimated frequency and d is the maximum possible error in f. We process tsvectors in batches. After each w tsvectors we will be pruning our data structure, deleting entries that are not frequent enough. Let b be the number of the current batch (tsvectors 1 to w are processed with b = 1, tsvectors w+1 to w*2 are processed with b = 2 and so on). Observe that w is an arbitrary number. Initially, D is empty. For each lexeme e we look up its entry in D. If it's found, we increment f in that entry. If not, we add a new entry in the form of (e, 1, b - 1).
After each w tsvectors we delete all entries from D, that have f + d <= b.

For example, consider such a list of tsvectors (numbers stand for lexemes, each line is a tsvector):
1 2 3
2 3 4 5
1 2
1 3 4
2 3 4 5
2 4 5 6

Let w = 3.
After processing the first tsvector D would look like so:
(1, 1, 0), (2, 1, 0), (3, 1, 0)
After processing the second tsvector D would be:
(1, 1, 0), (2, 2, 0), (3, 2, 0), (4, 1, 0), (5, 1, 0)
After the third tsvector we'd have:
(1, 2, 0), (2, 3, 0), (3, 2, 0), (4, 1, 0), (5, 1, 0)

And now we'd do the pruning. b = 1, so all entries with f + d <= 1 are discarded. D is then:
(1, 2, 0), (2, 3, 0), (3, 2, 0)

After processing the fourth tsvector we get:
(1, 3, 0), (2, 3, 0), (3, 3, 0), (4, 1, 1)
And then:
(1, 3, 0), (2, 4, 0), (3, 4, 0), (4, 2, 1), (5, 1, 1)
And finally:
(1, 3, 0), (2, 5, 0), (3, 4, 0), (4, 3, 1), (5, 2, 1), (6, 1, 1)

Now comes the time for the second pruning. b = 2, to D is left with:
(1, 3, 0), (2, 5, 0), (3, 4, 0), (4, 3, 1), (5, 2, 1)

Finally we return entries with the largest values of f, their number depending on statistics_target. That ends the algorithm.

The original procedure from the paper prunes the entries after each "element batch" is processed, but the batches are of equal size. With tsvectors it seems sensible to divide lexemes into batches of different sizes, so the pruning always takes place after all the lexemes from a tsvector have been processed. I'm not totally sure if that change does not destroy some properties of that algorithm, but it looks reasonably sane.

As to the implementation: like Tom suggested, we could use a hashtable as D and an array of pointers to the hashtable entries, that would get updated each time a new hashtable entry would be added. Pruning would then require a qsort() of that table taking (f + d) as the sort key, but other than that we could just add new entries at the end of the array. An open question is: what size should that array be? I think that by knowing how long is the longest tsvector we could come up with a good estimate, but this needs to be thought over. Anyway, we can always repalloc().

Getting the max tsvector length by looking at the datum widths would be a neat trick. I just wonder: if we have to detoast all these tsvectors anyway, is it not possible to do that and do two passes over already detoasted datums? I've got to admit I don't really understand how toasting works, so I'm probably completely wrong on that.

Does this algorithm sound sensible? I might have not understood some things from that publication, they give some proofs of why their algorithm yields good results and takes up little space, but I'm not all that certain it can be so easily adapted.

If you think the Lossy Counting method has potential, I could test it somehow. Using my current work I could extract a stream of lexemes as ANALYZE sees it and run it through a python implementation of the algorithm to see if the result makes sense.

This is getting more complicated, than I thought it would :^)

Cheers,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to