Re: [HACKERS] gsoc, text search selectivity and dllist enhancments

Jan Urbański Mon, 07 Jul 2008 14:55:16 -0700

Tom Lane wrote:

target would typically be around 10 or so.  It really wasn't designed to
be industrial-strength code.  (It was still a big improvement over what
we'd had before :-(.)  So I'm not very comfortable with taking it as the
design base for something that does need to be industrial-strength.

I did some googling and found a paper describing a method of determiningmost frequent elements from a stream called Lossy Counting:

http://www.vldb.org/conf/2002/S10P03.pdf

The following is an adaptation of an algorithm described in section 4.2of that paper. I'll summarize it and propose an implementation usingexisting PG data structures.

The data structure (called D) is a set of entries in the form (e, f, d),where e is the lexeme, f is an integer representing the estimatedfrequency and d is the maximum possible error in f. We process tsvectorsin batches. After each w tsvectors we will be pruning our datastructure, deleting entries that are not frequent enough. Let b be thenumber of the current batch (tsvectors 1 to w are processed with b = 1,tsvectors w+1 to w*2 are processed with b = 2 and so on). Observe that wis an arbitrary number.Initially, D is empty. For each lexeme e we look up its entry in D. Ifit's found, we increment f in that entry. If not, we add a new entry inthe form of (e, 1, b - 1).

After each w tsvectors we delete all entries from D, that have f + d <= b.

For example, consider such a list of tsvectors (numbers stand forlexemes, each line is a tsvector):

1 2 3
2 3 4 5
1 2
1 3 4
2 3 4 5
2 4 5 6

Let w = 3.
After processing the first tsvector D would look like so:
(1, 1, 0), (2, 1, 0), (3, 1, 0)
After processing the second tsvector D would be:
(1, 1, 0), (2, 2, 0), (3, 2, 0), (4, 1, 0), (5, 1, 0)
After the third tsvector we'd have:
(1, 2, 0), (2, 3, 0), (3, 2, 0), (4, 1, 0), (5, 1, 0)

And now we'd do the pruning. b = 1, so all entries with f + d <= 1 arediscarded. D is then:

(1, 2, 0), (2, 3, 0), (3, 2, 0)

After processing the fourth tsvector we get:
(1, 3, 0), (2, 3, 0), (3, 3, 0), (4, 1, 1)
And then:
(1, 3, 0), (2, 4, 0), (3, 4, 0), (4, 2, 1), (5, 1, 1)
And finally:
(1, 3, 0), (2, 5, 0), (3, 4, 0), (4, 3, 1), (5, 2, 1), (6, 1, 1)

Now comes the time for the second pruning. b = 2, to D is left with:
(1, 3, 0), (2, 5, 0), (3, 4, 0), (4, 3, 1), (5, 2, 1)

Finally we return entries with the largest values of f, their numberdepending on statistics_target. That ends the algorithm.

The original procedure from the paper prunes the entries after each"element batch" is processed, but the batches are of equal size. Withtsvectors it seems sensible to divide lexemes into batches of differentsizes, so the pruning always takes place after all the lexemes from atsvector have been processed. I'm not totally sure if that change doesnot destroy some properties of that algorithm, but it looks reasonably sane.

As to the implementation: like Tom suggested, we could use a hashtableas D and an array of pointers to the hashtable entries, that would getupdated each time a new hashtable entry would be added. Pruning wouldthen require a qsort() of that table taking (f + d) as the sort key, butother than that we could just add new entries at the end of the array.An open question is: what size should that array be? I think that byknowing how long is the longest tsvector we could come up with a goodestimate, but this needs to be thought over. Anyway, we can alwaysrepalloc().

Getting the max tsvector length by looking at the datum widths would bea neat trick. I just wonder: if we have to detoast all these tsvectorsanyway, is it not possible to do that and do two passes over alreadydetoasted datums? I've got to admit I don't really understand howtoasting works, so I'm probably completely wrong on that.

Does this algorithm sound sensible? I might have not understood somethings from that publication, they give some proofs of why theiralgorithm yields good results and takes up little space, but I'm notall that certain it can be so easily adapted.

If you think the Lossy Counting method has potential, I could test itsomehow. Using my current work I could extract a stream of lexemes asANALYZE sees it and run it through a python implementation of thealgorithm to see if the result makes sense.


This is getting more complicated, than I thought it would :^)

Cheers,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] gsoc, text search selectivity and dllist enhancments

Reply via email to