Re: [HACKERS] estimating # of distinct values

Florian Pflug Thu, 30 Dec 2010 06:44:17 -0800

On Dec27, 2010, at 23:49 , Kevin Grittner wrote:
> Robert Haas <robertmh...@gmail.com> wrote:
> 
>> With respect to (b), I think I'd need to see a much more detailed
>> design for how you intend to make this work.  Off the top of my
>> head there seems to be some pretty serious feasibility problems.
> 
> I had one random thought on that -- it seemed like a large concern
> was that there would need to be at least an occasional scan of the
> entire table to rebuild the distinct value information.


I believe we could actually avoid that.

First, the paper "An Optimal Algorithm for the Distinct Elements Problem"
actually contains an algorithm with *does* handle deletes - it's called
"L_0" estimate there.

Second, as Tomas pointed out, the stream-based estimator is essentially a
simplified version of a bloom filter. It starts out with a field of
N zero bits, and sets K of them to 1 for each value v in the stream.
Which bits are set to 1 depends on some hash function(s) H_i(v). It's
then easy to compute how many 1-bits you'd expect to find in the bit
field after seeing D distinct values, and by reversing that you can
estimate D from the number of 1-bits in the bit field.

To avoid having to rescan large tables, instead of storing one such
bit field, we'd store one per B pages of data. We'd then only need
to scan a range of B pages around every updated or deleted tuple,
and could afterwards compute a new global estimate of D by combining
the individual bit fields with bitwise and.

Since the need to regularly VACUUM tables hit by updated or deleted
won't go away any time soon, we could piggy-back the bit field
rebuilding onto VACUUM to avoid a second scan.

A good value for B would probably be around
1000*<size of bitfield>/<page size>. If the bitfield needs ~100k, that'd
make B ~= 12000 pages ~= 100MB.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] estimating # of distinct values

Reply via email to