> -----Original Message-----
> From: Josh Berkus [mailto:[EMAIL PROTECTED]
> Sent: Sunday, April 24, 2005 2:08 PM
> To: Andrew Dunstan
> Cc: Tom Lane; Greg Stark; Marko Ristola; pgsql-perform;
> pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks
> suggested?
> 
> [...]
> Actually, that paper looks *really* promising.   Does anyone here have
> enough math to solve for D(sub)Md on page 6?   I'd like to test it on
> samples of < 0.01%.    
> [...]

D_Md = [1 - sqrt(f_1 / s)] D_b + sqrt(f_1 / s) D_B

s = block size

f~_1 = median frequency within blocks for distinct values occurring in
  only one block

D_b = d + f_1^(b+1)

d = distinct classes in the sample

f_1^(b+1) = number of distinct values occurring in a single block in
  a sample of b+1 blocks

D_B = d + [B / (b + 1)] f_1^(b+1)

b = sample size (in blocks)

B = total table size (in blocks)

f_k and f~_k are the only tricky functions here, but they are easy to 
understand:

Suppose our column contains values from the set {a, b, c, ..., z}.
Suppose we have a sample of b = 10 blocks.
Suppose that the value 'c' occurs in exactly 3 blocks (we don't care
how often it occurs *within* those blocks).
Suppose that the value 'f' also occurs in exactly 3 blocks.
Suppose that the values 'h', 'p', and 'r' occur in exactly 3 blocks.
Suppose that no other value occurs in exactly 3 blocks.

f_3^b = 5

This is because there are 5 distinct values that occur in exactly
3 blocks.  f_1^b is the number of distinct values that occur in
exactly 1 block, regardless of how often it occurs within that block.

Note that when you select a sample size of b blocks, you actually
need to sample b+1 blocks to compute f_1^(b+1).  This is actually
pedantry since all occurrences of b in the formula are really b+1.

f~ is slightly trickier.  First, we pick the distinct values that
occur in only one block.  Then, we count how often each value
occurs within its block.  To wit:

Suppose we have a set {d, q, y, z} of values that occur in only
one block.
Suppose that d occurs 3x, q occurs 1x, y occurs 8x, and z occurs 6x.

The function f- would take the mean of these counts to determine
the "cluster frequency".  So f- here would be 4.5.  This allows
one to compute D_MF.

The function f~ takes the median of this sample, which is 3 or 6
(or I suppose you could even take the mean of the two medians if
you wanted).

No tricky math involved.  That should be enough to tell you how to
write the estimator.

__
David B. Held
Software Engineer/Array Services Group
200 14th Ave. East,  Sartell, MN 56377
320.534.3637 320.253.7800 800.752.8129

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Reply via email to