> -----Original Message----- > From: Josh Berkus [mailto:[EMAIL PROTECTED] > Sent: Sunday, April 24, 2005 2:08 PM > To: Andrew Dunstan > Cc: Tom Lane; Greg Stark; Marko Ristola; pgsql-perform; > pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks > suggested? > > [...] > Actually, that paper looks *really* promising. Does anyone here have > enough math to solve for D(sub)Md on page 6? I'd like to test it on > samples of < 0.01%. > [...]
D_Md = [1 - sqrt(f_1 / s)] D_b + sqrt(f_1 / s) D_B s = block size f~_1 = median frequency within blocks for distinct values occurring in only one block D_b = d + f_1^(b+1) d = distinct classes in the sample f_1^(b+1) = number of distinct values occurring in a single block in a sample of b+1 blocks D_B = d + [B / (b + 1)] f_1^(b+1) b = sample size (in blocks) B = total table size (in blocks) f_k and f~_k are the only tricky functions here, but they are easy to understand: Suppose our column contains values from the set {a, b, c, ..., z}. Suppose we have a sample of b = 10 blocks. Suppose that the value 'c' occurs in exactly 3 blocks (we don't care how often it occurs *within* those blocks). Suppose that the value 'f' also occurs in exactly 3 blocks. Suppose that the values 'h', 'p', and 'r' occur in exactly 3 blocks. Suppose that no other value occurs in exactly 3 blocks. f_3^b = 5 This is because there are 5 distinct values that occur in exactly 3 blocks. f_1^b is the number of distinct values that occur in exactly 1 block, regardless of how often it occurs within that block. Note that when you select a sample size of b blocks, you actually need to sample b+1 blocks to compute f_1^(b+1). This is actually pedantry since all occurrences of b in the formula are really b+1. f~ is slightly trickier. First, we pick the distinct values that occur in only one block. Then, we count how often each value occurs within its block. To wit: Suppose we have a set {d, q, y, z} of values that occur in only one block. Suppose that d occurs 3x, q occurs 1x, y occurs 8x, and z occurs 6x. The function f- would take the mean of these counts to determine the "cluster frequency". So f- here would be 4.5. This allows one to compute D_MF. The function f~ takes the median of this sample, which is 3 or 6 (or I suppose you could even take the mean of the two medians if you wanted). No tricky math involved. That should be enough to tell you how to write the estimator. __ David B. Held Software Engineer/Array Services Group 200 14th Ave. East, Sartell, MN 56377 320.534.3637 320.253.7800 800.752.8129 ---------------------------(end of broadcast)--------------------------- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]