Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-05-04 Thread John A Meinel
Josh Berkus wrote: Mischa, Okay, although given the track record of page-based sampling for n-distinct, it's a bit like looking for your keys under the streetlight, rather than in the alley where you dropped them :-) Bad analogy, but funny. The issue with page-based vs. pure random sampling is th

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-05-03 Thread Mischa Sandberg
Quoting Josh Berkus : > Mischa, > > > Okay, although given the track record of page-based sampling for > > n-distinct, it's a bit like looking for your keys under the > streetlight, > > rather than in the alley where you dropped them :-) > > Bad analogy, but funny. Bad analogy? Page-

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-05-03 Thread Josh Berkus
John, > But doesn't an index only sample one column at a time, whereas with > page-based sampling, you can sample all of the columns at once. Hmmm. Yeah, we're not currently doing that though. Another good idea ... -- --Josh Josh Berkus Aglio Database Solutions San Francisco --

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-05-03 Thread Josh Berkus
Mischa, > Okay, although given the track record of page-based sampling for > n-distinct, it's a bit like looking for your keys under the streetlight, > rather than in the alley where you dropped them :-) Bad analogy, but funny. The issue with page-based vs. pure random sampling is that to do, fo

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-05-03 Thread Mischa Sandberg
Quoting Markus Schaber <[EMAIL PROTECTED]>: > Hi, Josh, > > Josh Berkus wrote: > > > Yes, actually. We need 3 different estimation methods: > > 1 for tables where we can sample a large % of pages (say, >= 0.1) > > 1 for tables where we sample a small % of pages but are "easily > estimated" > >

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-05-03 Thread Markus Schaber
Hi, Josh, Josh Berkus wrote: > Yes, actually. We need 3 different estimation methods: > 1 for tables where we can sample a large % of pages (say, >= 0.1) > 1 for tables where we sample a small % of pages but are "easily estimated" > 1 for tables which are not easily estimated by we can't afford

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-28 Thread Marko Ristola
First I will comment my original idea. Second I will give another improved suggestion (an idea). I hope, that they will be useful for you. (I don't know, wether the first one was useful at all because it showed, that I and some others of us are not very good with statistics :( ) I haven't looked ab

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-28 Thread Mischa Sandberg
Quoting Josh Berkus : > > >Perhaps I can save you some time (yes, I have a degree in Math). If I > > >understand correctly, you're trying extrapolate from the correlation > > >between a tiny sample and a larger sample. Introducing the tiny sample > > >into any decision can only produce a less accu

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-27 Thread Greg Stark
"Dave Held" <[EMAIL PROTECTED]> writes: > > Actually, it's more to characterize how large of a sample > > we need. For example, if we sample 0.005 of disk pages, and > > get an estimate, and then sample another 0.005 of disk pages > > and get an estimate which is not even close to the first > >

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-27 Thread Dave Held
> -Original Message- > From: Josh Berkus [mailto:[EMAIL PROTECTED] > Sent: Wednesday, April 27, 2005 10:25 AM > To: Andrew Dunstan > Cc: Mischa Sandberg; pgsql-perform; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation;

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-27 Thread Josh Berkus
Mischa, > >Perhaps I can save you some time (yes, I have a degree in Math). If I > >understand correctly, you're trying extrapolate from the correlation > >between a tiny sample and a larger sample. Introducing the tiny sample > >into any decision can only produce a less accurate result than just

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-27 Thread Dave Held
> -Original Message- > From: Greg Stark [mailto:[EMAIL PROTECTED] > Sent: Wednesday, April 27, 2005 1:00 AM > To: Tom Lane > Cc: Rod Taylor; Greg Stark; pgsql-hackers@postgresql.org; > Gurmeet Manku > Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-27 Thread Dave Held
TECTED] > Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks > suggested? > > [...] > 2. In a single scan, it is possible to estimate n_distinct by using > a very simple algorithm: > > "Distinct sampling for highly-accurate answers to distinct val

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-27 Thread Rod Taylor
On Tue, 2005-04-26 at 19:03 -0400, Greg Stark wrote: > This one looks *really* good. > > http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf > > It does require a single full table scan Ack.. Not by default please. I have a few large append-only tables (vacuum isn't necessary) whi

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-27 Thread Andrew Dunstan
Mischa Sandberg wrote: Perhaps I can save you some time (yes, I have a degree in Math). If I understand correctly, you're trying extrapolate from the correlation between a tiny sample and a larger sample. Introducing the tiny sample into any decision can only produce a less accurate result than

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-27 Thread Gurmeet Manku
Hi everybody! Perhaps the following papers are relevant to the discussion here (their contact authors have been cc'd): 1. The following proposes effective algorithms for using block-level sampling for n_distinct estimation: "Effective use of block-level sampling in statistics estimat

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-27 Thread Simon Riggs
On Tue, 2005-04-26 at 15:00 -0700, Gurmeet Manku wrote: > 2. In a single scan, it is possible to estimate n_distinct by using > a very simple algorithm: > > "Distinct sampling for highly-accurate answers to distinct value > queries and event reports" by Gibbons, VLDB 2001. > > http://ww

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-26 Thread Greg Stark
Tom Lane <[EMAIL PROTECTED]> writes: > Rod Taylor <[EMAIL PROTECTED]> writes: > > If when we have partitions, that'll be good enough. If partitions aren't > > available this would be quite painful to anyone with large tables -- > > much as the days of old used to be painful for ANALYZE. > > Yeah

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-26 Thread Mischa Sandberg
Quoting Andrew Dunstan <[EMAIL PROTECTED]>: > After some more experimentation, I'm wondering about some sort of > adaptive algorithm, a bit along the lines suggested by Marko Ristola, but limited to 2 rounds. > > The idea would be that we take a sample (either of fixed size, or > some sm

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-26 Thread Tom Lane
Rod Taylor <[EMAIL PROTECTED]> writes: > If when we have partitions, that'll be good enough. If partitions aren't > available this would be quite painful to anyone with large tables -- > much as the days of old used to be painful for ANALYZE. Yeah ... I am very un-enthused about these suggestions

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-26 Thread Rod Taylor
On Tue, 2005-04-26 at 19:28 -0400, Greg Stark wrote: > Rod Taylor <[EMAIL PROTECTED]> writes: > > > On Tue, 2005-04-26 at 19:03 -0400, Greg Stark wrote: > > > This one looks *really* good. > > > > > > http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf > > > > > > It does require a

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-26 Thread Greg Stark
Rod Taylor <[EMAIL PROTECTED]> writes: > On Tue, 2005-04-26 at 19:03 -0400, Greg Stark wrote: > > This one looks *really* good. > > > > http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf > > > > It does require a single full table scan > > Ack.. Not by default please. > > I have

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-26 Thread Greg Stark
This one looks *really* good. http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf It does require a single full table scan but it works in O(n) time and constant space and it guarantees the confidence intervals for the estimates it provides like the histograms do for regular range s

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-26 Thread Andrew Dunstan
Simon Riggs wrote: The comment * Every value in the sample appeared more than once. Assume * the column has just these values. doesn't seem to apply when using larger samples, as Josh is using. Looking at Josh's application it does seem likely that when taking a sample, all site

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-26 Thread Simon Riggs
On Mon, 2005-04-25 at 17:10 -0400, Tom Lane wrote: > Simon Riggs <[EMAIL PROTECTED]> writes: > > On Mon, 2005-04-25 at 11:23 -0400, Tom Lane wrote: > >> It's not just the scan --- you also have to sort, or something like > >> that, if you want to count distinct values. I doubt anyone is really > >

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-26 Thread Josh Berkus
Simon, > Could it be that we have overlooked this simple explanation and that the > Haas and Stokes equation is actually quite good, but just not being > applied? That's probably part of it, but I've tried Haas and Stokes on a pure random sample and it's still bad, or more specifically overly co

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-26 Thread Simon Riggs
On Sun, 2005-04-24 at 00:48 -0400, Tom Lane wrote: > Josh Berkus writes: > > Overall, our formula is inherently conservative of n_distinct. That is, I > > believe that it is actually computing the *smallest* number of distinct > > values which would reasonably produce the given sample, rather

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-25 Thread Dave Held
> -Original Message- > From: Andrew Dunstan [mailto:[EMAIL PROTECTED] > Sent: Monday, April 25, 2005 3:43 PM > To: josh@agliodbs.com > Cc: pgsql-perform; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks > suggested? &

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-25 Thread Tom Lane
Simon Riggs <[EMAIL PROTECTED]> writes: > On Mon, 2005-04-25 at 11:23 -0400, Tom Lane wrote: >> It's not just the scan --- you also have to sort, or something like >> that, if you want to count distinct values. I doubt anyone is really >> going to consider this a feasible answer for large tables.

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-25 Thread Dave Held
> -Original Message- > From: Josh Berkus [mailto:[EMAIL PROTECTED] > Sent: Sunday, April 24, 2005 2:08 PM > To: Andrew Dunstan > Cc: Tom Lane; Greg Stark; Marko Ristola; pgsql-perform; > pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] [PERFORM] Bad n_disti

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-25 Thread Andrew Dunstan
Josh Berkus wrote: Simon, Tom: While it's not possible to get accurate estimates from a fixed size sample, I think it would be possible from a small but scalable sample: say, 0.1% of all data pages on large tables, up to the limit of maintenance_work_mem. Setting up these samples as a % of da

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-25 Thread Josh Berkus
Simon, Tom: While it's not possible to get accurate estimates from a fixed size sample, I think it would be possible from a small but scalable sample: say, 0.1% of all data pages on large tables, up to the limit of maintenance_work_mem. Setting up these samples as a % of data pages, rather th

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-25 Thread Josh Berkus
Guys, > While it's not possible to get accurate estimates from a fixed size sample, > I think it would be possible from a small but scalable sample: say, 0.1% of > all data pages on large tables, up to the limit of maintenance_work_mem. BTW, when I say "accurate estimates" here, I'm talking about

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-25 Thread Simon Riggs
On Mon, 2005-04-25 at 11:23 -0400, Tom Lane wrote: > Simon Riggs <[EMAIL PROTECTED]> writes: > > My suggested hack for PostgreSQL is to have an option to *not* sample, > > just to scan the whole table and find n_distinct accurately. > > ... > > What price a single scan of a table, however large, wh

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-25 Thread Dave Held
> -Original Message- > From: Tom Lane [mailto:[EMAIL PROTECTED] > Sent: Monday, April 25, 2005 10:23 AM > To: Simon Riggs > Cc: josh@agliodbs.com; Greg Stark; Marko Ristola; pgsql-perform; > pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] [PERFORM] Bad n_disti

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-25 Thread Tom Lane
Simon Riggs <[EMAIL PROTECTED]> writes: > My suggested hack for PostgreSQL is to have an option to *not* sample, > just to scan the whole table and find n_distinct accurately. > ... > What price a single scan of a table, however large, when incorrect > statistics could force scans and sorts to occu

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-25 Thread Simon Riggs
On Sat, 2005-04-23 at 16:39 -0700, Josh Berkus wrote: Greg Stark wrote > > I looked into this a while back when we were talking about changing the > > sampling method. The conclusions were discouraging. Fundamentally, using > > constant sized samples of data for n_distinct is bogus. Constant sized

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-24 Thread Tom Lane
Josh Berkus writes: > Tom, how does our heuristic sampling work? Is it pure random sampling, or > page sampling? Manfred probably remembers better than I do, but I think the idea is to approximate pure random sampling as best we can without actually examining every page of the table.

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-24 Thread Marko Ristola
Here is my opinion. I hope this helps. Maybe there is no one good formula: On boolean type, there are at most 3 distinct values. There is an upper bound for fornames in one country. There is an upper bound for last names in one country. There is a fixed number of states and postal codes in one coun

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-24 Thread Josh Berkus
Folks, > I wonder if this paper has anything that might help: > http://www.stat.washington.edu/www/research/reports/1999/tr355.ps - if I > were more of a statistician I might be able to answer :-) Actually, that paper looks *really* promising. Does anyone here have enough math to solve for D(s

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-24 Thread Josh Berkus
Andrew, > The math in the paper does not seem to look at very low levels of q (= > sample to pop ratio). Yes, I think that's the failing. Mind you, I did more testing and found out that for D/N ratios of 0.1 to 0.3, the formula only works within 5x accuracy (which I would consider acceptable)

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-24 Thread Andrew Dunstan
Tom Lane wrote: Josh Berkus writes: Overall, our formula is inherently conservative of n_distinct. That is, I believe that it is actually computing the *smallest* number of distinct values which would reasonably produce the given sample, rather than the *median* one. This is contrary to

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-23 Thread Tom Lane
Josh Berkus writes: > Overall, our formula is inherently conservative of n_distinct. That is, I > believe that it is actually computing the *smallest* number of distinct > values which would reasonably produce the given sample, rather than the > *median* one. This is contrary to the notes in

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-23 Thread Andrew Dunstan
Josh Berkus said: > > > Well, unusual distributions are certainly tough. But I think the > problem exists even for relatively well-distributed populations. > Part of it is, I believe, the formula we are using: > > n*d / (n - f1 + f1*n/N) > [snip] > > This is so broken, in fact, that I'm wonderin

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-23 Thread Josh Berkus
People, > Can someone whose math is more recent than calculus in 1989 take a look at > that paper, and look at the formula toward the bottom of page 10, and see > if we are correctly interpreting it?    I'm particularly confused as to > what "q" and "d-sub-n" represent.  Thanks! Actually, I manag

Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?

2005-04-23 Thread Josh Berkus
Greg, > I looked into this a while back when we were talking about changing the > sampling method. The conclusions were discouraging. Fundamentally, using > constant sized samples of data for n_distinct is bogus. Constant sized > samples only work for things like the histograms that can be analyze