Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-16 Thread Simon Riggs
On Mon, 2006-01-16 at 21:24 +0100, Manfred Koizar wrote: > On Fri, 13 Jan 2006 19:18:29 +, Simon Riggs > <[EMAIL PROTECTED]> wrote: > >I enclose a patch for checking out block sampling. > > Can't comment on the merits of block sampling and your implementation > thereof. But your thoughts ar

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-16 Thread Tom Lane
Simon Riggs <[EMAIL PROTECTED]> writes: > Tom has not spoken against checking for UNIQUE constraints: he is just > pointing out that there never could be a constraint in the case I was > identifying. More generally, arguing for or against any system-wide change on the basis of performance of compu

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-16 Thread Manfred Koizar
On Fri, 13 Jan 2006 19:18:29 +, Simon Riggs <[EMAIL PROTECTED]> wrote: >I enclose a patch for checking out block sampling. Can't comment on the merits of block sampling and your implementation thereof. Just some nitpicking: |! * Row Sampling: As of May 2004, we use the Vitter algorithm to c

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-16 Thread Simon Riggs
On Mon, 2006-01-16 at 12:26 -0600, Jim C. Nasby wrote: > On Fri, Jan 13, 2006 at 11:37:38PM -0500, Tom Lane wrote: > > Josh Berkus writes: > > >> It's also worth mentioning that for datatypes that only have an "=" > > >> operator the performance of compute_minimal_stats is O(N^2) when values > > >

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-16 Thread Jim C. Nasby
On Fri, Jan 13, 2006 at 11:37:38PM -0500, Tom Lane wrote: > Josh Berkus writes: > >> It's also worth mentioning that for datatypes that only have an "=" > >> operator the performance of compute_minimal_stats is O(N^2) when values > >> are unique, so increasing sample size is a very bad idea in tha

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-13 Thread Tom Lane
Josh Berkus writes: >> It's also worth mentioning that for datatypes that only have an "=" >> operator the performance of compute_minimal_stats is O(N^2) when values >> are unique, so increasing sample size is a very bad idea in that case. > Hmmm ... does ANALYZE check for UNIQUE constraints? Ou

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-13 Thread Josh Berkus
Simon, > It's also worth mentioning that for datatypes that only have an "=" > operator the performance of compute_minimal_stats is O(N^2) when values > are unique, so increasing sample size is a very bad idea in that case. > It may be possible to re-sample the sample, so that we get only one row

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-13 Thread Simon Riggs
On Fri, 2006-01-06 at 15:26 -0800, Josh Berkus wrote: > Anyway, since the proof is in the pudding, Simon and I will be working on > some demo code for different sampling methods so that we can debate > results rather than theory. I enclose a patch for checking out block sampling. This is not pr

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-10 Thread Jim C. Nasby
FWIW, results from a FBSD6.0 box. Between every run I ran code to zero 800MB (out of 1G) of memory, which should have effectively flushed the OS cache (it would just put the OS into swapping). Reading 1% (1280/128000 blocks 1048576000 bytes) total time 6870595us MB/s 1.53 effective MB/s 152.62 Rea

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-10 Thread Greg Stark
Simon Riggs <[EMAIL PROTECTED]> writes: > On Mon, 2006-01-09 at 22:08 -0500, Greg Stark wrote: > > > So it's not the 8k block reading that's fooling Linux into reading ahead > > 32k. > > It seems 32k readahead is the default for Linux, or perhaps it's the > > sequential access pattern that's tr

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-10 Thread Simon Riggs
On Mon, 2006-01-09 at 22:08 -0500, Greg Stark wrote: > So it's not the 8k block reading that's fooling Linux into reading ahead 32k. > It seems 32k readahead is the default for Linux, or perhaps it's the > sequential access pattern that's triggering it. Nah, Linux 2.6 uses flexible readahead logi

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-09 Thread Greg Stark
Greg Stark <[EMAIL PROTECTED]> writes: > Well my theory was sort of half right. It has nothing to do with fooling Linux > into thinking it's a sequential read. Apparently this filesystem was created > with 32k blocks. I don't remember if that was intentional or if ext2/3 did it > automatically ba

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-09 Thread Greg Stark
Greg Stark <[EMAIL PROTECTED]> writes: > > > These numbers don't make much sense to me. It seems like 5% is about as > > > slow > > > as reading the whole file which is even worse than I expected. I thought > > > I was > > > being a bit pessimistic to think reading 5% would be as slow as readin

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-09 Thread Greg Stark
> > These numbers don't make much sense to me. It seems like 5% is about as slow > > as reading the whole file which is even worse than I expected. I thought I > > was > > being a bit pessimistic to think reading 5% would be as slow as reading 20% > > of > > the table. I have a theory. My test

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-09 Thread Greg Stark
Simon Riggs <[EMAIL PROTECTED]> writes: > - yes, I would expect the results you get. If you sample 5% of rows and > each block has on average at least 20 rows, then we should expect the > majority of blocks to be hit. These results are from my test program. 5% means 5% of 8k blocks from the test

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-09 Thread Lukas Smith
Simon Riggs wrote: - yes, the random sampling is random - please read the code and comments - yes, I would expect the results you get. If you sample 5% of rows and each block has on average at least 20 rows, then we should expect the majority of blocks to be hit. and it seems from the benchma

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-09 Thread Simon Riggs
On Fri, 2006-01-06 at 16:13 -0500, Greg Stark wrote: > "Jim C. Nasby" <[EMAIL PROTECTED]> writes: > > > Before we start debating merits of proposals based on random reads, can > > someone confirm that the sampling code actually does read randomly? I > > looked at it yesterday; there is a comment t

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-08 Thread Kenneth Marshall
On Fri, Jan 06, 2006 at 06:36:52PM -0500, Greg Stark wrote: > Josh Berkus writes: > > > > These numbers don't make much sense to me. It seems like 5% is about as > > > slow as reading the whole file which is even worse than I expected. I > > > thought I was being a bit pessimistic to think readin

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-06 Thread Greg Stark
Josh Berkus writes: > > These numbers don't make much sense to me. It seems like 5% is about as > > slow as reading the whole file which is even worse than I expected. I > > thought I was being a bit pessimistic to think reading 5% would be as > > slow as reading 20% of the table. > > It's about

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-06 Thread Josh Berkus
Greg, > These numbers don't make much sense to me. It seems like 5% is about as > slow as reading the whole file which is even worse than I expected. I > thought I was being a bit pessimistic to think reading 5% would be as > slow as reading 20% of the table. It's about what *I* expected. Disk s

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-06 Thread Greg Stark
"Jim C. Nasby" <[EMAIL PROTECTED]> writes: > Before we start debating merits of proposals based on random reads, can > someone confirm that the sampling code actually does read randomly? I > looked at it yesterday; there is a comment that states that blocks to be > scanned are passed to the analy

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-06 Thread Tom Lane
"Jim C. Nasby" <[EMAIL PROTECTED]> writes: > Before we start debating merits of proposals based on random reads, can > someone confirm that the sampling code actually does read randomly? Well, it's not so much that it's not "random", as that it's not sequential --- it skips blocks, and therefore y

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-06 Thread Jim C. Nasby
On Fri, Jan 06, 2006 at 01:24:41AM -0500, Greg Stark wrote: > > 5% based on block-based sampling is reasonable; that means a straight 5% of > > the on-disk size of the table, so 5gb for a 100gb table. With random-row > > sampling, that would require as much as 25% of the table, making it easier

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-05 Thread Greg Stark
Josh Berkus writes: > Greg, > > > We *currently* use a block based sampling algorithm that addresses this > > issue by taking care to select rows within the selected blocks in an > > unbiased way. You were proposing reading *all* the records from the > > selected blocks, which throws away that

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-05 Thread Jim C. Nasby
On Thu, Jan 05, 2006 at 10:12:29AM -0500, Greg Stark wrote: > Worse, my recollection from the paper I mentioned earlier was that sampling > small percentages like 3-5% didn't get you an acceptable accuracy. Before you > got anything reliable you found you were sampling very large percentages of > t

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-05 Thread Josh Berkus
Greg, > We *currently* use a block based sampling algorithm that addresses this > issue by taking care to select rows within the selected blocks in an > unbiased way. You were proposing reading *all* the records from the > selected blocks, which throws away that feature. The block-based algorithm

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-05 Thread Greg Stark
Josh Berkus writes: > > These statements are at odds with my admittedly basic understanding of > > statistics. Isn't the power of a sample more related to the absolute size > > of > > the sample than the sample as fraction of the population? Why not just pick > > a smallish sample size, say a

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-05 Thread Greg Stark
Josh Berkus writes: > > Only if your sample is random and independent. The existing mechanism tries > > fairly hard to ensure that every record has an equal chance of being > > selected. > > If you read the entire block and not appropriate samples then you'll > > introduce > > systematic sampl

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-05 Thread Rod Taylor
> > Do you *really* want the median estimate in these case? Are you certain > > you > > do not want something with the opposite behavior of Chaudhuri's estimate so > > that for small sample sizes the bias is toward a high estimate of D? > > (Converges on D from the right instead of the left.)

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-05 Thread Simon Riggs
On Thu, 2006-01-05 at 00:33 -0500, Greg Stark wrote: > Simon Riggs <[EMAIL PROTECTED]> writes: > > > The approach I suggested uses the existing technique for selecting > > random blocks, then either an exhaustive check on all of the rows in a > > block or the existing random row approach, dependin

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-04 Thread Josh Berkus
Folks, Nope, it's definitely proportional. As a simple example, a sample of 500 rows in a table of 1000 rows should yeild stats estimates with 90%+ accuracy. But a sample of 500 rows in a 600,000,000 row table is so small as to be nearly useless; it's quite possible to get all the same val

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-04 Thread Josh Berkus
Greg, Only if your sample is random and independent. The existing mechanism tries fairly hard to ensure that every record has an equal chance of being selected. If you read the entire block and not appropriate samples then you'll introduce systematic sampling errors. For example, if you read an

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-04 Thread Josh Berkus
Trent, Sorry to interupt. The discussion is interesting, but I need some help to follow along. Thought-out commentary is welcome. Is "replace the algorithm" the same as saying "contextually use some estimate of D that is not Chaudhuri? Yes. I favor a block-based approach like Brutlag, l

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-04 Thread Greg Stark
Simon Riggs <[EMAIL PROTECTED]> writes: > The approach I suggested uses the existing technique for selecting > random blocks, then either an exhaustive check on all of the rows in a > block or the existing random row approach, depending upon available > memory. We need to check all of the rows in

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-04 Thread Trent Shipley
Sorry to interupt. The discussion is interesting, but I need some help to follow along. On Wednesday 2006-01-04 17:07, Josh Berkus wrote: > Simon, > > > - Are there any performance issues that can be directly attributed to > > mis-estimation of N-Distinct ("D") by the ANALYZE command? > > Yes.

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-04 Thread Simon Riggs
On Wed, 2006-01-04 at 19:22 -0500, Greg Stark wrote: > I think you're right that a reasonable sample size for this kind of > estimate > is going to be proportional to the table size, not the constant sized > sample > that regular statistics need. Agreed [I said exactly that in April]; the counter

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-04 Thread Simon Riggs
On Wed, 2006-01-04 at 17:57 -0600, Jim C. Nasby wrote: > On Wed, Jan 04, 2006 at 07:10:29PM +, Simon Riggs wrote: > > 3. We should also apply multi-column heuristics to the estimation of D, > > once we have estimated all columns. For column groups (pairs, triples > > etc) that form part of a PK

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-04 Thread Greg Stark
Josh Berkus writes: > Tom, > > > In general, estimating n-distinct from a sample is just plain a hard > > problem, and it's probably foolish to suppose we'll ever be able to > > do it robustly. What we need is to minimize the impact when we get > > it wrong. > > Well, I think it's pretty wel

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-04 Thread Simon Riggs
On Wed, 2006-01-04 at 14:49 -0500, Tom Lane wrote: > Simon Riggs <[EMAIL PROTECTED]> writes: > > [ ... a large amount of analysis based on exactly one test case ... ] [Hmmm, those are your opinions, not my words. Funny guy ;-) ] The one test case just happens to be a very common 1:M relationship,

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-04 Thread Josh Berkus
Simon, > - Are there any performance issues that can be directly attributed to > mis-estimation of N-Distinct ("D") by the ANALYZE command? Yes. There's at least one query (maybe two) from TPC-H which bombs because of bad N-distinct estimation, even with stats_target =1000. Based on my exper

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-04 Thread Jim C. Nasby
On Wed, Jan 04, 2006 at 07:10:29PM +, Simon Riggs wrote: > 3. We should also apply multi-column heuristics to the estimation of D, > once we have estimated all columns. For column groups (pairs, triples > etc) that form part of a PK, we know that it must be true that D1 * > D2 ... Dk >= N. In m

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-04 Thread Josh Berkus
Tom, > In general, estimating n-distinct from a sample is just plain a hard > problem, and it's probably foolish to suppose we'll ever be able to > do it robustly. What we need is to minimize the impact when we get > it wrong. Well, I think it's pretty well proven that to be accurate at all yo

Re: [HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-04 Thread Tom Lane
Simon Riggs <[EMAIL PROTECTED]> writes: > [ ... a large amount of analysis based on exactly one test case ... ] I think you are putting too much emphasis on fixing one case and not enough on considering what may happen in other cases ... In general, estimating n-distinct from a sample is just pla

[HACKERS] Improving N-Distinct estimation by ANALYZE

2006-01-04 Thread Simon Riggs
Improving N-Distinct estimation === v1.1 OBJECTIVES Answer these points... - Are there any performance issues that can be directly attributed to mis-estimation of N-Distinct ("D") by the ANALYZE command? - If so, can we do better than we currently achieve? How? - W