Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-06 Thread Ron Mayer
Tom Lane wrote: One objection to this is that after moving "off the gold standard" of 1.0 = one page fetch, there is no longer any clear meaning to the cost estimate units; you're faced with the fact that they're just an arbitrary scale. I'm not sure that's such a bad thing, though. It seems

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-05 Thread Jim Nasby
On Jun 4, 2006, at 5:09 PM, Tom Lane wrote: Greg Stark <[EMAIL PROTECTED]> writes: Hannu Krosing <[EMAIL PROTECTED]> writes: Ühel kenal päeval, L, 2006-06-03 kell 10:43, kirjutas Jim Nasby: Might also be worth adding analyze delay settings, ala vacuum_cost_delay. ANALYZE already respects th

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-05 Thread Hannu Krosing
Ühel kenal päeval, P, 2006-06-04 kell 18:09, kirjutas Tom Lane: > Greg Stark <[EMAIL PROTECTED]> writes: > > Hannu Krosing <[EMAIL PROTECTED]> writes: > >> Ühel kenal päeval, L, 2006-06-03 kell 10:43, kirjutas Jim Nasby: > >>> Might also be worth adding analyze delay settings, ala > >>> vacuum_co

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-04 Thread Tom Lane
Greg Stark <[EMAIL PROTECTED]> writes: > Hannu Krosing <[EMAIL PROTECTED]> writes: >> Ühel kenal päeval, L, 2006-06-03 kell 10:43, kirjutas Jim Nasby: >>> Might also be worth adding analyze delay settings, ala >>> vacuum_cost_delay. ANALYZE already respects the vacuum delay settings. >> Actua

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-04 Thread Greg Stark
Hannu Krosing <[EMAIL PROTECTED]> writes: > Ühel kenal päeval, L, 2006-06-03 kell 10:43, kirjutas Jim Nasby: > > > Might also be worth adding analyze delay settings, ala > > vacuum_cost_delay. > > Actually we should have delay settings for all potential > (almost-)full-scan service ops, - V

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-04 Thread Greg Stark
Greg Stark <[EMAIL PROTECTED]> writes: > I see improvements up to around 256M blocks or so, but my data is pretty > questionable since I'm busy watching tv in Mythtv in another window. It's on > another drive but it still seems to be making the numbers jump around a bit. Oops, I meant blocks of 2

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-03 Thread Greg Stark
Greg Stark <[EMAIL PROTECTED]> writes: > Perhaps what this indicates is that the real meat is in track sampling, not > block sampling. Fwiw, I've done a little benchmarking and I'm starting to think this isn't a bad idea. I see a dramatic speed improvement for samples of 1-10% as the block size

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-03 Thread Greg Stark
Hannu Krosing <[EMAIL PROTECTED]> writes: > Disks can read at full rotation speed, so skipping (not reading) some > blocks will not make reading the remaining blocks from the same track > faster. And if there are more than 20 8k pages per track, you still have > a very high probablility you need

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-03 Thread Mike Benoit
pgbench appears to already support arbitrary SQL queries with the -f switch, so why couldn't we just make it a little smarter and have people enable SQL query logging for a day or two, then pass the log off to pgbench: pgbench -f Seems to me like that wouldn't be too difficult to do, and would g

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-03 Thread Hannu Krosing
Ühel kenal päeval, R, 2006-06-02 kell 16:23, kirjutas Greg Stark: > And a 5% sample is a pretty big. In fact my tests earlier showed the i/o from > 5% block sampling took just as long as reading all the blocks. Even if we > figure out what's causing that (IMHO surprising) result and improve matter

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-03 Thread Hannu Krosing
Ühel kenal päeval, L, 2006-06-03 kell 10:43, kirjutas Jim Nasby: > Might also be worth adding analyze delay settings, ala > vacuum_cost_delay. Actually we should have delay settings for all potential (almost-)full-scan service ops, - VACUUM, ANALYSE, CREATE INDEX, ADD CONSTRAINT, maybe more - s

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-03 Thread Jim Nasby
On Jun 2, 2006, at 5:24 PM, Todd A. Cook wrote: Josh Berkus wrote: Greg, Tom, But for most users analyze doesn't really have to run as often as vacuum. One sequential scan per night doesn't seem like that big a deal to me. Clearly you don't have any 0.5 TB databases. Perhaps something lik

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-02 Thread Tom Lane
Rod Taylor <[EMAIL PROTECTED]> writes: >> One objection to this is that after moving "off the gold standard" of >> 1.0 = one page fetch, there is no longer any clear meaning to the >> cost estimate units; you're faced with the fact that they're just an >> arbitrary scale. I'm not sure that's such

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-02 Thread Rod Taylor
> One objection to this is that after moving "off the gold standard" of > 1.0 = one page fetch, there is no longer any clear meaning to the > cost estimate units; you're faced with the fact that they're just an > arbitrary scale. I'm not sure that's such a bad thing, though. For > instance, some

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-02 Thread Greg Stark
Josh Berkus writes: > Greg, Tom, > > > But for most users analyze doesn't really have to run as often as > > vacuum. One sequential scan per night doesn't seem like that big a deal > > to me. > > Clearly you don't have any 0.5 TB databases. Actually I did not so long ago. Sequential scans

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-02 Thread Tom Lane
I wrote: > In general it seems to me that for CPU-bound databases, the default values > of the cpu_xxx_cost variables are too low. ... rather than telling people > to manipulate all three of these variables individually, I think it might > also be a good idea to provide a new GUC variable named so

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-02 Thread Todd A. Cook
Josh Berkus wrote: Greg, Tom, But for most users analyze doesn't really have to run as often as vacuum. One sequential scan per night doesn't seem like that big a deal to me. Clearly you don't have any 0.5 TB databases. Perhaps something like "ANALYZE FULL"? Then only those who need the

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-02 Thread Josh Berkus
Greg, Tom, > But for most users analyze doesn't really have to run as often as > vacuum. One sequential scan per night doesn't seem like that big a deal > to me. Clearly you don't have any 0.5 TB databases. > > I'd still be worried about the CPU pain though. ANALYZE can afford to > > expend a

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-02 Thread Greg Stark
Tom Lane <[EMAIL PROTECTED]> writes: > Greg Stark <[EMAIL PROTECTED]> writes: > > And a 5% sample is a pretty big. In fact my tests earlier showed the i/o > > from > > 5% block sampling took just as long as reading all the blocks. Even if we > > figure out what's causing that (IMHO surprising) re

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-02 Thread David Fetter
On Fri, Jun 02, 2006 at 01:39:32PM -0700, Michael Dean wrote: > I'm sorry to interrupt your esoteric (to me) discussion, but I have > a very simple question: would you define a "good unbiased sample"? > My statistics professor Dan Price (rest his soul) would tell me > there are only random sample

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-02 Thread Tom Lane
Greg Stark <[EMAIL PROTECTED]> writes: > And a 5% sample is a pretty big. In fact my tests earlier showed the i/o from > 5% block sampling took just as long as reading all the blocks. Even if we > figure out what's causing that (IMHO surprising) result and improve matters I > would only expect it t

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-02 Thread Michael Dean
Greg Stark wrote: Josh Berkus writes: Using a variety of synthetic and real-world data sets, we show that distinct sampling gives estimates for distinct values queries that are within 0%-10%, whereas previous methods were typically 50%-250% off, across the spectrum of data sets and q

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-02 Thread Greg Stark
Josh Berkus writes: > > Using a variety of synthetic and real-world data sets, we show that > > distinct sampling gives estimates for distinct values queries that > > are within 0%-10%, whereas previous methods were typically 50%-250% off, > > across the spectrum of data sets and queries

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-02 Thread Josh Berkus
Greg, > Using a variety of synthetic and real-world data sets, we show that > distinct sampling gives estimates for distinct values queries that > are within 0%-10%, whereas previous methods were typically 50%-250% off, > across the spectrum of data sets and queries studied. Aha. It's a

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-02 Thread Greg Stark
David Fetter <[EMAIL PROTECTED]> writes: > > In the prior discussions someone posted the paper with the algorithm > > I mentioned. That paper mentions that previous work showed poor > > results at estimating n_distinct even with sample sizes as large as > > 50% or more. > > Which paper? People

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-02 Thread Kenneth Marshall
Josh, Greg, and Tom, I do not know how sensitive the plans will be to the correlation, but one thought might be to map the histogram X histogram correlation to a square grid of values. Then you can map them to an integer which would give you 8 x 8 with binary values, a 5 x 5 with 4 values per poin

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-02 Thread Nicolai Petri
Hi All, Just a small comment from a mortal user. On Thursday 01 June 2006 19:28, Josh Berkus wrote: > 5. random_page_cost (as previously discussed) is actually a funciton of > relatively immutable hardware statistics, and as such should not need to > exist as a GUC once the cost model is fixed. I

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Tom Lane
Mark Kirkwood <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> With this model, the disk cost to fetch a single >> index entry will be estimated as random_page_cost (default 4.0) rather >> than the current fixed 2.0. This shouldn't hurt things too much for >> simple indexscans --- especially since

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Mark Kirkwood
Tom Lane wrote: Another thing that's bothering me is that the index access cost computation (in genericcostestimate) is looking sillier and sillier: /* * Estimate the number of index pages that will be retrieved. * * For all currently-supported index types, the first page of

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread David Fetter
On Thu, Jun 01, 2006 at 08:36:16PM -0400, Greg Stark wrote: > > Josh Berkus writes: > > > Greg, Tom, > > > > > a) We already use block based sampling to reduce overhead. If > > > you're talking about using the entire block and not just > > > randomly sampled tuples from within those blocks the

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Greg Stark
Josh Berkus writes: > Greg, Tom, > > > a) We already use block based sampling to reduce overhead. If you're > > talking about using the entire block and not just randomly sampled > > tuples from within those blocks then your sample will be biased. > > There are actually some really good equati

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Tom Lane
"Jim C. Nasby" <[EMAIL PROTECTED]> writes: > Speaking of plan instability, something that's badly needed is the > ability to steer away from query plans that *might* be the most optimal, > but also will fail horribly should the cost estimates be wrong. You sure that doesn't leave us with the empty

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Jim C. Nasby
On Thu, Jun 01, 2006 at 03:15:09PM -0400, Tom Lane wrote: > These would all be nice things to know, but I'm afraid it's pie in the > sky. We have no reasonable way to get those numbers. (And if we could > get them, there would be another set of problems, namely plan instability: > the planner's c

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Jim C. Nasby
On Thu, Jun 01, 2006 at 02:25:56PM -0400, Greg Stark wrote: > > Josh Berkus writes: > > > 1. n-distinct estimation is bad, as previously discussed; > > > > 2. our current heuristics sampling methods prevent us from sampling more > > than > > 0.5% of any reasonably large table, causing all stat

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Josh Berkus
Greg, > > 1) You have n^2 possible two-column combinations. That's a lot of > > processing and storage. > > Yes, that's the hard problem to solve.  Actually, btw, it's n!, not n^2. Ooops, bad math. Andrew pointed out it's actually n*(n-1)/2, not n!. Also, we could omit columns unlikely to corre

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Josh Berkus
Greg, Tom, > a) We already use block based sampling to reduce overhead. If you're > talking about using the entire block and not just randomly sampled > tuples from within those blocks then your sample will be biased. There are actually some really good equations to work with this, estimating bo

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Greg Stark
Josh Berkus writes: > > However it will only make sense if people are willing to accept that > > analyze will need a full table scan -- at least for tables where the DBA > > knows that good n_distinct estimates are necessary. > > What about block-based sampling? Sampling 1 in 20 disk pages, r

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Tom Lane
Josh Berkus writes: > Yeah. I've refrained from proposing changes because it's a > pick-up-sticks. If we start modifying the model, we need to fix > *everything*, not just one item. And then educate our users that they > need to use the GUC variables in a different way. Here's the issues I

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Josh Berkus
Greg, > I'm convinced these two are more connected than you believe. Actually, I think they are inseparable. > I might be interested in implementing that algorithm that was posted a > while back that involved generating good unbiased samples of discrete > values. The algorithm was quite clever a

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Greg Stark
Josh Berkus writes: > 1. n-distinct estimation is bad, as previously discussed; > > 2. our current heuristics sampling methods prevent us from sampling more than > 0.5% of any reasonably large table, causing all statistics on those tables to > be off for any table with irregular distribution of

Re: [HACKERS] More thoughts about planner's cost estimates

2006-06-01 Thread Josh Berkus
Tom, As you know, this is something I think about a bit too, though not nearly as deeply as you. In general it seems to me that for CPU-bound databases, the default values of the cpu_xxx_cost variables are too low. I am tempted to raise the default value of cpu_index_tuple_cost to 0.005, whi

[HACKERS] More thoughts about planner's cost estimates

2006-05-31 Thread Tom Lane
I've been thinking about the planner's costing problems again, particularly in connection with Philippe Lang's complaint here: http://archives.postgresql.org/pgsql-general/2006-05/msg01526.php Investigation showed that the planner is much too optimistic about the usefulness of a multi-index BitmapA