Re: [HACKERS] benchmarking the query planner

2009-04-02 Thread Robert Haas
On Thu, Mar 19, 2009 at 4:04 AM, ITAGAKI Takahiro wrote: > Robert Haas wrote: >> >> Works for me. Especially if you want to think more about ANALYZE before >> >> changing that. >> > >> > Well, it's something that would be sane to contemplate adding in 8.4. >> > It's way too late for any of this o

Re: [HACKERS] benchmarking the query planner

2009-03-19 Thread ITAGAKI Takahiro
Robert Haas wrote: > >> Works for me. Especially if you want to think more about ANALYZE before > >> changing that. > > > > Well, it's something that would be sane to contemplate adding in 8.4. > > It's way too late for any of this other stuff to happen in this release. > > I'm thinking about t

Re: [HACKERS] benchmarking the query planner

2009-03-11 Thread Robert Haas
On Fri, Dec 12, 2008 at 3:59 PM, Tom Lane wrote: > Simon Riggs writes: >> On Fri, 2008-12-12 at 13:20 -0500, Tom Lane wrote: >>> Simon Riggs writes: Solutions can also include * manual mechanism for setting ndistinct that doesn't keep getting overwritten by subsequent ANALYZEs >>>

Re: [HACKERS] benchmarking the query planner

2009-01-02 Thread Greg Smith
On Thu, 11 Dec 2008, Tom Lane wrote: On the whole I think we have some evidence here to say that upping the default value of default_stats_target to 100 wouldn't be out of line, but 1000 definitely is. Comments? Circling back to where this started from now that the discussion has slowed down

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Tom Lane
Simon Riggs writes: > On Fri, 2008-12-12 at 13:20 -0500, Tom Lane wrote: >> Simon Riggs writes: >>> Solutions can also include >>> * manual mechanism for setting ndistinct that doesn't keep getting >>> overwritten by subsequent ANALYZEs >> >> Hmm, that might actually be the most practical answer

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Simon Riggs
On Fri, 2008-12-12 at 13:20 -0500, Tom Lane wrote: > Simon Riggs writes: > > Solutions can also include > > * manual mechanism for setting ndistinct that doesn't keep getting > > overwritten by subsequent ANALYZEs > > Hmm, that might actually be the most practical answer for large, > reasonably-

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Ron Mayer
Gregory Stark wrote: Simon Riggs writes: The amount of I/O could stay the same, just sample all rows on block. [] It will also introduce strange biases. For instance in a clustered table it'll think there are a lot more duplicates than there really are because it'll see lots of similar va

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Simon Riggs
On Fri, 2008-12-12 at 13:43 -0500, Tom Lane wrote: > Simon Riggs writes: > > On Fri, 2008-12-12 at 13:18 -0500, Tom Lane wrote: > >> Could we skip the hyperbole please? > > > Some of the ndistinct values are very badly off, and in the common cases > > I cited previously, consistently so. > > >

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Greg Stark
On Fri, Dec 12, 2008 at 6:31 PM, Simon Riggs wrote: > > Why not keep the random algorithm we have now, but scan the block into a > separate hash table for ndistinct estimation. That way we keep the > correct random rows for other purposes. It seems to me that what you have to do is look at a set

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Greg Stark
On Fri, Dec 12, 2008 at 6:18 PM, Tom Lane wrote: > I seem to recall Greg suggesting that there were ways to estimate > ndistinct without sorting, but short of a fundamental algorithm change > there's not going to be a win here. Not sure if you meant me, but I can say the approach I saw documented

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Robert Haas
>> Perhaps a table-level option to scan the whole table instead of >> estimating would be appropriate? >> > ANALYZE FULL foo ? I was thinking more like a flag that you could set on the table itself, that would apply to all future analyzes. ...Robert -- Sent via pgsql-hackers mailing list (pgsql

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Euler Taveira de Oliveira
Robert Haas escreveu: >> Which raises the issue, if we could get better statistics by passing >> the whole table, why not do that when VACUUM ANALYZE is run? > > I think the reason is "because the next autovacuum would undo it". > Ok, but even if autovacuum will undo it, almost-static-dataset wou

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Tom Lane
Simon Riggs writes: > On Fri, 2008-12-12 at 13:18 -0500, Tom Lane wrote: >> Could we skip the hyperbole please? > Some of the ndistinct values are very badly off, and in the common cases > I cited previously, consistently so. > Once I'm certain the rescue helicopter has seen me, I'll stop waving

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Simon Riggs
On Fri, 2008-12-12 at 13:18 -0500, Tom Lane wrote: > I seem to recall Greg suggesting that there were ways to estimate > ndistinct without sorting, but short of a fundamental algorithm change > there's not going to be a win here. Hash table? Haas Stokes suggests a Bloom filter. Why not keep the

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Simon Riggs
On Fri, 2008-12-12 at 13:10 -0500, Tom Lane wrote: > If we don't > have at least ballpark-correct figures for cost and number of output > rows, we'll start making mistakes at higher plan levels. That's definitely where we are now. Had a call with a customer today where I said "it's OK, its onl

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Tom Lane
Simon Riggs writes: > Solutions can also include > * manual mechanism for setting ndistinct that doesn't keep getting > overwritten by subsequent ANALYZEs Hmm, that might actually be the most practical answer for large, reasonably-static tables. Especially if we expose the "negative stadistinct"

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Tom Lane
Simon Riggs writes: > As I said, we would only increase sample for ndistinct, not for others. How will you do that? Keep in mind that one of the things we have to do to compute ndistinct is to sort the sample. ISTM that the majority of the cost of a larger sample is going to get expended anyway

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Simon Riggs
On Fri, 2008-12-12 at 18:01 +, Greg Stark wrote: > I think you need to find two different formulas, one which represents > a clustered table and one which represents randomly distributed data. > Then you need a way to measure just how clustered the data is so you > know how much weight to giv

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Bruce Momjian
Alvaro Herrera wrote: > Tom Lane escribi?: > > > If you want ANALYZE to be cheap then you simply don't get to have a > > trustworthy value of ndistinct. > > But then, maybe it's not all that critical that ANALYZE is cheap. For > example, if we were to rework VACUUM ANALYZE so that on the same pa

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Bruce Momjian
Alvaro Herrera wrote: > Robert Haas escribi?: > > > Which raises the issue, if we could get better statistics by passing > > > the whole table, why not do that when VACUUM ANALYZE is run? > > > > I think the reason is "because the next autovacuum would undo it". > > Is there any way to "merge" th

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Tom Lane
Alvaro Herrera writes: > Is there any way to "merge" the statistics? i.e. if a full table scan > is done to compute precise statistics, and later a regular analyze scan > is done, then perhaps instead of clobbering the previous stats, you > merge them with the new ones, thus not completely losing

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Tom Lane
Simon Riggs writes: > On Fri, 2008-12-12 at 11:16 -0500, Tom Lane wrote: >> Perhaps a better plan is to try to de-emphasize use of ndistinct, >> though I concede I have no idea how to do that. > We don't actually care about the accuracy of the ndistinct much, just > the accuracy of our answer to

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Simon Riggs
On Fri, 2008-12-12 at 14:03 -0300, Alvaro Herrera wrote: > Partial vacuum is a roadblock for this though :-( Actually, perhaps its an enabler for looking at changed stats? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mail

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Greg Stark
On Fri, Dec 12, 2008 at 5:33 PM, Simon Riggs wrote: >> Incidentally we *do* do block sampling. We pick random blocks and then pick >> random records within those blocks. This was new in, uh, 7.4? 8.0? Sometime >> around then. It dramatically reduced the i/o requirements but there were long >> disc

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Tom Lane
Alvaro Herrera writes: > Tom Lane escribió: >> If you want ANALYZE to be cheap then you simply don't get to have a >> trustworthy value of ndistinct. > But then, maybe it's not all that critical that ANALYZE is cheap. For > example, if we were to rework VACUUM ANALYZE so that on the same pass >

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Tom Lane
"Robert Haas" writes: > On Fri, Dec 12, 2008 at 4:04 AM, Simon Riggs wrote: >>> The existing sampling mechanism is tied to solid statistics. >> >> Sounds great, but its not true. The sample size is not linked to data >> volume, so how can it possibly give a consistent confidence range? > It is

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Simon Riggs
On Fri, 2008-12-12 at 17:05 +, Gregory Stark wrote: > Simon Riggs writes: > > > The amount of I/O could stay the same, just sample all rows on block. > > Lifting the sample size will help large tables. Will it be perfect? No. > > But I'll take "better" over "not working at all". > > That wi

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Kevin Grittner
>>> "Robert Haas" wrote: >> Which raises the issue, if we could get better statistics by passing >> the whole table, why not do that when VACUUM ANALYZE is run? > > I think the reason is "because the next autovacuum would undo it". The table has 32.4 million rows. autovacuum_analyze_scale_fac

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Simon Riggs
On Fri, 2008-12-12 at 16:10 +, Gregory Stark wrote: > Uhm, this is a survey of lots of different methods and does lots of > analysis. > I don't see any simple conclusions about stability. Perhaps I'm just > missing > it in the technical details. Could you point out exactly what part of > the >

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Gregory Stark
Simon Riggs writes: > The amount of I/O could stay the same, just sample all rows on block. > Lifting the sample size will help large tables. Will it be perfect? No. > But I'll take "better" over "not working at all". That will just raise the table size at which the problems start. It'll still b

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Alvaro Herrera
Tom Lane escribió: > If you want ANALYZE to be cheap then you simply don't get to have a > trustworthy value of ndistinct. But then, maybe it's not all that critical that ANALYZE is cheap. For example, if we were to rework VACUUM ANALYZE so that on the same pass that VACUUM cleans each heap page

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Alvaro Herrera
Robert Haas escribió: > > Which raises the issue, if we could get better statistics by passing > > the whole table, why not do that when VACUUM ANALYZE is run? > > I think the reason is "because the next autovacuum would undo it". Is there any way to "merge" the statistics? i.e. if a full table

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Simon Riggs
On Fri, 2008-12-12 at 11:16 -0500, Tom Lane wrote: > Perhaps a better plan is to try to de-emphasize use of ndistinct, > though I concede I have no idea how to do that. We don't actually care about the accuracy of the ndistinct much, just the accuracy of our answer to the question "given work_me

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Simon Riggs
On Fri, 2008-12-12 at 11:16 -0500, Tom Lane wrote: > If you want ANALYZE to be cheap then you simply don't get to have > a trustworthy value of ndistinct. Understood, but a cheap ANALYZE isn't always a higher priority than all other considerations. Solutions can also include * allowing user t

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Simon Riggs
On Fri, 2008-12-12 at 16:10 +, Gregory Stark wrote: > Right, but increasing our sample size by a factor of 150 for a 100M > row table doesn't seem like a reasonable solution to one metric being > bogus. > > For that matter, if we do consider sampling 5% of the table we may as > well just go a

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Robert Haas
> Which raises the issue, if we could get better statistics by passing > the whole table, why not do that when VACUUM ANALYZE is run? I think the reason is "because the next autovacuum would undo it". Perhaps a table-level option to scan the whole table instead of estimating would be appropriate?

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Kevin Grittner
>>> "Nathan Boley" wrote: > Can anyone suggest a good data set to test this sort of question on? Where we have the biggest problem with bad estimates is on complex searches involving many joins where the main criterion limiting the result set is a name. The estimate based on the histogram is o

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Tom Lane
Gregory Stark writes: > For that matter, if we do consider sampling 5% of the table we may as well > just go ahead and scan the whole table. It wouldn't take much longer and it > would actually produce good estimates. Yeah. Anything over a small fraction of a percent is going to imply fetching e

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Gregory Stark
Simon Riggs writes: > On Fri, 2008-12-12 at 06:44 -0500, Robert Haas wrote: >> It is a pretty well-known mathematical fact that for something like an >> opinion poll your margin of error does not depend on the size of the >> population but only on the size of your sample. > > Yes, I agree with th

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Tom Lane
"Greg Stark" writes: > On Fri, Dec 12, 2008 at 2:35 PM, Tom Lane wrote: >> But having said that, I have wondered whether we should consider >> allowing the sample to grow to fill maintenance_work_mem > Hm, so I wonder what this does to the time analyze takes. I think it > would be the only thing

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Simon Riggs
On Fri, 2008-12-12 at 09:35 -0500, Tom Lane wrote: > AFAICS, marginal enlargements in the sample size aren't going to help > much for ndistinct --- you really need to look at most or all of the > table to be guaranteed anything about that. > > But having said that, I have wondered whether we sho

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Simon Riggs
On Fri, 2008-12-12 at 06:44 -0500, Robert Haas wrote: > On Fri, Dec 12, 2008 at 4:04 AM, Simon Riggs wrote: > >> The existing sampling mechanism is tied to solid statistics. It > >> provides the correct sample size to get a consistent confidence range > >> for range queries. This is the same math

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Tom Lane
"Robert Haas" writes: > On Thu, Dec 11, 2008 at 10:12 PM, Tom Lane wrote: >> Maybe so. If we stick to the other design (end both lists at a preset >> frequency threshold) then the math clearly goes through the same as >> before, just with num_mcvs that are determined differently. But can >> we

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Greg Stark
On Fri, Dec 12, 2008 at 2:35 PM, Tom Lane wrote: > AFAICS, marginal enlargements in the sample size aren't going to help > much for ndistinct --- you really need to look at most or all of the > table to be guaranteed anything about that. Well you only need to maintain a fixed percentage of the ta

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Robert Haas
On Thu, Dec 11, 2008 at 10:12 PM, Tom Lane wrote: > "Robert Haas" writes: >> I had this idle thought too, but I didn't write it down because... > >>> ought to be, but it seems like it ought to be possible to determine >>> that given a desired maximum error in the overall estimate. I'm also >>> n

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Robert Haas
On Fri, Dec 12, 2008 at 4:04 AM, Simon Riggs wrote: >> The existing sampling mechanism is tied to solid statistics. It >> provides the correct sample size to get a consistent confidence range >> for range queries. This is the same mathematics which governs election >> polling and other surveys. Th

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Simon Riggs
On Thu, 2008-12-11 at 18:52 -0500, Tom Lane wrote: > Simon Riggs writes: > > On Thu, 2008-12-11 at 22:29 +, Gregory Stark wrote: > >>> And I would like it even more if the sample size increased according > >>> to table size, since that makes ndistinct values fairly random for > >>> large tabl

Re: [HACKERS] benchmarking the query planner

2008-12-12 Thread Simon Riggs
On Fri, 2008-12-12 at 02:23 +, Greg Stark wrote: > The existing sampling mechanism is tied to solid statistics. It > provides the correct sample size to get a consistent confidence range > for range queries. This is the same mathematics which governs election > polling and other surveys. The

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Tom Lane
"Robert Haas" writes: > I had this idle thought too, but I didn't write it down because... >> ought to be, but it seems like it ought to be possible to determine >> that given a desired maximum error in the overall estimate. I'm also >> not very clear on what the "total frequency" computations (

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Robert Haas
>> OK, I'll bite. How do we decide where to put the cutoff? If we make >> it too high, it will have a negative effect on join selectivity >> estimates; if it's too low, it won't really address the problem we're >> trying to fix. I randomly propose p = 0.001, which should limit >> eqjoinsel() to

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Nathan Boley
>> Isn't a selectivity estimate of x = v as ( the number of values in v's >> histogram bucket ) / ( number of distinct values in v's histogram >> bucket ) pretty rational? Thats currently what we do for non-mcv >> values, except that we look at ndistinct over the whole table instead >> of individua

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Greg Stark
On Thu, Dec 11, 2008 at 11:44 PM, Simon Riggs wrote: > > On Thu, 2008-12-11 at 22:29 +, Gregory Stark wrote: > >> > And I would like it even more if the sample size increased according >> to table size, since that makes ndistinct values fairly random for >> large >> > tables. >> >> Unfortunate

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Tom Lane
"Robert Haas" writes: >> It might be best to stop when the frequency drops below some threshold, >> rather than taking a fixed number of entries. > OK, I'll bite. How do we decide where to put the cutoff? If we make > it too high, it will have a negative effect on join selectivity > estimates;

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Robert Haas
> I think though that the case for doing so is pretty good. "MCVs" that > are beyond the K'th entry can't possibly have frequencies greater than > 1/K, and in most cases it'll be a lot less. So the incremental > contribution to the accuracy of the join selectivity estimate drops off > pretty quic

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Tom Lane
"Nathan Boley" writes: > Isn't a selectivity estimate of x = v as ( the number of values in v's > histogram bucket ) / ( number of distinct values in v's histogram > bucket ) pretty rational? Thats currently what we do for non-mcv > values, except that we look at ndistinct over the whole table ins

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Nathan Boley
Thanks for the response. >> Well, ISTM there is a profound difference. For scalarineqsel we care >> about the total number of values in a bucket. For eqsel we care about >> the total number of *distinct* values in each bucket > > Really? > Well, we would obviously also care about the total numbe

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Tom Lane
"Nathan Boley" writes: > Well, ISTM there is a profound difference. For scalarineqsel we care > about the total number of values in a bucket. For eqsel we care about > the total number of *distinct* values in each bucket Really? > IMHO, the whole idea of increasing mcv's seems a mistake. Why no

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Nathan Boley
>> What is the specific difference between what you are talking about and >> what scalarineqsel already implements? > > Hmm... Northing new. Feel sorry for bothering you. I did not realize > histograms are implemented. > Well, ISTM there is a profound difference. For scalarineqsel we care about th

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Tom Lane
Bruce Momjian writes: > Why is selfuncs.c::var_eq_const() doing a linear scan over the MCV array > instead of having the list sorted and doing a binary search on the > array? 1. Keeping the MCV array sorted by frequency allows cheap extraction of less-accurate MCV lists; you just take the first N

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Tom Lane
Simon Riggs writes: > On Thu, 2008-12-11 at 22:29 +, Gregory Stark wrote: >>> And I would like it even more if the sample size increased according >>> to table size, since that makes ndistinct values fairly random for >>> large tables. >> >> Unfortunately _any_ ndistinct estimate based on a s

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Bruce Momjian
Tom Lane wrote: > "Robert Haas" writes: > >> On the whole I think we have some evidence here to say that upping the > >> default value of default_stats_target to 100 wouldn't be out of line, > >> but 1000 definitely is. Comments? > > > Do you think there's any value in making it scale based on t

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Kevin Grittner
>>> Simon Riggs wrote: > On Thu, 2008-12-11 at 17:45 -0500, Tom Lane wrote: >> I don't see >> much value in a data-type-dependent default anyway I am dubious about the value there, too, at least for our environment. > I would prefer distinct type or domain specific defaults Now that would

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Simon Riggs
On Thu, 2008-12-11 at 22:29 +, Gregory Stark wrote: > > And I would like it even more if the sample size increased according > to table size, since that makes ndistinct values fairly random for > large > > tables. > > Unfortunately _any_ ndistinct estimate based on a sample of the table > is

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Tom Lane
Simon Riggs writes: > On Thu, 2008-12-11 at 17:45 -0500, Tom Lane wrote: >> Simon Riggs writes: >>> I would like it even more if there was a data type specific default. >>> Currently we have a special case for boolean, but that's it. >> >> No, we don't (or if we do I'd be interested to know wher

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Tom Lane
"Robert Haas" writes: > On Thu, Dec 11, 2008 at 5:45 PM, Tom Lane wrote: >> Simon Riggs writes: >>> I would like it even more if there was a data type specific default. >>> Currently we have a special case for boolean, but that's it. >> >> No, we don't (or if we do I'd be interested to know whe

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Simon Riggs
On Thu, 2008-12-11 at 17:45 -0500, Tom Lane wrote: > Simon Riggs writes: > > I would like it even more if there was a data type specific default. > > Currently we have a special case for boolean, but that's it. > > No, we don't (or if we do I'd be interested to know where). Your commit, selfun

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Robert Haas
On Thu, Dec 11, 2008 at 5:45 PM, Tom Lane wrote: > Simon Riggs writes: >> I would like it even more if there was a data type specific default. >> Currently we have a special case for boolean, but that's it. > > No, we don't (or if we do I'd be interested to know where). utils/adt/selfuncs.c, get

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Tom Lane
Gregory Stark writes: > Tom Lane writes: >> BTW, does anyone have an opinion about changing the upper limit for >> default_stats_target to, say, 1? These tests suggest that you >> wouldn't want such a value for a column used as a join key, but >> I can see a possible argument for high values

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Gregory Stark
Tom Lane writes: > 500 114.076 > 600 157.535 > 700 211.189 > 800 269.731 > 900 335.427 > 1000 409.638 >... > BTW, does anyone have an opinion about changing the upper limit for > default_stats_target to, say, 1? These tests suggest that you > wouldn't want such a value for a colu

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Vladimir Sitnikov
> > What is the specific difference between what you are talking about and > what scalarineqsel already implements? > Hmm... Northing new. Feel sorry for bothering you. I did not realize histograms are implemented. Regards, Vladimir Sitnikov

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Tom Lane
Simon Riggs writes: > I would like it even more if there was a data type specific default. > Currently we have a special case for boolean, but that's it. No, we don't (or if we do I'd be interested to know where). I don't see much value in a data-type-dependent default anyway --- would you make

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Tom Lane
Gregory Stark writes: > Simon Riggs writes: >> And I would like it even more if the sample size increased according to >> table size, since that makes ndistinct values fairly random for large >> tables. > Unfortunately _any_ ndistinct estimate based on a sample of the table is going > to be pret

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Gregory Stark
Simon Riggs writes: > On Thu, 2008-12-11 at 13:09 -0500, Tom Lane wrote: > >> On the whole I think we have some evidence here to say that upping the >> default value of default_stats_target to 100 wouldn't be out of line, >> but 1000 definitely is. Comments? > > Sounds good to me. > > I would li

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Tom Lane
"Vladimir Sitnikov" writes: >> I think that would likely be redundant with the histogram. >> > I am afraid I do not get you. AFAICS what you're proposing *is* a histogram, just awkwardly described. What is the specific difference between what you are talking about and what scalarineqsel already

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Simon Riggs
On Thu, 2008-12-11 at 13:09 -0500, Tom Lane wrote: > On the whole I think we have some evidence here to say that upping the > default value of default_stats_target to 100 wouldn't be out of line, > but 1000 definitely is. Comments? Sounds good to me. I would like it even more if there was a da

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Vladimir Sitnikov
> > > I think that would likely be redundant with the histogram. > I am afraid I do not get you. I mean histograms should be considered when it comes to increasing number of MCV entries (at least for numeric/timestamp values). With histogram lower number of entries could be sufficient to get reason

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Nathan Boley
> One more direction could be implementing "MCV" for range of values (group > values and interpolate in between). Consider statistics on timestamp column > that says that for "2008-December" there are as many X rows, for > "2008-November" as many as Y, etc. That could be used for rather accurate >

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Tom Lane
"Vladimir Sitnikov" writes: > One more direction could be implementing "MCV" for range of values (group > values and interpolate in between). Consider statistics on timestamp column > that says that for "2008-December" there are as many X rows, for > "2008-November" as many as Y, etc. That could

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Vladimir Sitnikov
> > > > There's something in what you say, but consider that we have pretty > much unanimous agreement that 10 is too small. I think we should > try to fix the problem, not just gradually ratchet up the value until > people start complaining in the other direction. (Also, we should have > plenty

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Tom Lane
"Robert Haas" writes: >> On the whole I think we have some evidence here to say that upping the >> default value of default_stats_target to 100 wouldn't be out of line, >> but 1000 definitely is. Comments? > Do you think there's any value in making it scale based on the size of > the table? As

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Robert Haas
On Thu, Dec 11, 2008 at 2:06 PM, Tom Lane wrote: > "Vladimir Sitnikov" writes: >> Do you consider using hash tables? > > Doubt it's really worth it, unless there's some way to amortize the > setup cost across multiple selectivity estimations; which would surely > complicate life. > > One thing th

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Vladimir Sitnikov
> > > > Do you consider using hash tables? > > Doubt it's really worth it, unless there's some way to amortize the > setup cost across multiple selectivity estimations; which would surely > complicate life. > MCV lists are updated only during "analyze" phase, don't they? If the "setup cost" is the

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Robert Haas
> When looking at these numbers one might think the threshold of pain > is about 50, rather than 100 which is where I'd put it for the text > example. However, this is probably an extreme worst case. > > On the whole I think we have some evidence here to say that upping the > default value of defa

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Tom Lane
"Vladimir Sitnikov" writes: > Do you consider using hash tables? Doubt it's really worth it, unless there's some way to amortize the setup cost across multiple selectivity estimations; which would surely complicate life. One thing that just now occurred to me is that as long as we maintain the c

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Vladimir Sitnikov
> > > BTW, does anyone have an opinion about changing the upper limit for > default_stats_target to, say, 1? These tests suggest that you > wouldn't want such a value for a column used as a join key, but > I can see a possible argument for high values in text search and > similar applications.

Re: [HACKERS] benchmarking the query planner

2008-12-11 Thread Tom Lane
"Robert Haas" writes: > Ah, that makes sense. Here's a test case based on Greg's. This is > definitely more than linear once you get above about n = 80, but it's > not quadratic either. n = 1000 is only 43x n = 80, and while that's > surely more than 1000/80 = 12.5, it's also a lot less than (1

Re: [HACKERS] benchmarking the query planner

2008-12-10 Thread Robert Haas
> Yeah, that would be fast. To see a quadratic case you need MCV arrays > that have little or no overlap of common values --- then each element of > the first will be compared (in vain) to all or most of the elements in > the second. Ah, that makes sense. Here's a test case based on Greg's. Thi

Re: [HACKERS] benchmarking the query planner

2008-12-10 Thread Tom Lane
"Robert Haas" <[EMAIL PROTECTED]> writes: > On Mon, Dec 8, 2008 at 10:24 AM, Gregory Stark <[EMAIL PROTECTED]> wrote: >> I tried a different query, trying to get quadratic growth and again failed. >> It > The profiling results I sent the other day show an exactly-linear > increase in the number o

Re: [HACKERS] benchmarking the query planner

2008-12-09 Thread Robert Haas
On Mon, Dec 8, 2008 at 10:24 AM, Gregory Stark <[EMAIL PROTECTED]> wrote: > I tried a different query, trying to get quadratic growth and again failed. It The profiling results I sent the other day show an exactly-linear increase in the number of times eqjoinsel invokes FunctionCall2. Reading thro

Re: [HACKERS] benchmarking the query planner (was Re: Simple postgresql.conf wizard)

2008-12-08 Thread Robert Haas
On Mon, Dec 8, 2008 at 11:56 AM, Tom Lane <[EMAIL PROTECTED]> wrote: > Greg Stark <[EMAIL PROTECTED]> writes: >> That might only be the case when the pg_statistic record is in shared >> buffers. > > Yeah, it seems unlikely that disabling compression is a good idea in the > bigger scheme of things.

Re: [HACKERS] benchmarking the query planner (was Re: Simple postgresql.conf wizard)

2008-12-08 Thread Tom Lane
Greg Stark <[EMAIL PROTECTED]> writes: > That might only be the case when the pg_statistic record is in shared > buffers. Yeah, it seems unlikely that disabling compression is a good idea in the bigger scheme of things. > Also I wonder if eqjoinsel and company might need to be made more > toa

Re: [HACKERS] benchmarking the query planner (was Re: Simple postgresql.conf wizard)

2008-12-06 Thread Greg Stark
That might only be the case when the pg_statistic record is in shared buffers. Also I wonder if eqjoinsel and company might need to be made more toast-aware by detoasring all the things it needs once rather than every time it accesses them. greg On 6 Dec 2008, at 06:19 PM, "Robert Haas"

[HACKERS] benchmarking the query planner (was Re: Simple postgresql.conf wizard)

2008-12-06 Thread Robert Haas
Sorry for top posting but we are getting a bit far afield from the original topic. I followed up the tests I did last night: http://archives.postgresql.org/pgsql-hackers/2008-12/msg00369.php I benchmarked 100 iterations of EXPLAIN on the query Greg Stark put together as a synthetic benchmark for