On Wed, 2006-01-04 at 19:22 -0500, Greg Stark wrote: > I think you're right that a reasonable sample size for this kind of > estimate > is going to be proportional to the table size, not the constant sized > sample > that regular statistics need.
Agreed [I said exactly that in April]; the counter argument at that time was that proportional samples on large tables lead to external sorts in many cases which would lead to unacceptably long run times - since we need to sort the values for each attribute in turn. I've proposed limiting ourselves to maintenance_work_mem (I credit Josh with that idea from April). If you can allocate 1 GB of memory to an ANALYZE then this would be a very large proportion of a medium sized table (or partition). Considering how few rows we sample at the moment, increasing the actual sample size by many 000s would have a very beneficial effect. > On Tue, 2005-04-26 at 19:03 -0400, Greg Stark wrote: > > This one looks *really* good. > > > > http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf > > > > It does require a single full table scan The Distinct Sampling approach you mention would scan the whole table and also use large in-memory hash tables. So it uses more I/O, the same memory and probably less CPU - no sorting required. The technique is the same implementation as a HashAgg, just one that loses rows in a predictable manner when it spills out of memory. It doesn't identify columns that scale with N, nor does it calculate correlation. Thats the same as re-writing Count(Distinct) to use hashing, which is a TODO item. So perhaps you could plan the code to do the Distinct Sampling approach at the same time. Hmmm. I'll think about that. Best Regards, Simon Riggs ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match