> -----Original Message----- > From: Greg Stark [mailto:[EMAIL PROTECTED] > Sent: Wednesday, April 27, 2005 1:00 AM > To: Tom Lane > Cc: Rod Taylor; Greg Stark; pgsql-hackers@postgresql.org; > Gurmeet Manku > Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks > suggested? > > Tom Lane <[EMAIL PROTECTED]> writes: > > > Rod Taylor <[EMAIL PROTECTED]> writes: > > > If when we have partitions, that'll be good enough. If > > > partitions aren't available this would be quite painful > > > to anyone with large tables -- much as the days of old > > > used to be painful for ANALYZE. > > > > Yeah ... I am very un-enthused about these suggestions to > > make ANALYZE go back to doing a full scan ...
I don't see why ANALYZE would always have to do a full scan. Clearly, this statistic would only be useful to people who need a very accurate n_distinct on tables where the current metric does not work very well. Applying a specialized solution to every table doesn't seem like an efficient way to go about things. Instead, the distinct sampling mechanism should be purely optional, and probably purely separate from the vanilla ANALYZE mechanism, because it works differently. If it were designed that way, the full table scan would be a one-time cost that would not even need to be paid if the user turned on this mechanism at table creation. Thereafter, the statistic would need to be updated incrementally, but that just distributes the cost of ANALYZE over the INSERT/UPDATE/DELETEs. Obviously, it's a higher cost because you touch every record that hits the table, but that's the price you pay for a good n_distinct. The block estimator should probably become the default, since it works within the current ANALYZE paradigm of sampling the data. > [...] > For most use cases users have to run vacuum occasionally. In > those cases "vacuum analyze" would be no worse than a straight > normal vacuum. And that's only if you do a full table scan every time. In the incremental implementation, there are no lump sum costs involved except when the statistic is first initialized. > Note that this algorithm doesn't require storing more data > because of the large scan or performing large sorts per > column. It's purely O(n) time and O(1) space. And I think it should be emphasized that distinct sampling not only gives you a good n_distinct for query planning, it also gives you a very fast approximate answer for related aggregate queries. So you're getting more than just query tuning for that cost. > On the other hand, if you have tables you aren't vacuuming > that means you perform zero updates or deletes. In which case > some sort of incremental statistics updating would be a good > solution. A better solution even than sampling. Which, for the large data warehousing situations where it seems this mechanism would be most useful, this would probably be the most common case. __ David B. Held Software Engineer/Array Services Group 200 14th Ave. East, Sartell, MN 56377 320.534.3637 320.253.7800 800.752.8129 ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly