On Wed, Mar 07, 2012 at 07:51:42PM -0500, Tom Lane wrote: > Alexander Korotkov <aekorot...@gmail.com> writes: > > True. If (max count - min count + 1) is small, enumerating of frequencies > > is both more compact and more precise representation. Simultaneously, > > if (max count - min count + 1) is large, we can run out of > > statistics_target with such representation. We can use same representation > > of count distribution as for scalar column value: MCV and HISTOGRAM, but it > > would require additional statkind and statistics slot. Probably, you've > > better ideas? > > I wasn't thinking of introducing two different representations, > but just trimming the histogram length when it's larger than necessary. > > On reflection my idea above is wrong; for example assume that we have a > column with 900 arrays of length 1 and 100 arrays of length 2. Going by > what I said, we'd reduce the histogram to {1,2}, which might accurately > capture the set of lengths present but fails to show that 1 is much more > common than 2. However, a histogram {1,1,1,1,1,1,1,1,1,2} (ten entries) > would capture the situation perfectly in one-tenth the space that the > current logic does.
Granted. When the next sample finds 899/101 instead, though, the optimization vanishes. You save 90% of the space, perhaps 10% of the time. If you want to materially narrow typical statistics, Alexander's proposal looks like the way to go. I'd guess array columns always having DEC <= default_statistics_target are common enough to make that representation the dominant representation, if not the only necessary representation. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers