Josh Berkus wrote:
Mischa,
Okay, although given the track record of page-based sampling for
n-distinct, it's a bit like looking for your keys under the streetlight,
rather than in the alley where you dropped them :-)
Bad analogy, but funny.
The issue with page-based vs. pure random sampling is th
Quoting Josh Berkus :
> Mischa,
>
> > Okay, although given the track record of page-based sampling for
> > n-distinct, it's a bit like looking for your keys under the
> streetlight,
> > rather than in the alley where you dropped them :-)
>
> Bad analogy, but funny.
Bad analogy? Page-
John,
> But doesn't an index only sample one column at a time, whereas with
> page-based sampling, you can sample all of the columns at once.
Hmmm. Yeah, we're not currently doing that though. Another good idea ...
--
--Josh
Josh Berkus
Aglio Database Solutions
San Francisco
--
Mischa,
> Okay, although given the track record of page-based sampling for
> n-distinct, it's a bit like looking for your keys under the streetlight,
> rather than in the alley where you dropped them :-)
Bad analogy, but funny.
The issue with page-based vs. pure random sampling is that to do, fo
Quoting Markus Schaber <[EMAIL PROTECTED]>:
> Hi, Josh,
>
> Josh Berkus wrote:
>
> > Yes, actually. We need 3 different estimation methods:
> > 1 for tables where we can sample a large % of pages (say, >= 0.1)
> > 1 for tables where we sample a small % of pages but are "easily
> estimated"
> >
Hi, Josh,
Josh Berkus wrote:
> Yes, actually. We need 3 different estimation methods:
> 1 for tables where we can sample a large % of pages (say, >= 0.1)
> 1 for tables where we sample a small % of pages but are "easily estimated"
> 1 for tables which are not easily estimated by we can't afford
First I will comment my original idea.
Second I will give another improved suggestion (an idea).
I hope, that they will be useful for you.
(I don't know, wether the first one was useful at all because it showed,
that I and some others of us are not very good with statistics :( )
I haven't looked ab
Quoting Josh Berkus :
> > >Perhaps I can save you some time (yes, I have a degree in Math). If I
> > >understand correctly, you're trying extrapolate from the correlation
> > >between a tiny sample and a larger sample. Introducing the tiny sample
> > >into any decision can only produce a less accu
"Dave Held" <[EMAIL PROTECTED]> writes:
> > Actually, it's more to characterize how large of a sample
> > we need. For example, if we sample 0.005 of disk pages, and
> > get an estimate, and then sample another 0.005 of disk pages
> > and get an estimate which is not even close to the first
> >
> -Original Message-
> From: Josh Berkus [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, April 27, 2005 10:25 AM
> To: Andrew Dunstan
> Cc: Mischa Sandberg; pgsql-perform; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation;
Mischa,
> >Perhaps I can save you some time (yes, I have a degree in Math). If I
> >understand correctly, you're trying extrapolate from the correlation
> >between a tiny sample and a larger sample. Introducing the tiny sample
> >into any decision can only produce a less accurate result than just
> -Original Message-
> From: Greg Stark [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, April 27, 2005 1:00 AM
> To: Tom Lane
> Cc: Rod Taylor; Greg Stark; pgsql-hackers@postgresql.org;
> Gurmeet Manku
> Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks
TECTED]
> Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks
> suggested?
>
> [...]
> 2. In a single scan, it is possible to estimate n_distinct by using
> a very simple algorithm:
>
> "Distinct sampling for highly-accurate answers to distinct val
On Tue, 2005-04-26 at 19:03 -0400, Greg Stark wrote:
> This one looks *really* good.
>
> http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf
>
> It does require a single full table scan
Ack.. Not by default please.
I have a few large append-only tables (vacuum isn't necessary) whi
Mischa Sandberg wrote:
Perhaps I can save you some time (yes, I have a degree in Math). If I
understand correctly, you're trying extrapolate from the correlation
between a tiny sample and a larger sample. Introducing the tiny sample
into any decision can only produce a less accurate result than
Hi everybody!
Perhaps the following papers are relevant to the discussion here
(their contact authors have been cc'd):
1. The following proposes effective algorithms for using block-level
sampling for n_distinct estimation:
"Effective use of block-level sampling in statistics estimat
On Tue, 2005-04-26 at 15:00 -0700, Gurmeet Manku wrote:
> 2. In a single scan, it is possible to estimate n_distinct by using
> a very simple algorithm:
>
> "Distinct sampling for highly-accurate answers to distinct value
> queries and event reports" by Gibbons, VLDB 2001.
>
> http://ww
Tom Lane <[EMAIL PROTECTED]> writes:
> Rod Taylor <[EMAIL PROTECTED]> writes:
> > If when we have partitions, that'll be good enough. If partitions aren't
> > available this would be quite painful to anyone with large tables --
> > much as the days of old used to be painful for ANALYZE.
>
> Yeah
Quoting Andrew Dunstan <[EMAIL PROTECTED]>:
> After some more experimentation, I'm wondering about some sort of
> adaptive algorithm, a bit along the lines suggested by Marko
Ristola, but limited to 2 rounds.
>
> The idea would be that we take a sample (either of fixed size, or
> some sm
Rod Taylor <[EMAIL PROTECTED]> writes:
> If when we have partitions, that'll be good enough. If partitions aren't
> available this would be quite painful to anyone with large tables --
> much as the days of old used to be painful for ANALYZE.
Yeah ... I am very un-enthused about these suggestions
On Tue, 2005-04-26 at 19:28 -0400, Greg Stark wrote:
> Rod Taylor <[EMAIL PROTECTED]> writes:
>
> > On Tue, 2005-04-26 at 19:03 -0400, Greg Stark wrote:
> > > This one looks *really* good.
> > >
> > > http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf
> > >
> > > It does require a
Rod Taylor <[EMAIL PROTECTED]> writes:
> On Tue, 2005-04-26 at 19:03 -0400, Greg Stark wrote:
> > This one looks *really* good.
> >
> > http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf
> >
> > It does require a single full table scan
>
> Ack.. Not by default please.
>
> I have
This one looks *really* good.
http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf
It does require a single full table scan but it works in O(n) time and
constant space and it guarantees the confidence intervals for the estimates it
provides like the histograms do for regular range s
Simon Riggs wrote:
The comment
* Every value in the sample appeared more than once. Assume
* the column has just these values.
doesn't seem to apply when using larger samples, as Josh is using.
Looking at Josh's application it does seem likely that when taking a
sample, all site
On Mon, 2005-04-25 at 17:10 -0400, Tom Lane wrote:
> Simon Riggs <[EMAIL PROTECTED]> writes:
> > On Mon, 2005-04-25 at 11:23 -0400, Tom Lane wrote:
> >> It's not just the scan --- you also have to sort, or something like
> >> that, if you want to count distinct values. I doubt anyone is really
> >
Simon,
> Could it be that we have overlooked this simple explanation and that the
> Haas and Stokes equation is actually quite good, but just not being
> applied?
That's probably part of it, but I've tried Haas and Stokes on a pure random
sample and it's still bad, or more specifically overly co
On Sun, 2005-04-24 at 00:48 -0400, Tom Lane wrote:
> Josh Berkus writes:
> > Overall, our formula is inherently conservative of n_distinct. That is, I
> > believe that it is actually computing the *smallest* number of distinct
> > values which would reasonably produce the given sample, rather
> -Original Message-
> From: Andrew Dunstan [mailto:[EMAIL PROTECTED]
> Sent: Monday, April 25, 2005 3:43 PM
> To: josh@agliodbs.com
> Cc: pgsql-perform; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks
> suggested?
&
Simon Riggs <[EMAIL PROTECTED]> writes:
> On Mon, 2005-04-25 at 11:23 -0400, Tom Lane wrote:
>> It's not just the scan --- you also have to sort, or something like
>> that, if you want to count distinct values. I doubt anyone is really
>> going to consider this a feasible answer for large tables.
> -Original Message-
> From: Josh Berkus [mailto:[EMAIL PROTECTED]
> Sent: Sunday, April 24, 2005 2:08 PM
> To: Andrew Dunstan
> Cc: Tom Lane; Greg Stark; Marko Ristola; pgsql-perform;
> pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] [PERFORM] Bad n_disti
Josh Berkus wrote:
Simon, Tom:
While it's not possible to get accurate estimates from a fixed size sample, I
think it would be possible from a small but scalable sample: say, 0.1% of all
data pages on large tables, up to the limit of maintenance_work_mem.
Setting up these samples as a % of da
Simon, Tom:
While it's not possible to get accurate estimates from a fixed size sample, I
think it would be possible from a small but scalable sample: say, 0.1% of all
data pages on large tables, up to the limit of maintenance_work_mem.
Setting up these samples as a % of data pages, rather th
Guys,
> While it's not possible to get accurate estimates from a fixed size sample,
> I think it would be possible from a small but scalable sample: say, 0.1% of
> all data pages on large tables, up to the limit of maintenance_work_mem.
BTW, when I say "accurate estimates" here, I'm talking about
On Mon, 2005-04-25 at 11:23 -0400, Tom Lane wrote:
> Simon Riggs <[EMAIL PROTECTED]> writes:
> > My suggested hack for PostgreSQL is to have an option to *not* sample,
> > just to scan the whole table and find n_distinct accurately.
> > ...
> > What price a single scan of a table, however large, wh
> -Original Message-
> From: Tom Lane [mailto:[EMAIL PROTECTED]
> Sent: Monday, April 25, 2005 10:23 AM
> To: Simon Riggs
> Cc: josh@agliodbs.com; Greg Stark; Marko Ristola; pgsql-perform;
> pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] [PERFORM] Bad n_disti
Simon Riggs <[EMAIL PROTECTED]> writes:
> My suggested hack for PostgreSQL is to have an option to *not* sample,
> just to scan the whole table and find n_distinct accurately.
> ...
> What price a single scan of a table, however large, when incorrect
> statistics could force scans and sorts to occu
On Sat, 2005-04-23 at 16:39 -0700, Josh Berkus wrote:
Greg Stark wrote
> > I looked into this a while back when we were talking about changing the
> > sampling method. The conclusions were discouraging. Fundamentally, using
> > constant sized samples of data for n_distinct is bogus. Constant sized
Josh Berkus writes:
> Tom, how does our heuristic sampling work? Is it pure random sampling, or
> page sampling?
Manfred probably remembers better than I do, but I think the idea is
to approximate pure random sampling as best we can without actually
examining every page of the table.
Here is my opinion.
I hope this helps.
Maybe there is no one good formula:
On boolean type, there are at most 3 distinct values.
There is an upper bound for fornames in one country.
There is an upper bound for last names in one country.
There is a fixed number of states and postal codes in one coun
Folks,
> I wonder if this paper has anything that might help:
> http://www.stat.washington.edu/www/research/reports/1999/tr355.ps - if I
> were more of a statistician I might be able to answer :-)
Actually, that paper looks *really* promising. Does anyone here have enough
math to solve for D(s
Andrew,
> The math in the paper does not seem to look at very low levels of q (=
> sample to pop ratio).
Yes, I think that's the failing. Mind you, I did more testing and found out
that for D/N ratios of 0.1 to 0.3, the formula only works within 5x accuracy
(which I would consider acceptable)
Tom Lane wrote:
Josh Berkus writes:
Overall, our formula is inherently conservative of n_distinct. That is, I
believe that it is actually computing the *smallest* number of distinct
values which would reasonably produce the given sample, rather than the
*median* one. This is contrary to
Josh Berkus writes:
> Overall, our formula is inherently conservative of n_distinct. That is, I
> believe that it is actually computing the *smallest* number of distinct
> values which would reasonably produce the given sample, rather than the
> *median* one. This is contrary to the notes in
Josh Berkus said:
>
>
> Well, unusual distributions are certainly tough. But I think the
> problem exists even for relatively well-distributed populations.
> Part of it is, I believe, the formula we are using:
>
> n*d / (n - f1 + f1*n/N)
>
[snip]
>
> This is so broken, in fact, that I'm wonderin
People,
> Can someone whose math is more recent than calculus in 1989 take a look at
> that paper, and look at the formula toward the bottom of page 10, and see
> if we are correctly interpreting it? I'm particularly confused as to
> what "q" and "d-sub-n" represent. Thanks!
Actually, I manag
Greg,
> I looked into this a while back when we were talking about changing the
> sampling method. The conclusions were discouraging. Fundamentally, using
> constant sized samples of data for n_distinct is bogus. Constant sized
> samples only work for things like the histograms that can be analyze
46 matches
Mail list logo