Re: [HACKERS] Risk Estimation WAS: Planner hints in Postgresql

2014-03-20 Thread Robert Haas
On Tue, Mar 18, 2014 at 2:41 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Atri Sharma atri.j...@gmail.com writes:
 One of the factors that leads to bad estimates is that the histogram of the
 values of a column maintained by the planner gets old by time and the data
 in the column changes. So, the histogram is no longer a quite accurate view
 of the data and it leads to bad selectivity.

 TBH, this is so far down the list of problems that it'll be a long time
 before we need to worry about it.  It's certainly not the number one
 priority for any project to model risk in the planner.

 The thing that I think is probably the number one problem is estimates
 that depend on an assumption of uniform distribution of sought-after rows
 among those encountered by a scan.  This is usually where bad plans for
 LIMIT queries are coming from.  We could certainly add some sort of fudge
 factor to those costs, but I'd like to have a more-or-less principled
 framework for doing so.

I think the problem is, in some sense, more basic than that.  I think
the kind of query we're talking about here is:

SELECT * FROM foo WHERE unlikely ORDER BY indexed_column LIMIT 1

Assume for the sake of argument that there are 100 rows that would be
returned in the absence of the limit.  Let SC and TC be the startup
cost and total cost of the index scan.  As a matter of general policy,
we're going to say that the cost of this is SC + 0.01 * (TC - SC).
What makes this path look appealing to the planner is that SC is small
relative to TC.  If we knew, for example, that we weren't going to
find the first match until 90% of the way through the index scan, then
we could set SC = 90% * TC and, all else being equal, the planner
would make the right decision.

So you might think that the problem here is that we're assuming
uniform density.  Let's say there are a million rows in the table, and
there are 100 that match our criteria, so the first one is going to
happen 1/10,000'th of the way through the table.  Thus we set SC =
0.0001 * TC, and that turns out to be an underestimate if the
distribution isn't as favorable as we're hoping.  However, that is NOT
what we are doing.  What we are doing is setting SC = 0.  I mean, not
quite 0, but yeah, effectively 0. Essentially we're assuming that no
matter how selective the filter condition may be, we assume that it
will match *the very first row*.

So we're not assuming the average case and getting hosed when things
come out worse than average.  We're assuming the *best* case.  So
unless things happen to really swing in our favor, we got hosed.

Now it might be that a fudge factor of 2 or 1.5 or 10 or 3 or 17 is
appropriate, so that we actually assume we're going to have to scan a
little more of the index than we expect.  That can perhaps be
justified by the possibility that there may actually be NO rows
matching the filter condition, and we'll have to try scanning the
entire index to get off the ground.  We could also try to come up with
a mathematical model for that.  But that fudge factor would presumably
be a multiplier on the effort of finding the first tuple.  And right
now we assume that finding the first tuple will be trivial.  So I
think we should fix THAT problem first, and then if that turns out to
be insufficient, we can worry about what further fudging is required.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Risk Estimation WAS: Planner hints in Postgresql

2014-03-20 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 So you might think that the problem here is that we're assuming
 uniform density.  Let's say there are a million rows in the table, and
 there are 100 that match our criteria, so the first one is going to
 happen 1/10,000'th of the way through the table.  Thus we set SC =
 0.0001 * TC, and that turns out to be an underestimate if the
 distribution isn't as favorable as we're hoping.  However, that is NOT
 what we are doing.  What we are doing is setting SC = 0.  I mean, not
 quite 0, but yeah, effectively 0. Essentially we're assuming that no
 matter how selective the filter condition may be, we assume that it
 will match *the very first row*.

I think this is wrong.  Yeah, the SC may be 0 or near it, but the time to
fetch the first tuple is estimated as SC + (TC-SC)/N.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Risk Estimation WAS: Planner hints in Postgresql

2014-03-20 Thread Atri Sharma
On Thu, Mar 20, 2014 at 8:10 PM, Robert Haas robertmh...@gmail.com wrote:

 On Tue, Mar 18, 2014 at 2:41 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Atri Sharma atri.j...@gmail.com writes:
  One of the factors that leads to bad estimates is that the histogram of
 the
  values of a column maintained by the planner gets old by time and the
 data
  in the column changes. So, the histogram is no longer a quite accurate
 view
  of the data and it leads to bad selectivity.
 
  TBH, this is so far down the list of problems that it'll be a long time
  before we need to worry about it.  It's certainly not the number one
  priority for any project to model risk in the planner.
 
  The thing that I think is probably the number one problem is estimates
  that depend on an assumption of uniform distribution of sought-after rows
  among those encountered by a scan.  This is usually where bad plans for
  LIMIT queries are coming from.  We could certainly add some sort of fudge
  factor to those costs, but I'd like to have a more-or-less principled
  framework for doing so.

 I think the problem is, in some sense, more basic than that.  I think
 the kind of query we're talking about here is:

 SELECT * FROM foo WHERE unlikely ORDER BY indexed_column LIMIT 1

 Assume for the sake of argument that there are 100 rows that would be
 returned in the absence of the limit.  Let SC and TC be the startup
 cost and total cost of the index scan.  As a matter of general policy,
 we're going to say that the cost of this is SC + 0.01 * (TC - SC).
 What makes this path look appealing to the planner is that SC is small
 relative to TC.  If we knew, for example, that we weren't going to
 find the first match until 90% of the way through the index scan, then
 we could set SC = 90% * TC and, all else being equal, the planner
 would make the right decision.

 So you might think that the problem here is that we're assuming
 uniform density.  Let's say there are a million rows in the table, and
 there are 100 that match our criteria, so the first one is going to
 happen 1/10,000'th of the way through the table.  Thus we set SC =
 0.0001 * TC, and that turns out to be an underestimate if the
 distribution isn't as favorable as we're hoping.  However, that is NOT
 what we are doing.  What we are doing is setting SC = 0.  I mean, not
 quite 0, but yeah, effectively 0. Essentially we're assuming that no
 matter how selective the filter condition may be, we assume that it
 will match *the very first row*.



Cannot we reuse the same histogram we have in the planner right now for
this? I mean, AFAIK, the heuristic we have is that we divide the histogram
into equal size buckets and then find the bucket in which our predicate
value lies, then take some part of that bucket and the rest of the buckets
before that bucket,right?

So, suppose a query is SELECT * FROM table WHERE a  10, we shall find the
bucket that 10 lies in, right?

Now, why cannot we take the estimate of all the buckets behind the bucket
in which our value is present? Will that estimate not give us the fraction
of tuples that are expected to be before the first matching row?

Its pretty wild, but I wanted to know if my understanding of this scenario
is correct or not.

Regards,

Atri

-- 
Regards,

Atri
*l'apprenant*


Re: [HACKERS] Risk Estimation WAS: Planner hints in Postgresql

2014-03-20 Thread Tom Lane
Atri Sharma atri.j...@gmail.com writes:
 Now, why cannot we take the estimate of all the buckets behind the bucket
 in which our value is present? Will that estimate not give us the fraction
 of tuples that are expected to be before the first matching row?

Uh, no, not unless you assume that the table happens to be perfectly
sorted by the column's value.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Risk Estimation WAS: Planner hints in Postgresql

2014-03-20 Thread Atri Sharma
On Thu, Mar 20, 2014 at 8:51 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Atri Sharma atri.j...@gmail.com writes:
  Now, why cannot we take the estimate of all the buckets behind the bucket
  in which our value is present? Will that estimate not give us the
 fraction
  of tuples that are expected to be before the first matching row?

 Uh, no, not unless you assume that the table happens to be perfectly
 sorted by the column's value.




Yes, that is true. So, if an attribute has an index present, can we do this
somehow?

Regards,

Atri



-- 
Regards,

Atri
*l'apprenant*


Re: [HACKERS] Risk Estimation WAS: Planner hints in Postgresql

2014-03-20 Thread Robert Haas
On Thu, Mar 20, 2014 at 10:45 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 So you might think that the problem here is that we're assuming
 uniform density.  Let's say there are a million rows in the table, and
 there are 100 that match our criteria, so the first one is going to
 happen 1/10,000'th of the way through the table.  Thus we set SC =
 0.0001 * TC, and that turns out to be an underestimate if the
 distribution isn't as favorable as we're hoping.  However, that is NOT
 what we are doing.  What we are doing is setting SC = 0.  I mean, not
 quite 0, but yeah, effectively 0. Essentially we're assuming that no
 matter how selective the filter condition may be, we assume that it
 will match *the very first row*.

 I think this is wrong.  Yeah, the SC may be 0 or near it, but the time to
 fetch the first tuple is estimated as SC + (TC-SC)/N.

Hmm, you're right, and experimentation confirms that the total cost of
the limit comes out to about TC/selectivity.  So scratch that theory.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Risk Estimation WAS: Planner hints in Postgresql

2014-03-18 Thread Josh Berkus

 On Mon, Mar 17, 2014 at 8:47 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Yeah.  I would like to see the planner's cost estimates extended to
 include some sort of uncertainty estimate, whereupon risk-averse people
 could ask it to prefer low-uncertainty plans over high-uncertainty ones
 (the plans we typically choose for ORDER BY ... LIMIT queries being great
 examples of the latter).  But it's a long way from wishing that to making
 it so.  Right now it's not even clear (to me anyway) how we'd measure or
 model such uncertainty.

This is not a model, but here's some starting thoughts:

A high risk plan has two components:

a) our statistical data is out-of-date or inadequate

b) the potential execution time if our estimates of selectivity are
wrong is high

c) the cost ratio of certain operations is wrong.

Factor (a) can be modeled two ways:

1. If last_analyze is a long time ago, we have increased the risk.
   (Ideally, we'd have some idea of the change rate on the table vs.
the last analyze time; right now we don't have those stats)

2. Certain patterns, such as multi-column selectivity and GIN/GiST
selectivity are known to have poor estimates, and be higher risk.
Certainly selectivity functions which have been programmed with a flat
coefficient (like default 0.05 selectivity for gist_ops) could also
return a risk factor which is fairly high.

Factor (b) can be modeled simply by estimating the cost of a plan where
all row estimates are changed by 10X, or even better by a calculation on
the risk factor calculated in (a).  This would then give us the failure
cost of the bad plan.  Note that we need to estimate in both
directions, both for higher estimates and lower ones; abort early
plans fail because the rows returned are lower than expected, for example.

(b) estimation would be expensive if we did every combination of the
entire plan with wrong estimates, so I'm wondering if it would be
adequate to just estimate the node selectivity being off on a per-node
basis.

(c) we can't realistically estimate for at all (i.e. if we knew the cost
factor was wrong, we'd fix it) so I suggest ignoring it for risk estimation.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Risk Estimation WAS: Planner hints in Postgresql

2014-03-18 Thread Atri Sharma
On Tue, Mar 18, 2014 at 11:43 PM, Josh Berkus j...@agliodbs.com wrote:


  On Mon, Mar 17, 2014 at 8:47 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Yeah.  I would like to see the planner's cost estimates extended to
  include some sort of uncertainty estimate, whereupon risk-averse people
  could ask it to prefer low-uncertainty plans over high-uncertainty ones
  (the plans we typically choose for ORDER BY ... LIMIT queries being
 great
  examples of the latter).  But it's a long way from wishing that to
 making
  it so.  Right now it's not even clear (to me anyway) how we'd measure or
  model such uncertainty.



I have been thinking of some ways to have a risk estimate of each
selectivity that our planner gives. I think a way to do it is as follows:

One of the factors that leads to bad estimates is that the histogram of the
values of a column maintained by the planner gets old by time and the data
in the column changes. So, the histogram is no longer a quite accurate view
of the data and it leads to bad selectivity.

One thing we can try to do is to add a factor of error that we feel the
selectivity given can have. This allows us to factor in the probability
that the data changed and the estimate of the difference of the current
histogram and the histogram of the actual data currently present in the
column in the table.

We can use Central Limit Theorem (
http://en.wikipedia.org/wiki/Central_limit_theorem). Essentially, what the
theorem says is that given a distribution that has finite variance and
finite mean, we can take random independent samples from the data and
calculate the standard deviation and the mean of the sample. If we have
large enough number of samples and if we plot the mean and SD, they would
follow a normal distribution.

What is interesting is that this can allow us to predict the SD of a given
dataset from the curve and the SD should be directly proportional to the
deviation it has from the given planner histogram.

I am no mathematician hence its hard for me to explain. I think this link
[1] will be more helpful.

So, we can have a probability value for the random variable and that shall
model the confidence we have in our estimate.

I may be wrong in some parts but I hope I have been able to convey the
general idea.

If this idea develops, I shall be happy to work on this but my hands are
full in ROLLUPS right now, so for my part it shall take time. I just want
to float the idea and get a general feel about the idea right now.

Please let me know your comments and feedback on this.

Regards,

Atri

[1]: http://www.theriac.org/DeskReference/viewDocument.php?id=177
-- 
Regards,

Atri
*l'apprenant*


Re: [HACKERS] Risk Estimation WAS: Planner hints in Postgresql

2014-03-18 Thread Tom Lane
Atri Sharma atri.j...@gmail.com writes:
 One of the factors that leads to bad estimates is that the histogram of the
 values of a column maintained by the planner gets old by time and the data
 in the column changes. So, the histogram is no longer a quite accurate view
 of the data and it leads to bad selectivity.

TBH, this is so far down the list of problems that it'll be a long time
before we need to worry about it.  It's certainly not the number one
priority for any project to model risk in the planner.

The thing that I think is probably the number one problem is estimates
that depend on an assumption of uniform distribution of sought-after rows
among those encountered by a scan.  This is usually where bad plans for
LIMIT queries are coming from.  We could certainly add some sort of fudge
factor to those costs, but I'd like to have a more-or-less principled
framework for doing so.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers