Re: [HACKERS] Incorrect assumptions with low LIMITs

2012-03-19 Thread Daniel Farina
On Mon, Mar 19, 2012 at 6:19 PM, Jeff Davis  wrote:
> On Sat, 2012-03-17 at 12:48 +, Simon Riggs wrote:
>> The problems are as I described them
>>
>> (1) no account made for sparsity, and other factors leading to an
>> overestimate of rows (N)
>>
>> (2) inappropriate assumption of the effect of LIMIT m, which causes a
>> costly SeqScan to appear better than an IndexScan for low m/N, when in
>> fact that is seldom the case.
>>
>> Overestimating N in (1) inverts the problem, so that an overestimate
>> isn't the safe thing at all.
>
> I think the actual problem has more to do with risk. The planner doesn't
> know how uniform the distribution of the table is, which introduces risk
> for the table scan.
>
> I would tend to agree that for low selectivity fraction and a very low
> limit (e.g. 1-3 in your example) and a large table, it doesn't seem like
> a good risk to use a table scan. I don't know how that should be modeled
> or implemented though.

FWIW, we have been bitten by the uniformity assumption a number of
times, but we kind of know what to look for.  It also has an
interesting interpretation as applied to garbage (insomuch as queries
all carry a conceptual "WHERE xmin...xmax... predicate).  Sometimes a
table is mostly garbage for a particular value, and the index scan
constantly has to seek the next tuple in its search to find
non-garbage.

I don't know how many times the uniformity assumption has held up fine
(guess: many), but cases of where an index cond
 has that little "Filter:" footnote has been a ticking timebomb for a
number of people I've interacted with.  It can be subtle because the
amount of scanned data may not be enormous, so upon re-running the
query it is not nearly as pathological as the original run given cache
effects, the only way to make it very visible is to look at the
EXPLAIN with BUFFERS option and note how many pages are being
discarded by the filter/how many pages of the underlying relation and
indexes are being processed only to be discarded.

Were I writing a OLTP-query performance linting tool, queries with
partially-matched index conditions might even be on my list of things
to warn about.  That is unfortunate for cases where the unmatched part
of the index condition may only qualify, say, five records in all
situations, but the database has no construct to know and enforce this
AFAIK.

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Incorrect assumptions with low LIMITs

2012-03-19 Thread Jeff Davis
On Sat, 2012-03-17 at 12:48 +, Simon Riggs wrote:
> The problems are as I described them
> 
> (1) no account made for sparsity, and other factors leading to an
> overestimate of rows (N)
> 
> (2) inappropriate assumption of the effect of LIMIT m, which causes a
> costly SeqScan to appear better than an IndexScan for low m/N, when in
> fact that is seldom the case.
> 
> Overestimating N in (1) inverts the problem, so that an overestimate
> isn't the safe thing at all.

I think the actual problem has more to do with risk. The planner doesn't
know how uniform the distribution of the table is, which introduces risk
for the table scan.

I would tend to agree that for low selectivity fraction and a very low
limit (e.g. 1-3 in your example) and a large table, it doesn't seem like
a good risk to use a table scan. I don't know how that should be modeled
or implemented though.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Incorrect assumptions with low LIMITs

2012-03-17 Thread Simon Riggs
On Sat, Mar 17, 2012 at 11:33 AM, Greg Stark  wrote:
> On Sat, Mar 17, 2012 at 9:34 AM, Simon Riggs  wrote:
>> My wish was to register this as both a common and significant bug,
>
> It has definitely come up here before many times.
>
> However at root the problem is part of the general class of not
> understanding how two different columns are related. Postgres is
> assuming they're entirely independent and therefore all the values of
> x are uniformly distributed over values of y.
>
> To plan this better in your case it would have to know that blah_id <=
> 72572020 is not equally likely for user_id = ANY
> ('{list}'::integer[]) as it is for the table as a whole.

That is not the root cause in this case, though I agree that is also a
problem. The first plan is more complex because of an ORDER BY that
favours the index scan, but that's actually irrelevant to the case;
the use of blah_id is actually the user using the fact that things are
allocated in date order to avoid doing a date sort.

The problems are as I described them

(1) no account made for sparsity, and other factors leading to an
overestimate of rows (N)

(2) inappropriate assumption of the effect of LIMIT m, which causes a
costly SeqScan to appear better than an IndexScan for low m/N, when in
fact that is seldom the case.

Overestimating N in (1) inverts the problem, so that an overestimate
isn't the safe thing at all.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Incorrect assumptions with low LIMITs

2012-03-17 Thread Greg Stark
On Sat, Mar 17, 2012 at 9:34 AM, Simon Riggs  wrote:
> My wish was to register this as both a common and significant bug,

It has definitely come up here before many times.

However at root the problem is part of the general class of not
understanding how two different columns are related. Postgres is
assuming they're entirely independent and therefore all the values of
x are uniformly distributed over values of y.

To plan this better in your case it would have to know that blah_id <=
72572020 is not equally likely for user_id = ANY
('{list}'::integer[]) as it is for the table as a whole.


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Incorrect assumptions with low LIMITs

2012-03-17 Thread Simon Riggs
On Fri, Mar 16, 2012 at 9:11 PM, Tom Lane  wrote:

> Simon Riggs  writes:
>> 2. We assume that if values do exist that they have rows uniformly
>> distributed across the whole table like rungs on a ladder.
>
> Well, yeah.  That's sometimes wrong, but not always.  In the absence
> of evidence to the contrary, I think it's a better assumption than
> most others.

The evidence I showed proves it is x1000 worse. I'm not sure how you
say it is "better than most" or that there is no evidence.


> You are not the first person to have run into this type of problem.

Yes - if I thought it was an isolated problem I would not have brought it up.

Also, I don't bring it up because I need help; the actual problem has
workarounds. But rewriting SQL because of a planner problem is not
something that highlights our strengths.


> If it were easily solved by some minor hack, we would have done that
> long since.  The problem is to not throw the baby out with the
> bathwater.
>
> I can see an argument for downgrading the assumed effectiveness of
> restrictions that are applied as qpquals (ie, not indexquals), but
> the LIMIT node coster doesn't have easy access to the information
> needed for that, especially not in situations more complicated than
> a single-table scan.

My wish was to register this as both a common and significant bug,
whatever the solution is.

The bug appears in very simple queries, so we can't hide behind some
doubt as to the exact cause.

Few planning errors are wrong by more than 1000 times; this plan gets
worse every time the client gains new customers, so its not just bad,
its getting worse. Multiple different queries show the same symptoms,
so its common also.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Incorrect assumptions with low LIMITs

2012-03-16 Thread Simon Riggs
On Fri, Mar 16, 2012 at 9:39 PM, Jeff Davis  wrote:
> On Fri, 2012-03-16 at 18:25 +, Simon Riggs wrote:
>> Any time we apply a LIMIT clause to a plan with a SeqScan or
>> unqualified IndexScan, we shouldn't assume the scan will do less than
>> say 10% of the table. It might, but its an unsafe assumption because
>> as the selectivity decreases so does the safety of the assumption that
>> rows are uniformly distributed.
>
> Just trying to follow along. You mean "as the selectivity _increases_
> the safety of the assumption that the rows are uniformly distributed
> decreases", right?

Selectivity meaning the value between 0 and 1 that describes, in the
planner, the fraction of rows we will get back from a scan. 1.0 means
100% of rows. When selectivity is low, that means very few rows will
come back. I think you are using "high selectivity" as meaning
"returns few of the rows", so you understand me, but just flip the
meaning of the words.

When you have lots of rows, its a good assumption they are spread out
and a scan will find some of them quickly.

When you have very few rows, assuming they are evenly spaced is just
weird. Clumpiness of some kind seems much more likely. Much more
easily understood if the values are dates, for example.

Given the estimated number of rows is deliberately a worst case (i,e.
high), that sounds like the scan will work. Yet the reality is that
doing the scan is incredibly costly when those assumptions break,
which they do, often. Especially when the values don't exist at all
because the table is sparse.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Incorrect assumptions with low LIMITs

2012-03-16 Thread Jeff Davis
On Fri, 2012-03-16 at 18:25 +, Simon Riggs wrote:
> Any time we apply a LIMIT clause to a plan with a SeqScan or
> unqualified IndexScan, we shouldn't assume the scan will do less than
> say 10% of the table. It might, but its an unsafe assumption because
> as the selectivity decreases so does the safety of the assumption that
> rows are uniformly distributed.

Just trying to follow along. You mean "as the selectivity _increases_
the safety of the assumption that the rows are uniformly distributed
decreases", right?

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Incorrect assumptions with low LIMITs

2012-03-16 Thread Tom Lane
Simon Riggs  writes:
> 2. We assume that if values do exist that they have rows uniformly
> distributed across the whole table like rungs on a ladder.

Well, yeah.  That's sometimes wrong, but not always.  In the absence
of evidence to the contrary, I think it's a better assumption than
most others.

> Any time we apply a LIMIT clause to a plan with a SeqScan or
> unqualified IndexScan, we shouldn't assume the scan will do less than
> say 10% of the table.

This is a horrid idea, because it will stop the planner from using
fast-start plans in many situations where they are wins.

> Other ideas welcome.

You are not the first person to have run into this type of problem.
If it were easily solved by some minor hack, we would have done that
long since.  The problem is to not throw the baby out with the
bathwater.

I can see an argument for downgrading the assumed effectiveness of
restrictions that are applied as qpquals (ie, not indexquals), but
the LIMIT node coster doesn't have easy access to the information
needed for that, especially not in situations more complicated than
a single-table scan.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers