Re: [HACKERS] improving GROUP BY estimation
Dean Rasheed writes: > Here's an updated patch with references to both papers, and a more > detailed justification for the formula, along with the other changes > discussed. Note that although equation (2) in the Dell'Era paper looks > different from the Yao formula, it's actually the same. Looks good to me. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] improving GROUP BY estimation
On 31 March 2016 at 22:02, Tom Lane wrote: > I'm just concerned about what happens when the Dellera paper stops being > available. I don't mind including that URL as a backup to the written-out > argument I just suggested. > Here's an updated patch with references to both papers, and a more detailed justification for the formula, along with the other changes discussed. Note that although equation (2) in the Dell'Era paper looks different from the Yao formula, it's actually the same. Regards, Dean diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c new file mode 100644 index a6555e9..99f5f7c --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -3439,9 +3439,51 @@ estimate_num_groups(PlannerInfo *root, L reldistinct = clamp; /* - * Multiply by restriction selectivity. + * Update the estimate based on the restriction selectivity, + * guarding against division by zero when reldistinct is zero. + * Also skip this if we know that we are returning all rows. */ - reldistinct *= rel->rows / rel->tuples; + if (reldistinct > 0 && rel->rows < rel->tuples) + { +/* + * Given a table containing N rows with n distinct values in a + * uniform distribution, if we select p rows at random then + * the expected number of distinct values selected is + * + * n * (1 - product((N-N/n-i)/(N-i), i=0..p-1)) + * + * = n * (1 - (N-N/n)! / (N-N/n-p)! * (N-p)! / N!) + * + * See "Approximating block accesses in database + * organizations", S. B. Yao, Communications of the ACM, + * Volume 20 Issue 4, April 1977 Pages 260-261. + * + * Alternatively, re-arranging the terms from the factorials, + * this may be written as + * + * n * (1 - product((N-p-i)/(N-i), i=0..N/n-1)) + * + * This form of the formula is more efficient to compute in + * the common case where p is larger than N/n. Additionally, + * as pointed out by Dell'Era, if i << N for all terms in the + * product, it can be approximated by + * + * n * (1 - ((N-p)/N)^(N/n)) + * + * See "Expected distinct values when selecting from a bag + * without replacement", Alberto Dell'Era, + * http://www.adellera.it/investigations/distinct_balls/. + * + * The condition i << N is equivalent to n >> 1, so this is a + * good approximation when the number of distinct values in + * the table is large. It turns out that this formula also + * works well even when n is small. + */ +reldistinct *= + (1 - pow((rel->tuples - rel->rows) / rel->tuples, + rel->tuples / reldistinct)); + } + reldistinct = clamp_row_est(reldistinct); /* * Update estimate of total distinct groups. diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out new file mode 100644 index de64ca7..0fc93d9 --- a/src/test/regress/expected/subselect.out +++ b/src/test/regress/expected/subselect.out @@ -807,27 +807,24 @@ select * from int4_tbl where explain (verbose, costs off) select * from int4_tbl o where (f1, f1) in (select f1, generate_series(1,2) / 10 g from int4_tbl i group by f1); - QUERY PLAN --- - Hash Join + QUERY PLAN + + Hash Semi Join Output: o.f1 Hash Cond: (o.f1 = "ANY_subquery".f1) -> Seq Scan on public.int4_tbl o Output: o.f1 -> Hash Output: "ANY_subquery".f1, "ANY_subquery".g - -> HashAggregate + -> Subquery Scan on "ANY_subquery" Output: "ANY_subquery".f1, "ANY_subquery".g - Group Key: "ANY_subquery".f1, "ANY_subquery".g - -> Subquery Scan on "ANY_subquery" - Output: "ANY_subquery".f1, "ANY_subquery".g - Filter: ("ANY_subquery".f1 = "ANY_subquery".g) - -> HashAggregate - Output: i.f1, (generate_series(1, 2) / 10) - Group Key: i.f1 - -> Seq Scan on public.int4_tbl i - Output: i.f1 -(18 rows) + Filter: ("ANY_subquery".f1 = "ANY_subquery".g) + -> HashAggregate + Output: i.f1, (generate_series(1, 2) / 10) + Group Key: i.f1 + -> Seq Scan on public.int4_tbl i + Output: i.f1 +(15 rows) select * from int4_tbl o where (f1, f1) in (select f1, generate_series(1,2) / 10 g from int4_tbl i group by f1); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] improving GROUP BY estimation
Alvaro Herrera writes: > Tom Lane wrote: >> See "Approximating block accesses in database organizations", S. B. Yao, >> Communications of the ACM, Volume 20 Issue 4, April 1977 Pages 260-261 > That sounds nice all right, but I'm not sure it's actually helpful, > because the article text is not available anywhere. I doubt most people > will spend 15 bucks to buy that paper ... so we don't actually know > whether the paper supports the chosen formula :-) unless you have a > CACM subscription and can verify it. Well, I do and I did, see my previous response. > I think it's good to have the ACM reference anyway, for posterity, but > it'd be good to (additionally) have something that people can read. I'm just concerned about what happens when the Dellera paper stops being available. I don't mind including that URL as a backup to the written-out argument I just suggested. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] improving GROUP BY estimation
Dean Rasheed writes: > Yeah, that makes sense. In fact, if we only apply the adjustment when > reldistinct > 0 and rel->rows < rel->tuples, and rewrite the first > argument to pow() as (rel->tuples - rel->rows) / rel->tuples, then it > is guaranteed to be non-negative. If rel->rows >= rel->tuples (not > sure if it can be greater), then we just want the original > reldistinct. Works for me. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] improving GROUP BY estimation
Dean Rasheed writes: > On 31 March 2016 at 21:40, Tom Lane wrote: >> Let's use something like this: >> See "Approximating block accesses in database organizations", S. B. Yao, >> Communications of the ACM, Volume 20 Issue 4, April 1977 Pages 260-261 Actually, having now looked at both the Dellera paper and the Yao one, what the latter shows seems to be equivalent to Dellera's equation (2) (the first one in his section 2.2). But what the code is actually using is the approximation in the second equation in 2.2, and that one is certainly not in Yao, although it's not hard to see how you get from the first to the second if you assume i << Nr. I think it'd be worth writing out equation (2), attributing it to the Yao paper, and then saying something like "Alberto Dellera points out that if Nd is large, so that all the values of i are much less than Nr, then this formula can be approximated by [the formula used in the code]. It turns out that that formula also works well even when Nd is small." regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] improving GROUP BY estimation
Tom Lane wrote: > > The article text refers to this 1977 S. B. Yao paper "Approximating > > block accesses in database organizations" which doesn't appear to be > > available online, except behind ACM's paywall at > > http://dl.acm.org/citation.cfm?id=359475 > > Well, a CACM citation is perfectly fine by my lights (especially one > that's that far back and therefore certainly patent-free ...) > > Let's use something like this: > > See "Approximating block accesses in database organizations", S. B. Yao, > Communications of the ACM, Volume 20 Issue 4, April 1977 Pages 260-261 That sounds nice all right, but I'm not sure it's actually helpful, because the article text is not available anywhere. I doubt most people will spend 15 bucks to buy that paper ... so we don't actually know whether the paper supports the chosen formula :-) unless you have a CACM subscription and can verify it. I think it's good to have the ACM reference anyway, for posterity, but it'd be good to (additionally) have something that people can read. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] improving GROUP BY estimation
On 31 March 2016 at 21:40, Tom Lane wrote: > Alvaro Herrera writes: >> Tom Lane wrote: >>> Another minor gripe is the use of a random URL as justification. This >>> code will still be around when that URL exists nowhere but the Wayback >>> Machine. Can't we find a more formal citation to use? > >> The article text refers to this 1977 S. B. Yao paper "Approximating >> block accesses in database organizations" which doesn't appear to be >> available online, except behind ACM's paywall at >> http://dl.acm.org/citation.cfm?id=359475 > > Well, a CACM citation is perfectly fine by my lights (especially one > that's that far back and therefore certainly patent-free ...) > > Let's use something like this: > > See "Approximating block accesses in database organizations", S. B. Yao, > Communications of the ACM, Volume 20 Issue 4, April 1977 Pages 260-261 > Sounds good. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] improving GROUP BY estimation
On 31 March 2016 at 20:18, Tom Lane wrote: > Also, I wonder if it'd be a good idea to provide a guard against division > by zero --- we know rel->tuples > 0 at this point, but I'm less sure that > reldistinct can't be zero. In the same vein, I'm worried about the first > argument of pow() being slightly negative due to roundoff error, leading > to a NaN result. Yeah, that makes sense. In fact, if we only apply the adjustment when reldistinct > 0 and rel->rows < rel->tuples, and rewrite the first argument to pow() as (rel->tuples - rel->rows) / rel->tuples, then it is guaranteed to be non-negative. If rel->rows >= rel->tuples (not sure if it can be greater), then we just want the original reldistinct. > Maybe we should also consider clamping the final reldistinct estimate to > an integer with clamp_row_est(). The existing code doesn't do that but > it seems like a good idea on general principles. OK, that seems sensible. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] improving GROUP BY estimation
Alvaro Herrera writes: > Tom Lane wrote: >> Another minor gripe is the use of a random URL as justification. This >> code will still be around when that URL exists nowhere but the Wayback >> Machine. Can't we find a more formal citation to use? > The article text refers to this 1977 S. B. Yao paper "Approximating > block accesses in database organizations" which doesn't appear to be > available online, except behind ACM's paywall at > http://dl.acm.org/citation.cfm?id=359475 Well, a CACM citation is perfectly fine by my lights (especially one that's that far back and therefore certainly patent-free ...) Let's use something like this: See "Approximating block accesses in database organizations", S. B. Yao, Communications of the ACM, Volume 20 Issue 4, April 1977 Pages 260-261 regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] improving GROUP BY estimation
Tom Lane wrote: > Another minor gripe is the use of a random URL as justification. This > code will still be around when that URL exists nowhere but the Wayback > Machine. Can't we find a more formal citation to use? The article text refers to this 1977 S. B. Yao paper "Approximating block accesses in database organizations" which doesn't appear to be available online, except behind ACM's paywall at http://dl.acm.org/citation.cfm?id=359475 -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] improving GROUP BY estimation
Dean Rasheed writes: > On 30 March 2016 at 14:03, Tomas Vondra wrote: >> Attached is v4 of the patch > Thanks, I think this is good to go, except that I think we need to use > pow() rather than powl() because AIUI powl() is new to C99, and so > won't necessarily be available on all supported platforms. I don't > think we need worry about loss of precision, since that would only be > an issue if rel->rows / rel->tuples were smaller than maybe 10^-14 or > so, and it seems unlikely we'll get anywhere near that any time soon. I took a quick look. I concur with using pow() not powl(); the latter is not in SUS v2 which is our baseline portability expectation, and in fact there is *noplace* where we expect long double to work. Moreover, I don't believe that any of the estimates we're working with are so accurate that a double-width power result would be a useful improvement. Also, I wonder if it'd be a good idea to provide a guard against division by zero --- we know rel->tuples > 0 at this point, but I'm less sure that reldistinct can't be zero. In the same vein, I'm worried about the first argument of pow() being slightly negative due to roundoff error, leading to a NaN result. Maybe we should also consider clamping the final reldistinct estimate to an integer with clamp_row_est(). The existing code doesn't do that but it seems like a good idea on general principles. Another minor gripe is the use of a random URL as justification. This code will still be around when that URL exists nowhere but the Wayback Machine. Can't we find a more formal citation to use? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] improving GROUP BY estimation
On 30 March 2016 at 14:03, Tomas Vondra wrote: > Attached is v4 of the patch Thanks, I think this is good to go, except that I think we need to use pow() rather than powl() because AIUI powl() is new to C99, and so won't necessarily be available on all supported platforms. I don't think we need worry about loss of precision, since that would only be an issue if rel->rows / rel->tuples were smaller than maybe 10^-14 or so, and it seems unlikely we'll get anywhere near that any time soon. I think this is a good, well thought-out change, so unless anyone objects I'll push it (probably this weekend). Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] improving GROUP BY estimation
Hi, On 03/22/2016 03:40 PM, Alexander Korotkov wrote: I think you should send a revision of patch including comments proposed by Deam Rasheed. I'm switching patch status to waiting on author in commitfest. Attached is v4 of the patch - the only difference w.r.t. v3 is that I've used the comment as proposed by Dean. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services estimate-num-groups-v4.patch Description: binary/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] improving GROUP BY estimation
Hi Thomas, On 3/22/16 10:40 AM, Alexander Korotkov wrote: I think you should send a revision of patch including comments proposed by Deam Rasheed. I'm switching patch status to waiting on author in commitfest. We're getting pretty close to the end of the CF. Do you know when you will have a new patch ready? Thanks, -- -David da...@pgmasters.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] improving GROUP BY estimation
Hi, Tomas! On Mon, Mar 21, 2016 at 2:39 PM, Alexander Korotkov < a.korot...@postgrespro.ru> wrote: > On Fri, Mar 18, 2016 at 1:20 PM, Dean Rasheed > wrote: > >> Probably a better URL to give is >> http://www.adellera.it/investigations/distinct_balls/ which has a link >> to the PDF version of the paper and also some supporting material. >> >> However, while that paper is in general very clear, I don't think it >> gives a very clear explanation of that particular formula, and it >> doesn't explain what it represents. It merely says that if "i" can be >> ignored "for some reason (e.g. i << Nr)", then that formula is an >> approximation to the exact "without replacement" formula, which is the >> subject of that paper. >> >> But actually, that formula is the exact formula for the expected >> number of distinct values when selecting *with replacement* from a >> collection where each of the distinct values occurs an equal number of >> times. So I think we should say that. >> >> Perhaps something along the lines of: >> >> /* >> * Update the estimate based on the restriction selectivity. >> * >> * This uses the formula for the expected number of distinct >> values >> * when selecting with replacement from a collection where >> each of >> * the distinct values occurs an equal number of times (a >> uniform >> * distribution of values). This is a very close >> approximation to >> * the more correct method of selecting without replacement >> when >> * the number of distinct values is quite large --- for >> example, >> * see http://www.adellera.it/investigations/distinct_balls/. >> * Additionally, it also turns out to work very well even >> when the >> * number of distinct values is small. >> */ >> > > +1 > Thank you for work on this patch. The formula you propose and explanation > look great! > I think you should send a revision of patch including comments proposed by Deam Rasheed. I'm switching patch status to waiting on author in commitfest. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] improving GROUP BY estimation
Hi, Dean! On Fri, Mar 18, 2016 at 1:20 PM, Dean Rasheed wrote: > Probably a better URL to give is > http://www.adellera.it/investigations/distinct_balls/ which has a link > to the PDF version of the paper and also some supporting material. > > However, while that paper is in general very clear, I don't think it > gives a very clear explanation of that particular formula, and it > doesn't explain what it represents. It merely says that if "i" can be > ignored "for some reason (e.g. i << Nr)", then that formula is an > approximation to the exact "without replacement" formula, which is the > subject of that paper. > > But actually, that formula is the exact formula for the expected > number of distinct values when selecting *with replacement* from a > collection where each of the distinct values occurs an equal number of > times. So I think we should say that. > > Perhaps something along the lines of: > > /* > * Update the estimate based on the restriction selectivity. > * > * This uses the formula for the expected number of distinct > values > * when selecting with replacement from a collection where > each of > * the distinct values occurs an equal number of times (a > uniform > * distribution of values). This is a very close approximation > to > * the more correct method of selecting without replacement > when > * the number of distinct values is quite large --- for > example, > * see http://www.adellera.it/investigations/distinct_balls/. > * Additionally, it also turns out to work very well even when > the > * number of distinct values is small. > */ > +1 Thank you for work on this patch. The formula you propose and explanation look great! -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] improving GROUP BY estimation
On 18 March 2016 at 00:37, Tomas Vondra wrote: >> On Sun, 2016-03-13 at 15:24 +, Dean Rasheed wrote: >>> I think that a better formula to use would be >>> >>> reldistinct *= (1 - powl(1 - rel-rows / rel->tuples, rel->tuples / >>> reldistinct) > > Attached is a v3 of the patch using this formula instead of the original > one. Interestingly, that apparently reduces the number of regression tests > that get broken to a single one. > Cool. > I'm not sure whether we need to provide a link to the PDF the formula comes > from - perhaps we should? > Probably a better URL to give is http://www.adellera.it/investigations/distinct_balls/ which has a link to the PDF version of the paper and also some supporting material. However, while that paper is in general very clear, I don't think it gives a very clear explanation of that particular formula, and it doesn't explain what it represents. It merely says that if "i" can be ignored "for some reason (e.g. i << Nr)", then that formula is an approximation to the exact "without replacement" formula, which is the subject of that paper. But actually, that formula is the exact formula for the expected number of distinct values when selecting *with replacement* from a collection where each of the distinct values occurs an equal number of times. So I think we should say that. Perhaps something along the lines of: /* * Update the estimate based on the restriction selectivity. * * This uses the formula for the expected number of distinct values * when selecting with replacement from a collection where each of * the distinct values occurs an equal number of times (a uniform * distribution of values). This is a very close approximation to * the more correct method of selecting without replacement when * the number of distinct values is quite large --- for example, * see http://www.adellera.it/investigations/distinct_balls/. * Additionally, it also turns out to work very well even when the * number of distinct values is small. */ Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] improving GROUP BY estimation
On 03/13/2016 11:09 PM, Tomas Vondra wrote: Hi, On Sun, 2016-03-13 at 15:24 +, Dean Rasheed wrote: On 4 March 2016 at 13:10, Tomas Vondra wrote: ... >>> I think that a better formula to use would be reldistinct *= (1 - powl(1 - rel-rows / rel->tuples, rel->tuples / reldistinct) Attached is a v3 of the patch using this formula instead of the original one. Interestingly, that apparently reduces the number of regression tests that get broken to a single one. I'm not sure whether we need to provide a link to the PDF the formula comes from - perhaps we should? I've also repeated the tests for the two tables (dependent and independent columns), comparing the actual number of groups and different estimates, and the results look like this (v3 is the formula used in this patch): 1) independent | 10 | 50 | 100 | 500 | 1000 | 5000 - actual | 919 | 3829 | 6244 | 9944 | 10001 | 10001 current | 10 | 50 | 102 | 516 | 1018 | 4996 new (v1) | 973 | 4001 | 6382 | 9897 | 9951 | 9951 new (v3) | 1117 | 3852 | 6229 | 9943 | 10004 | 10004 2) dependent | 10 | 50 | 100 | 500 | 1000 | 5000 actual | 10 | 50 | 100 | 500 | 1000 | 5000 current | 10 | 53 | 105 | 508 | 1016 | 5014 new (v1) | 880 | 4105 | 6472 | 9955 | 10018 | 10018 new (v3) | 807 | 3680 | 6050 | 9916 | 9983 | 9983 I only collected numbers for the new estimator, the other numbers are just a copy from the previous message. So there might be minor differences due to slightly different ndistinct estimates etc. Anyway, the numbers are obviously quite close to the formula from v1 of the patch, plus the formula gives better estimates when scanning nearly all rows. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services estimate-num-groups-v3.patch Description: binary/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] improving GROUP BY estimation
Hi, On Sun, 2016-03-13 at 15:24 +, Dean Rasheed wrote: > On 4 March 2016 at 13:10, Tomas Vondra > wrote: > > > > The risk associated with over-estimation is that we may switch from > > HashAggregate to GroupAggregate, and generally select paths better > > suited for large number of rows. Not great, because the plan may > > not be > > optimal and thus run much slower - but that usually only happens on > > small tables, because on large tables we would probably end up > > using > > those same paths anyway. > > > > With under-estimation, the main risks are much graver, as we may > > end up > > using HashAggregate only to get killed by OOM. When we're lucky not > > to > > hit that, my experience this often leads to a cascade of NestedLoop > > nodes processing orders of magnitude more tuples than expected. > > Awful. > > > > So I think the risk is asymmetric, and that's why I like the new > > estimator more, because it tends to over-estimate, but in a very > > reasonable way. > > > Agreed. Under-estimating is worse than over-estimating. > > > - reldistinct *= rel->rows / rel->tuples; > + reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, > rel->rows) > > Looking at this, I agree that this new formula from [1] is generally > better than the old one in most (but not all) cases. > > One problem with it is that it no longer takes into account > rel->tuples, which means that when returning all the tuples from the > table, it no longer gives an estimate equal to the total number of > distinct values in the table. In fact it tends to underestimate when > returning nearly all the rows from the table. Yes, that's a good point. > > The old formula is clearly not a good general-purpose estimate. > However, there is one case where it does return an accurate estimate > -- when all the values in the underlying table are distinct. In this > case the new formula consistently produces underestimates, while the > old formula works well. For example: > > CREATE TABLE t AS SELECT (1 * random())::int a, > i::int b > FROM generate_series(1,100) s(i); > ANALYSE t; > > EXPLAIN ANALYSE > SELECT a FROM t GROUP BY a; > > So there are clearly around 1 million distinct values for "a", but > the new formula gives an estimate of around 630,000. That's not a > massive underestimate, but it's sufficiently obvious and visible that > it would be good to do better if we can. > > > I think that a better formula to use would be > > reldistinct *= (1 - powl(1 - rel-rows / rel->tuples, rel->tuples / > reldistinct) > > This comes from [2], which gives a general formula for selection > without replacement, and then gives the above as an approximation to > the uniform distribution case. It's not really made clear in that > paper, but that approximation is actually the "with replacement" > approximation, but starting from different initial assumptions to > give a formula that has better boundary conditions and special-case > behaviour, as described below. > > ... > > For most values of N and n, the approximation from [2] is almost > indistinguishable from the exact formula, whereas the formula from > [1] tends to underestimate when selecting a large number of distinct > values (e.g., try setting n=90 in the plot above). Yes, I agree that formula you propose seems to behave better. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] improving GROUP BY estimation
On 4 March 2016 at 13:10, Tomas Vondra wrote: > The risk associated with over-estimation is that we may switch from > HashAggregate to GroupAggregate, and generally select paths better > suited for large number of rows. Not great, because the plan may not be > optimal and thus run much slower - but that usually only happens on > small tables, because on large tables we would probably end up using > those same paths anyway. > > With under-estimation, the main risks are much graver, as we may end up > using HashAggregate only to get killed by OOM. When we're lucky not to > hit that, my experience this often leads to a cascade of NestedLoop > nodes processing orders of magnitude more tuples than expected. Awful. > > So I think the risk is asymmetric, and that's why I like the new > estimator more, because it tends to over-estimate, but in a very > reasonable way. > Agreed. Under-estimating is worse than over-estimating. - reldistinct *= rel->rows / rel->tuples; + reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows) Looking at this, I agree that this new formula from [1] is generally better than the old one in most (but not all) cases. One problem with it is that it no longer takes into account rel->tuples, which means that when returning all the tuples from the table, it no longer gives an estimate equal to the total number of distinct values in the table. In fact it tends to underestimate when returning nearly all the rows from the table. The old formula is clearly not a good general-purpose estimate. However, there is one case where it does return an accurate estimate -- when all the values in the underlying table are distinct. In this case the new formula consistently produces underestimates, while the old formula works well. For example: CREATE TABLE t AS SELECT (1 * random())::int a, i::int b FROM generate_series(1,100) s(i); ANALYSE t; EXPLAIN ANALYSE SELECT a FROM t GROUP BY a; So there are clearly around 1 million distinct values for "a", but the new formula gives an estimate of around 630,000. That's not a massive underestimate, but it's sufficiently obvious and visible that it would be good to do better if we can. I think that a better formula to use would be reldistinct *= (1 - powl(1 - rel-rows / rel->tuples, rel->tuples / reldistinct) This comes from [2], which gives a general formula for selection without replacement, and then gives the above as an approximation to the uniform distribution case. It's not really made clear in that paper, but that approximation is actually the "with replacement" approximation, but starting from different initial assumptions to give a formula that has better boundary conditions and special-case behaviour, as described below. For the record, here's a quick analysis of how the 2 formulae come about: Assume there are: N rows in the table n distinct values (n <= N) p rows are chosen at random (p <= N) so the problem is to estimate how many distinct values there will be in the p rows chosen. Both approaches work by first estimating the probability that some particular value x is *not* chosen. [1] starts from the assumption that each row of the table has a probability of 1/n of being equal to x. So the probability that x is not the first value chosen is P(not x, 1) = 1 - 1/n Then, if the value chosen first is replaced, the probability that x is not the second value chosen is the same. The "with replacement" approximation makes each choice an independent probability, so the overall probability that x is not in any of the p rows chosen is just the product of the p individual probabilities, which is just P(not x, p) = (1 - 1/n)^p Then of course the probability that x *is* chosen at least once in the p rows is P(x, p) = 1 - (1 - 1/n)^p Finally, since there are n possible values of x, and they're all equally likely in the table, the expected number of distinct values is D(p) = n * (1 - (1 - 1/n)^p) The flaw in this approach is that for large p the "with replacement" approximation gets worse and worse, and the probabilities P(x, p) systematically under-estimate the actual probability that x is chosen, which increases as more values not equal to x are chosen. By the time the last value is chosen P(x, p=N) should actually be 1. [2] starts from a different initial assumption (uniform distribution case) -- for any given value x, assume that the table contains N/n instances of x (ignoring the fact that that might not be an integer). Note that with this assumption there is guaranteed to be at least one instance of x, which is not the case with the above approach. Now consider the first instance of x in the table. If we choose p rows from the table, the probability that that first instance of x is not chosen is P(not x[1], p) = 1 - p / N Making the same "with replacement" simplification, the probability that the second instance of x is not chose
Re: [HACKERS] improving GROUP BY estimation
On Thu, 2016-03-03 at 11:42 -0800, Mark Dilger wrote: > > On Mar 3, 2016, at 11:27 AM, Alexander Korotkov > > wrote: > > > > On Thu, Mar 3, 2016 at 10:16 PM, Tomas Vondra > > wrote: > > So yes, each estimator works great for exactly the opposite cases. But > > notice that typically, the results of the new formula is much higher than > > the old one, sometimes by two orders of magnitude (and it shouldn't be > > difficult to construct examples of much higher differences). > > > > The table also includes the 'average' estimator you propose, but it's > > rather obvious that the result is always much closer to the new value, > > simply because > > > >(small number) + (huge number) > >-- > > 2 > > > > is always much closer to the huge number. We're usually quite happy > > when the estimates are within the same order of magnitude, so whether > > it's K or K/2 makes pretty much no difference. > > > > I believe that Mark means geometrical average, i.e. sqrt((small number) * > > (huge number)). Ah, OK. Haven't realized you meant geometric mean. With that it looks like this: 1) independent 10 50 100 50010005000 -- actual 919382962449944 10001 10001 current 10 50 102 51610184996 new 97340016382989799519951 average 49120253205520654847473 geom. mean99 447 807226031837051 2) dependent 10 50 100 50010005000 -- actual10 50 100 50010005000 current 10 53 105 50810165014 new 880410564729955 10018 10018 average 44520793288523155177516 geom. mean94 466 824224931907087 To better illustrate the differences, we may use the "ratio error" defined as err(a,b) = MAX(a/b, b/a) where 'a' is the actual value and 'b' is the estimate. That gives us: 1) independent 10 50 100 50010005000 -- current 91.9076.5861.2219.279.822.00 new1.06 1.04 1.02 1.001.011.01 average1.87 1.89 1.95 1.911.821.34 geom. mean 9.32 8.56 7.74 4.403.141.42 2) dependent 10 50 100 50010005000 -- current1.00 1.06 1.05 1.021.021.00 new 88.0082.1064.7219.91 10.022.00 average 44.5041.5832.8810.465.521.50 geom. mean 9.38 9.33 8.24 4.503.191.42 So the geometric mean seems to be working better than plain average. But I don't think we can simply conclude we should use the geometric mean of the estimates, as it assumes the risks of over- and under-estimating the cardinality are the same. Because they aren't. > > Yes, that is what I proposed upthread. I'm not wedded to that, though. > In general, I am with Tomas on this one, believing that his estimate > will be much better than the current estimate. But I believe the *best* > estimate will be somewhere between his and the current one, and I'm > fishing for any decent way of coming up with a weighted average that > is closer to his than to the current, but not simply equal to his proposal. > > The reason I want the formula to be closer to Tomas's than to the > current is that I think that on average, across all tables, across all > databases, in practice it will be closer to the right estimate than the > current formula. That's just my intuition, and so I can't defend it. > But if my intuition is right, the best formula we can adopt would be one > that is moderated from his by a little bit, and in the direction of the > estimate that the current code generates. I think looking merely at what fraction of data sets (or queries) uses dependent GROUP BY and WHERE clauses is not sufficient. The general behavior of the two GROUP BY estimators is that the current one tends to under-estimate, while the new one tends to over-estimate. (It's not difficult to construct counter-examples by using more complex dependencies between the columns etc. but let's ignore that.) The risk associated with over-estimation is that we may switch from HashAggregate to GroupAggregate, and generally select paths better suited for large number of rows. Not grea
Re: [HACKERS] improving GROUP BY estimation
> On Mar 3, 2016, at 11:27 AM, Alexander Korotkov > wrote: > > On Thu, Mar 3, 2016 at 10:16 PM, Tomas Vondra > wrote: > So yes, each estimator works great for exactly the opposite cases. But notice > that typically, the results of the new formula is much higher than the old > one, sometimes by two orders of magnitude (and it shouldn't be difficult to > construct examples of much higher differences). > > The table also includes the 'average' estimator you propose, but it's rather > obvious that the result is always much closer to the new value, simply because > >(small number) + (huge number) >-- > 2 > > is always much closer to the huge number. We're usually quite happy when the > estimates are within the same order of magnitude, so whether it's K or K/2 > makes pretty much no difference. > > I believe that Mark means geometrical average, i.e. sqrt((small number) * > (huge number)). Yes, that is what I proposed upthread. I'm not wedded to that, though. In general, I am with Tomas on this one, believing that his estimate will be much better than the current estimate. But I believe the *best* estimate will be somewhere between his and the current one, and I'm fishing for any decent way of coming up with a weighted average that is closer to his than to the current, but not simply equal to his proposal. The reason I want the formula to be closer to Tomas's than to the current is that I think that on average, across all tables, across all databases, in practice it will be closer to the right estimate than the current formula. That's just my intuition, and so I can't defend it. But if my intuition is right, the best formula we can adopt would be one that is moderated from his by a little bit, and in the direction of the estimate that the current code generates. I could easily lose this debate merely for lack of a principled basis for saying how far toward the current estimate the new estimate should be adjusted. The geometric average is one suggestion, but I don't have a principled argument for it. Like I said above, I'm fishing for a decent formula here. Mark Dilger -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] improving GROUP BY estimation
On Thu, Mar 3, 2016 at 10:16 PM, Tomas Vondra wrote: > So yes, each estimator works great for exactly the opposite cases. But > notice that typically, the results of the new formula is much higher than > the old one, sometimes by two orders of magnitude (and it shouldn't be > difficult to construct examples of much higher differences). > > The table also includes the 'average' estimator you propose, but it's > rather obvious that the result is always much closer to the new value, > simply because > >(small number) + (huge number) >-- > 2 > > is always much closer to the huge number. We're usually quite happy when > the estimates are within the same order of magnitude, so whether it's K or > K/2 makes pretty much no difference. I believe that Mark means geometrical average, i.e. sqrt((small number) * (huge number)). -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] improving GROUP BY estimation
Hi, On 03/03/2016 06:40 PM, Mark Dilger wrote: On Mar 3, 2016, at 8:35 AM, Tomas Vondra wrote: Hi, On 03/03/2016 12:53 PM, Alexander Korotkov wrote: Hi, Tomas! I've assigned to review this patch. I've checked version estimate-num-groups-v2.txt by Mark Dilger. It applies to head cleanly, passes corrected regression tests. About correlated/uncorrelated cases. I think now optimizer mostly assumes all distributions to be independent. I think we should follow this assumption in this case also until we have fundamentally better option (like your multivariate statistics). @@ -3438,9 +3438,9 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows, reldistinct = clamp; /* -* Multiply by restriction selectivity. +* Estimate number of distinct values expected in given number of rows. */ -reldistinct *= rel->rows / rel->tuples; +reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows)); /* * Update estimate of total distinct groups. I think we need way more explanation in comments here (possibly with external links). For now, it looks like formula which appears from nowhere. I thought the details (particularly the link to stackexchange, discussing the formula) would be enough, but let me elaborate. The current formula reldistinct *= rel->rows / rel->tuples; simply multiplies the ndistinct estimate with selectivity. So when you select 1% of the table, the estimate says we'll see 1% of ndistinct values. But clearly that's naive, because for example when you have 10k distinct values and you select 10M rows, you should expect nearly all of them in the sample. The current formula assumes total dependence between the columns. You can create tables where this is true, and the formula gives precisely the right estimate. I'm not claiming this is common, or that it should be the default assumption, but it is one end of the spectrum of possible correct estimates. I'm not really sure what you mean by dependence between columns, as both the old and new formula only works with total cardinality and selectivity, and has absolutely no idea about multiple columns. In case you mean dependence between columns in the GROUP BY and columns used to compute the selectivity (i.e. referenced in WHERE), then perhaps you could say it like that, although I think there's no such explicit assumption. And that's what the formula does - it gives you the expected number of distinct values, when selecting 'k' values from 'd' distinct values with replacements: count(k, d) = d * (1 - ((d-1)/d) ^ k) It's assuming the distinct values are equally probable (uniform distribution) and that the probabilities do not change (that's the "with replacement"). The new formula assumes total independence between the columns. You can likewise create tables where this is true, and you did so upthread, and for those tables it gives precisely the right estimate. This is the other end of the spectrum of possible correct estimates. In the absence of multivariate statistics, either because you haven't committed that work yet, or because the multivariate data the planner needs hasn't been collected, choosing an estimate exactly in the middle of the spectrum (whatever that would mean mathematically) would minimize the maximum possible error in the estimate. FWIW the current version of multivariate statistics can't really help in this case. Perhaps it will help in the future, but it's way more complicated that it might seem as it requires transferring information from the WHERE clause to the GROUP BY clause. I have the sense that my point of view is in the minority, because the expectation is the problems caused by independence assumptions will only be fixed when multivariate statistics are implemented and available, and so we should just keep the independence assumptions everywhere until that time. I tend to disagree with that, on the grounds that even when that work is finished, we'll never have complete coverage of every possible set of columns and what their degree of dependence is. Well, in a sense you're right that those estimates work best in different situations. The problem with constructing an estimator the way you propose (simply returning an average) is that in both the extreme cases (perfect dependence or independence) one of those estimates is really bad. Moreover the new formula typically produces higher values than the old one, Consider for example this: CREATE TABLE t AS SELECT (1 * random())::int a, (1 * random())::int b FROM generate_series(1,100) s(i); and let's see estimates for queries like this: SELECT a FROM t WHERE b < $1 GROUP BY a; Then for different values of $1 we get this: | 10 | 50 | 100 | 500 | 1000 | 5000 actual | 919 | 3829 | 6244 | 9944 | 10001 | 10001 current | 10 | 5
Re: [HACKERS] improving GROUP BY estimation
On Thu, Mar 3, 2016 at 7:35 PM, Tomas Vondra wrote: > On 03/03/2016 12:53 PM, Alexander Korotkov wrote: > >> I've assigned to review this patch. >> >> I've checked version estimate-num-groups-v2.txt by Mark Dilger. >> It applies to head cleanly, passes corrected regression tests. >> >> About correlated/uncorrelated cases. I think now optimizer mostly assumes >> all distributions to be independent. I think we should follow this >> assumption in this case also until we have fundamentally better option >> (like >> your multivariate statistics). >> >> @@ -3438,9 +3438,9 @@ estimate_num_groups(PlannerInfo *root, List >> *groupExprs, >> double input_rows, >> reldistinct = clamp; >> >> /* >> -* Multiply by restriction selectivity. >> +* Estimate number of distinct values expected in given number of rows. >> */ >> -reldistinct *= rel->rows / rel->tuples; >> +reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows)); >> >> /* >> * Update estimate of total distinct groups. >> >> I think we need way more explanation in comments here (possibly with >> external links). For now, it looks like formula which appears from >> nowhere. >> > > I thought the details (particularly the link to stackexchange, discussing > the formula) would be enough, but let me elaborate. > > The current formula > > reldistinct *= rel->rows / rel->tuples; > > simply multiplies the ndistinct estimate with selectivity. So when you > select 1% of the table, the estimate says we'll see 1% of ndistinct values. > But clearly that's naive, because for example when you have 10k distinct > values and you select 10M rows, you should expect nearly all of them in the > sample. > > And that's what the formula does - it gives you the expected number of > distinct values, when selecting 'k' values from 'd' distinct values with > replacements: > > count(k, d) = d * (1 - ((d-1)/d) ^ k) > > It's assuming the distinct values are equally probable (uniform > distribution) and that the probabilities do not change (that's the "with > replacement"). > > I guess we could relax those assumptions and for example use the MCV > statistics to further improve the estimate, and also relax the 'with > replacement' but that'd make the formula way more complicated. > > [1] > http://math.stackexchange.com/questions/72223/finding-expected-number-of-distinct-values-selected-from-a-set-of-integers Your explanation in the first mail was good enough. Not it's even better. But actually I mean that we need to include some brief explanation into source code (either comments or readme). It would be good if one who want to understand this could do without searching mailing list archives or git history. >> Also, note that rel->tuples gone away from formula parameters. So, we >> actually don't care about how may tuples are in the table. This is because >> this formula assumes that same tuple could be selected multiple times. For >> low numbers this can lead to some errors. Consider selecting 2 from 3 >> distinct tuples. If you can't select same tuple twice then you always >> selecting 2 distinct tuples. But according to formula you will select 5/3 >> in >> average. I think this error in small numbers is negligible, especially >> because getting rid of this error would require way more computations. But >> it worth mentioning in comments though. >> > > Well, yeah. That's the consequence of 'with replacement' assumption. > > I guess we could handle this somehow, though. For example we can compute > the minimum possible number of distinct values like this: > > average_rows_per_value = (tuples / ndistinct); > min_estimate = ceil(rows / average_rows_per_value); > > and use that as a minimum for the estimate. In the example you've given > this would say > > average_rows_per_value = (3 / 3) = 1; > min_estimate = ceil(2 / 1) = 2; > > So it serves as a correction for the 'with replacement' assumption. With > only 2 distinct values we'd get this: > > average_rows_per_value = (3 / 2) = 1.5; > min_estimate = ceil(2 / 1.5) = 2; > > Of course, it's just a heuristics, so this may fail in some other cases. I'm not sure we actually need it. Even in worst case error doesn't seems to be very big. But feel free to add this heuristics, it looks OK for me. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] improving GROUP BY estimation
> On Mar 3, 2016, at 8:35 AM, Tomas Vondra wrote: > > Hi, > > On 03/03/2016 12:53 PM, Alexander Korotkov wrote: >> Hi, Tomas! >> >> I've assigned to review this patch. >> >> I've checked version estimate-num-groups-v2.txt by Mark Dilger. >> It applies to head cleanly, passes corrected regression tests. >> >> About correlated/uncorrelated cases. I think now optimizer mostly assumes >> all distributions to be independent. I think we should follow this >> assumption in this case also until we have fundamentally better option (like >> your multivariate statistics). >> >> @@ -3438,9 +3438,9 @@ estimate_num_groups(PlannerInfo *root, List >> *groupExprs, >> double input_rows, >> reldistinct = clamp; >> >> /* >> -* Multiply by restriction selectivity. >> +* Estimate number of distinct values expected in given number of rows. >> */ >> -reldistinct *= rel->rows / rel->tuples; >> +reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows)); >> >> /* >> * Update estimate of total distinct groups. >> >> I think we need way more explanation in comments here (possibly with >> external links). For now, it looks like formula which appears from nowhere. > > I thought the details (particularly the link to stackexchange, discussing the > formula) would be enough, but let me elaborate. > > The current formula > >reldistinct *= rel->rows / rel->tuples; > > simply multiplies the ndistinct estimate with selectivity. So when you select > 1% of the table, the estimate says we'll see 1% of ndistinct values. But > clearly that's naive, because for example when you have 10k distinct values > and you select 10M rows, you should expect nearly all of them in the sample. The current formula assumes total dependence between the columns. You can create tables where this is true, and the formula gives precisely the right estimate. I'm not claiming this is common, or that it should be the default assumption, but it is one end of the spectrum of possible correct estimates. > And that's what the formula does - it gives you the expected number of > distinct values, when selecting 'k' values from 'd' distinct values with > replacements: > >count(k, d) = d * (1 - ((d-1)/d) ^ k) > > It's assuming the distinct values are equally probable (uniform distribution) > and that the probabilities do not change (that's the "with replacement"). The new formula assumes total independence between the columns. You can likewise create tables where this is true, and you did so upthread, and for those tables it gives precisely the right estimate. This is the other end of the spectrum of possible correct estimates. In the absence of multivariate statistics, either because you haven't committed that work yet, or because the multivariate data the planner needs hasn't been collected, choosing an estimate exactly in the middle of the spectrum (whatever that would mean mathematically) would minimize the maximum possible error in the estimate. I have the sense that my point of view is in the minority, because the expectation is the problems caused by independence assumptions will only be fixed when multivariate statistics are implemented and available, and so we should just keep the independence assumptions everywhere until that time. I tend to disagree with that, on the grounds that even when that work is finished, we'll never have complete coverage of every possible set of columns and what their degree of dependence is. Perhaps I am badly mistaken. Can somebody more familiar with the issues than I am please disabuse me of my wrongheadedness? Mark Dilger -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] improving GROUP BY estimation
Hi, On 03/03/2016 12:53 PM, Alexander Korotkov wrote: Hi, Tomas! I've assigned to review this patch. I've checked version estimate-num-groups-v2.txt by Mark Dilger. It applies to head cleanly, passes corrected regression tests. About correlated/uncorrelated cases. I think now optimizer mostly assumes all distributions to be independent. I think we should follow this assumption in this case also until we have fundamentally better option (like your multivariate statistics). @@ -3438,9 +3438,9 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows, reldistinct = clamp; /* -* Multiply by restriction selectivity. +* Estimate number of distinct values expected in given number of rows. */ -reldistinct *= rel->rows / rel->tuples; +reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows)); /* * Update estimate of total distinct groups. I think we need way more explanation in comments here (possibly with external links). For now, it looks like formula which appears from nowhere. I thought the details (particularly the link to stackexchange, discussing the formula) would be enough, but let me elaborate. The current formula reldistinct *= rel->rows / rel->tuples; simply multiplies the ndistinct estimate with selectivity. So when you select 1% of the table, the estimate says we'll see 1% of ndistinct values. But clearly that's naive, because for example when you have 10k distinct values and you select 10M rows, you should expect nearly all of them in the sample. And that's what the formula does - it gives you the expected number of distinct values, when selecting 'k' values from 'd' distinct values with replacements: count(k, d) = d * (1 - ((d-1)/d) ^ k) It's assuming the distinct values are equally probable (uniform distribution) and that the probabilities do not change (that's the "with replacement"). I guess we could relax those assumptions and for example use the MCV statistics to further improve the estimate, and also relax the 'with replacement' but that'd make the formula way more complicated. [1] http://math.stackexchange.com/questions/72223/finding-expected-number-of-distinct-values-selected-from-a-set-of-integers Also, note that rel->tuples gone away from formula parameters. So, we actually don't care about how may tuples are in the table. This is because this formula assumes that same tuple could be selected multiple times. For low numbers this can lead to some errors. Consider selecting 2 from 3 distinct tuples. If you can't select same tuple twice then you always selecting 2 distinct tuples. But according to formula you will select 5/3 in average. I think this error in small numbers is negligible, especially because getting rid of this error would require way more computations. But it worth mentioning in comments though. Well, yeah. That's the consequence of 'with replacement' assumption. I guess we could handle this somehow, though. For example we can compute the minimum possible number of distinct values like this: average_rows_per_value = (tuples / ndistinct); min_estimate = ceil(rows / average_rows_per_value); and use that as a minimum for the estimate. In the example you've given this would say average_rows_per_value = (3 / 3) = 1; min_estimate = ceil(2 / 1) = 2; So it serves as a correction for the 'with replacement' assumption. With only 2 distinct values we'd get this: average_rows_per_value = (3 / 2) = 1.5; min_estimate = ceil(2 / 1.5) = 2; Of course, it's just a heuristics, so this may fail in some other cases. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] improving GROUP BY estimation
Hi, Tomas! I've assigned to review this patch. I've checked version estimate-num-groups-v2.txt by Mark Dilger. It applies to head cleanly, passes corrected regression tests. About correlated/uncorrelated cases. I think now optimizer mostly assumes all distributions to be independent. I think we should follow this assumption in this case also until we have fundamentally better option (like your multivariate statistics). @@ -3438,9 +3438,9 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows, reldistinct = clamp; /* - * Multiply by restriction selectivity. + * Estimate number of distinct values expected in given number of rows. */ - reldistinct *= rel->rows / rel->tuples; + reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows)); /* * Update estimate of total distinct groups. I think we need way more explanation in comments here (possibly with external links). For now, it looks like formula which appears from nowhere. Also, note that rel->tuples gone away from formula parameters. So, we actually don't care about how may tuples are in the table. This is because this formula assumes that same tuple could be selected multiple times. For low numbers this can lead to some errors. Consider selecting 2 from 3 distinct tuples. If you can't select same tuple twice then you always selecting 2 distinct tuples. But according to formula you will select 5/3 in average. I think this error in small numbers is negligible, especially because getting rid of this error would require way more computations. But it worth mentioning in comments though. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] improving GROUP BY estimation
Hi, On 02/26/2016 04:32 AM, Mark Dilger wrote: On Feb 25, 2016, at 4:59 PM, Tomas Vondra wrote: ... The culprit here is that the two columns are not independent, but are (rather strongly) correlated due to the way you've generated the data. Yes, that was intentional. Your formula is exactly correct, so far as i can tell, for completely uncorrelated data. I don't have any tables with completely uncorrelated data, and was therefore interested in what happens when the data is correlated and your patch is applied. BTW, the whole reason I responded to your post is that I think I would like to have your changes in the code base. I'm just playing Devil's Advocate here, to see if there is room for any improvement. Sure, that's how I understood it. I just wanted to point out the correlation, as that might not have been obvious to others. Thanks for the patch submission. I hope my effort to review it is on the whole more helpful than harmful. Thanks for the feedback! regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] improving GROUP BY estimation
> On Feb 25, 2016, at 4:59 PM, Tomas Vondra > wrote: > > Hi, > > On 02/26/2016 12:16 AM, Mark Dilger wrote: >> >> It is not hard to write test cases where your patched version overestimates >> the number of rows by a very similar factor as the old code underestimates >> them. My very first test, which was not specifically designed to demonstrate >> this, happens to be one such example: >> >> >> CREATE TABLE t (a INT, b int); >> INSERT INTO t SELECT sqrt(gs)::int, gs FROM generate_series(1,1000) gs; >> ANALYZE t; >> EXPLAIN SELECT a FROM t WHERE b < 1000 GROUP BY a; >> QUERY PLAN >> --- >> HashAggregate (cost=169250.21..169258.71 rows=850 width=4) >>Group Key: a >>-> Seq Scan on t (cost=0.00..169247.71 rows=1000 width=4) >> Filter: (b < 1000) >> (4 rows) >> >> SELECT COUNT(*) FROM (SELECT a FROM t WHERE b < 1000 GROUP BY a) AS ss; >> count >> --- >> 32 >> (1 row) >> >> >> >> So, it estimates 850 rows where only 32 are returned . Without >> applying your patch, it estimates just 1 row where 32 are returned. >> That's an overestimate of roughly 26 times, rather than an >> underestimate of 32 times. > > The culprit here is that the two columns are not independent, but are (rather > strongly) correlated due to the way you've generated the data. Yes, that was intentional. Your formula is exactly correct, so far as i can tell, for completely uncorrelated data. I don't have any tables with completely uncorrelated data, and was therefore interested in what happens when the data is correlated and your patch is applied. BTW, the whole reason I responded to your post is that I think I would like to have your changes in the code base. I'm just playing Devil's Advocate here, to see if there is room for any improvement. > In cases like this (with correlated columns), it's mostly just luck how > accurate estimates you get, no matter which of the estimators you use. It's > pretty easy to generate arbitrary errors by breaking the independence > assumption, and it's not just this particular place of the planner. And it > won't change until we add some sort of smartness about dependencies between > columns. > > I think over-estimates are a bit more acceptable in this case, e.g. because > of the HashAggregate OOM risk. Also, I've seen too many nested loop cascades > due to under-estimates recently, but that's a bit unrelated. I have seen similar problems in systems I maintain, hence my interest in your patch. >> As a patch review, I'd say that your patch does what you claim it >> does, and it applies cleanly, and passes the regression tests with >> my included modifications. I think there needs to be some discussion >> on the list about whether the patch is agood idea. > > Yes, that's why I haven't bothered with fixing the regression tests in the > patch, actually. Right, but for the group as a whole, I thought it might generate some feedback if people saw what else changed in the regression results. If you look, the changes have to do with plans chosen and row estimates -- exactly the sort of stuff you would expect to change. Thanks for the patch submission. I hope my effort to review it is on the whole more helpful than harmful. Mark Dilger -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] improving GROUP BY estimation
Hi, On 02/26/2016 12:16 AM, Mark Dilger wrote: It is not hard to write test cases where your patched version overestimates the number of rows by a very similar factor as the old code underestimates them. My very first test, which was not specifically designed to demonstrate this, happens to be one such example: CREATE TABLE t (a INT, b int); INSERT INTO t SELECT sqrt(gs)::int, gs FROM generate_series(1,1000) gs; ANALYZE t; EXPLAIN SELECT a FROM t WHERE b < 1000 GROUP BY a; QUERY PLAN --- HashAggregate (cost=169250.21..169258.71 rows=850 width=4) Group Key: a -> Seq Scan on t (cost=0.00..169247.71 rows=1000 width=4) Filter: (b < 1000) (4 rows) SELECT COUNT(*) FROM (SELECT a FROM t WHERE b < 1000 GROUP BY a) AS ss; count --- 32 (1 row) So, it estimates 850 rows where only 32 are returned . Without applying your patch, it estimates just 1 row where 32 are returned. That's an overestimate of roughly 26 times, rather than an underestimate of 32 times. The culprit here is that the two columns are not independent, but are (rather strongly) correlated due to the way you've generated the data. In cases like this (with correlated columns), it's mostly just luck how accurate estimates you get, no matter which of the estimators you use. It's pretty easy to generate arbitrary errors by breaking the independence assumption, and it's not just this particular place of the planner. And it won't change until we add some sort of smartness about dependencies between columns. I think over-estimates are a bit more acceptable in this case, e.g. because of the HashAggregate OOM risk. Also, I've seen too many nested loop cascades due to under-estimates recently, but that's a bit unrelated. As a patch review, I'd say that your patch does what you claim it does, and it applies cleanly, and passes the regression tests with my included modifications. I think there needs to be some discussion on the list about whether the patch is agood idea. Yes, that's why I haven't bothered with fixing the regression tests in the patch, actually. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] improving GROUP BY estimation
> On Feb 25, 2016, at 3:16 PM, Mark Dilger wrote: > > >> On Feb 23, 2016, at 5:12 PM, Tomas Vondra >> wrote: >> >> >> >> So much better. Clearly, there are cases where this will over-estimate the >> cardinality - for example when the values are somehow correlated. >> > > I applied your patch, which caused a few regression tests to fail. Attached > is a patch that includes the necessary changes to the expected test results. > > It is not hard to write test cases where your patched version overestimates > the number of rows by a very similar factor as the old code underestimates > them. My very first test, which was not specifically designed to demonstrate > this, happens to be one such example: > > > CREATE TABLE t (a INT, b int); > INSERT INTO t SELECT sqrt(gs)::int, gs FROM generate_series(1,1000) gs; > ANALYZE t; > EXPLAIN SELECT a FROM t WHERE b < 1000 GROUP BY a; > QUERY PLAN > --- > HashAggregate (cost=169250.21..169258.71 rows=850 width=4) > Group Key: a > -> Seq Scan on t (cost=0.00..169247.71 rows=1000 width=4) > Filter: (b < 1000) > (4 rows) > > SELECT COUNT(*) FROM (SELECT a FROM t WHERE b < 1000 GROUP BY a) AS ss; > count > --- >32 > (1 row) > > > > So, it estimates 850 rows where only 32 are returned . Without applying your > patch, > it estimates just 1 row where 32 are returned. That's an overestimate of > roughly 26 times, > rather than an underestimate of 32 times. > > As a patch review, I'd say that your patch does what you claim it does, and > it applies > cleanly, and passes the regression tests with my included modifications. I > think there > needs to be some discussion on the list about whether the patch is a good > idea. > > Mark Dilger > > > Assuming the goal is to minimize the degree to which the estimates are inaccurate, I get better results by a kind of averaging of the old values from the current code base and the new values from Tomas's patch. I tried this and at least for the few examples we've been throwing around, I found the results were not as wildly inaccurate in the worst case examples than in either of the other two approaches: diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 46c95b0..c83135a 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -3414,6 +3414,8 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows, Assert(rel->reloptkind == RELOPT_BASEREL); if (rel->tuples > 0) { + double old_factor; /* The way it is currently done on master */ + double new_factor; /* The way Tomas Vondra proposes doing it */ /* * Clamp to size of rel, or size of rel / 10 if multiple Vars. The * fudge factor is because the Vars are probably correlated but we @@ -3440,7 +3442,10 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows, /* * Multiply by restriction selectivity. */ - reldistinct *= rel->rows / rel->tuples; + old_factor = rel->rows / rel->tuples; /* old way */ + new_factor = (1 - powl((reldistinct - 1) / reldistinct, rel->rows)); /* Tomas's way */ + + reldistinct *= sqrt(old_factor * new_factor); /* average of old way and new way, sort of */ /* * Update estimate of total distinct groups. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] improving GROUP BY estimation
> On Feb 23, 2016, at 5:12 PM, Tomas Vondra > wrote: > > > > So much better. Clearly, there are cases where this will over-estimate the > cardinality - for example when the values are somehow correlated. > I applied your patch, which caused a few regression tests to fail. Attached is a patch that includes the necessary changes to the expected test results. It is not hard to write test cases where your patched version overestimates the number of rows by a very similar factor as the old code underestimates them. My very first test, which was not specifically designed to demonstrate this, happens to be one such example: CREATE TABLE t (a INT, b int); INSERT INTO t SELECT sqrt(gs)::int, gs FROM generate_series(1,1000) gs; ANALYZE t; EXPLAIN SELECT a FROM t WHERE b < 1000 GROUP BY a; QUERY PLAN --- HashAggregate (cost=169250.21..169258.71 rows=850 width=4) Group Key: a -> Seq Scan on t (cost=0.00..169247.71 rows=1000 width=4) Filter: (b < 1000) (4 rows) SELECT COUNT(*) FROM (SELECT a FROM t WHERE b < 1000 GROUP BY a) AS ss; count --- 32 (1 row) So, it estimates 850 rows where only 32 are returned . Without applying your patch, it estimates just 1 row where 32 are returned. That's an overestimate of roughly 26 times, rather than an underestimate of 32 times. As a patch review, I'd say that your patch does what you claim it does, and it applies cleanly, and passes the regression tests with my included modifications. I think there needs to be some discussion on the list about whether the patch is a good idea. Mark Dilger diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 46c95b0..82a7f7f 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -3438,9 +3438,9 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows, reldistinct = clamp; /* -* Multiply by restriction selectivity. +* Estimate number of distinct values expected in given number of rows. */ - reldistinct *= rel->rows / rel->tuples; + reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, rel->rows)); /* * Update estimate of total distinct groups. diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out index 59d7877..d9dd5ca 100644 --- a/src/test/regress/expected/join.out +++ b/src/test/regress/expected/join.out @@ -3951,17 +3951,17 @@ select d.* from d left join (select * from b group by b.id, b.c_id) s on d.a = s.id; QUERY PLAN --- - Merge Left Join - Merge Cond: (d.a = s.id) - -> Sort - Sort Key: d.a - -> Seq Scan on d + Merge Right Join + Merge Cond: (s.id = d.a) -> Sort Sort Key: s.id -> Subquery Scan on s -> HashAggregate Group Key: b.id -> Seq Scan on b + -> Sort + Sort Key: d.a + -> Seq Scan on d (11 rows) -- similarly, but keying off a DISTINCT clause @@ -3970,17 +3970,17 @@ select d.* from d left join (select distinct * from b) s on d.a = s.id; QUERY PLAN - - Merge Left Join - Merge Cond: (d.a = s.id) - -> Sort - Sort Key: d.a - -> Seq Scan on d + Merge Right Join + Merge Cond: (s.id = d.a) -> Sort Sort Key: s.id -> Subquery Scan on s -> HashAggregate Group Key: b.id, b.c_id -> Seq Scan on b + -> Sort + Sort Key: d.a + -> Seq Scan on d (11 rows) -- check join removal works when uniqueness of the join condition is enforced diff --git a/src/test/regress/expected/subselect.out b/src/test/regress/expected/subselect.out index de64ca7..0fc93d9 100644 --- a/src/test/regress/expected/subselect.out +++ b/src/test/regress/expected/subselect.out @@ -807,27 +807,24 @@ select * from int4_tbl where explain (verbose, costs off) select * from int4_tbl o where (f1, f1) in (select f1, generate_series(1,2) / 10 g from int4_tbl i group by f1); - QUERY PLAN --- - Hash Join + QUERY PLAN + + Hash Semi Join Output: o.f1 Hash Cond: (o.f1 = "ANY_subquery".f1) -> Seq Scan on public.int4_tbl o Output: o.f1 -> Hash Output: "ANY_subquery".f1, "ANY_subquery".g -