Re: [HACKERS] improving GROUP BY estimation

2016-04-02 Thread Tom Lane
Dean Rasheed writes: > Here's an updated patch with references to both papers, and a more > detailed justification for the formula, along with the other changes > discussed. Note that although equation (2) in the Dell'Era paper looks > different from the Yao formula,

Re: [HACKERS] improving GROUP BY estimation

2016-04-02 Thread Dean Rasheed
On 31 March 2016 at 22:02, Tom Lane wrote: > I'm just concerned about what happens when the Dellera paper stops being > available. I don't mind including that URL as a backup to the written-out > argument I just suggested. > Here's an updated patch with references to both

Re: [HACKERS] improving GROUP BY estimation

2016-03-31 Thread Tom Lane
Alvaro Herrera writes: > Tom Lane wrote: >> See "Approximating block accesses in database organizations", S. B. Yao, >> Communications of the ACM, Volume 20 Issue 4, April 1977 Pages 260-261 > That sounds nice all right, but I'm not sure it's actually helpful, > because

Re: [HACKERS] improving GROUP BY estimation

2016-03-31 Thread Tom Lane
Dean Rasheed writes: > Yeah, that makes sense. In fact, if we only apply the adjustment when > reldistinct > 0 and rel->rows < rel->tuples, and rewrite the first > argument to pow() as (rel->tuples - rel->rows) / rel->tuples, then it > is guaranteed to be non-negative.

Re: [HACKERS] improving GROUP BY estimation

2016-03-31 Thread Tom Lane
Dean Rasheed writes: > On 31 March 2016 at 21:40, Tom Lane wrote: >> Let's use something like this: >> See "Approximating block accesses in database organizations", S. B. Yao, >> Communications of the ACM, Volume 20 Issue 4, April 1977 Pages 260-261

Re: [HACKERS] improving GROUP BY estimation

2016-03-31 Thread Alvaro Herrera
Tom Lane wrote: > > The article text refers to this 1977 S. B. Yao paper "Approximating > > block accesses in database organizations" which doesn't appear to be > > available online, except behind ACM's paywall at > > http://dl.acm.org/citation.cfm?id=359475 > > Well, a CACM citation is

Re: [HACKERS] improving GROUP BY estimation

2016-03-31 Thread Dean Rasheed
On 31 March 2016 at 21:40, Tom Lane wrote: > Alvaro Herrera writes: >> Tom Lane wrote: >>> Another minor gripe is the use of a random URL as justification. This >>> code will still be around when that URL exists nowhere but the Wayback >>> Machine.

Re: [HACKERS] improving GROUP BY estimation

2016-03-31 Thread Dean Rasheed
On 31 March 2016 at 20:18, Tom Lane wrote: > Also, I wonder if it'd be a good idea to provide a guard against division > by zero --- we know rel->tuples > 0 at this point, but I'm less sure that > reldistinct can't be zero. In the same vein, I'm worried about the first >

Re: [HACKERS] improving GROUP BY estimation

2016-03-31 Thread Tom Lane
Alvaro Herrera writes: > Tom Lane wrote: >> Another minor gripe is the use of a random URL as justification. This >> code will still be around when that URL exists nowhere but the Wayback >> Machine. Can't we find a more formal citation to use? > The article text

Re: [HACKERS] improving GROUP BY estimation

2016-03-31 Thread Alvaro Herrera
Tom Lane wrote: > Another minor gripe is the use of a random URL as justification. This > code will still be around when that URL exists nowhere but the Wayback > Machine. Can't we find a more formal citation to use? The article text refers to this 1977 S. B. Yao paper "Approximating block

Re: [HACKERS] improving GROUP BY estimation

2016-03-31 Thread Tom Lane
Dean Rasheed writes: > On 30 March 2016 at 14:03, Tomas Vondra wrote: >> Attached is v4 of the patch > Thanks, I think this is good to go, except that I think we need to use > pow() rather than powl() because AIUI powl() is new to C99, and

Re: [HACKERS] improving GROUP BY estimation

2016-03-31 Thread Dean Rasheed
On 30 March 2016 at 14:03, Tomas Vondra wrote: > Attached is v4 of the patch Thanks, I think this is good to go, except that I think we need to use pow() rather than powl() because AIUI powl() is new to C99, and so won't necessarily be available on all supported

Re: [HACKERS] improving GROUP BY estimation

2016-03-30 Thread Tomas Vondra
Hi, On 03/22/2016 03:40 PM, Alexander Korotkov wrote: I think you should send a revision of patch including comments proposed by Deam Rasheed. I'm switching patch status to waiting on author in commitfest. Attached is v4 of the patch - the only difference w.r.t. v3 is that I've used the

Re: [HACKERS] improving GROUP BY estimation

2016-03-29 Thread David Steele
Hi Thomas, On 3/22/16 10:40 AM, Alexander Korotkov wrote: I think you should send a revision of patch including comments proposed by Deam Rasheed. I'm switching patch status to waiting on author in commitfest. We're getting pretty close to the end of the CF. Do you know when you will have

Re: [HACKERS] improving GROUP BY estimation

2016-03-22 Thread Alexander Korotkov
Hi, Tomas! On Mon, Mar 21, 2016 at 2:39 PM, Alexander Korotkov < a.korot...@postgrespro.ru> wrote: > On Fri, Mar 18, 2016 at 1:20 PM, Dean Rasheed > wrote: > >> Probably a better URL to give is >> http://www.adellera.it/investigations/distinct_balls/ which has a link

Re: [HACKERS] improving GROUP BY estimation

2016-03-21 Thread Alexander Korotkov
Hi, Dean! On Fri, Mar 18, 2016 at 1:20 PM, Dean Rasheed wrote: > Probably a better URL to give is > http://www.adellera.it/investigations/distinct_balls/ which has a link > to the PDF version of the paper and also some supporting material. > > However, while that paper

Re: [HACKERS] improving GROUP BY estimation

2016-03-19 Thread Dean Rasheed
On 18 March 2016 at 00:37, Tomas Vondra wrote: >> On Sun, 2016-03-13 at 15:24 +, Dean Rasheed wrote: >>> I think that a better formula to use would be >>> >>> reldistinct *= (1 - powl(1 - rel-rows / rel->tuples, rel->tuples / >>> reldistinct) > > Attached is a v3

Re: [HACKERS] improving GROUP BY estimation

2016-03-19 Thread Tomas Vondra
On 03/13/2016 11:09 PM, Tomas Vondra wrote: Hi, On Sun, 2016-03-13 at 15:24 +, Dean Rasheed wrote: On 4 March 2016 at 13:10, Tomas Vondra wrote: ... >>> I think that a better formula to use would be reldistinct *= (1 - powl(1 - rel-rows / rel->tuples,

Re: [HACKERS] improving GROUP BY estimation

2016-03-13 Thread Tomas Vondra
Hi, On Sun, 2016-03-13 at 15:24 +, Dean Rasheed wrote: > On 4 March 2016 at 13:10, Tomas Vondra > wrote: > > > > The risk associated with over-estimation is that we may switch from > > HashAggregate to GroupAggregate, and generally select paths better > >

Re: [HACKERS] improving GROUP BY estimation

2016-03-13 Thread Dean Rasheed
On 4 March 2016 at 13:10, Tomas Vondra wrote: > The risk associated with over-estimation is that we may switch from > HashAggregate to GroupAggregate, and generally select paths better > suited for large number of rows. Not great, because the plan may not be >

Re: [HACKERS] improving GROUP BY estimation

2016-03-04 Thread Tomas Vondra
On Thu, 2016-03-03 at 11:42 -0800, Mark Dilger wrote: > > On Mar 3, 2016, at 11:27 AM, Alexander Korotkov > > wrote: > > > > On Thu, Mar 3, 2016 at 10:16 PM, Tomas Vondra > > wrote: > > So yes, each estimator works great for exactly the

Re: [HACKERS] improving GROUP BY estimation

2016-03-03 Thread Mark Dilger
> On Mar 3, 2016, at 11:27 AM, Alexander Korotkov > wrote: > > On Thu, Mar 3, 2016 at 10:16 PM, Tomas Vondra > wrote: > So yes, each estimator works great for exactly the opposite cases. But notice > that typically, the results of the

Re: [HACKERS] improving GROUP BY estimation

2016-03-03 Thread Alexander Korotkov
On Thu, Mar 3, 2016 at 10:16 PM, Tomas Vondra wrote: > So yes, each estimator works great for exactly the opposite cases. But > notice that typically, the results of the new formula is much higher than > the old one, sometimes by two orders of magnitude (and it

Re: [HACKERS] improving GROUP BY estimation

2016-03-03 Thread Tomas Vondra
Hi, On 03/03/2016 06:40 PM, Mark Dilger wrote: On Mar 3, 2016, at 8:35 AM, Tomas Vondra wrote: Hi, On 03/03/2016 12:53 PM, Alexander Korotkov wrote: Hi, Tomas! I've assigned to review this patch. I've checked version estimate-num-groups-v2.txt by Mark

Re: [HACKERS] improving GROUP BY estimation

2016-03-03 Thread Alexander Korotkov
On Thu, Mar 3, 2016 at 7:35 PM, Tomas Vondra wrote: > On 03/03/2016 12:53 PM, Alexander Korotkov wrote: > >> I've assigned to review this patch. >> >> I've checked version estimate-num-groups-v2.txt by Mark Dilger. >> It applies to head cleanly, passes corrected

Re: [HACKERS] improving GROUP BY estimation

2016-03-03 Thread Mark Dilger
> On Mar 3, 2016, at 8:35 AM, Tomas Vondra wrote: > > Hi, > > On 03/03/2016 12:53 PM, Alexander Korotkov wrote: >> Hi, Tomas! >> >> I've assigned to review this patch. >> >> I've checked version estimate-num-groups-v2.txt by Mark Dilger. >> It applies to head

Re: [HACKERS] improving GROUP BY estimation

2016-03-03 Thread Tomas Vondra
Hi, On 03/03/2016 12:53 PM, Alexander Korotkov wrote: Hi, Tomas! I've assigned to review this patch. I've checked version estimate-num-groups-v2.txt by Mark Dilger. It applies to head cleanly, passes corrected regression tests. About correlated/uncorrelated cases. I think now optimizer

Re: [HACKERS] improving GROUP BY estimation

2016-03-03 Thread Alexander Korotkov
Hi, Tomas! I've assigned to review this patch. I've checked version estimate-num-groups-v2.txt by Mark Dilger. It applies to head cleanly, passes corrected regression tests. About correlated/uncorrelated cases. I think now optimizer mostly assumes all distributions to be independent. I think we

Re: [HACKERS] improving GROUP BY estimation

2016-02-25 Thread Tomas Vondra
Hi, On 02/26/2016 04:32 AM, Mark Dilger wrote: On Feb 25, 2016, at 4:59 PM, Tomas Vondra wrote: ... The culprit here is that the two columns are not independent, but are (rather strongly) correlated due to the way you've generated the data. Yes, that was

Re: [HACKERS] improving GROUP BY estimation

2016-02-25 Thread Mark Dilger
> On Feb 25, 2016, at 4:59 PM, Tomas Vondra > wrote: > > Hi, > > On 02/26/2016 12:16 AM, Mark Dilger wrote: >> >> It is not hard to write test cases where your patched version overestimates >> the number of rows by a very similar factor as the old code

Re: [HACKERS] improving GROUP BY estimation

2016-02-25 Thread Tomas Vondra
Hi, On 02/26/2016 12:16 AM, Mark Dilger wrote: It is not hard to write test cases where your patched version overestimates the number of rows by a very similar factor as the old code underestimates them. My very first test, which was not specifically designed to demonstrate this, happens to

Re: [HACKERS] improving GROUP BY estimation

2016-02-25 Thread Mark Dilger
> On Feb 25, 2016, at 3:16 PM, Mark Dilger wrote: > > >> On Feb 23, 2016, at 5:12 PM, Tomas Vondra >> wrote: >> >> >> >> So much better. Clearly, there are cases where this will over-estimate the >> cardinality - for example when the

Re: [HACKERS] improving GROUP BY estimation

2016-02-25 Thread Mark Dilger
> On Feb 23, 2016, at 5:12 PM, Tomas Vondra > wrote: > > > > So much better. Clearly, there are cases where this will over-estimate the > cardinality - for example when the values are somehow correlated. > I applied your patch, which caused a few regression

[HACKERS] improving GROUP BY estimation

2016-02-23 Thread Tomas Vondra
Hi, while working of using correlation statistics when estimating GROUP BY (or generally whatever uses estimate_num_groups), I've noticed that we apply the selectivity in a rather naive way. After estimating number of groups in the whole table - either by reading pg_stats.n_distinct for a