Re: [HACKERS] multivariate statistics (v19)
On Sun, Feb 12, 2017 at 10:35:04AM +, Dean Rasheed wrote: > On 11 February 2017 at 01:17, Tomas Vondra > wrote: > > Thanks for the feedback, I'll fix this. I've allowed myself to be a bit > > sloppy because the number of attributes in the statistics is currently > > limited to 8, so the overflows are currently not an issue. But it doesn't > > hurt to make it future-proof, in case we change that mostly artificial limit > > sometime in the future. > > > > Ah right, so it can't overflow at present, but it's neater to have an > overflow-proof algorithm. > > Thinking about the exactness of the division steps is quite > interesting. Actually, the order of the multiplying factors doesn't > matter as long as the divisors are in increasing order. So in both my > proposal: > > result = 1 > for (i = 1; i <= k; i++) > result = (result * (n-k+i)) / i; > > and David's proposal, which is equivalent but has the multiplying > factors in the opposite order, equivalent to: > > result = 1 > for (i = 1; i <= k; i++) > result = (result * (n-i+1)) / i; > > the divisions are exact at each step. The first time through the loop > it divides by 1 which is trivially exact. The second time it divides > by 2, having multiplied by 2 consecutive factors, one of which is > therefore guaranteed to be divisible by 2. The third time it divides > by 3, having multiplied by 3 consecutive factors, one of which is > therefore guaranteed to be divisible by 3, and so on. Right. You know you can use integer division, which make sense as permutations of discrete sets are always integers. > My approach originally seemed more logical to me because of the way it > derives from the recurrence relation binomial(n, k) = binomial(n-1, > k-1) * n / k, but they both work fine as long as they have suitable > overflow checks. Right. We could even cache those checks (sorry) based on data type limits by architecture and OS if performance on those operations ever matters that much. > It's also interesting that descriptions of this algorithm tend to > talk about setting k to min(k, n-k) at the start as an optimisation > step, as I did in fact, whereas it's actually more than that -- it > helps prevent unnecessary intermediate overflows when k > n/2. Of > course, that's not a worry for the current use of this function, but > it's good to have a robust algorithm. Indeed. :) Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On 11 February 2017 at 01:17, Tomas Vondra wrote: > Thanks for the feedback, I'll fix this. I've allowed myself to be a bit > sloppy because the number of attributes in the statistics is currently > limited to 8, so the overflows are currently not an issue. But it doesn't > hurt to make it future-proof, in case we change that mostly artificial limit > sometime in the future. > Ah right, so it can't overflow at present, but it's neater to have an overflow-proof algorithm. Thinking about the exactness of the division steps is quite interesting. Actually, the order of the multiplying factors doesn't matter as long as the divisors are in increasing order. So in both my proposal: result = 1 for (i = 1; i <= k; i++) result = (result * (n-k+i)) / i; and David's proposal, which is equivalent but has the multiplying factors in the opposite order, equivalent to: result = 1 for (i = 1; i <= k; i++) result = (result * (n-i+1)) / i; the divisions are exact at each step. The first time through the loop it divides by 1 which is trivially exact. The second time it divides by 2, having multiplied by 2 consecutive factors, one of which is therefore guaranteed to be divisible by 2. The third time it divides by 3, having multiplied by 3 consecutive factors, one of which is therefore guaranteed to be divisible by 3, and so on. My approach originally seemed more logical to me because of the way it derives from the recurrence relation binomial(n, k) = binomial(n-1, k-1) * n / k, but they both work fine as long as they have suitable overflow checks. It's also interesting that descriptions of this algorithm tend to talk about setting k to min(k, n-k) at the start as an optimisation step, as I did in fact, whereas it's actually more than that -- it helps prevent unnecessary intermediate overflows when k > n/2. Of course, that's not a worry for the current use of this function, but it's good to have a robust algorithm. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On 02/08/2017 07:40 PM, Dean Rasheed wrote: On 8 February 2017 at 16:09, David Fetter wrote: Combinations are n!/(k! * (n-k)!), so computing those is more along the lines of: unsigned long long choose(unsigned long long n, unsigned long long k) { if (k > n) { return 0; } unsigned long long r = 1; for (unsigned long long d = 1; d <= k; ++d) { r *= n--; r /= d; } return r; } which greatly reduces the chance of overflow. Hmm, but that doesn't actually prevent overflows, since it can overflow in the multiplication step, and there is no protection against that. In the algorithm I presented, the inputs and the intermediate result are kept below INT_MAX, so the multiplication step cannot overflow the 64-bit integer, and it will only raise an overflow error if the actual result won't fit in a 32-bit int. Actually a crucial part of that, which I failed to mention previously, is the first step replacing k with min(k, n-k). This is necessary for inputs like (100,99), which should return 100, and which must be computed as 100 choose 1, not 100 choose 99, otherwise it will overflow internally before getting to the final result. Thanks for the feedback, I'll fix this. I've allowed myself to be a bit sloppy because the number of attributes in the statistics is currently limited to 8, so the overflows are currently not an issue. But it doesn't hurt to make it future-proof, in case we change that mostly artificial limit sometime in the future. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On 8 February 2017 at 16:09, David Fetter wrote: > Combinations are n!/(k! * (n-k)!), so computing those is more > along the lines of: > > unsigned long long > choose(unsigned long long n, unsigned long long k) { > if (k > n) { > return 0; > } > unsigned long long r = 1; > for (unsigned long long d = 1; d <= k; ++d) { > r *= n--; > r /= d; > } > return r; > } > > which greatly reduces the chance of overflow. > Hmm, but that doesn't actually prevent overflows, since it can overflow in the multiplication step, and there is no protection against that. In the algorithm I presented, the inputs and the intermediate result are kept below INT_MAX, so the multiplication step cannot overflow the 64-bit integer, and it will only raise an overflow error if the actual result won't fit in a 32-bit int. Actually a crucial part of that, which I failed to mention previously, is the first step replacing k with min(k, n-k). This is necessary for inputs like (100,99), which should return 100, and which must be computed as 100 choose 1, not 100 choose 99, otherwise it will overflow internally before getting to the final result. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On Wed, Feb 08, 2017 at 03:23:25PM +, Dean Rasheed wrote: > On 6 February 2017 at 21:26, Alvaro Herrera wrote: > > Tomas Vondra wrote: > >> On 02/01/2017 11:52 PM, Alvaro Herrera wrote: > > > >> > Nearby, some auxiliary functions such as n_choose_k and > >> > num_combinations are not documented. What it is that they do? I'd > >> > move these at the end of the file, keeping the important entry points > >> > at the top of the file. > >> > >> I'd say n-choose-k is pretty widely known term from combinatorics. The > >> comment would essentially say just 'this is n-choose-k' which seems rather > >> pointless. So as much as I dislike the self-documenting code, this actually > >> seems like a good case of that. > > > > Actually, we do have such comments all over the place. I knew this as > > "n sobre k", so the english name doesn't immediately ring a bell with me > > until I look it up; I think the function comment could just say > > "n_choose_k -- this function returns the binomial coefficient". > > One of the things you have to watch out for when writing code to > compute binomial coefficients is integer overflow, since the numerator > and denominator get large very quickly. For example, the current code > will overflow for n=13, k=12, which really isn't that large. > > This can be avoided by computing the product in reverse and using a > larger datatype like a 64-bit integer to store a single intermediate > result. The point about multiplying the terms in reverse is that it > guarantees that each intermediate result is an exact integer (a > smaller binomial coefficient), so there is no need to track separate > numerators and denominators, and you avoid huge intermediate > factorials. Here's what that looks like in psuedo-code: > > binomial(int n, int k): > # Save computational effort by using the symmetry of the binomial > # coefficients > k = min(k, n-k); > > # Compute the result using binomial(n, k) = binomial(n-1, k-1) * n / k, > # starting from binomial(n-k, 0) = 1, and computing the sequence > # binomial(n-k+1, 1), binomial(n-k+2, 2), ... > # > # Note that each intermediate result is an exact integer. > int64 result = 1; > for (int i = 1; i <= k; i++) > { > result = (result * (n-k+i)) / i; > if (result > INT_MAX) Raise overflow error > } > return (int) result; > > > Note also that I think num_combinations(n) is just an expensive way of > calculating 2^n - n - 1. Combinations are n!/(k! * (n-k)!), so computing those is more along the lines of: unsigned long long choose(unsigned long long n, unsigned long long k) { if (k > n) { return 0; } unsigned long long r = 1; for (unsigned long long d = 1; d <= k; ++d) { r *= n--; r /= d; } return r; } which greatly reduces the chance of overflow. Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On 6 February 2017 at 21:26, Alvaro Herrera wrote: > Tomas Vondra wrote: >> On 02/01/2017 11:52 PM, Alvaro Herrera wrote: > >> > Nearby, some auxiliary functions such as n_choose_k and >> > num_combinations are not documented. What it is that they do? I'd >> > move these at the end of the file, keeping the important entry points >> > at the top of the file. >> >> I'd say n-choose-k is pretty widely known term from combinatorics. The >> comment would essentially say just 'this is n-choose-k' which seems rather >> pointless. So as much as I dislike the self-documenting code, this actually >> seems like a good case of that. > > Actually, we do have such comments all over the place. I knew this as > "n sobre k", so the english name doesn't immediately ring a bell with me > until I look it up; I think the function comment could just say > "n_choose_k -- this function returns the binomial coefficient". > One of the things you have to watch out for when writing code to compute binomial coefficients is integer overflow, since the numerator and denominator get large very quickly. For example, the current code will overflow for n=13, k=12, which really isn't that large. This can be avoided by computing the product in reverse and using a larger datatype like a 64-bit integer to store a single intermediate result. The point about multiplying the terms in reverse is that it guarantees that each intermediate result is an exact integer (a smaller binomial coefficient), so there is no need to track separate numerators and denominators, and you avoid huge intermediate factorials. Here's what that looks like in psuedo-code: binomial(int n, int k): # Save computational effort by using the symmetry of the binomial # coefficients k = min(k, n-k); # Compute the result using binomial(n, k) = binomial(n-1, k-1) * n / k, # starting from binomial(n-k, 0) = 1, and computing the sequence # binomial(n-k+1, 1), binomial(n-k+2, 2), ... # # Note that each intermediate result is an exact integer. int64 result = 1; for (int i = 1; i <= k; i++) { result = (result * (n-k+i)) / i; if (result > INT_MAX) Raise overflow error } return (int) result; Note also that I think num_combinations(n) is just an expensive way of calculating 2^n - n - 1. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
Still about 0003. dependencies.c comment at the top of the file should contain some details about what is it implementing and a general description of the algorithm and data structures. As before, it's best to have the main entry point build_mv_dependencies at the top, the other public functions, keeping the internal routines at the bottom of the file. That eases code study for future readers. (Minimizing number of function prototypes is not a goal.) What is MVSTAT_DEPS_TYPE_BASIC? Is "functional dependencies" really BASIC? I wonder if it should be TYPE_FUNCTIONAL_DEPS or something. As with pg_ndistinct_out, there's no need to pstrdup(str.data), as it's already palloc'ed in the right context. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
Looking at 0003, I notice that gram.y is changed to add a WITH ( .. ) clause. If it's not specified, an error is raised. If you create stats with (ndistinct) then you can't alter it later to add "dependencies" or whatever; unless I misunderstand, you have to drop the statistics and create another one. Probably in a forthcoming patch we should have ALTER support to add a stats type. Also, why isn't the default to build everything, rather than nothing? BTW, almost everything in the backend could be inside "utils/", so let's not do that -- let's just create src/backend/statistics/ for all your code. Here a few notes while reading README.dependencies -- some typos, two questions. diff --git a/src/backend/utils/mvstats/README.dependencies b/src/backend/utils/mvstats/README.dependencies index 908f094..7f3ed3d 100644 --- a/src/backend/utils/mvstats/README.dependencies +++ b/src/backend/utils/mvstats/README.dependencies @@ -36,7 +36,7 @@ design choice to model the dataset in denormalized way, either because of performance or to make querying easier. -soft dependencies +Soft dependencies - Real-world data sets often contain data errors, either because of data entry @@ -48,7 +48,7 @@ rendering the approach mostly useless even for slightly noisy data sets, or result in sudden changes in behavior depending on minor differences between samples provided to ANALYZE. -For this reason the statistics implementes "soft" functional dependencies, +For this reason the statistics implements "soft" functional dependencies, associating each functional dependency with a degree of validity (a number number between 0 and 1). This degree is then used to combine selectivities in a smooth manner. @@ -75,6 +75,7 @@ The algorithm also requires a minimum size of the group to consider it consistent (currently 3 rows in the sample). Small groups make it less likely to break the consistency. +## What is it that we store in the catalog? Clause reduction (planner/optimizer) @@ -95,12 +96,12 @@ example for (a,b,c) we first use (a,b=>c) to break the computation into and then apply (a=>b) the same way on P(a=?,b=?). -Consistecy of clauses +Consistency of clauses - Functional dependencies only express general dependencies between columns, without referencing particular values. This assumes that the equality clauses -are in fact consistent with the functinal dependency, i.e. that given a +are in fact consistent with the functional dependency, i.e. that given a dependency (a=>b), the value in (b=?) clause is the value determined by (a=?). If that's not the case, the clauses are "inconsistent" with the functional dependency and the result will be over-estimation. @@ -111,6 +112,7 @@ set will be empty, but we'll estimate the selectivity using the ZIP condition. In this case the default estimation based on AVIA principle happens to work better, but mostly by chance. +## what is AVIA principle? This issue is the price for the simplicity of functional dependencies. If the application frequently constructs queries with clauses inconsistent with -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
Tomas Vondra wrote: > On 02/01/2017 11:52 PM, Alvaro Herrera wrote: > > Nearby, some auxiliary functions such as n_choose_k and > > num_combinations are not documented. What it is that they do? I'd > > move these at the end of the file, keeping the important entry points > > at the top of the file. > > I'd say n-choose-k is pretty widely known term from combinatorics. The > comment would essentially say just 'this is n-choose-k' which seems rather > pointless. So as much as I dislike the self-documenting code, this actually > seems like a good case of that. Actually, we do have such comments all over the place. I knew this as "n sobre k", so the english name doesn't immediately ring a bell with me until I look it up; I think the function comment could just say "n_choose_k -- this function returns the binomial coefficient". > > I see this patch has a estimate_ndistinct() which claims to be a re- > > implementation of code already in analyze.c, but it is actually a lot > > simpler than what analyze.c does. I've been wondering if it'd be a good > > idea to use some of this code so that some routines are moved out of > > analyze.c; good implementations of statistics-related functions would > > live in src/backend/statistics/ where they can be used both by analyze.c > > and your new mvstats stuff. (More generally I am beginning to wonder if > > the new directory should be just src/backend/statistics.) > > I'll look into that. I have to check if I ignored some assumptions or corner > cases the analyze.c deals with. Maybe it's not terribly important to refactor analyze.c from the get go, but let's give the subdir a more general name. Hence my vote for having the subdir be "statistics" instead of "mvstats". > > In another subthread you seem to have surrendered to the opinion that > > the new catalog should be called pg_statistics_ext, just in case in the > > future we come up with additional things to put on it. However, given > > its schema, with a "starelid / stakeys", is it sensible to think that > > we're going to get anything other than something that involves multiple > > variables? Maybe it should just be "pg_statistics_multivar" and if > > something else comes along we create another catalog with an appropriate > > schema. Heck, how does this catalog serve the purpose of cross-table > > statistics in the first place, given that it has room to record a single > > relid only? Are you thinking that in the future you'd change starelid > > into an oidvector column? > > Yes, I think the starelid will turn into OID vector. The reason why I > haven't done that in the current version of the catalog is to keep it > simple. OK -- as long as we know what the way forward is, I'm good. Still, my main point was that even if we have multiple rels, this catalog will be about having multivariate statistics, and not different kinds of statistical data. I would keep pg_mv_statistics, really. > > The comment in gram.y about the CREATE STATISTICS is at odds with what > > is actually allowed by the grammar. > > Which comment? This one: * CREATE STATISTICS stats_name ON relname (columns) WITH (options) the production actually says: CREATE STATISTICS any_name ON '(' columnList ')' FROM qualified_name > > I think the name of a statistics is only useful to DROP/ALTER it, right? > > I wonder why it's useful that statistics belongs in a schema. Perhaps > > it should be a global object? I suppose the name collisions would > > become bothersome if you have many mvstats. > > I think it shouldn't be a global object. I consider them to be a part of a > schema (just like indexes, for example). Imagine you have a multi-tenant > database, with using exactly the same (tables/indexes) schema, but keept in > different schemas. Why shouldn't it be possible to also use the same set of > statistics for each tenant? True. Suggestion withdrawn. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On Thu, Feb 2, 2017 at 3:59 AM, Tomas Vondra wrote: > There's a subtle difference between pg_node_tree and the data types for > statistics - pg_node_tree stores the value as a string (matching the > nodeToString output), so the _in function is fairly simple. Of course, > stringToNode() assumes safe input, which is why the input is disabled. > > OTOH the statistics are stored in an optimized binary format, allowing to > use the value directly (without having to do expensive parsing etc). > > I was thinking that the easiest way to add support for _in would be to add a > bunch of Nodes for the statistics, along with in/out functions, but keeping > the internal binary representation. But that'll be tricky to do in a safe > way - even if those nodes are coded in a very defensive ways, I'd bet > there'll be ways to inject unsafe nodes. > > So I'm OK with not having the _in for now. If needed, it's possible to > construct the statistics as a bytea using a bit of C code. That's at least > obviously unsafe, as anything written in C, touching the memory. Since these data types are already special-purpose, I don't really see why it would be desirable to entangle them with the existing code for serializing and deserializing Nodes. Whether or not it's absolutely necessary for these types to have input functions, it seems at least possible that it would be useful, and it becomes much less likely that we can make that work if it's piggybacking on stringToNode(). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On 02/01/2017 11:52 PM, Alvaro Herrera wrote: Still looking at 0002. pg_ndistinct_in disallows input, claiming that pg_node_tree does the same thing. But pg_node_tree does it for security reasons: you could crash the backend if you supplied a malicious value. I don't think that applies to pg_ndistinct_in. Perhaps it will be useful to inject fake stats at some point, so why not allow it? It shouldn't be complicated (though it does require writing some additional code, so perhaps that's one reason we don't want to allow input of these values). > Yes, I haven't written the code, and I'm not sure it's a very practical way to inject custom statistics. But if we decide to allow that in the future, we can probably add the code. There's a subtle difference between pg_node_tree and the data types for statistics - pg_node_tree stores the value as a string (matching the nodeToString output), so the _in function is fairly simple. Of course, stringToNode() assumes safe input, which is why the input is disabled. OTOH the statistics are stored in an optimized binary format, allowing to use the value directly (without having to do expensive parsing etc). I was thinking that the easiest way to add support for _in would be to add a bunch of Nodes for the statistics, along with in/out functions, but keeping the internal binary representation. But that'll be tricky to do in a safe way - even if those nodes are coded in a very defensive ways, I'd bet there'll be ways to inject unsafe nodes. So I'm OK with not having the _in for now. If needed, it's possible to construct the statistics as a bytea using a bit of C code. That's at least obviously unsafe, as anything written in C, touching the memory. The comment on top of pg_ndistinct_out is missing the "_out"; also it talks about histograms, which is not what this is about. OK, will fix. In the same function, a trivial point you don't need to pstrdup() the .data out of a stringinfo; it's already palloc'ed in the right context -- just PG_RETURN_CSTRING(str.data) and forget about "ret". Saves you one line. Will fix too. Nearby, some auxiliary functions such as n_choose_k and num_combinations are not documented. What it is that they do? I'd move these at the end of the file, keeping the important entry points at the top of the file. I'd say n-choose-k is pretty widely known term from combinatorics. The comment would essentially say just 'this is n-choose-k' which seems rather pointless. So as much as I dislike the self-documenting code, this actually seems like a good case of that. I see this patch has a estimate_ndistinct() which claims to be a re- implementation of code already in analyze.c, but it is actually a lot simpler than what analyze.c does. I've been wondering if it'd be a good idea to use some of this code so that some routines are moved out of analyze.c; good implementations of statistics-related functions would live in src/backend/statistics/ where they can be used both by analyze.c and your new mvstats stuff. (More generally I am beginning to wonder if the new directory should be just src/backend/statistics.) I'll look into that. I have to check if I ignored some assumptions or corner cases the analyze.c deals with. common.h does not belong in src/backend/utils/mvstats; IMO it should be called src/include/utils/mvstat.h. Also, it must not include postgres.h, and it probably doesn't need most of the #includes it has; those are better put into whatever include it. It definitely needs a guarding #ifdef MVSTATS_H around its whole content too. An include file is not just a way to avoid #includes in other files; it is supposed to be a minimally invasive way of exporting the structs and functions implemented in some file into other files. So it must be kept minimal. Will do. psql/tab-complete.c compares the wrong version number (9.6 instead of 10). Is it important to have a cast from pg_ndistinct to bytea? I think it's odd that outputting it as bytea yields something completely different than as text. (The bytea is not human readable and cannot be used for future input, so what is the point?) Because it internally is a bytea, and it seems useful to have the ability to inspect the bytea value directly (e.g. to see the length of the bytea and not the string output). In another subthread you seem to have surrendered to the opinion that the new catalog should be called pg_statistics_ext, just in case in the future we come up with additional things to put on it. However, given its schema, with a "starelid / stakeys", is it sensible to think that we're going to get anything other than something that involves multiple variables? Maybe it should just be "pg_statistics_multivar" and if something else comes along we create another catalog with an appropriate schema. Heck, how does this catalog serve the purpose of cross-table statistics in the first place, given that it has room to record a single re
Re: [HACKERS] multivariate statistics (v19)
Still looking at 0002. pg_ndistinct_in disallows input, claiming that pg_node_tree does the same thing. But pg_node_tree does it for security reasons: you could crash the backend if you supplied a malicious value. I don't think that applies to pg_ndistinct_in. Perhaps it will be useful to inject fake stats at some point, so why not allow it? It shouldn't be complicated (though it does require writing some additional code, so perhaps that's one reason we don't want to allow input of these values). The comment on top of pg_ndistinct_out is missing the "_out"; also it talks about histograms, which is not what this is about. In the same function, a trivial point you don't need to pstrdup() the .data out of a stringinfo; it's already palloc'ed in the right context -- just PG_RETURN_CSTRING(str.data) and forget about "ret". Saves you one line. Nearby, some auxiliary functions such as n_choose_k and num_combinations are not documented. What it is that they do? I'd move these at the end of the file, keeping the important entry points at the top of the file. I see this patch has a estimate_ndistinct() which claims to be a re- implementation of code already in analyze.c, but it is actually a lot simpler than what analyze.c does. I've been wondering if it'd be a good idea to use some of this code so that some routines are moved out of analyze.c; good implementations of statistics-related functions would live in src/backend/statistics/ where they can be used both by analyze.c and your new mvstats stuff. (More generally I am beginning to wonder if the new directory should be just src/backend/statistics.) common.h does not belong in src/backend/utils/mvstats; IMO it should be called src/include/utils/mvstat.h. Also, it must not include postgres.h, and it probably doesn't need most of the #includes it has; those are better put into whatever include it. It definitely needs a guarding #ifdef MVSTATS_H around its whole content too. An include file is not just a way to avoid #includes in other files; it is supposed to be a minimally invasive way of exporting the structs and functions implemented in some file into other files. So it must be kept minimal. psql/tab-complete.c compares the wrong version number (9.6 instead of 10). Is it important to have a cast from pg_ndistinct to bytea? I think it's odd that outputting it as bytea yields something completely different than as text. (The bytea is not human readable and cannot be used for future input, so what is the point?) In another subthread you seem to have surrendered to the opinion that the new catalog should be called pg_statistics_ext, just in case in the future we come up with additional things to put on it. However, given its schema, with a "starelid / stakeys", is it sensible to think that we're going to get anything other than something that involves multiple variables? Maybe it should just be "pg_statistics_multivar" and if something else comes along we create another catalog with an appropriate schema. Heck, how does this catalog serve the purpose of cross-table statistics in the first place, given that it has room to record a single relid only? Are you thinking that in the future you'd change starelid into an oidvector column? The comment in gram.y about the CREATE STATISTICS is at odds with what is actually allowed by the grammar. I think the name of a statistics is only useful to DROP/ALTER it, right? I wonder why it's useful that statistics belongs in a schema. Perhaps it should be a global object? I suppose the name collisions would become bothersome if you have many mvstats. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On 01/31/2017 07:52 AM, Amit Langote wrote: On 2017/01/31 6:57, Tomas Vondra wrote: On 01/30/2017 09:37 PM, Alvaro Herrera wrote: Looks good to me. I don't think we need to keep the names very short -- I would propose "standistinct", "stahistogram", "stadependencies". Yeah, I got annoyed by the short names too. This however reminds me that perhaps pg_mv_statistic is not the best name. I know others proposed pg_statistic_ext (and pg_stats_ext), and while I wasn't a big fan initially, I think it's a better name. People generally don't know what 'multivariate' means, while 'extended' is better known (e.g. because Oracle uses it for similar stuff). So I think I'll switch to that name too. +1 to pg_statistics_ext. Maybe, even pg_statistics_extended, however being that verbose may not be warranted. Yeah, I think pg_statistic_extended / pg_stats_extended seems fine. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On 2017/01/31 6:57, Tomas Vondra wrote: > On 01/30/2017 09:37 PM, Alvaro Herrera wrote: >> Looks good to me. I don't think we need to keep the names very short -- >> I would propose "standistinct", "stahistogram", "stadependencies". >> > > Yeah, I got annoyed by the short names too. > > This however reminds me that perhaps pg_mv_statistic is not the best name. > I know others proposed pg_statistic_ext (and pg_stats_ext), and while I > wasn't a big fan initially, I think it's a better name. People generally > don't know what 'multivariate' means, while 'extended' is better known > (e.g. because Oracle uses it for similar stuff). > > So I think I'll switch to that name too. +1 to pg_statistics_ext. Maybe, even pg_statistics_extended, however being that verbose may not be warranted. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On Tue, Jan 31, 2017 at 6:57 AM, Tomas Vondra wrote: > This however reminds me that perhaps pg_mv_statistic is not the best name. I > know others proposed pg_statistic_ext (and pg_stats_ext), and while I wasn't > a big fan initially, I think it's a better name. People generally don't know > what 'multivariate' means, while 'extended' is better known (e.g. because > Oracle uses it for similar stuff). > > So I think I'll switch to that name too. I have moved this patch to the next CF, with Álvaro as reviewer. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On 01/30/2017 09:37 PM, Alvaro Herrera wrote: Tomas Vondra wrote: The 'built' flags may be easily replaced with a check if the bytea-like columns are NULL, and the 'enabled' columns may be replaced by the array of char, just like you proposed. That'd give us a single catalog looking like this: pg_mv_statistics starelid staname stanamespace staowner -- all the above as currently staenabledarray of "char" {d,f,s} stakeys stadeps (dependencies) standist (ndistinct coefficients) stamcv (MCV list) stahist (histogram) Which is probably a better / simpler structure than the current one. Looks good to me. I don't think we need to keep the names very short -- I would propose "standistinct", "stahistogram", "stadependencies". Yeah, I got annoyed by the short names too. This however reminds me that perhaps pg_mv_statistic is not the best name. I know others proposed pg_statistic_ext (and pg_stats_ext), and while I wasn't a big fan initially, I think it's a better name. People generally don't know what 'multivariate' means, while 'extended' is better known (e.g. because Oracle uses it for similar stuff). So I think I'll switch to that name too. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
Tomas Vondra wrote: > The 'built' flags may be easily replaced with a check if the bytea-like > columns are NULL, and the 'enabled' columns may be replaced by the array of > char, just like you proposed. > > That'd give us a single catalog looking like this: > > pg_mv_statistics > starelid > staname > stanamespace > staowner -- all the above as currently > staenabled array of "char" {d,f,s} > stakeys > stadeps (dependencies) > standist (ndistinct coefficients) > stamcv (MCV list) > stahist (histogram) > > Which is probably a better / simpler structure than the current one. Looks good to me. I don't think we need to keep the names very short -- I would propose "standistinct", "stahistogram", "stadependencies". Thanks, -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On 01/30/2017 05:12 PM, Alvaro Herrera wrote: > Hmm. So we have a catalog pg_mv_statistics which stores two things: 1. the configuration regarding mvstats that have been requested by user via CREATE/ALTER STATISTICS 2. the actual values captured from the above, via ANALYZE I think this conflates two things that really are separate, given their different timings and usage patterns. This decision is causing the catalog to have columns enabled/built flags for each set of stats requested, which looks a bit odd. In particular, the fact that you have to heap_update the catalog in order to add more stuff as it's built looks inconvenient. Have you thought about having the "requested" bits be separate from the actual computed values? Something like pg_mv_statistics starelid staname stanamespace staowner -- all the above as currently staenabledarray of "char" {d,f,s} stakeys // no CATALOG_VARLEN here where each char in the staenabled array has a #define and indicates one type, "ndistinct", "functional dep", "selectivity" etc. The actual values computed by ANALYZE would live in a catalog like: pg_mv_statistics_values stvstaid -- OID of the corresponding pg_mv_statistics row. Needed? Definitely needed. How else would you know which MCV list and histogram belong together? This works just like in pg_statistic - when both MCV and histograms are enabled for the statistic, we first build MCV list, then histogram on remaining rows. So we need to pair them. stvrelid -- same as starelid stvkeys -- same as stakeys #ifdef CATALOG_VARLEN stvkind 'd' or 'f' or 's', etc stvvalue the bytea blob #endif I think that would be simpler, both conceptually and in terms of code. I think the main issue here is that it throws away the special data types (pg_histogram, pg_mcv, pg_ndistinct, pg_dependencies), which I think is a neat idea and would like to keep it. This would throw that away, making everything bytea again. I don't like that. The other angle to consider is planner-side: how does the planner gets to the values? I think as far as the planner goes, the first catalog doesn't matter at all, because a statistics type that has been enabled but not computed is not interesting at all; planner only cares about the values in the second catalog (this is why I added stvkeys). Currently you're just caching a single pg_mv_statistics row in get_relation_info (and only if any of the "built" flags is set), which is simple. With my proposed change, you'd need to keep multiple pg_mv_statistics_values rows. But maybe you already tried something like what I propose and there's a reason not to do it? Honestly, I don't see how this improves the situation. We still need to cache data for exactly one catalog, so how is that simpler? The way I see it, it actually makes things more complicated, because now we have two catalogs to manage instead of one (e.g. when doing DROP STATISTICS, or after ALTER TABLE ... DROP COLUMN). The 'built' flags may be easily replaced with a check if the bytea-like columns are NULL, and the 'enabled' columns may be replaced by the array of char, just like you proposed. That'd give us a single catalog looking like this: pg_mv_statistics starelid staname stanamespace staowner -- all the above as currently staenabledarray of "char" {d,f,s} stakeys stadeps (dependencies) standist (ndistinct coefficients) stamcv (MCV list) stahist (histogram) Which is probably a better / simpler structure than the current one. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On 01/30/2017 05:55 PM, Alvaro Herrera wrote: Minor nitpicks: Let me suggest to use get_attnum() in CreateStatistics instead of SearchSysCacheAttName for each column. Also, we use type AttrNumber for attribute numbers rather than int16. Finally in the same function you have an erroneous ERRCODE_UNDEFINED_COLUMN which should be ERRCODE_DUPLICATE_COLUMN in the loop that searches for duplicates. May I suggest that compare_int16 be named attnum_cmp (just to be consistent with other qsort comparators) and look like return *((const AttrNumber *) a) - *((const AttrNumber *) b); instead of memcmp? Yes, I think this is pretty much what Kyotaro-san pointed out in his review. I'll go through the patch and make sure the correct data types are used. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
Hello, On 01/26/2017 10:03 AM, Ideriha, Takeshi wrote: Though I pointed out these typoes and so on, I believe these feedback are less priority compared to the source code itself. So please work on my feedback if you have time. I think getting the comments (and docs in general) right is just as important as the code. So thank you for your review! regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On 01/26/2017 10:43 AM, Dilip Kumar wrote: histograms -- + if (matches[i] == MVSTATS_MATCH_FULL) + s += mvhist->buckets[i]->ntuples; + else if (matches[i] == MVSTATS_MATCH_PARTIAL) + s += 0.5 * mvhist->buckets[i]->ntuples; Isn't it will be better that take some percentage of the bucket based on the number of distinct element for partial matching buckets. I don't think so, for the same reason why ineq_histogram_selectivity() in selfuncs.c uses binfrac = 0.5; for partial bucket matches - it provides minimum average error. Even if we knew the number of distinct items in the bucket, we have no idea what the distribution within the bucket looks like. Maybe 99% of the bucket are covered by a single distinct value, maybe all the items are squashed on one side of the bucket, etc. Moreover we don't really know the number of distinct values in the bucket - we only know the number of distinct items in the sample, and only while building the histogram. I don't think it makes much sense to estimate the number of distinct items in a bucket, because the buckets contain only very few rows so the estimates would be wildly inaccurate. +static int +update_match_bitmap_histogram(PlannerInfo *root, List *clauses, + int2vector *stakeys, + MVSerializedHistogram mvhist, + int nmatches, char *matches, + bool is_or) +{ + int i; For each clause we are processing all the buckets, can't we use some data structure which can make multi-dimensions information searching faster. > No, we're not processing all buckets for each clause. We're' only processing buckets that were not "ruled out" by preceding clauses. That's the whole point of the bitmap. For example for condition (a=1) AND (b=2), the code will first evaluate (a=1) on all buckets, and then (b=2) but only on buckets where (a=1) was evaluated as true. Similarly for OR clauses. > Something like HTree, RTree, Maybe storing histogram in these formats will be difficult? Maybe, but I don't want to do that in the first version. I'm not opposed to doing that in the future, if we find out the v1 histograms are not efficient (I don't think we will, based on tests I did while working on the patch). Support for other histogram implementations is pretty much why there is 'type' field in the struct. For now I think we should stick with the simple implementation. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
Minor nitpicks: Let me suggest to use get_attnum() in CreateStatistics instead of SearchSysCacheAttName for each column. Also, we use type AttrNumber for attribute numbers rather than int16. Finally in the same function you have an erroneous ERRCODE_UNDEFINED_COLUMN which should be ERRCODE_DUPLICATE_COLUMN in the loop that searches for duplicates. May I suggest that compare_int16 be named attnum_cmp (just to be consistent with other qsort comparators) and look like return *((const AttrNumber *) a) - *((const AttrNumber *) b); instead of memcmp? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
Tomas Vondra wrote: > On 01/03/2017 05:22 PM, Tomas Vondra wrote: > > On 01/03/2017 02:42 PM, Dilip Kumar wrote: > ... > > > I think it should be easily reproducible, in case it's not I can send > > > call stack or core dump. > > > > > > > Thanks for the report. It was trivial to reproduce and it turned out to > > be a fairly simple bug. Will send a new version of the patch soon. > > > > Attached is v22 of the patch series, rebased to current master and fixing > the reported bug. I haven't made any other changes - the issues reported by > Petr are mostly minor, so I've decided to wait a bit more for (hopefully) > other reviews. Hmm. So we have a catalog pg_mv_statistics which stores two things: 1. the configuration regarding mvstats that have been requested by user via CREATE/ALTER STATISTICS 2. the actual values captured from the above, via ANALYZE I think this conflates two things that really are separate, given their different timings and usage patterns. This decision is causing the catalog to have columns enabled/built flags for each set of stats requested, which looks a bit odd. In particular, the fact that you have to heap_update the catalog in order to add more stuff as it's built looks inconvenient. Have you thought about having the "requested" bits be separate from the actual computed values? Something like pg_mv_statistics starelid staname stanamespace staowner -- all the above as currently staenabledarray of "char" {d,f,s} stakeys // no CATALOG_VARLEN here where each char in the staenabled array has a #define and indicates one type, "ndistinct", "functional dep", "selectivity" etc. The actual values computed by ANALYZE would live in a catalog like: pg_mv_statistics_values stvstaid -- OID of the corresponding pg_mv_statistics row. Needed? stvrelid -- same as starelid stvkeys -- same as stakeys #ifdef CATALOG_VARLEN stvkind 'd' or 'f' or 's', etc stvvalue the bytea blob #endif I think that would be simpler, both conceptually and in terms of code. The other angle to consider is planner-side: how does the planner gets to the values? I think as far as the planner goes, the first catalog doesn't matter at all, because a statistics type that has been enabled but not computed is not interesting at all; planner only cares about the values in the second catalog (this is why I added stvkeys). Currently you're just caching a single pg_mv_statistics row in get_relation_info (and only if any of the "built" flags is set), which is simple. With my proposed change, you'd need to keep multiple pg_mv_statistics_values rows. But maybe you already tried something like what I propose and there's a reason not to do it? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
Hello, I'll return on this since this should welcome more eyeballs. At Thu, 26 Jan 2017 09:03:10 +, "Ideriha, Takeshi" wrote in <4E72940DA2BF16479384A86D54D0988A565822A9@G01JPEXMBKW04> > Hi > > When you have time, could you rebase the pathes? > Some patches cannot be applied to the current HEAD. For those who are willing to look this, 352a24a1f9d6f7d4abb1175bfd22acc358f43140 breaks this. So just before it can accept this patches cleanly. > 0001 patch can be applied but the following 0002 patch cannot be. > > I've just started reading your patch (mainly docs and README, not yet source > code.) > > Though these are minor things, I've found some typos or mistakes in the > document and README. > > >+ statistics on the table. The statistics will be created in the in the > >+ current database. The statistics will be owned by the user issuing > > Regarding line 629 at > 0002-PATCH-shared-infrastructure-and-ndistinct-coeffi-v22.patch, > there is a double "in the". > > >+ knowledge of a value in the first column is sufficient for detemining the > >+ value in the other column. Then functional dependencies are built on > >those > > Regarding line 701 at 0002-PATCH, > "determining" is mistakenly spelled "detemining". > > > >@@ -0,0 +1,98 @@ > >+Multivariate statististics > >+== > > Regarding line 2415 at 0002-PATCH, "statististics" should be statistics > > > >+ > >+ CREATE STATISTICS > >+ define a new statistics > >+ > > >+ > >+ DROP STATISTICS > >+ remove a statistics > >+ > > Regarding line 612 and 771 at 0002-PATCH, > I assume saying "multiple statistics" explicitly is easier to understand to > users > since these commands don't for the statistics we already have in the > pg_statistics in my understanding. > > >+ [1] http://en.wikipedia.org/wiki/Database_normalization > > Regarding line 386 at 0003-PATCH, is it better to change this link to this > one: > https://en.wikipedia.org/wiki/Functional_dependency ? > README.dependencies cites directly above link. > > Though I pointed out these typoes and so on, > I believe these feedback are less priority compared to the source code itself. > > So please work on my feedback if you have time. README.dependencies > dependencies, and for each one count the number of rows rows consistent it. "of rows rows consistent it" => "or rows consistent with it"? > are in fact consistent with the functinal dependency, i.e. that given the a "that given the a" => "that given a" ? dependencies.c: dependency_dgree(): - The k is assumed larger than 1. I think assertion is required. - "/* end of the preceding group */" seems to be better if it is just after the "if (multi_sort.." currently just after it. - The following comment seems mis-edited. > * If there is a single are no contradicting rows, count the group > * as supporting, otherwise contradicting. maybe this would be like the following? The varialbe counting the first "contradiction" is named "n_violations". This seems somewhat confusing. > * If there are no violating rows up to here, count the group > * as supporting, otherwise contradicting. - "/* first columns match, but the last one does not" else if (multi_sort_compare_dims((k - 1), (k - 1), ... The above comparison should use multi_sort_compare_dim, not dims - This function counts "n_contradicting_rows" but it is not referenced. Anyway n_contradicting_rows = numrows - n_supporing_rows so it and n_contradicting seem unncecessary. build_mv_dependencies(): - In the commnet, "* covering jut 2 columns, to the largest ones, covering all columns" "* included int the statistics. We start from the smallest ones because we" l1: "jut" => "just", l2: "int" => "in" mvstats.h: - struct MVDependencyData/ MVDependenciesData The varialbe length member at the last of the structs should be defined using FLEXIBLE_ARRAY_MEMBER, from the convention. - I'm not sure how much it impacts performance, but some struct members seems to have a bit too wide types. For example, MVDepedenciesData.type is of int32 but it can have only '1' for now and it won't be two-digits. Also ndeps cannot be so large. common.c: multi_sort_compare_dims needs comment. general: This patch uses int16 as the type of attrubute number but it might be better to use AttrNumber for the purpose. (Specifically it seems defined as the type for an attribute index but also used as the varialbe for number of attributes) Sorry for the random comment in advance. I'll learn this further. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On Thu, Jan 5, 2017 at 3:27 AM, Tomas Vondra wrote: > Thanks. Those plans match my experiments with the TPC-H data set, although > I've been playing with the smallest scale (1GB). > > It's not very difficult to make the estimation error arbitrary large, e.g. > by using perfectly correlated (identical) columns. I have done an initial review for ndistint and histogram patches, there are few review comments. ndistinct - 1. Duplicate statistics: postgres=# create statistics s with (ndistinct) on (a,c) from t; 2017-01-07 16:21:54.575 IST [63817] ERROR: duplicate key value violates unique constraint "pg_mv_statistic_name_index" 2017-01-07 16:21:54.575 IST [63817] DETAIL: Key (staname, stanamespace)=(s, 2200) already exists. 2017-01-07 16:21:54.575 IST [63817] STATEMENT: create statistics s with (ndistinct) on (a,c) from t; ERROR: duplicate key value violates unique constraint "pg_mv_statistic_name_index" DETAIL: Key (staname, stanamespace)=(s, 2200) already exists. For duplicate statistics, I think we can check the existence of the statistics and give more meaningful error code something statistics "s" already exist. 2. Typo + /* + * Sort the attnums, which makes detecting duplicies somewhat + * easier, and it does not hurt (it does not affect the efficiency, + * onlike for indexes, for example). + */ /onlike/unlike 3. Typo /* * Find attnims of MV stats using the mvoid. */ int2vector * find_mv_attnums(Oid mvoid, Oid *relid) /attnims/attnums histograms -- + if (matches[i] == MVSTATS_MATCH_FULL) + s += mvhist->buckets[i]->ntuples; + else if (matches[i] == MVSTATS_MATCH_PARTIAL) + s += 0.5 * mvhist->buckets[i]->ntuples; Isn't it will be better that take some percentage of the bucket based on the number of distinct element for partial matching buckets. +static int +update_match_bitmap_histogram(PlannerInfo *root, List *clauses, + int2vector *stakeys, + MVSerializedHistogram mvhist, + int nmatches, char *matches, + bool is_or) +{ + int i; For each clause we are processing all the buckets, can't we use some data structure which can make multi-dimensions information searching faster. Something like HTree, RTree, Maybe storing histogram in these formats will be difficult? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
Hi When you have time, could you rebase the pathes? Some patches cannot be applied to the current HEAD. 0001 patch can be applied but the following 0002 patch cannot be. I've just started reading your patch (mainly docs and README, not yet source code.) Though these are minor things, I've found some typos or mistakes in the document and README. >+ statistics on the table. The statistics will be created in the in the >+ current database. The statistics will be owned by the user issuing Regarding line 629 at 0002-PATCH-shared-infrastructure-and-ndistinct-coeffi-v22.patch, there is a double "in the". >+ knowledge of a value in the first column is sufficient for detemining the >+ value in the other column. Then functional dependencies are built on those Regarding line 701 at 0002-PATCH, "determining" is mistakenly spelled "detemining". >@@ -0,0 +1,98 @@ >+Multivariate statististics >+== Regarding line 2415 at 0002-PATCH, "statististics" should be statistics >+ >+ CREATE STATISTICS >+ define a new statistics >+ >+ >+ DROP STATISTICS >+ remove a statistics >+ Regarding line 612 and 771 at 0002-PATCH, I assume saying "multiple statistics" explicitly is easier to understand to users since these commands don't for the statistics we already have in the pg_statistics in my understanding. >+ [1] http://en.wikipedia.org/wiki/Database_normalization Regarding line 386 at 0003-PATCH, is it better to change this link to this one: https://en.wikipedia.org/wiki/Functional_dependency ? README.dependencies cites directly above link. Though I pointed out these typoes and so on, I believe these feedback are less priority compared to the source code itself. So please work on my feedback if you have time. regards, Ideriha Takeshi -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On Wed, Jan 25, 2017 at 9:56 PM, Alvaro Herrera wrote: > Michael Paquier wrote: >> And nothing has happened since. Are there people willing to review >> this patch and help it proceed? > > I am going to grab this patch as committer. Thanks, that's good to know. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
Michael Paquier wrote: > On Wed, Jan 4, 2017 at 11:35 AM, Tomas Vondra > wrote: > > Attached is v22 of the patch series, rebased to current master and fixing > > the reported bug. I haven't made any other changes - the issues reported by > > Petr are mostly minor, so I've decided to wait a bit more for (hopefully) > > other reviews. > > And nothing has happened since. Are there people willing to review > this patch and help it proceed? I am going to grab this patch as committer. > As this patch is quite large, I am not sure if it is fit to join the > last CF. Thoughts? All patches, regardless of size, are welcome to join any commitfest. The last commitfest is not different in that regard. The rule I remember is that patches may not arrive *for the first time* in the last commitfest. This patch has already seen a lot of work in previous commitfests, so it's fine. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On Wed, Jan 4, 2017 at 11:35 AM, Tomas Vondra wrote: > On 01/03/2017 05:22 PM, Tomas Vondra wrote: >> >> On 01/03/2017 02:42 PM, Dilip Kumar wrote: > > ... >>> >>> I think it should be easily reproducible, in case it's not I can send >>> call stack or core dump. >>> >> >> Thanks for the report. It was trivial to reproduce and it turned out to >> be a fairly simple bug. Will send a new version of the patch soon. >> > > Attached is v22 of the patch series, rebased to current master and fixing > the reported bug. I haven't made any other changes - the issues reported by > Petr are mostly minor, so I've decided to wait a bit more for (hopefully) > other reviews. And nothing has happened since. Are there people willing to review this patch and help it proceed? As this patch is quite large, I am not sure if it is fit to join the last CF. Thoughts? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On 01/04/2017 03:21 PM, Dilip Kumar wrote: On Wed, Jan 4, 2017 at 8:05 AM, Tomas Vondra wrote: Attached is v22 of the patch series, rebased to current master and fixing the reported bug. I haven't made any other changes - the issues reported by Petr are mostly minor, so I've decided to wait a bit more for (hopefully) other reviews. v22 fixes the problem, I reported. In my test, I observed that group by estimation is much better with ndistinct stat. Here is one example: postgres=# explain analyze select p_brand, p_type, p_size from part group by p_brand, p_type, p_size; QUERY PLAN --- HashAggregate (cost=37992.00..38992.00 rows=10 width=36) (actual time=953.359..1011.302 rows=186607 loops=1) Group Key: p_brand, p_type, p_size -> Seq Scan on part (cost=0.00..30492.00 rows=100 width=36) (actual time=0.013..163.672 rows=100 loops=1) Planning time: 0.194 ms Execution time: 1020.776 ms (5 rows) postgres=# CREATE STATISTICS s2 WITH (ndistinct) on (p_brand, p_type, p_size) from part; CREATE STATISTICS postgres=# analyze part; ANALYZE postgres=# explain analyze select p_brand, p_type, p_size from part group by p_brand, p_type, p_size; QUERY PLAN --- HashAggregate (cost=37992.00..39622.46 rows=163046 width=36) (actual time=935.162..992.944 rows=186607 loops=1) Group Key: p_brand, p_type, p_size -> Seq Scan on part (cost=0.00..30492.00 rows=100 width=36) (actual time=0.013..156.746 rows=100 loops=1) Planning time: 0.308 ms Execution time: 1001.889 ms In above example, Without MVStat-> estimated: 10 Actual: 186607 With MVStat-> estimated: 163046 Actual: 186607 Thanks. Those plans match my experiments with the TPC-H data set, although I've been playing with the smallest scale (1GB). It's not very difficult to make the estimation error arbitrary large, e.g. by using perfectly correlated (identical) columns. regard -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On Wed, Jan 4, 2017 at 8:05 AM, Tomas Vondra wrote: > Attached is v22 of the patch series, rebased to current master and fixing > the reported bug. I haven't made any other changes - the issues reported by > Petr are mostly minor, so I've decided to wait a bit more for (hopefully) > other reviews. v22 fixes the problem, I reported. In my test, I observed that group by estimation is much better with ndistinct stat. Here is one example: postgres=# explain analyze select p_brand, p_type, p_size from part group by p_brand, p_type, p_size; QUERY PLAN --- HashAggregate (cost=37992.00..38992.00 rows=10 width=36) (actual time=953.359..1011.302 rows=186607 loops=1) Group Key: p_brand, p_type, p_size -> Seq Scan on part (cost=0.00..30492.00 rows=100 width=36) (actual time=0.013..163.672 rows=100 loops=1) Planning time: 0.194 ms Execution time: 1020.776 ms (5 rows) postgres=# CREATE STATISTICS s2 WITH (ndistinct) on (p_brand, p_type, p_size) from part; CREATE STATISTICS postgres=# analyze part; ANALYZE postgres=# explain analyze select p_brand, p_type, p_size from part group by p_brand, p_type, p_size; QUERY PLAN --- HashAggregate (cost=37992.00..39622.46 rows=163046 width=36) (actual time=935.162..992.944 rows=186607 loops=1) Group Key: p_brand, p_type, p_size -> Seq Scan on part (cost=0.00..30492.00 rows=100 width=36) (actual time=0.013..156.746 rows=100 loops=1) Planning time: 0.308 ms Execution time: 1001.889 ms In above example, Without MVStat-> estimated: 10 Actual: 186607 With MVStat-> estimated: 163046 Actual: 186607 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On 12/30/2016 02:12 PM, Petr Jelinek wrote: On 12/12/16 22:50, Tomas Vondra wrote: On 12/12/2016 12:26 PM, Amit Langote wrote: Hi Tomas, On 2016/10/30 4:23, Tomas Vondra wrote: Hi, Attached is v20 of the multivariate statistics patch series, doing mostly the changes outlined in the preceding e-mail from October 11. The patch series currently has these parts: * 0001 : (FIX) teach pull_varno about RestrictInfo * 0002 : (PATCH) shared infrastructure and ndistinct coefficients Hi, I went over these two (IMHO those could easily be considered as minimal committable set even if the user visible functionality they provide is rather limited). Yes, although I still have my doubts 0001 is the right way to make pull_varnos work. It's probably related to the bigger design question, because moving the statistics selection to an earlier phase could make it unnecessary I guess. dropping statistics --- The statistics may be dropped automatically using DROP STATISTICS. After ALTER TABLE ... DROP COLUMN, statistics referencing are: (a) dropped, if the statistics would reference only one column (b) retained, but modified on the next ANALYZE This should be documented in user visible form if you plan to keep it (it does make sense to me). Yes, I plan to keep it. I agree it should be documented, probably on the ALTER TABLE page (and linked from CREATE/DROP statistics pages). + therefore perfectly correlated. Providing additional information about + correlation between columns is the purpose of multivariate statistics, + and the rest of this section thoroughly explains how the planner + leverages them to improve estimates. + + + + For additional details about multivariate statistics, see + src/backend/utils/mvstats/README.stats. There are additional + READMEs for each type of statistics, mentioned in the following + sections. + + + I don't think this qualifies as "thoroughly explains" ;) OK, I'll drop the "thoroughly" ;-) + +Oid +get_statistics_oid(List *names, bool missing_ok) No comment? + case OBJECT_STATISTICS: + msg = gettext_noop("statistics \"%s\" does not exist, skipping"); + name = NameListToString(objname); + break; This sounds somewhat weird (plural vs singular). Ah, right - it should be either "statistic ... does not" or "statistics ... do not". I think "statistics" is the right choice here, because (a) we have CREATE STATISTICS and (b) it may be a combination of statistics, e.g. histogram + MCV. + * XXX Maybe this should check for duplicate stats. Although it's not clear + * what "duplicate" would mean here (wheter to compare only keys or also + * options). Moreover, we don't do such checks for indexes, although those + * store tuples and recreating a new index may be a way to fix bloat (which + * is a problem statistics don't have). + */ +ObjectAddress +CreateStatistics(CreateStatsStmt *stmt) I don't think we should check duplicates TBH so I would remove the XXX (also "wheter" is typo but if you remove that paragraph it does not matter). Yes, I came to the same conclusion - we can only really check for exact matches (same set of columns, same choice of statistic types), but that's fairly useless. I'll remove the XXX. + if (true) + { Huh? Yeah, that's a bit weird pattern. It's a remainder of copy-pasting the preceding block, which looks like this if (hasindex) { ... } But we've decided to not add similar flag for the statistics. I'll move the block to a separate function (instead of merging it directly into the function, which is already a bit largeish). + +List * +RelationGetMVStatList(Relation relation) +{ ... + +void +update_mv_stats(Oid mvoid, MVNDistinct ndistinct, + int2vector *attrs, VacAttrStats **stats) ... +static double +ndistinct_for_combination(double totalrows, int numrows, HeapTuple *rows, + int2vector *attrs, VacAttrStats **stats, + int k, int *combination) +{ Again, these deserve comment. OK, will add. I'll try to look at other patches in the series as time permits. thanks -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On 01/03/2017 02:42 PM, Dilip Kumar wrote: On Tue, Dec 13, 2016 at 3:20 AM, Tomas Vondra wrote: attached is v21 of the patch series, rebased to current master (resolving the duplicate OID and a few trivial merge conflicts), and also fixing some of the issues you reported. I wanted to test the grouping estimation behaviour with TPCH, While testing I found some crash so I thought of reporting it. My setup detail: TPCH scale factor : 5 Applied all the patch for 21 series, and ran below queries. postgres=# analyze part; ANALYZE postgres=# CREATE STATISTICS s2 WITH (ndistinct) on (p_brand, p_type, p_size) from part; CREATE STATISTICS postgres=# analyze part; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. I think it should be easily reproducible, in case it's not I can send call stack or core dump. Thanks for the report. It was trivial to reproduce and it turned out to be a fairly simple bug. Will send a new version of the patch soon. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On Tue, Dec 13, 2016 at 3:20 AM, Tomas Vondra wrote: > attached is v21 of the patch series, rebased to current master (resolving > the duplicate OID and a few trivial merge conflicts), and also fixing some > of the issues you reported. I wanted to test the grouping estimation behaviour with TPCH, While testing I found some crash so I thought of reporting it. My setup detail: TPCH scale factor : 5 Applied all the patch for 21 series, and ran below queries. postgres=# analyze part; ANALYZE postgres=# CREATE STATISTICS s2 WITH (ndistinct) on (p_brand, p_type, p_size) from part; CREATE STATISTICS postgres=# analyze part; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. I think it should be easily reproducible, in case it's not I can send call stack or core dump. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On 12/12/16 22:50, Tomas Vondra wrote: > On 12/12/2016 12:26 PM, Amit Langote wrote: >> >> Hi Tomas, >> >> On 2016/10/30 4:23, Tomas Vondra wrote: >>> Hi, >>> >>> Attached is v20 of the multivariate statistics patch series, doing >>> mostly >>> the changes outlined in the preceding e-mail from October 11. >>> >>> The patch series currently has these parts: >>> >>> * 0001 : (FIX) teach pull_varno about RestrictInfo >>> * 0002 : (PATCH) shared infrastructure and ndistinct coefficients Hi, I went over these two (IMHO those could easily be considered as minimal committable set even if the user visible functionality they provide is rather limited). > dropping statistics > --- > > The statistics may be dropped automatically using DROP STATISTICS. > > After ALTER TABLE ... DROP COLUMN, statistics referencing are: > > (a) dropped, if the statistics would reference only one column > > (b) retained, but modified on the next ANALYZE This should be documented in user visible form if you plan to keep it (it does make sense to me). > + therefore perfectly correlated. Providing additional information about > + correlation between columns is the purpose of multivariate statistics, > + and the rest of this section thoroughly explains how the planner > + leverages them to improve estimates. > + > + > + > + For additional details about multivariate statistics, see > + src/backend/utils/mvstats/README.stats. There are additional > + READMEs for each type of statistics, mentioned in the > following > + sections. > + > + > + I don't think this qualifies as "thoroughly explains" ;) > + > +Oid > +get_statistics_oid(List *names, bool missing_ok) No comment? > + case OBJECT_STATISTICS: > + msg = gettext_noop("statistics \"%s\" does not exist, > skipping"); > + name = NameListToString(objname); > + break; This sounds somewhat weird (plural vs singular). > + * XXX Maybe this should check for duplicate stats. Although it's not clear > + * what "duplicate" would mean here (wheter to compare only keys or also > + * options). Moreover, we don't do such checks for indexes, although those > + * store tuples and recreating a new index may be a way to fix bloat (which > + * is a problem statistics don't have). > + */ > +ObjectAddress > +CreateStatistics(CreateStatsStmt *stmt) I don't think we should check duplicates TBH so I would remove the XXX (also "wheter" is typo but if you remove that paragraph it does not matter). > + if (true) > + { Huh? > + > +List * > +RelationGetMVStatList(Relation relation) > +{ ... > + > +void > +update_mv_stats(Oid mvoid, MVNDistinct ndistinct, > + int2vector *attrs, VacAttrStats **stats) ... > +static double > +ndistinct_for_combination(double totalrows, int numrows, HeapTuple *rows, > +int2vector *attrs, VacAttrStats **stats, > +int k, int *combination) > +{ Again, these deserve comment. I'll try to look at other patches in the series as time permits. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On 12/12/16 22:50, Tomas Vondra wrote: >> + >> +SELECT pg_mv_stats_dependencies_show(stadeps) >> + FROM pg_mv_statistic WHERE staname = 's1'; >> + >> + pg_mv_stats_dependencies_show >> +--- >> + (1) => 2, (2) => 1 >> +(1 row) >> + >> >> Couldn't this somehow show actual column names, instead of attribute >> numbers? >> > > Yeah, I was thinking about that too. The trouble is that's table-level > metadata, so we don't have that kind of info serialized within the data > type (e.g. because it would not handle column renames etc.). > > It might be possible to explicitly pass the table OID as a parameter of > the function, but it seemed a bit ugly to me. I think it makes sense to have such function, this is not out function so I think it's ok for it to have the oid as input, especially since in the use-case shown above you can use starelid easily. > > FWIW, as I wrote in this thread, the place where this patch series needs > feedback most desperately is integration into the optimizer. Currently > all the magic happens in clausesel.c and does not leave it.I think it > would be good to move some of that (particularly the choice of > statistics to apply) to an earlier stage, and store the information > within the plan tree itself, so that it's available outside clausesel.c > (e.g. for EXPLAIN - showing which stats were picked seems useful). > > I was thinking it might work similarly to the foreign key estimation > patch (100340e2). It might even be more efficient, as the current code > may end repeating the selection of statistics multiple times. But > enriching the plan tree turned out to be way more invasive than I'm > comfortable with (but maybe that'd be OK). > In theory it seems like possibly reasonable approach to me, mainly because mv statistics are user defined objects. I guess we'd have to see at least some PoC to see how invasive it is. But I ultimately think that feedback from a committer who is more familiar with planner is needed here. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
Hi Tomas, On 2016/10/30 4:23, Tomas Vondra wrote: > Hi, > > Attached is v20 of the multivariate statistics patch series, doing mostly > the changes outlined in the preceding e-mail from October 11. > > The patch series currently has these parts: > > * 0001 : (FIX) teach pull_varno about RestrictInfo > * 0002 : (PATCH) shared infrastructure and ndistinct coefficients > * 0003 : (PATCH) functional dependencies (only the ANALYZE part) > * 0004 : (PATCH) selectivity estimation using functional dependencies > * 0005 : (PATCH) multivariate MCV lists > * 0006 : (PATCH) multivariate histograms > * 0007 : (WIP) selectivity estimation using ndistinct coefficients > * 0008 : (WIP) use multiple statistics for estimation > * 0009 : (WIP) psql tab completion basics Unfortunately, this failed to compile because of the duplicate_oids error. Partitioning patch consumed same OIDs as used in this patch. I will try to read the patches in some more detail, but in the meantime, here are some comments/nitpicks on the documentation: No updates to doc/src/sgml/catalogs.sgml? + + The examples presented in used + statistics about individual columns to compute selectivity estimates. + When estimating conditions on multiple columns, the planner assumes + independence and multiplies the selectivities. When the columns are + correlated, the independence assumption is violated, and the estimates + may be seriously off, resulting in poor plan choices. + The term independence is used in isolation - independence of what? Independence of the distributions of values in separate columns? Also, the phrase "seriously off" could perhaps be replaced by more rigorous terminology; it might be unclear to some readers. Perhaps: wildly inaccurate, :) + +EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1; + QUERY PLAN +- + Seq Scan on t (cost=0.00..170.00 rows=100 width=8) (actual time=0.031..2.870 rows=100 loops=1) + Filter: (a = 1) + Rows Removed by Filter: 9900 + Planning time: 0.092 ms + Execution time: 3.103 ms Is there a reason why examples in "67.2. Multivariate Statistics" (like the one above) use EXPLAIN ANALYZE, whereas those in "67.1. Row Estimation Examples" (also, other relevant chapters) uses just EXPLAIN. + the final 0.01% estimate. The plan however shows that this results in + a significant under-estimate, as the actual number of rows matching the s/under-estimate/underestimate/g + + For additional details about multivariate statistics, see + src/backend/utils/mvstats/README.statsc. There are additional + README for each type of statistics, mentioned in the following + sections. + Referring to source tree READMEs seems novel around this portion of the documentation, but I think not too far away, there are some references. This is under the VII. Internals chapter anyway, so that might be OK. In any case, s/README.statsc/README.stats/g Also, s/additional README/additional READMEs/g (tags omitted for brevity) +used in definitions of database normal forms. When simplified, saying that +b is functionally dependent on a means that Maybe, s/When simplified/In simple terms/g +In normalized databases, only functional dependencies on primary keys +and super keys are allowed. In practice however many data sets are not +fully normalized, for example thanks to intentional denormalization for +performance reasons. The table t is an example of a data +with functional dependencies. As a=b for all rows in the +table, a is functionally dependent on b and +b is functionally dependent on a. "super keys" sounds like a new term. s/for example thanks to/for example, thanks to/g (or due to instead of thanks to) How about: s/an example of a data with/an example of a schema with/g Perhaps, s/a=b/a = b/g (additional white space) +Similarly to per-column statistics, multivariate statistics are stored in I notice that "similar to" is used more often than "similarly to". But that might be OK. + This shows that the statistics is defined on table t, Perhaps: the statistics is -> the statistics are or the statistic is + lists attnums of the columns (references + pg_attribute). While this text may be OK on the catalog description page, it might be better to expand attnums here as "attribute numbers" dropping the parenthesized phrase altogether. + +SELECT pg_mv_stats_dependencies_show(stadeps) + FROM pg_mv_statistic WHERE staname = 's1'; + + pg_mv_stats_dependencies_show +--- + (1) => 2, (2) => 1 +(1 row) + Couldn't this somehow show actual column names, instead of attribute numbers? Will read more later. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
Hi everyone, thanks for the reviews. Let me sum the feedback so far, and outline my plans for the next patch version that I'd like to submit for CF 2016-11. 1) syntax changes I agree with the changes proposed by Dean, although only a subset of the syntax is going to be supported until we add support for either join or partial statistics. So something like this: CREATE STATISTICS name [ WITH (options) ] ON (column1, column2 [, ...]) FROM table That should be a difficult change. 2) catalog names I'm not sure what are the best names, so I'm fine with using whatever is the consensus. That being said, I'm not sure I like extending the catalog to also support non-multivariate statistics (like for example statistics on expressions). While that would be a clearly useful feature, it seems like a slightly different use case and perhaps a separate catalog would be better. So maybe pg_statistic_ext is not the best name. 3) special data type(s) to store statistics I agree using an opaque bytea value is not very nice. I see Heikki proposed using something like pg_node_tree, and maybe storing all the statistics in a single value. I assume the pg_node_tree was meant only as an inspiration how to build pseudo-type on top of a varlena value. I agree that's a good idea, and I plan to do something like that - say adding pg_mcv, pg_histogram, pg_ndistinct and pg_dependencies data types. Heikki also mentioned that maybe JSONB would be a good way to store the statistics. I don't think so - firstly, it only supports a subset of data types, so we'd be unable to store statistics for some data types (or we'd have to store them as text, which sucks). Also, there's a fair amount of smartness in how the statistics are stored (e.g. how the histogram bucket boundaries are deduplicated, or how the estimation uses the serialized representation directly). We'd lose all of that when using JSONB. Similarly for storing all the statistics in a single value - I see no reason why keeping the statistics in separate columns would be a bad idea (after all, that's kinda the point of relational databases). Also, there are perfectly valid cases when the caller only needs a particular type of statistic - e.g. when estimating GROUP BY we'll only need the ndistinct coefficients. Why should we force the caller to fetch and detoast everything, and throw away probably 99% of that? So my plan here is to define pseudo types similar to how pg_node_tree is defined. That does not seem like a tremendous amount of work. 4) functional dependencies Several people mentioned they don't like how functional dependencies are detected at ANALYZE time, particularly that there's a sudden jump between 0 and 1. Instead, a continuous "dependency degree" between 0 and 1 was proposed. I'm fine with that, although that makes "clause reduction" (deciding that we don't need to estimate one of the clauses at all, as it's implied by some other clause) impossible. But that's fine, the functional dependencies will still be much less expensive than the other statistics. I'm wondering how will this interact with transitivity, though. IIRC the current implementation is able to detect transitive dependencies and use that to reduce storage space etc. In any case, this significantly complicates the functional dependencies, which were meant as a trivial type of statistics, mostly to establish the shared infrastructure. Which brings me to ndistinct. 5) ndistinct So far, the ndistinct coefficients were lumped at the very end of the patch, and the statistic was only built but not used for any sort of estimation. I agree with Dean that perhaps it'd be better to move this to the very beginning, and use it as the simplest statistic to build the infrastructure instead of functional dependencies (which only gets truer due to the changes in functional dependencies, discussed in the preceding section). I think it's probably a good idea and I plan to do that, so the patch series will probably look like this: * 001 - CREATE STATISTICS infrastucture with ndistinct coefficients * 002 - use ndistinct coefficients to improve GROUP BY estimates * 003 - use ndistinct coefficients in clausesel.c (not sure) * 004 - add functional dependencies (build + clausesel.c) * 005 - add multivariate MCV (build + clausesel.c) * 006 - add multivariate histograms (build + clausesel.c) I'm not sure about using the ndistinct coefficients in clausesel.c to estimate regular conditions - it's the place for which ndistinct coefficients were originally proposed by Kyotaro-san, but I seem to remember it was non-trivial to choose the best statistics when there were other types of stats available. But I'll look into that. 6) combining statistics I've decided not to re-submit this part of the patch until the basic functionality gets in. I do think it's a very useful feature (despite having my doubts about the ex
Re: [HACKERS] multivariate statistics (v19)
On 4 October 2016 at 09:15, Heikki Linnakangas wrote: > However, for tables and views, each object you store in those views is a > "table" or "view", but with this thing, the object you store is > "statistics". Would you have a catalog table called "pg_scissor"? > No, probably not (unless it was storing individual scissor blades). However, in this case, we have related pre-existing catalog tables, so... > We call the current system table "pg_statistic", though. I agree we should > call it pg_mv_statistic, in singular, to follow the example of pg_statistic. > > Of course, the user-friendly system view on top of that is called > "pg_stats", just to confuse things more :-). > I agree. Given where we are, with a pg_statistic table and a pg_stats view, I think the least worst solution is to have a pg_statistic_ext table, and then maybe a pg_stats_ext view. >> It doesn't seem like the end of the world that it doesn't >> match the user-facing syntax. A bigger concern is the use of "mv" in >> the name, because as has already been pointed out, this table may also >> in the future be used to store univariate expression and partial >> statistics, so I think we should drop the "mv" and go with something >> like pg_statistic_ext, or some other more general name. > > > Also, "mv" makes me think of materialized views, which is completely > unrelated to this. > Yeah, I hadn't thought of that. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On 04/10/16 20:37, Dean Rasheed wrote: On 4 October 2016 at 04:25, Michael Paquier wrote: OK. A second thing was related to the use of schemas in the new system catalogs. As mentioned in [1], those could be removed. [1]: https://www.postgresql.org/message-id/cab7npqtu40q5_nsghvomjfbyh1hdtqmbfdj+kwfjspam35b...@mail.gmail.com. That doesn't work, because if the intention is to be able to one day support statistics across multiple tables, you can't assume that the statistics are in the same schema as the table. In fact, if multi-table statistics are to be allowed in the future, I think you want to move away from thinking of statistics as depending on and referring to a single table, and handle them more like views -- i.e, store a pg_node_tree representing the from_clause and add multiple dependencies at statistics creation time. That was what I was getting at upthread when I suggested the alternate syntax, and also answers Tomas' question about how JOIN might one day be supported. Of course, if we don't think that we will ever support multi-table statistics, that all goes away, and you may as well make the statistics name local to the table, but I think that's a bit limiting. One way or the other, I think this is a question that needs to be answered now. My vote is to leave expansion room to support multi-table statistics in the future. Regards, Dean I can see multi-table statistics being useful if one is trying to optimise indexes for multiple joins. Am assuming that the statistics can be accessed by the user as well as the planner? (I've only lightly followed this thread, so I might have missed, significant relevant details!) Cheers, Gavin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On 10/04/2016 10:49 AM, Dean Rasheed wrote: On 30 September 2016 at 12:10, Heikki Linnakangas wrote: I fear that using "statistics" as the name of the new object might get a bit awkward. "statistics" is a plural, but we use it as the name of a single object, like "pants" or "scissors". Not sure I have any better ideas though. "estimator"? "statistics collection"? Or perhaps it should be singular, "statistic". I note that you actually called the system table "pg_mv_statistic", in singular. I think it's OK. The functional dependency is a single statistic, but MCV lists and histograms are multiple statistics (multiple facts about the data sampled), so in general when you create one of these new objects, you are creating multiple statistics about the data. Ok. I don't really have any better ideas, was just hoping that someone else would. Also I find "CREATE STATISTIC" just sounds a bit clumsy compared to "CREATE STATISTICS". Agreed. The convention for naming system catalogs seems to be to use the singular for tables and plural for views, so I guess we should stick with that. However, for tables and views, each object you store in those views is a "table" or "view", but with this thing, the object you store is "statistics". Would you have a catalog table called "pg_scissor"? We call the current system table "pg_statistic", though. I agree we should call it pg_mv_statistic, in singular, to follow the example of pg_statistic. Of course, the user-friendly system view on top of that is called "pg_stats", just to confuse things more :-). It doesn't seem like the end of the world that it doesn't match the user-facing syntax. A bigger concern is the use of "mv" in the name, because as has already been pointed out, this table may also in the future be used to store univariate expression and partial statistics, so I think we should drop the "mv" and go with something like pg_statistic_ext, or some other more general name. Also, "mv" makes me think of materialized views, which is completely unrelated to this. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On 30 September 2016 at 12:10, Heikki Linnakangas wrote: > I fear that using "statistics" as the name of the new object might get a bit > awkward. "statistics" is a plural, but we use it as the name of a single > object, like "pants" or "scissors". Not sure I have any better ideas though. > "estimator"? "statistics collection"? Or perhaps it should be singular, > "statistic". I note that you actually called the system table > "pg_mv_statistic", in singular. > I think it's OK. The functional dependency is a single statistic, but MCV lists and histograms are multiple statistics (multiple facts about the data sampled), so in general when you create one of these new objects, you are creating multiple statistics about the data. Also I find "CREATE STATISTIC" just sounds a bit clumsy compared to "CREATE STATISTICS". The convention for naming system catalogs seems to be to use the singular for tables and plural for views, so I guess we should stick with that. It doesn't seem like the end of the world that it doesn't match the user-facing syntax. A bigger concern is the use of "mv" in the name, because as has already been pointed out, this table may also in the future be used to store univariate expression and partial statistics, so I think we should drop the "mv" and go with something like pg_statistic_ext, or some other more general name. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On 4 October 2016 at 04:25, Michael Paquier wrote: > OK. A second thing was related to the use of schemas in the new system > catalogs. As mentioned in [1], those could be removed. > [1]: > https://www.postgresql.org/message-id/cab7npqtu40q5_nsghvomjfbyh1hdtqmbfdj+kwfjspam35b...@mail.gmail.com. > That doesn't work, because if the intention is to be able to one day support statistics across multiple tables, you can't assume that the statistics are in the same schema as the table. In fact, if multi-table statistics are to be allowed in the future, I think you want to move away from thinking of statistics as depending on and referring to a single table, and handle them more like views -- i.e, store a pg_node_tree representing the from_clause and add multiple dependencies at statistics creation time. That was what I was getting at upthread when I suggested the alternate syntax, and also answers Tomas' question about how JOIN might one day be supported. Of course, if we don't think that we will ever support multi-table statistics, that all goes away, and you may as well make the statistics name local to the table, but I think that's a bit limiting. One way or the other, I think this is a question that needs to be answered now. My vote is to leave expansion room to support multi-table statistics in the future. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On Mon, Oct 3, 2016 at 8:25 PM, Heikki Linnakangas wrote: > Yeah. The idea was to use something like pg_node_tree to store all the > different kinds of statistics, the histogram, the MCV, and the functional > dependencies, in one datum. Or JSON, maybe. It sounds better than an opaque > bytea blob, although I'd prefer something more relational. For the > functional dependencies, I think we could get away with a simple float > array, so let's do that in the first cut, and revisit this for the MCV and > histogram later. OK. A second thing was related to the use of schemas in the new system catalogs. As mentioned in [1], those could be removed. [1]: https://www.postgresql.org/message-id/cab7npqtu40q5_nsghvomjfbyh1hdtqmbfdj+kwfjspam35b...@mail.gmail.com. > Separate columns for the functional dependencies, the MCVs, > and the histogram, probably makes sense anyway. Probably.. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On 10/03/2016 04:46 AM, Michael Paquier wrote: On Fri, Sep 30, 2016 at 8:10 PM, Heikki Linnakangas wrote: This patch set is in pretty good shape, the only problem is that it's so big that no-one seems to have the time or courage to do the final touches and commit it. Did you see my suggestions about simplifying its SQL structure? You could shave some code without impacting the base set of features. Yeah. The idea was to use something like pg_node_tree to store all the different kinds of statistics, the histogram, the MCV, and the functional dependencies, in one datum. Or JSON, maybe. It sounds better than an opaque bytea blob, although I'd prefer something more relational. For the functional dependencies, I think we could get away with a simple float array, so let's do that in the first cut, and revisit this for the MCV and histogram later. Separate columns for the functional dependencies, the MCVs, and the histogram, probably makes sense anyway. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On Fri, Sep 30, 2016 at 8:10 PM, Heikki Linnakangas wrote: > This patch set is in pretty good shape, the only problem is that it's so big > that no-one seems to have the time or courage to do the final touches and > commit it. Did you see my suggestions about simplifying its SQL structure? You could shave some code without impacting the base set of features. > I fear that using "statistics" as the name of the new object might get a bit > awkward. "statistics" is a plural, but we use it as the name of a single > object, like "pants" or "scissors". Not sure I have any better ideas though. > "estimator"? "statistics collection"? Or perhaps it should be singular, > "statistic". I note that you actually called the system table > "pg_mv_statistic", in singular. > > I'm not a big fan of storing the stats as just a bytea blob, and having to > use special functions to interpret it. By looking at the patch, it's not > clear to me what we actually store for functional dependencies. A list of > attribute numbers? Could we store them simply as an int[]? (I'm not a big > fan of the hack in pg_statistic, that allows storing arrays of any data type > in the same column, though. But for functional dependencies, I don't think > we need that.) I am marking this patch as returned with feedback for now. > Overall, this is going to be a great feature! +1. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
This patch set is in pretty good shape, the only problem is that it's so big that no-one seems to have the time or courage to do the final touches and commit it. If we just focus on the functional dependencies part for now, I think we might get somewhere. I peeked at the MCV and histogram patches too, and I think they make total sense as well, and are a natural extension of the functional dependencies patch. So if we just focus on that for now, I don't think we will paint ourselves in the corner. (more below) On 09/14/2016 01:01 AM, Tomas Vondra wrote: On 09/12/2016 04:08 PM, Dean Rasheed wrote: Regarding the statistics themselves, I read the description of soft functional dependencies, and I'm somewhat skeptical about that algorithm. I don't like the arbitrary thresholds or the sudden jump from independence to dependence and clause reduction. As others have said, I think this should account for a continuous spectrum of dependence from fully independent to fully dependent, and combine clause selectivities in a way based on the degree of dependence. For example, if you computed an estimate for the fraction 'f' of the table's rows for which a -> b, then it might be reasonable to combine the selectivities using P(a,b) = P(a) * (f + (1-f) * P(b)) Yeah, I agree that the thresholds resulting in sudden changes between "dependent" and "not dependent" are annoying. The question is whether it makes sense to fix that, though - the functional dependencies were meant as the simplest form of statistics, allowing us to get the rest of the infrastructure in. I'm OK with replacing the true/false dependencies with a degree of dependency between 0 and 1, but I'm a bit afraid it'll result in complaints that the first patch got too large / complicated. +1 for using a floating degree between 0 and 1, rather than a boolean. It also contradicts the idea of using functional dependencies as a low-overhead type of statistics, filtering the list of clauses that need to be estimated using more expensive types of statistics (MCV lists, histograms, ...). Switching to a degree of dependency would prevent removal of "unnecessary" clauses. That sounds OK to me, although I'm not deeply familiar with this patch yet. Of course, having just a single number that tells you the columns are correlated, tells you nothing about whether the clauses on those columns are consistent with that correlation. For example, in the following table CREATE TABLE t(a int, b int); INSERT INTO t SELECT x/10, ((x/10)*789)%100 FROM generate_series(0,999) g(x); 'b' is functionally dependent on 'a' (and vice versa), but if you query the rows with a<50 and with b<50, those clauses behave essentially independently, because they're not consistent with the functional dependence between 'a' and 'b', so the best way to combine their selectivities is just to multiply them, as we currently do. So whilst it may be interesting to determine that 'b' is functionally dependent on 'a', it's not obvious whether that fact by itself should be used in the selectivity estimates. Perhaps it should, on the grounds that it's best to attempt to use all the available information, but only if there are no more detailed statistics available. In any case, knowing that there is a correlation can be used as an indicator that it may be worthwhile to build more detailed multivariate statistics like a MCV list or a histogram on those columns. Right. IIRC this is actually described in the README as "incompatible conditions". While implementing it, I concluded that this is OK and it's up to the developer to decide whether the queries are compatible with the "assumption of compatibility". But maybe this is reasoning is bogus and makes (the current implementation of) functional dependencies unusable in practice. I think that's OK. It seems like a good assumption that the conditions are "compatible" with the functional dependency. For two reasons: 1) A query with compatible clauses is much more likely to occur in real life. Why would you run a query with an incompatible ZIP and city clauses? 2) If the conditions were in fact incompatible, the query is likely to return 0 rows, and will bail out very quickly, even if the estimates are way off and you choose a non-optimal plan. There are exceptions, of course: an index scan might be able to conclude that there are no rows much quicker than a seqscan, but as a general rule of thumb, a query that returns 0 rows isn't very sensitive to the chosen plan. And of course, as long as we're not collecting these statistics automatically, if it doesn't work for your application, just don't collect them. I fear that using "statistics" as the name of the new object might get a bit awkward. "statistics" is a plural, but we use it as the name of a single object, like "pants" or "scissors". Not sure I have any better ideas though. "estimator"? "statistics collection"? Or perhaps it should be singular, "statistic". I not
Re: [HACKERS] multivariate statistics (v19)
Hi, Thanks for looking into this! On 09/12/2016 04:08 PM, Dean Rasheed wrote: On 3 August 2016 at 02:58, Tomas Vondra wrote: Attached is v19 of the "multivariate stats" patch series Hi, I started looking at this - just at a very high level - I've not read much of the detail yet, but here are some initial review comments. I think the overall infrastructure approach for CREATE STATISTICS makes sense, and I agree with other suggestions upthread that it would be useful to be able to build statistics on arbitrary expressions, although that doesn't need to be part of this patch, it's useful to keep that in mind as a possible future extension of this initial design. I can imagine it being useful to be able to create user-defined statistics on an arbitrary list of expressions, and I think that would include univariate as well as multivariate statistics. Perhaps that's something to take into account in the naming of things, e.g., as David Rowley suggested, something like pg_statistic_ext, rather than pg_mv_statistic. I also like the idea that this might one day be extended to support statistics across multiple tables, although I think that might be challenging to achieve -- you'd need a method of taking a random sample of rows from a join between 2 or more tables. However, if the intention is to be able to support that one day, I think that needs to be accounted for in the syntax now -- specifically, I think it will be too limiting to only support things extending the current syntax of the form table1(col1, col2, ...), table2(col1, col2, ...), because that precludes building statistics on an expression referring to columns from more than one table. So I think we should plan further ahead and use a syntax giving greater flexibility in the future, for example something structured more like a query (like CREATE VIEW): CREATE STATISTICS name [ WITH (options) ] ON expression [, ...] FROM table [, ...] WHERE condition where the first version of the patch would only support expressions that are simple column references, and would require at least 2 such columns from a single table with no WHERE clause, i.e.: CREATE STATISTICS name [ WITH (options) ] ON column1, column2 [, ...] FROM table For multi-table statistics, a WHERE clause would typically be needed to specify how the tables are expected to be joined, but potentially such a clause might also be useful in single-table statistics, to build partial statistics on a commonly queried subset of the table, just like a partial index. Hmm, the "partial statistics" idea seems interesting, It would allow us to provide additional / more detailed statistics only for a subset of a table. I'm however not sure about the join case - how would the syntax work with outer joins? But as you said, we only need CREATE STATISTICS name [ WITH (options) ] ON (column1, column2 [, ...]) FROM table WHERE condition until we add support for join statistics. Regarding the statistics themselves, I read the description of soft functional dependencies, and I'm somewhat skeptical about that algorithm. I don't like the arbitrary thresholds or the sudden jump from independence to dependence and clause reduction. As others have said, I think this should account for a continuous spectrum of dependence from fully independent to fully dependent, and combine clause selectivities in a way based on the degree of dependence. For example, if you computed an estimate for the fraction 'f' of the table's rows for which a -> b, then it might be reasonable to combine the selectivities using P(a,b) = P(a) * (f + (1-f) * P(b)) Yeah, I agree that the thresholds resulting in sudden changes between "dependent" and "not dependent" are annoying. The question is whether it makes sense to fix that, though - the functional dependencies were meant as the simplest form of statistics, allowing us to get the rest of the infrastructure in. I'm OK with replacing the true/false dependencies with a degree of dependency between 0 and 1, but I'm a bit afraid it'll result in complaints that the first patch got too large / complicated. It also contradicts the idea of using functional dependencies as a low-overhead type of statistics, filtering the list of clauses that need to be estimated using more expensive types of statistics (MCV lists, histograms, ...). Switching to a degree of dependency would prevent removal of "unnecessary" clauses. Of course, having just a single number that tells you the columns are correlated, tells you nothing about whether the clauses on those columns are consistent with that correlation. For example, in the following table CREATE TABLE t(a int, b int); INSERT INTO t SELECT x/10, ((x/10)*789)%100 FROM generate_series(0,999) g(x); 'b' is functionally dependent on 'a' (and vice versa), but if you query the rows with a<50 and with b<50, those clauses behave essentially independently, because they're not consistent with the functional
Re: [HACKERS] multivariate statistics (v19)
On 3 August 2016 at 02:58, Tomas Vondra wrote: > Attached is v19 of the "multivariate stats" patch series Hi, I started looking at this - just at a very high level - I've not read much of the detail yet, but here are some initial review comments. I think the overall infrastructure approach for CREATE STATISTICS makes sense, and I agree with other suggestions upthread that it would be useful to be able to build statistics on arbitrary expressions, although that doesn't need to be part of this patch, it's useful to keep that in mind as a possible future extension of this initial design. I can imagine it being useful to be able to create user-defined statistics on an arbitrary list of expressions, and I think that would include univariate as well as multivariate statistics. Perhaps that's something to take into account in the naming of things, e.g., as David Rowley suggested, something like pg_statistic_ext, rather than pg_mv_statistic. I also like the idea that this might one day be extended to support statistics across multiple tables, although I think that might be challenging to achieve -- you'd need a method of taking a random sample of rows from a join between 2 or more tables. However, if the intention is to be able to support that one day, I think that needs to be accounted for in the syntax now -- specifically, I think it will be too limiting to only support things extending the current syntax of the form table1(col1, col2, ...), table2(col1, col2, ...), because that precludes building statistics on an expression referring to columns from more than one table. So I think we should plan further ahead and use a syntax giving greater flexibility in the future, for example something structured more like a query (like CREATE VIEW): CREATE STATISTICS name [ WITH (options) ] ON expression [, ...] FROM table [, ...] WHERE condition where the first version of the patch would only support expressions that are simple column references, and would require at least 2 such columns from a single table with no WHERE clause, i.e.: CREATE STATISTICS name [ WITH (options) ] ON column1, column2 [, ...] FROM table For multi-table statistics, a WHERE clause would typically be needed to specify how the tables are expected to be joined, but potentially such a clause might also be useful in single-table statistics, to build partial statistics on a commonly queried subset of the table, just like a partial index. Of course, I'm not suggesting that the current patch do any of that -- it's big enough as it is. I'm just throwing out possible future directions this might go in, so that we don't get painted into a corner when designing the syntax for the current patch. Regarding the statistics themselves, I read the description of soft functional dependencies, and I'm somewhat skeptical about that algorithm. I don't like the arbitrary thresholds or the sudden jump from independence to dependence and clause reduction. As others have said, I think this should account for a continuous spectrum of dependence from fully independent to fully dependent, and combine clause selectivities in a way based on the degree of dependence. For example, if you computed an estimate for the fraction 'f' of the table's rows for which a -> b, then it might be reasonable to combine the selectivities using P(a,b) = P(a) * (f + (1-f) * P(b)) Of course, having just a single number that tells you the columns are correlated, tells you nothing about whether the clauses on those columns are consistent with that correlation. For example, in the following table CREATE TABLE t(a int, b int); INSERT INTO t SELECT x/10, ((x/10)*789)%100 FROM generate_series(0,999) g(x); 'b' is functionally dependent on 'a' (and vice versa), but if you query the rows with a<50 and with b<50, those clauses behave essentially independently, because they're not consistent with the functional dependence between 'a' and 'b', so the best way to combine their selectivities is just to multiply them, as we currently do. So whilst it may be interesting to determine that 'b' is functionally dependent on 'a', it's not obvious whether that fact by itself should be used in the selectivity estimates. Perhaps it should, on the grounds that it's best to attempt to use all the available information, but only if there are no more detailed statistics available. In any case, knowing that there is a correlation can be used as an indicator that it may be worthwhile to build more detailed multivariate statistics like a MCV list or a histogram on those columns. Looking at the ndistinct coefficient 'q', I think it would be better if the recorded statistic were just the estimate for ndistinct(a,b,...) rather than a ratio of ndistinct values. That's a more fundamental statistic, and it's easier to document and easier to interpret. Also, I don't believe that the coefficient 'q' is the right number to use for clause estimation: Looking at README.ndistinct, I'm skeptical about the
Re: [HACKERS] multivariate statistics (v19)
On Wed, Aug 24, 2016 at 2:03 AM, Robert Haas wrote: > ISTR that you were going to try to look at this patch set. It seems > from the discussion that it's not really ready for serious > consideration for commit yet, but also that some high-level design > comments from you at this stage could go a long way toward making sure > that the final form of the patch is something that will be acceptable. > > I'd really like to see us get some kind of capability along these > lines, but I'm sure it will go a lot better if you or Dean handle it > than if I try to do it ... not to mention that there are only so many > hours in the day. Agreed. What I have been able to look until now was the high-level structure of the patch, and I think that we should really shave 0002 and simplify it to get a core infrastructure in place, but the core patch is at another level, and it would be good to get some feedback regarding the structure of the patch and if it is moving in the good direction is good or not. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On Tue, Aug 2, 2016 at 9:58 PM, Tomas Vondra wrote: > Attached is v19 of the "multivariate stats" patch series - essentially v18 > rebased on top of current master. Tom: ISTR that you were going to try to look at this patch set. It seems from the discussion that it's not really ready for serious consideration for commit yet, but also that some high-level design comments from you at this stage could go a long way toward making sure that the final form of the patch is something that will be acceptable. I'd really like to see us get some kind of capability along these lines, but I'm sure it will go a lot better if you or Dean handle it than if I try to do it ... not to mention that there are only so many hours in the day. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On 08/10/2016 06:41 AM, Michael Paquier wrote: On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra wrote: 1) enriching the query tree with multivariate statistics info Right now all the stuff related to multivariate statistics estimation happens in clausesel.c - matching condition to statistics, selection of statistics to use (if there are multiple usable stats), etc. So pretty much all this info is internal to clausesel.c and does not get outside. This does not seem bad to me as first sight but... I'm starting to think that some of the steps (matching quals to stats, selection of stats) should happen in a "preprocess" step before the actual estimation, storing the information (which stats to use, etc.) in a new type of node in the query tree - something like RestrictInfo. I believe this needs to happen sometime after deconstruct_jointree() as that builds RestrictInfos nodes, and looking at planmain.c, right after extract_restriction_or_clauses seems about right. Haven't tried, though. This would move all the "statistics selection" logic from clausesel.c, separating it from the "actual estimation" and simplifying the code. But more importantly, I think we'll need to show some of the data in EXPLAIN output. With per-column statistics it's fairly straightforward to determine which statistics are used and how. But with multivariate stats things are often more complicated - there may be multiple candidate statistics (e.g. histograms covering different subsets of the conditions), it's possible to apply them in different orders, etc. But EXPLAIN can't show the info if it's ephemeral and available only within clausesel.c (and thrown away after the estimation). This gives a good reason to not do that in clauserel.c, it would be really cool to be able to get some information regarding the stats used with a simple EXPLAIN. I've been thinking about this, and I'm afraid it's way more complicated in practice. It essentially means doing something like rel->baserestrictinfo = enrichWithStatistics(rel->baserestrictinfo); for each table (and in the future maybe also for joins etc.) But as the name suggests the list should only include RestrictInfo nodes, which seems to contradict the transformation. For example with conditions WHERE (a=1) AND (b=2) AND (c=3) the list will contain 3 RestrictInfos. But if there's a statistics on (a,b,c), we need to note that somehow - my plan was to inject a node storing this information, something like (a bit simplified): StatisticsInfo { Oid statisticsoid; /* OID of the statistics */ List *mvconditions; /* estimate using the statistics */ List *otherconditions; /* estimate the old way */ } But that'd clearly violate the assumption that baserestrictinfo only contains RestrictInfo. I don't think it's feasible (or desirable) to rework all the places to expect both RestrictInfo and the new node. I can think of two alternatives: 1) keep the transformed list as separate list, next to baserestrictinfo This obviously fixes the issue, as each caller can decide which node it wants. But it also means we need to maintain two lists instead of one, and keep them synchronized. 2) embed the information into the existing tree It might be possible to store the information in existing nodes, i.e. each node would track whether it's estimated the "old way" or using multivariate statistics (and which one). But it would require changing many of the existing nodes (at least those compatible with multivariate statistics: currently OpExpr, NullTest, ...). And it also seems fairly difficult to reconstruct the information during the estimation, as it'd be necessary to look for other nodes to be estimated by the same statistics. Which seems to defeat the idea of preprocessing to some degree. So I'm not sure what's the best solution. I'm leaning to (1), i.e. keeping a separate list, but I'd welcome other ideas. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On Thu, Aug 11, 2016 at 3:34 AM, Tomas Vondra wrote: > On 08/10/2016 02:23 PM, Michael Paquier wrote: >> >> On Wed, Aug 10, 2016 at 8:33 PM, Tomas Vondra >> wrote: >>> The idea is that the syntax should work even for statistics built on >>> multiple tables, e.g. to provide better statistics for joins. That's why >>> the >>> schema may be specified (as each table might be in different schema), and >>> so >>> on. >> >> >> So you mean that the same statistics could be shared between tables? >> But as this is visibly not a concept introduced yet in this set of >> patches, why not just cut it off for now to simplify the whole? If >> there is no schema-related field in pg_mv_statistics we could still >> add it later if it proves to be useful. >> > > Yes, I think creating statistics on multiple tables is one of the possible > future directions. One of the previous patch versions introduced ALTER TABLE > ... ADD STATISTICS syntax, but that ran into issues in gram.y, and given the > multi-table possibilities the CREATE STATISTICS seems like a much better > idea anyway. > > But I guess you're right we may make this a bit more strict now, and relax > it in the future if needed. For example as we only support single-table > statistics at this point, we may remove the schema and always create the > statistics in the schema of the table. This would simplify the code the code a bit so I'd suggest removing that from the first shot. If there is demand for it, keeping the infrastructure open for this extension is what we had better do. > But I don't think we should make the statistics names unique only within a > table (instead of within the schema). They could be made unique using (name, table_oid, column_list). There is a lot of mumbo-jumbo regarding the way dependencies are stored with mainly serialize_mv_dependencies and deserialize_mv_dependencies that operates them from bytea/dep trees. That's not cool and not portable because pg_mv_statistic represents that as pure bytea. I would suggest creating a generic data type that does those operations, named like pg_dependency_tree and then use that in those new catalogs. pg_node_tree is a precedent of such a thing. New features could as well make use of this new data type of we are able to design that in a way generic enough, so that would be a base patch that the current 0002 applies on top of. >>> >>> >>> >>> Interesting idea, haven't thought about that. So are you suggesting to >>> add a >>> data type for each statistics type (dependencies, MCV, histogram, ...)? >> >> >> Yes that would be something like that, it would be actually perhaps >> better to have one single data type, and be able to switch between >> each model easily instead of putting byteas in the catalog. > > Hmmm, not sure about that. For example what about combinations of statistics > - e.g. when we have MCV list on the most common values and a histogram on > the rest? Should we store both as a single value, or would that be in two > separate values, or what? The same statistics can combine two different things, using different columns may depend on how readable things get... Btw, for the format we could get inspired from pg_node_tree, with pg_stat_tree: {HISTOGRAM :arg {BUCKET :index 0 :minvals ... }} {DEPENDENCY :arg {:elt "a => c" ...} ... } {MVC :arg {:index 0 :values {0,0} ... } ... } Please consider that as a tentative idea to make things more friendly. Others may have a different opinion on the matter. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On 08/10/2016 02:23 PM, Michael Paquier wrote: On Wed, Aug 10, 2016 at 8:33 PM, Tomas Vondra wrote: On 08/10/2016 06:41 AM, Michael Paquier wrote: Patch 0001: there have been comments about that before, and you have put the checks on RestrictInfo in a couple of variables of pull_varnos_walker, so nothing to say from here. I don't follow. Are you suggesting 0001 is a reasonable fix, or that there's a proposed solution? I think that's reasonable. Well, to me the 0001 feels more like a temporary workaround rather than a proper solution. I just don't know how to deal with it so I've kept it for now. Pretty sure there will be complaints that adding RestrictInfo to the expression walkers is not a nice idea. >> ... The idea is that the syntax should work even for statistics built on multiple tables, e.g. to provide better statistics for joins. That's why the schema may be specified (as each table might be in different schema), and so on. So you mean that the same statistics could be shared between tables? But as this is visibly not a concept introduced yet in this set of patches, why not just cut it off for now to simplify the whole? If there is no schema-related field in pg_mv_statistics we could still add it later if it proves to be useful. Yes, I think creating statistics on multiple tables is one of the possible future directions. One of the previous patch versions introduced ALTER TABLE ... ADD STATISTICS syntax, but that ran into issues in gram.y, and given the multi-table possibilities the CREATE STATISTICS seems like a much better idea anyway. But I guess you're right we may make this a bit more strict now, and relax it in the future if needed. For example as we only support single-table statistics at this point, we may remove the schema and always create the statistics in the schema of the table. But I don't think we should make the statistics names unique only within a table (instead of within the schema). The difference between those two cases is that if we allow multi-table statistics in the future, we can simply allow specifying the schema and everything will work just fine. But it'd break the second case, as it might result in conflicts in existing schemas. I do realize this might be seen as a case of "future proofing" based on dubious predictions of how something might work, but OTOH this (schema inherited from table, unique within a schema) is pretty consistent with how this work for indexes. + /* Spurious noise in the patch. + /* check that at least some statistics were requested */ + if (!build_dependencies) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), +errmsg("no statistics type (dependencies) was requested"))); So, WITH (dependencies) is mandatory in any case. Why not just dropping it from the first cut then? Because the follow-up patches extend this to require at least one statistics type. So in 0004 it becomes if (!(build_dependencies || build_mcv)) and in 0005 it's if (!(build_dependencies || build_mcv || build_histogram)) We might drop it from 0002 (and assume build_dependencies=true), and then add the check in 0004. But it seems a bit pointless. This is a complicated set of patches. I'd think that we should try to simplify things as much as possible first, and the WITH clause is not mandatory to have as of 0002. OK, I can remove the WITH from the 0002 part. Not a big deal. Statistics definition reorder the columns by itself depending on their order. For example: create table aa (a int, b int); create statistics aas on aa(b, a) with (dependencies); \d aa "public.aas" (dependencies) ON (a, b) As this defines a correlation between multiple columns, isn't it wrong to assume that (b, a) and (a, b) are always the same correlation? I don't recall such properties as being always commutative (old memories, I suck at stats in general). [...reading README...] So this is caused by the implementation limitations that only limit the analysis between interactions of two columns. Still it seems incorrect to reorder the user-visible portion. I don't follow. If you talk about Pearson's correlation, that clearly does not depend on the order of columns - it's perfectly independent of that. If you talk about about correlation in the wider sense (i.e. arbitrary dependence between columns), that might depend - but I don't remember a single piece of the patch where this might be a problem. Yes, based on what is done in the patch that may not be a problem, but I am wondering if this is not restricting things too much. Let's keep the code as it is. If we run into this issue in the future, we can easily relax this - there's nothing depending on the ordering of attnums, IIRC. Also, which README states that we can only analyze interactions between two columns? That's pretty clearly not the case - the patch should handle dependencies between more columns without any problems. I have
Re: [HACKERS] multivariate statistics (v19)
On 08/10/2016 02:24 PM, Michael Paquier wrote: On Wed, Aug 10, 2016 at 8:50 PM, Petr Jelinek wrote: On 10/08/16 13:33, Tomas Vondra wrote: On 08/10/2016 06:41 AM, Michael Paquier wrote: On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra 2) combining multiple statistics I think the ability to combine multivariate statistics (covering different subsets of conditions) is important and useful, but I'm starting to think that the current implementation may not be the correct one (which is why I haven't written the SGML docs about this part of the patch series yet). Assume there's a table "t" with 3 columns (a, b, c), and that we're estimating query: SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3 but that we only have two statistics (a,b) and (b,c). The current patch does about this: P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2) i.e. it estimates the first two conditions using (a,b), and then estimates (c=3) using (b,c) with "b=2" as a condition. Now, this is very efficient, but it only works as long as the query contains conditions "connecting" the two statistics. So if we remove the "b=2" condition from the query, this stops working. This is trying to make the algorithm smarter than the user, which is something I'd think we could live without. In this case statistics on (a,c) or (a,b,c) are missing. And what if the user does not want to make use of stats for (a,c) because he only defined (a,b) and (b,c)? I don't think so. Obviously, if you have statistics covering all the conditions - great, we can't really do better than that. But there's a crucial relation between the number of dimensions of the statistics and accuracy of the statistics. Let's say you have statistics on 8 columns, and you split each dimension twice to build a histogram - that's 256 buckets right there, and we only get ~50% selectivity in each dimension (the actual histogram building algorithm is more complex, but you get the idea). I think it makes sense to pursue this, but I also think we can easily live with not having it in the first version that gets committed and doing it as follow-up patch. This patch is large and complicated enough. As this is not a mandatory piece to get a basic support, I'd suggest just to drop that for later. Which is why combining multiple statistics is in part 0006 and all the previous parts simply choose the single "best" statistics ;-) I'm perfectly fine with committing just the first few parts, and leaving 0006 for the next major version. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On 08/10/2016 03:29 PM, Ants Aasma wrote: On Wed, Aug 3, 2016 at 4:58 AM, Tomas Vondra wrote: 2) combining multiple statistics I think the ability to combine multivariate statistics (covering different subsets of conditions) is important and useful, but I'm starting to think that the current implementation may not be the correct one (which is why I haven't written the SGML docs about this part of the patch series yet). While researching this topic a few years ago I came across a paper on this exact topic called "Consistently Estimating the Selectivity of Conjuncts of Predicates" [1]. While effective it seems to be quite heavy-weight, so would probably need support for tiered optimization. [1] https://courses.cs.washington.edu/courses/cse544/11wi/papers/markl-vldb-2005.pdf I think I've read that paper some time ago, and IIRC it's solving the same problem but in a very different way - instead of combining the statistics directly, it relies on the "partial" selectivities and then estimates the total selectivity using the maximum-entropy principle. I think it's a nice idea and it probably works fine in many cases, but it kinda throws away part of the information (that we could get by matching the statistics against each other directly). But I'll keep that paper in mind, and we can revisit this solution later. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On Wed, Aug 3, 2016 at 4:58 AM, Tomas Vondra wrote: > 2) combining multiple statistics > > I think the ability to combine multivariate statistics (covering different > subsets of conditions) is important and useful, but I'm starting to think > that the current implementation may not be the correct one (which is why I > haven't written the SGML docs about this part of the patch series yet). While researching this topic a few years ago I came across a paper on this exact topic called "Consistently Estimating the Selectivity of Conjuncts of Predicates" [1]. While effective it seems to be quite heavy-weight, so would probably need support for tiered optimization. [1] https://courses.cs.washington.edu/courses/cse544/11wi/papers/markl-vldb-2005.pdf Regards, Ants Aasma -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On Wed, Aug 10, 2016 at 8:50 PM, Petr Jelinek wrote: > On 10/08/16 13:33, Tomas Vondra wrote: >> >> On 08/10/2016 06:41 AM, Michael Paquier wrote: >>> >>> On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra 2) combining multiple statistics I think the ability to combine multivariate statistics (covering different subsets of conditions) is important and useful, but I'm starting to think that the current implementation may not be the correct one (which is why I haven't written the SGML docs about this part of the patch series yet). Assume there's a table "t" with 3 columns (a, b, c), and that we're estimating query: SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3 but that we only have two statistics (a,b) and (b,c). The current patch does about this: P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2) i.e. it estimates the first two conditions using (a,b), and then estimates (c=3) using (b,c) with "b=2" as a condition. Now, this is very efficient, but it only works as long as the query contains conditions "connecting" the two statistics. So if we remove the "b=2" condition from the query, this stops working. >>> >>> >>> This is trying to make the algorithm smarter than the user, which is >>> something I'd think we could live without. In this case statistics on >>> (a,c) or (a,b,c) are missing. And what if the user does not want to >>> make use of stats for (a,c) because he only defined (a,b) and (b,c)? >>> >> >> I don't think so. Obviously, if you have statistics covering all the >> conditions - great, we can't really do better than that. >> >> But there's a crucial relation between the number of dimensions of the >> statistics and accuracy of the statistics. Let's say you have statistics >> on 8 columns, and you split each dimension twice to build a histogram - >> that's 256 buckets right there, and we only get ~50% selectivity in each >> dimension (the actual histogram building algorithm is more complex, but >> you get the idea). > > I think it makes sense to pursue this, but I also think we can easily live > with not having it in the first version that gets committed and doing it as > follow-up patch. This patch is large and complicated enough. As this is not a mandatory piece to get a basic support, I'd suggest just to drop that for later. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On Wed, Aug 10, 2016 at 8:33 PM, Tomas Vondra wrote: > On 08/10/2016 06:41 AM, Michael Paquier wrote: >> Patch 0001: there have been comments about that before, and you have >> put the checks on RestrictInfo in a couple of variables of >> pull_varnos_walker, so nothing to say from here. >> > > I don't follow. Are you suggesting 0001 is a reasonable fix, or that there's > a proposed solution? I think that's reasonable. >> Patch 0002: >> + >> + CREATE STATISTICS will create a new multivariate >> + statistics on the table. The statistics will be created in the in the >> + current database. The statistics will be owned by the user issuing >> + the command. >> + >> s/in the/in the/. >> >> + >> + Create table t1 with two functionally dependent >> columns, i.e. >> + knowledge of a value in the first column is sufficient for detemining >> the >> + value in the other column. Then functional dependencies are built on >> those >> + columns: >> s/detemining/determining/ >> >> + >> + If a schema name is given (for example, CREATE STATISTICS >> + myschema.mystat ...) then the statistics is created in the >> specified >> + schema. Otherwise it is created in the current schema. The name of >> + the table must be distinct from the name of any other statistics in >> the >> + same schema. >> + >> I would just assume that a statistics is located on the schema of the >> relation it depends on. So the thing that may be better to do is just: >> - Register the OID of the table a statistics depends on but not the >> schema. >> - Give up on those query extensions related to the schema. >> - Allow the same statistics name to be used for multiple tables. >> - Just fail if a statistics name is being reused on the table again. >> It may be better to complain about that even if the column list is >> different. >> - Register the dependency between the statistics and the table. > > The idea is that the syntax should work even for statistics built on > multiple tables, e.g. to provide better statistics for joins. That's why the > schema may be specified (as each table might be in different schema), and so > on. So you mean that the same statistics could be shared between tables? But as this is visibly not a concept introduced yet in this set of patches, why not just cut it off for now to simplify the whole? If there is no schema-related field in pg_mv_statistics we could still add it later if it proves to be useful. >> + >> /* >> Spurious noise in the patch. >> >> + /* check that at least some statistics were requested */ >> + if (!build_dependencies) >> + ereport(ERROR, >> + (errcode(ERRCODE_SYNTAX_ERROR), >> +errmsg("no statistics type (dependencies) was >> requested"))); >> So, WITH (dependencies) is mandatory in any case. Why not just >> dropping it from the first cut then? > > > Because the follow-up patches extend this to require at least one statistics > type. So in 0004 it becomes > > if (!(build_dependencies || build_mcv)) > > and in 0005 it's > > if (!(build_dependencies || build_mcv || build_histogram)) > > We might drop it from 0002 (and assume build_dependencies=true), and then > add the check in 0004. But it seems a bit pointless. This is a complicated set of patches. I'd think that we should try to simplify things as much as possible first, and the WITH clause is not mandatory to have as of 0002. >> Statistics definition reorder the columns by itself depending on their >> order. For example: >> create table aa (a int, b int); >> create statistics aas on aa(b, a) with (dependencies); >> \d aa >> "public.aas" (dependencies) ON (a, b) >> As this defines a correlation between multiple columns, isn't it wrong >> to assume that (b, a) and (a, b) are always the same correlation? I >> don't recall such properties as being always commutative (old >> memories, I suck at stats in general). [...reading README...] So this >> is caused by the implementation limitations that only limit the >> analysis between interactions of two columns. Still it seems incorrect >> to reorder the user-visible portion. > > I don't follow. If you talk about Pearson's correlation, that clearly does > not depend on the order of columns - it's perfectly independent of that. If > you talk about about correlation in the wider sense (i.e. arbitrary > dependence between columns), that might depend - but I don't remember a > single piece of the patch where this might be a problem. Yes, based on what is done in the patch that may not be a problem, but I am wondering if this is not restricting things too much. > Also, which README states that we can only analyze interactions between two > columns? That's pretty clearly not the case - the patch should handle > dependencies between more columns without any problems. I have noticed that the patch evaluates all the set of permutations possible using a column list, it seems to me though that say if we have three columns (a,
Re: [HACKERS] multivariate statistics (v19)
On 10/08/16 13:33, Tomas Vondra wrote: On 08/10/2016 06:41 AM, Michael Paquier wrote: On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra 2) combining multiple statistics I think the ability to combine multivariate statistics (covering different subsets of conditions) is important and useful, but I'm starting to think that the current implementation may not be the correct one (which is why I haven't written the SGML docs about this part of the patch series yet). Assume there's a table "t" with 3 columns (a, b, c), and that we're estimating query: SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3 but that we only have two statistics (a,b) and (b,c). The current patch does about this: P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2) i.e. it estimates the first two conditions using (a,b), and then estimates (c=3) using (b,c) with "b=2" as a condition. Now, this is very efficient, but it only works as long as the query contains conditions "connecting" the two statistics. So if we remove the "b=2" condition from the query, this stops working. This is trying to make the algorithm smarter than the user, which is something I'd think we could live without. In this case statistics on (a,c) or (a,b,c) are missing. And what if the user does not want to make use of stats for (a,c) because he only defined (a,b) and (b,c)? I don't think so. Obviously, if you have statistics covering all the conditions - great, we can't really do better than that. But there's a crucial relation between the number of dimensions of the statistics and accuracy of the statistics. Let's say you have statistics on 8 columns, and you split each dimension twice to build a histogram - that's 256 buckets right there, and we only get ~50% selectivity in each dimension (the actual histogram building algorithm is more complex, but you get the idea). I think it makes sense to pursue this, but I also think we can easily live with not having it in the first version that gets committed and doing it as follow-up patch. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On 08/10/2016 06:41 AM, Michael Paquier wrote: On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra wrote: ... But more importantly, I think we'll need to show some of the data in EXPLAIN output. With per-column statistics it's fairly straightforward to determine which statistics are used and how. But with multivariate stats things are often more complicated - there may be multiple candidate statistics (e.g. histograms covering different subsets of the conditions), it's possible to apply them in different orders, etc. But EXPLAIN can't show the info if it's ephemeral and available only within clausesel.c (and thrown away after the estimation). This gives a good reason to not do that in clauserel.c, it would be really cool to be able to get some information regarding the stats used with a simple EXPLAIN. I think there are two separate questions: (a) Whether the query plan is "enriched" with information about statistics, or whether this information is ephemeral and available only in clausesel.c. (b) Where exactly this enrichment happens. Theoretically we might enrich the query plan (add nodes with info about the statistics), so that EXPLAIN gets the info, and it might still happen in clausesel.c. 2) combining multiple statistics I think the ability to combine multivariate statistics (covering different subsets of conditions) is important and useful, but I'm starting to think that the current implementation may not be the correct one (which is why I haven't written the SGML docs about this part of the patch series yet). Assume there's a table "t" with 3 columns (a, b, c), and that we're estimating query: SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3 but that we only have two statistics (a,b) and (b,c). The current patch does about this: P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2) i.e. it estimates the first two conditions using (a,b), and then estimates (c=3) using (b,c) with "b=2" as a condition. Now, this is very efficient, but it only works as long as the query contains conditions "connecting" the two statistics. So if we remove the "b=2" condition from the query, this stops working. This is trying to make the algorithm smarter than the user, which is something I'd think we could live without. In this case statistics on (a,c) or (a,b,c) are missing. And what if the user does not want to make use of stats for (a,c) because he only defined (a,b) and (b,c)? I don't think so. Obviously, if you have statistics covering all the conditions - great, we can't really do better than that. But there's a crucial relation between the number of dimensions of the statistics and accuracy of the statistics. Let's say you have statistics on 8 columns, and you split each dimension twice to build a histogram - that's 256 buckets right there, and we only get ~50% selectivity in each dimension (the actual histogram building algorithm is more complex, but you get the idea). I see this as probably the most interesting part of the patch, and quite useful. But we'll definitely get the single-statistics estimate first, no doubt about that. Patch 0001: there have been comments about that before, and you have put the checks on RestrictInfo in a couple of variables of pull_varnos_walker, so nothing to say from here. I don't follow. Are you suggesting 0001 is a reasonable fix, or that there's a proposed solution? Patch 0002: + + CREATE STATISTICS will create a new multivariate + statistics on the table. The statistics will be created in the in the + current database. The statistics will be owned by the user issuing + the command. + s/in the/in the/. + + Create table t1 with two functionally dependent columns, i.e. + knowledge of a value in the first column is sufficient for detemining the + value in the other column. Then functional dependencies are built on those + columns: s/detemining/determining/ + + If a schema name is given (for example, CREATE STATISTICS + myschema.mystat ...) then the statistics is created in the specified + schema. Otherwise it is created in the current schema. The name of + the table must be distinct from the name of any other statistics in the + same schema. + I would just assume that a statistics is located on the schema of the relation it depends on. So the thing that may be better to do is just: - Register the OID of the table a statistics depends on but not the schema. - Give up on those query extensions related to the schema. - Allow the same statistics name to be used for multiple tables. - Just fail if a statistics name is being reused on the table again. It may be better to complain about that even if the column list is different. - Register the dependency between the statistics and the table. The idea is that the syntax should work even for statistics built on multiple tables, e.g. to provide better statistics for joins. That's why the schema may be specified (as each table might be in different schema),
Re: [HACKERS] multivariate statistics (v19)
On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra wrote: > 1) enriching the query tree with multivariate statistics info > > Right now all the stuff related to multivariate statistics estimation > happens in clausesel.c - matching condition to statistics, selection of > statistics to use (if there are multiple usable stats), etc. So pretty much > all this info is internal to clausesel.c and does not get outside. This does not seem bad to me as first sight but... > I'm starting to think that some of the steps (matching quals to stats, > selection of stats) should happen in a "preprocess" step before the actual > estimation, storing the information (which stats to use, etc.) in a new type > of node in the query tree - something like RestrictInfo. > > I believe this needs to happen sometime after deconstruct_jointree() as that > builds RestrictInfos nodes, and looking at planmain.c, right after > extract_restriction_or_clauses seems about right. Haven't tried, though. > > This would move all the "statistics selection" logic from clausesel.c, > separating it from the "actual estimation" and simplifying the code. > > But more importantly, I think we'll need to show some of the data in EXPLAIN > output. With per-column statistics it's fairly straightforward to determine > which statistics are used and how. But with multivariate stats things are > often more complicated - there may be multiple candidate statistics (e.g. > histograms covering different subsets of the conditions), it's possible to > apply them in different orders, etc. > > But EXPLAIN can't show the info if it's ephemeral and available only within > clausesel.c (and thrown away after the estimation). This gives a good reason to not do that in clauserel.c, it would be really cool to be able to get some information regarding the stats used with a simple EXPLAIN. > 2) combining multiple statistics > > I think the ability to combine multivariate statistics (covering different > subsets of conditions) is important and useful, but I'm starting to think > that the current implementation may not be the correct one (which is why I > haven't written the SGML docs about this part of the patch series yet). > > Assume there's a table "t" with 3 columns (a, b, c), and that we're > estimating query: > >SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3 > > but that we only have two statistics (a,b) and (b,c). The current patch does > about this: > >P(a=1,b=2,c=3) = P(a=1,b=2) * P(c=3|b=2) > > i.e. it estimates the first two conditions using (a,b), and then estimates > (c=3) using (b,c) with "b=2" as a condition. Now, this is very efficient, > but it only works as long as the query contains conditions "connecting" the > two statistics. So if we remove the "b=2" condition from the query, this > stops working. This is trying to make the algorithm smarter than the user, which is something I'd think we could live without. In this case statistics on (a,c) or (a,b,c) are missing. And what if the user does not want to make use of stats for (a,c) because he only defined (a,b) and (b,c)? Patch 0001: there have been comments about that before, and you have put the checks on RestrictInfo in a couple of variables of pull_varnos_walker, so nothing to say from here. Patch 0002: + + CREATE STATISTICS will create a new multivariate + statistics on the table. The statistics will be created in the in the + current database. The statistics will be owned by the user issuing + the command. + s/in the/in the/. + + Create table t1 with two functionally dependent columns, i.e. + knowledge of a value in the first column is sufficient for detemining the + value in the other column. Then functional dependencies are built on those + columns: s/detemining/determining/ + + If a schema name is given (for example, CREATE STATISTICS + myschema.mystat ...) then the statistics is created in the specified + schema. Otherwise it is created in the current schema. The name of + the table must be distinct from the name of any other statistics in the + same schema. + I would just assume that a statistics is located on the schema of the relation it depends on. So the thing that may be better to do is just: - Register the OID of the table a statistics depends on but not the schema. - Give up on those query extensions related to the schema. - Allow the same statistics name to be used for multiple tables. - Just fail if a statistics name is being reused on the table again. It may be better to complain about that even if the column list is different. - Register the dependency between the statistics and the table. +ALTER STATISTICS name OWNER TO { new_owner | CURRENT_USER | SESSION_USER } On the same line, is OWNER TO really necessary? I could have assumed that if a user is able to query the set of columns related to a statistics, he should have access to it. =# create statistics aa_a_b3 on aam (a, b) with (dependencies); ERROR: 23505: duplicate key value vio
Re: [HACKERS] multivariate statistics (v19)
On Sat, Aug 6, 2016 at 2:38 AM, Tomas Vondra wrote: > On 08/05/2016 06:24 AM, Michael Paquier wrote: >> >> On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra >> wrote: >>> >>> Attached is v19 of the "multivariate stats" patch series - essentially >>> v18 >>> rebased on top of current master. Aside from a few bug fixes, the main >>> improvement is addition of SGML docs demonstrating the statistics in a >>> way >>> similar to the current "Row Estimation Examples" (and the docs are >>> actually >>> in the same section). I've tried to keep the right amount of technical >>> detail (and pointing to the right README for additional details), but >>> this >>> may need improvements. I have not written docs explaining how statistics >>> may >>> be combined yet (more about this later). >> >> >> What we have here is quite something: >> $ git diff master --stat | tail -n1 >> 77 files changed, 12809 insertions(+), 65 deletions(-) >> I will try to get familiar on the topic and added myself as a reviewer >> of this patch. Hopefully I'll get feedback soon. > > > Yes, it's a large patch. Although 25% of the insertions are SGML docs, > regression tests and READMEs, and large part of the remaining ~9k insertions > are comments. But it may still be overwhelming, no doubt about that. > > FWIW, if someone is interested in the patch but is unsure where to start, > I'm ready to help with that as much as possible. For example if you happen > to go to PostgresOpen, feel free to drag me to a corner and ask me as many > questions as you want ... Sure. Only PGconf SV is on my track this year. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On 08/05/2016 06:24 AM, Michael Paquier wrote: On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra wrote: Attached is v19 of the "multivariate stats" patch series - essentially v18 rebased on top of current master. Aside from a few bug fixes, the main improvement is addition of SGML docs demonstrating the statistics in a way similar to the current "Row Estimation Examples" (and the docs are actually in the same section). I've tried to keep the right amount of technical detail (and pointing to the right README for additional details), but this may need improvements. I have not written docs explaining how statistics may be combined yet (more about this later). What we have here is quite something: $ git diff master --stat | tail -n1 77 files changed, 12809 insertions(+), 65 deletions(-) I will try to get familiar on the topic and added myself as a reviewer of this patch. Hopefully I'll get feedback soon. Yes, it's a large patch. Although 25% of the insertions are SGML docs, regression tests and READMEs, and large part of the remaining ~9k insertions are comments. But it may still be overwhelming, no doubt about that. FWIW, if someone is interested in the patch but is unsure where to start, I'm ready to help with that as much as possible. For example if you happen to go to PostgresOpen, feel free to drag me to a corner and ask me as many questions as you want ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] multivariate statistics (v19)
On Wed, Aug 3, 2016 at 10:58 AM, Tomas Vondra wrote: > Attached is v19 of the "multivariate stats" patch series - essentially v18 > rebased on top of current master. Aside from a few bug fixes, the main > improvement is addition of SGML docs demonstrating the statistics in a way > similar to the current "Row Estimation Examples" (and the docs are actually > in the same section). I've tried to keep the right amount of technical > detail (and pointing to the right README for additional details), but this > may need improvements. I have not written docs explaining how statistics may > be combined yet (more about this later). What we have here is quite something: $ git diff master --stat | tail -n1 77 files changed, 12809 insertions(+), 65 deletions(-) I will try to get familiar on the topic and added myself as a reviewer of this patch. Hopefully I'll get feedback soon. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers