Re: pg_stats and range statistics
On Sat, Nov 25, 2023 at 11:14 AM Egor Rogov wrote: > > Hi Alexander, > > On 25.11.2023 02:06, Alexander Korotkov wrote: > > > > In conclusion of all of the above, I decided to revise the patch and > > show the bounds histogram as it's stored in pg_statistic. I revised > > the docs correspondingly. > > > So basically we returned to what it all has started from? I guess it's > better than nothing, although I have to admit that two-array > representation is much more readable. Unfortunately it brings in a > surprising amount of complexity. Yep, it is. > Anyway, thanks for looking into this! And thank you for the feedback! -- Regards, Alexander Korotkov
Re: pg_stats and range statistics
On Sat, Nov 25, 2023 at 10:58 AM jian he wrote: > On Sat, Nov 25, 2023 at 7:06 AM Alexander Korotkov > wrote: > > > > Hi! > > Additionally, I found that the current patch can't handle infinite > > range bounds and discards information about inclusiveness of range > > bounds. The infinite bounds could be represented as NULL (while I'm > > not sure how good this representation is). Regarding inclusiveness, I > > don't see the possibility to represent them in a reasonable way within > > an array of base types. I also don't feel good about discarding the > > accuracy in the pg_stats view. > > > > in range_length_histogram, maybe we can document that when calculating > the length of a range, inclusiveness will be true. I've revised the patchset. Edited comment in pg_statistic.h as you proposed. And I've added to the documentation a short note on how the range length histogram is calculated. -- Regards, Alexander Korotkov 0001-Update-comments-for-pg_statistic-catalog--20231125-2.patch Description: Binary data 0002-Display-length-and-bounds-histograms-in-p-20231125-2.patch Description: Binary data
Re: pg_stats and range statistics
Hi Alexander, On 25.11.2023 02:06, Alexander Korotkov wrote: In conclusion of all of the above, I decided to revise the patch and show the bounds histogram as it's stored in pg_statistic. I revised the docs correspondingly. So basically we returned to what it all has started from? I guess it's better than nothing, although I have to admit that two-array representation is much more readable. Unfortunately it brings in a surprising amount of complexity. Anyway, thanks for looking into this!
Re: pg_stats and range statistics
On Sat, Nov 25, 2023 at 7:06 AM Alexander Korotkov wrote: > > Hi! > Additionally, I found that the current patch can't handle infinite > range bounds and discards information about inclusiveness of range > bounds. The infinite bounds could be represented as NULL (while I'm > not sure how good this representation is). Regarding inclusiveness, I > don't see the possibility to represent them in a reasonable way within > an array of base types. I also don't feel good about discarding the > accuracy in the pg_stats view. > in range_length_histogram, maybe we can document that when calculating the length of a range, inclusiveness will be true.
Re: pg_stats and range statistics
On Sat, Nov 25, 2023 at 7:06 AM Alexander Korotkov wrote: > > Hi! > > I'm going to push this if there are no objections. > > -- > Regards, > Alexander Korotkov src/include/catalog/pg_statistic.h 268: * range type's subdiff function. Only non-null rows are considered. should it be: * range type's subdiff function. Only non-null, non-empty rows are considered. Other than that, it looks fine to me.
Re: pg_stats and range statistics
Hi! On Wed, Sep 6, 2023 at 6:18 PM jian he wrote: > +lower(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4)]) > should be > + > ranges_lower(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4)]) > > +upper(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4)]) > should be > + > ranges_upper(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4)]) > > https://www.postgresql.org/docs/current/catalog-pg-type.html > there is no association between numrange and their base type numeric. > so for template: anyarray ranges_lower(anyarray). I don't think we can > input numrange array and return a numeric array. > > https://www.postgresql.org/docs/current/extend-type-system.html#EXTEND-TYPES-POLYMORPHIC > >> When the return value of a function is declared as a polymorphic type, > >> there must be at least one argument position that is also >> polymorphic, > >> and the actual data type(s) supplied for the polymorphic arguments > >> determine the actual result type for that call. > > > regression=# select > ranges_lower(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4), > numrange(5.5,6.6)]); > ranges_lower > --- > {1.1,3.3,5.5} > (1 row) > regression=# \gdesc > Column|Type > --+ > ranges_lower | numrange[] > (1 row) > > I don't think you can cast literal ' {1.1,3.3,5.5}' to numrange[]. Thank you for noticing this. Indeed, our polymorphic type system doesn't support this case. In order to support this, we need something like "anyrangearray" pseudo-type. However, it seems overkill to introduce a new pseudo-type just to update pg_stats. Additionally, I found that the current patch can't handle infinite range bounds and discards information about inclusiveness of range bounds. The infinite bounds could be represented as NULL (while I'm not sure how good this representation is). Regarding inclusiveness, I don't see the possibility to represent them in a reasonable way within an array of base types. I also don't feel good about discarding the accuracy in the pg_stats view. In conclusion of all of the above, I decided to revise the patch and show the bounds histogram as it's stored in pg_statistic. I revised the docs correspondingly. Also for some reason, the patch added description of new columns to the documentation of pg_user_mapping table. I've fixed that by moving them to the documentation of pg_stats view. Also, I've extracted the new comment in pg_statistic.h into a separate patch. I'm going to push this if there are no objections. -- Regards, Alexander Korotkov 0001-Add-comment-to-pg_statistic-catalog-table-20231125.patch Description: Binary data 0002-Display-length-and-bounds-histograms-in-pg_-20231125.patch Description: Binary data
Re: pg_stats and range statistics
hi. I played around with the 2023-Apr 4 latest patch. +lower(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4)]) should be + ranges_lower(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4)]) +upper(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4)]) should be + ranges_upper(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4)]) https://www.postgresql.org/docs/current/catalog-pg-type.html there is no association between numrange and their base type numeric. so for template: anyarray ranges_lower(anyarray). I don't think we can input numrange array and return a numeric array. https://www.postgresql.org/docs/current/extend-type-system.html#EXTEND-TYPES-POLYMORPHIC >> When the return value of a function is declared as a polymorphic type, there >> must be at least one argument position that is also >> polymorphic, and the >> actual data type(s) supplied for the polymorphic arguments determine the >> actual result type for that call. regression=# select ranges_lower(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4), numrange(5.5,6.6)]); ranges_lower --- {1.1,3.3,5.5} (1 row) regression=# \gdesc Column|Type --+ ranges_lower | numrange[] (1 row) I don't think you can cast literal ' {1.1,3.3,5.5}' to numrange[].
Re: pg_stats and range statistics
On Fri, 24 Mar 2023 at 14:48, Egor Rogov wrote: > > Done. > There is one thing I'm not sure what to do about. This check: > > if (typentry->typtype != TYPTYPE_RANGE) > ereport(ERROR, > (errcode(ERRCODE_DATATYPE_MISMATCH), >errmsg("expected array of ranges"))); > > doesn't work, because the range_get_typcache() call errors out first > ("type %u is not a range type"). The message doesn't look friendly > enough for user-faced SQL function. Should we duplicate > range_get_typcache's logic and replace the error message? > Okay. I've corrected the examples a bit. It sounds like you've addressed Tomas's feedback and still have one open question. Fwiw I rebased it, it seemed to merge fine automatically. I've updated the CF entry to Needs Review. But at this late date it may have to wait until the next release. -- Gregory Stark As Commitfest Manager From 87424880f1a970448979681684e6916f33567eeb Mon Sep 17 00:00:00 2001 From: Greg Stark Date: Mon, 3 Apr 2023 17:04:11 -0400 Subject: [PATCH] pg_stats and range statistics --- doc/src/sgml/func.sgml| 36 ++ doc/src/sgml/system-views.sgml| 40 +++ src/backend/catalog/system_views.sql | 30 - src/backend/utils/adt/rangetypes_typanalyze.c | 107 ++ src/include/catalog/pg_proc.dat | 10 ++ src/test/regress/expected/rangetypes.out | 22 src/test/regress/expected/rules.out | 34 +- src/test/regress/sql/rangetypes.sql | 7 ++ 8 files changed, 284 insertions(+), 2 deletions(-) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 918a492234..548078a12e 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -19643,6 +19643,24 @@ SELECT NULLIF(value, '(none)') ... + + + + ranges_lower + +ranges_lower ( anyarray ) +anyarray + + +Extracts lower bounds of ranges in the array (NULL if +the range is empty or the lower bound is infinite). + + +lower(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4)]) +{1.1,3.3} + + + @@ -19661,6 +19679,24 @@ SELECT NULLIF(value, '(none)') ... + + + + ranges_upper + +ranges_upper ( anyarray ) +anyarray + + +Extracts upper bounds of ranges (NULL if the +range is empty or the upper bound is infinite). + + +upper(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4)]) +{2.2,4.4} + + + diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml index bb1a418450..d7760838ae 100644 --- a/doc/src/sgml/system-views.sgml +++ b/doc/src/sgml/system-views.sgml @@ -3784,6 +3784,46 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx non-null elements. (Null for scalar types.) + + + + range_length_histogram anyarray + + + A histogram of the lengths of non-empty and non-null range values of a + range type column. (Null for non-range types.) + + + + + + range_empty_frac float4 + + + Fraction of column entries whose values are empty ranges. + (Null for non-range types.) + + + + + + range_lower_histogram anyarray + + + A histogram of lower bounds of non-empty and non-null range values. + (Null for non-range types.) + + + + + + range_upper_histogram anyarray + + + A histogram of upper bounds of non-empty and non-null range values. + (Null for non-range types.) + + diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 574cbc2e44..3fb7ee448f 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -243,7 +243,35 @@ CREATE VIEW pg_stats WITH (security_barrier) AS WHEN stakind3 = 5 THEN stanumbers3 WHEN stakind4 = 5 THEN stanumbers4 WHEN stakind5 = 5 THEN stanumbers5 -END AS elem_count_histogram +END AS elem_count_histogram, +CASE +WHEN stakind1 = 6 THEN stanumbers1[1] +WHEN stakind2 = 6 THEN stanumbers2[1] +WHEN stakind3 = 6 THEN stanumbers3[1] +WHEN stakind4 = 6 THEN stanumbers4[1] +WHEN stakind5 = 6 THEN stanumbers5[1] +END AS range_empty_frac, +CASE +WHEN stakind1 = 6 THEN stavalues1 +WHEN stakind2 = 6 THEN stavalues2 +WHEN stakind3 = 6 THEN stavalues3 +WHEN stakind4 = 6 THEN stavalues4 +WHEN stakind5 = 6 THEN stavalues5 +END AS range_length_histogram, +CASE +WHE
Re: pg_stats and range statistics
On 24.03.2023 01:46, Tomas Vondra wrote: So if you could clean it up a bit, and do something about the two open items I mentioned (a bunch of tests on different array, I've added some tests to resgress/sql/rangetypes.sql, based on the same dataset that is used to test lower() and upper(). and behavior consistent with lower/upper), Done. This required to switch from construct_array(), which doesn't support NULLs, to construct_md_array(), which does. A nice side effect is that now we also support multidimentional arrays. I've moved a common part of ranges_lower_bounds() and ranges_upper_bounds() to ranges_bounds_common(), following Justin's advice. There is one thing I'm not sure what to do about. This check: if (typentry->typtype != TYPTYPE_RANGE) ereport(ERROR, (errcode(ERRCODE_DATATYPE_MISMATCH), errmsg("expected array of ranges"))); doesn't work, because the range_get_typcache() call errors out first ("type %u is not a range type"). The message doesn't look friendly enough for user-faced SQL function. Should we duplicate range_get_typcache's logic and replace the error message? that'd be great. Do we stick with the ranges_upper(anyarray) and ranges_lower(anyarray) functions? This approach is okay with me. Tomas, have you made up your mind? I think the function approach is fine, but in my January 22 message I was wondering why we're not actually naming them simply lower/upper. I'd expect from lower(anyarray) function to return the lowest element in the array. This name doesn't hint that the function takes an array of ranges. So, ranges_ prefix seems justified to me. Do we want to document these functions? They are very pg_statistic-specific and won't be useful for end users imo. I don't see why not to document them. Sure, we're using them in a fairly specific context, but I don't see why not to let people use them too (which would be hard without docs). Okay. I've corrected the examples a bit. The patch is attached. Thanks, Egor diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 0987eb805a..c4d86e1679 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -19446,6 +19446,24 @@ SELECT NULLIF(value, '(none)') ... + + + + ranges_lower + +ranges_lower ( anyarray ) +anyarray + + +Extracts lower bounds of ranges in the array (NULL if +the range is empty or the lower bound is infinite). + + +lower(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4)]) +{1.1,3.3} + + + @@ -19464,6 +19482,24 @@ SELECT NULLIF(value, '(none)') ... + + + + ranges_upper + +ranges_upper ( anyarray ) +anyarray + + +Extracts upper bounds of ranges (NULL if the +range is empty or the upper bound is infinite). + + +upper(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4)]) +{2.2,4.4} + + + diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml index 7c8fc3f654..a48d775467 100644 --- a/doc/src/sgml/system-views.sgml +++ b/doc/src/sgml/system-views.sgml @@ -3783,6 +3783,46 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx non-null elements. (Null for scalar types.) + + + + range_length_histogram anyarray + + + A histogram of the lengths of non-empty and non-null range values of a + range type column. (Null for non-range types.) + + + + + + range_empty_frac float4 + + + Fraction of column entries whose values are empty ranges. + (Null for non-range types.) + + + + + + range_lower_histogram anyarray + + + A histogram of lower bounds of non-empty and non-null range values. + (Null for non-range types.) + + + + + + range_upper_histogram anyarray + + + A histogram of upper bounds of non-empty and non-null range values. + (Null for non-range types.) + + diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 8ea159dbde..882344a2d7 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -243,7 +243,35 @@ CREATE VIEW pg_stats WITH (security_barrier) AS WHEN stakind3 = 5 THEN stanumbers3 WHEN stakind4 = 5 THEN stanumbers4 WHEN stakind5 = 5 THEN stanumbers5 -END AS elem_count_histogram +END AS elem_count_histogram, +CASE +WHEN stakind1 = 6 THEN stanumbers1[1] +WHEN stakind2 = 6 THEN stanumbers2[1] +WHEN stakind3 = 6 THEN stanumbers3[1] +W
Re: pg_stats and range statistics
On 3/20/23 20:54, Egor Rogov wrote: > On 20.03.2023 22:27, Gregory Stark (as CFM) wrote: >> On Sun, 22 Jan 2023 at 18:22, Tomas Vondra >> wrote: >>> I wonder if we have other functions doing something similar, i.e. >>> accepting a polymorphic type and then imposing additional restrictions >>> on it. >> Meh, there's things like array comparison functions that require both >> arguments to be the same kind of arrays. And array_agg that requires >> the elements to be the same type as the state array (ie, same type as >> the first element). Not sure there are any taking just one specific >> type though. >> Shouldn't this add some sql tests ? >>> Yeah, I guess we should have a couple tests calling these functions on >>> different range arrays. >>> >>> This reminds me lower()/upper() have some extra rules about handling >>> empty ranges / infinite boundaries etc. These functions should behave >>> consistently (as if we called lower() in a loop) and I'm pretty sure >>> that's not the current state. >> Are we still waiting on these two items? Egor, do you think you'll >> have a chance to work it for this month? > > > I can try to tidy things up, but I'm not sure if we reached a consensus. > We don't have any objections, and that's probably the best consensus we can get here, I guess ... So if you could clean it up a bit, and do something about the two open items I mentioned (a bunch of tests on different array, and behavior consistent with lower/upper), that'd be great. > Do we stick with the ranges_upper(anyarray) and ranges_lower(anyarray) > functions? This approach is okay with me. Tomas, have you made up your > mind? > I think the function approach is fine, but in my January 22 message I was wondering why we're not actually naming them simply lower/upper. > Do we want to document these functions? They are very > pg_statistic-specific and won't be useful for end users imo. > I don't see why not to document them. Sure, we're using them in a fairly specific context, but I don't see why not to let people use them too (which would be hard without docs). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: pg_stats and range statistics
On 20.03.2023 22:27, Gregory Stark (as CFM) wrote: On Sun, 22 Jan 2023 at 18:22, Tomas Vondra wrote: I wonder if we have other functions doing something similar, i.e. accepting a polymorphic type and then imposing additional restrictions on it. Meh, there's things like array comparison functions that require both arguments to be the same kind of arrays. And array_agg that requires the elements to be the same type as the state array (ie, same type as the first element). Not sure there are any taking just one specific type though. Shouldn't this add some sql tests ? Yeah, I guess we should have a couple tests calling these functions on different range arrays. This reminds me lower()/upper() have some extra rules about handling empty ranges / infinite boundaries etc. These functions should behave consistently (as if we called lower() in a loop) and I'm pretty sure that's not the current state. Are we still waiting on these two items? Egor, do you think you'll have a chance to work it for this month? I can try to tidy things up, but I'm not sure if we reached a consensus. Do we stick with the ranges_upper(anyarray) and ranges_lower(anyarray) functions? This approach is okay with me. Tomas, have you made up your mind? Do we want to document these functions? They are very pg_statistic-specific and won't be useful for end users imo.
Re: pg_stats and range statistics
On Sun, 22 Jan 2023 at 18:22, Tomas Vondra wrote: > > I wonder if we have other functions doing something similar, i.e. > accepting a polymorphic type and then imposing additional restrictions > on it. Meh, there's things like array comparison functions that require both arguments to be the same kind of arrays. And array_agg that requires the elements to be the same type as the state array (ie, same type as the first element). Not sure there are any taking just one specific type though. > > Shouldn't this add some sql tests ? > > Yeah, I guess we should have a couple tests calling these functions on > different range arrays. > > This reminds me lower()/upper() have some extra rules about handling > empty ranges / infinite boundaries etc. These functions should behave > consistently (as if we called lower() in a loop) and I'm pretty sure > that's not the current state. Are we still waiting on these two items? Egor, do you think you'll have a chance to work it for this month? -- Gregory Stark As Commitfest Manager
Re: pg_stats and range statistics
On 23.01.2023 13:01, Egor Rogov wrote: On 23.01.2023 02:21, Tomas Vondra wrote: On 1/22/23 22:33, Justin Pryzby wrote: On Sun, Jan 22, 2023 at 07:19:41PM +0100, Tomas Vondra wrote: On 1/21/23 19:53, Egor Rogov wrote: Hi Tomas, On 21.01.2023 00:50, Tomas Vondra wrote: This simply adds two functions, accepting/producing anyarray - one for lower bounds, one for upper bounds. I don't think it can be done with a plain subquery (or at least I don't know how). Anyarray is an alien to SQL, so functions are well justified here. What makes me a bit uneasy is two almost identical functions. Should we consider other options like a function with an additional parameter or a function returning an array of bounds arrays (which is somewhat wasteful, but probably it doesn't matter much here)? I thought about that, but I think the alternatives (e.g. a single function with a parameter determining which boundary to return). But I don't think it's better. What about a common function, maybe called like: ranges_upper_bounds(PG_FUNCTION_ARGS) { AnyArrayType *array = PG_GETARG_ANY_ARRAY_P(0); Oid element_type = AARR_ELEMTYPE(array); TypeCacheEntry *typentry; /* Get information about range type; note column might be a domain */ typentry = range_get_typcache(fcinfo, getBaseType(element_type)); return ranges_bounds_common(typentry, array, false); } That saves 40 LOC. Thanks, that's better. But I'm still not sure it's a good idea to add function with anyarray argument, when we need it to be an array of ranges ... I wonder if we have other functions doing something similar, i.e. accepting a polymorphic type and then imposing additional restrictions on it. I couldn't find such examples, but adding an adhoc polymorphic type just doesn't look right for me. Besides, you'll end up adding not just anyrangearray type, but also anymultirangearray, anycompatiblerangearray, anycompatiblemultirangearray, and maybe their "non"-counterparts like anynonrangearray, and all of these are not of much use. And one day you may need an array of arrays or something... I wonder if it's possible to teach SQL to work with anyarray type - at runtime the actual type of anyarray elements is known, right? In fact, unnest() alone is enough to eliminate the need of C functions altogether. When started to look at how we deal with anyarray columns, I came across the following comment in parse_coerce.c for enforce_generic_type_consistency(): * A special case is that we could see ANYARRAY as an actual_arg_type even * when allow_poly is false (this is possible only because pg_statistic has * columns shown as anyarray in the catalogs). It makes me realize how anyarray as-a-real-type is specific to pg_statistic. Even if it's possible to somehow postpone type inference for this case from parse time to execute time, it clearly doesn't worth the effort. So, I am for the simplest possible approach, that is, the two proposed functions ranges_upper(anyarray) and ranges_lower(anyarray). I am not even sure if it's worth documenting them, as they are very pg_statistic-specific and likely won't be useful for end users. Shouldn't this add some sql tests ? Yeah, I guess we should have a couple tests calling these functions on different range arrays. This reminds me lower()/upper() have some extra rules about handling empty ranges / infinite boundaries etc. These functions should behave consistently (as if we called lower() in a loop) and I'm pretty sure that's not the current state. I can try to tidy things up, but first we need to decide on the general approach.
Re: pg_stats and range statistics
Hi, On 23.01.2023 02:21, Tomas Vondra wrote: On 1/22/23 22:33, Justin Pryzby wrote: On Sun, Jan 22, 2023 at 07:19:41PM +0100, Tomas Vondra wrote: On 1/21/23 19:53, Egor Rogov wrote: Hi Tomas, On 21.01.2023 00:50, Tomas Vondra wrote: This simply adds two functions, accepting/producing anyarray - one for lower bounds, one for upper bounds. I don't think it can be done with a plain subquery (or at least I don't know how). Anyarray is an alien to SQL, so functions are well justified here. What makes me a bit uneasy is two almost identical functions. Should we consider other options like a function with an additional parameter or a function returning an array of bounds arrays (which is somewhat wasteful, but probably it doesn't matter much here)? I thought about that, but I think the alternatives (e.g. a single function with a parameter determining which boundary to return). But I don't think it's better. What about a common function, maybe called like: ranges_upper_bounds(PG_FUNCTION_ARGS) { AnyArrayType *array = PG_GETARG_ANY_ARRAY_P(0); Oid element_type = AARR_ELEMTYPE(array); TypeCacheEntry *typentry; /* Get information about range type; note column might be a domain */ typentry = range_get_typcache(fcinfo, getBaseType(element_type)); return ranges_bounds_common(typentry, array, false); } That saves 40 LOC. Thanks, that's better. But I'm still not sure it's a good idea to add function with anyarray argument, when we need it to be an array of ranges ... I wonder if we have other functions doing something similar, i.e. accepting a polymorphic type and then imposing additional restrictions on it. I couldn't find such examples, but adding an adhoc polymorphic type just doesn't look right for me. Besides, you'll end up adding not just anyrangearray type, but also anymultirangearray, anycompatiblerangearray, anycompatiblemultirangearray, and maybe their "non"-counterparts like anynonrangearray, and all of these are not of much use. And one day you may need an array of arrays or something... I wonder if it's possible to teach SQL to work with anyarray type - at runtime the actual type of anyarray elements is known, right? In fact, unnest() alone is enough to eliminate the need of C functions altogether. Shouldn't this add some sql tests ? Yeah, I guess we should have a couple tests calling these functions on different range arrays. This reminds me lower()/upper() have some extra rules about handling empty ranges / infinite boundaries etc. These functions should behave consistently (as if we called lower() in a loop) and I'm pretty sure that's not the current state. I can try to tidy things up, but first we need to decide on the general approach. regards
Re: pg_stats and range statistics
On 1/22/23 22:33, Justin Pryzby wrote: > On Sun, Jan 22, 2023 at 07:19:41PM +0100, Tomas Vondra wrote: >> On 1/21/23 19:53, Egor Rogov wrote: >>> Hi Tomas, >>> On 21.01.2023 00:50, Tomas Vondra wrote: This simply adds two functions, accepting/producing anyarray - one for lower bounds, one for upper bounds. I don't think it can be done with a plain subquery (or at least I don't know how). >>> >>> Anyarray is an alien to SQL, so functions are well justified here. What >>> makes me a bit uneasy is two almost identical functions. Should we >>> consider other options like a function with an additional parameter or a >>> function returning an array of bounds arrays (which is somewhat >>> wasteful, but probably it doesn't matter much here)? >>> >> >> I thought about that, but I think the alternatives (e.g. a single >> function with a parameter determining which boundary to return). But I >> don't think it's better. > > What about a common function, maybe called like: > > ranges_upper_bounds(PG_FUNCTION_ARGS) > { > AnyArrayType *array = PG_GETARG_ANY_ARRAY_P(0); > Oid element_type = AARR_ELEMTYPE(array); > TypeCacheEntry *typentry; > > /* Get information about range type; note column might be a domain */ > typentry = range_get_typcache(fcinfo, getBaseType(element_type)); > > return ranges_bounds_common(typentry, array, false); > } > > That saves 40 LOC. > Thanks, that's better. But I'm still not sure it's a good idea to add function with anyarray argument, when we need it to be an array of ranges ... I wonder if we have other functions doing something similar, i.e. accepting a polymorphic type and then imposing additional restrictions on it. > Shouldn't this add some sql tests ? > Yeah, I guess we should have a couple tests calling these functions on different range arrays. This reminds me lower()/upper() have some extra rules about handling empty ranges / infinite boundaries etc. These functions should behave consistently (as if we called lower() in a loop) and I'm pretty sure that's not the current state. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: pg_stats and range statistics
On Sun, Jan 22, 2023 at 07:19:41PM +0100, Tomas Vondra wrote: > On 1/21/23 19:53, Egor Rogov wrote: > > Hi Tomas, > > On 21.01.2023 00:50, Tomas Vondra wrote: > >> This simply adds two functions, accepting/producing anyarray - one for > >> lower bounds, one for upper bounds. I don't think it can be done with a > >> plain subquery (or at least I don't know how). > > > > Anyarray is an alien to SQL, so functions are well justified here. What > > makes me a bit uneasy is two almost identical functions. Should we > > consider other options like a function with an additional parameter or a > > function returning an array of bounds arrays (which is somewhat > > wasteful, but probably it doesn't matter much here)? > > > > I thought about that, but I think the alternatives (e.g. a single > function with a parameter determining which boundary to return). But I > don't think it's better. What about a common function, maybe called like: ranges_upper_bounds(PG_FUNCTION_ARGS) { AnyArrayType *array = PG_GETARG_ANY_ARRAY_P(0); Oid element_type = AARR_ELEMTYPE(array); TypeCacheEntry *typentry; /* Get information about range type; note column might be a domain */ typentry = range_get_typcache(fcinfo, getBaseType(element_type)); return ranges_bounds_common(typentry, array, false); } That saves 40 LOC. Shouldn't this add some sql tests ? -- Justin
Re: pg_stats and range statistics
On 1/21/23 19:53, Egor Rogov wrote: > Hi Tomas, > > On 21.01.2023 00:50, Tomas Vondra wrote: >> Hi Egor, >> >> While reviewing a patch improving join estimates for ranges [1] I >> realized we don't show stats for ranges in pg_stats, and I recalled we >> had this patch. >> >> I rebased the v2, and I decided to took a stab at showing separate >> histograms for lower/upper histogram bounds. I believe it makes it way >> more readable, which is what pg_stats is about IMHO. > > > Thanks for looking into this. > > I have to admit it looks much better this way, so +1. > OK, good to hear. > >> This simply adds two functions, accepting/producing anyarray - one for >> lower bounds, one for upper bounds. I don't think it can be done with a >> plain subquery (or at least I don't know how). > > > Anyarray is an alien to SQL, so functions are well justified here. What > makes me a bit uneasy is two almost identical functions. Should we > consider other options like a function with an additional parameter or a > function returning an array of bounds arrays (which is somewhat > wasteful, but probably it doesn't matter much here)? > I thought about that, but I think the alternatives (e.g. a single function with a parameter determining which boundary to return). But I don't think it's better. Moreover, I think this is pretty similar to lower/upper, which already work on range values. So if we have separate functions for that, we should do the same thing here. I renamed the functions to ranges_lower/ranges_upper, but maybe why not to even call the functions lower/upper too? The main trouble with the function I can think of is that we only have anyarray type, not anyrangearray. So the functions will get called for arbitrary array, and the check that it's array of ranges happens inside. I'm not sure if that's a good or bad idea, or what would it take to add a new polymorphic type ... For now I at least kept "ranges_" to make it less likely. > >> Finally, it renames the empty_range_frac to start with range_, per the >> earlier discussion. I wonder if the new column names for lower/upper >> bounds (range_lower_bounds_histograms/range_upper_bounds_histograms) are >> too long ... > > > It seems so. The ending -s should be left out since it's a single > histogram now. And I think that > range_lower_histogram/range_upper_histogram are descriptive enough. > > I'm adding one more patch to shorten the column names, refresh the docs, > and make 'make check' happy (unfortunately, we have to edit > src/regress/expected/rules.out every time pg_stats definition changes). > Thanks. I noticed the docs were added to pg_user_mapping by mistake, not to pg_stats. So I fixed that, and I also added the new functions. Finally, I reordered the fields a bit - moved range_empty_frac to keep the histogram fields together. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL CompanyFrom 8049d2b7e5d1636d5fb2b7d421d6b29a39389fb3 Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Fri, 20 Jan 2023 20:50:41 +0100 Subject: [PATCH 1/2] Display length and bounds histograms in pg_stats Values corresponding to STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM and STATISTIC_KIND_BOUNDS_HISTOGRAM were not exposed to pg_stats when these slot kinds were introduced in 918eee0c49. This commit adds the missing fields to pg_stats. TODO: catalog version bump --- doc/src/sgml/catalogs.sgml| 40 ++ src/backend/catalog/system_views.sql | 30 - src/backend/utils/adt/rangetypes_typanalyze.c | 118 ++ src/include/catalog/pg_proc.dat | 10 ++ src/include/catalog/pg_statistic.h| 3 + src/test/regress/expected/rules.out | 34 - 6 files changed, 233 insertions(+), 2 deletions(-) diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index c1e4048054..e0851c52b4 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -9634,6 +9634,46 @@ SCRAM-SHA-256$:&l User mapping specific options, as keyword=value strings + + + + range_length_histogram anyarray + + + A histogram of the lengths of non-empty and non-null range values of a + range type column. (Null for non-range types.) + + + + + + range_empty_frac float4 + + + Fraction of column entries whose values are empty ranges. + (Null for non-range types.) + + + + + + range_lower_histogram anyarray + + + A histogram of lower bounds of non-empty and non-null range values. + (Null for non-range types.) + + + + + + range_upper_histogram anyarray + + + A histogram of upper bounds of non-empty and non-null range values. + (Null for non-range types.) + + diff --git a/src/backend/catalog/system_views.
Re: pg_stats and range statistics
Hi Tomas, On 21.01.2023 00:50, Tomas Vondra wrote: Hi Egor, While reviewing a patch improving join estimates for ranges [1] I realized we don't show stats for ranges in pg_stats, and I recalled we had this patch. I rebased the v2, and I decided to took a stab at showing separate histograms for lower/upper histogram bounds. I believe it makes it way more readable, which is what pg_stats is about IMHO. Thanks for looking into this. I have to admit it looks much better this way, so +1. This simply adds two functions, accepting/producing anyarray - one for lower bounds, one for upper bounds. I don't think it can be done with a plain subquery (or at least I don't know how). Anyarray is an alien to SQL, so functions are well justified here. What makes me a bit uneasy is two almost identical functions. Should we consider other options like a function with an additional parameter or a function returning an array of bounds arrays (which is somewhat wasteful, but probably it doesn't matter much here)? Finally, it renames the empty_range_frac to start with range_, per the earlier discussion. I wonder if the new column names for lower/upper bounds (range_lower_bounds_histograms/range_upper_bounds_histograms) are too long ... It seems so. The ending -s should be left out since it's a single histogram now. And I think that range_lower_histogram/range_upper_histogram are descriptive enough. I'm adding one more patch to shorten the column names, refresh the docs, and make 'make check' happy (unfortunately, we have to edit src/regress/expected/rules.out every time pg_stats definition changes). regards [1] https://commitfest.postgresql.org/41/3821/ From b339c0eab5616cec61e9d9e85398034861608d30 Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Fri, 20 Jan 2023 20:50:41 +0100 Subject: [PATCH 1/3] Display length and bounds histograms in pg_stats Values corresponding to STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM and STATISTIC_KIND_BOUNDS_HISTOGRAM were not exposed to pg_stats when these slot kinds were introduced in 918eee0c49. This commit adds the missing fields to pg_stats. TODO: catalog version bump --- doc/src/sgml/catalogs.sgml | 32 src/backend/catalog/system_views.sql | 23 +++- src/include/catalog/pg_statistic.h | 3 +++ src/test/regress/expected/rules.out | 26 +- 4 files changed, 82 insertions(+), 2 deletions(-) diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index c1e4048054e..c8bd84c56eb 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -9634,6 +9634,38 @@ SCRAM-SHA-256$:&l User mapping specific options, as keyword=value strings + + + + range_length_histogram anyarray + + + A histogram of the lengths of non-empty and non-null range values of a + range type column. (Null for non-range types.) + + + + + + empty_range_frac float4 + + + Fraction of column entries whose values are empty ranges. + (Null for non-range types.) + + + + + + range_bounds_histograms anyarray + + + Histograms of lower and upper bounds of non-empty, non-null ranges, + combined into a single array of range values. The lower and upper bounds + of each value correspond to the histograms of lower and upper bounds + respectively. (Null for non-range types.) + + diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 8608e3fa5b1..ccd6c7ffdb7 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -243,7 +243,28 @@ CREATE VIEW pg_stats WITH (security_barrier) AS WHEN stakind3 = 5 THEN stanumbers3 WHEN stakind4 = 5 THEN stanumbers4 WHEN stakind5 = 5 THEN stanumbers5 -END AS elem_count_histogram +END AS elem_count_histogram, +CASE +WHEN stakind1 = 6 THEN stavalues1 +WHEN stakind2 = 6 THEN stavalues2 +WHEN stakind3 = 6 THEN stavalues3 +WHEN stakind4 = 6 THEN stavalues4 +WHEN stakind5 = 6 THEN stavalues5 +END AS range_length_histogram, +CASE +WHEN stakind1 = 6 THEN stanumbers1[1] +WHEN stakind2 = 6 THEN stanumbers2[1] +WHEN stakind3 = 6 THEN stanumbers3[1] +WHEN stakind4 = 6 THEN stanumbers4[1] +WHEN stakind5 = 6 THEN stanumbers5[1] +END AS empty_range_frac, +CASE +WHEN stakind1 = 7 THEN stavalues1 +WHEN stakind2 = 7 THEN stavalues2 +WHEN stakind3 = 7 THEN stavalues3 +WHEN stakind4 = 7 THEN stavalues4 +WHEN stakind5 = 7 THEN stavalues5 +END AS range_bounds_histograms FROM pg_statistic s JOIN pg_class c ON (c.oi
Re: pg_stats and range statistics
Hi Egor, While reviewing a patch improving join estimates for ranges [1] I realized we don't show stats for ranges in pg_stats, and I recalled we had this patch. I rebased the v2, and I decided to took a stab at showing separate histograms for lower/upper histogram bounds. I believe it makes it way more readable, which is what pg_stats is about IMHO. This simply adds two functions, accepting/producing anyarray - one for lower bounds, one for upper bounds. I don't think it can be done with a plain subquery (or at least I don't know how). Finally, it renames the empty_range_frac to start with range_, per the earlier discussion. I wonder if the new column names for lower/upper bounds (range_lower_bounds_histograms/range_upper_bounds_histograms) are too long ... regards [1] https://commitfest.postgresql.org/41/3821/ -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL CompanyFrom b339c0eab5616cec61e9d9e85398034861608d30 Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Fri, 20 Jan 2023 20:50:41 +0100 Subject: [PATCH 1/3] Display length and bounds histograms in pg_stats Values corresponding to STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM and STATISTIC_KIND_BOUNDS_HISTOGRAM were not exposed to pg_stats when these slot kinds were introduced in 918eee0c49. This commit adds the missing fields to pg_stats. TODO: catalog version bump --- doc/src/sgml/catalogs.sgml | 32 src/backend/catalog/system_views.sql | 23 +++- src/include/catalog/pg_statistic.h | 3 +++ src/test/regress/expected/rules.out | 26 +- 4 files changed, 82 insertions(+), 2 deletions(-) diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index c1e4048054e..c8bd84c56eb 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -9634,6 +9634,38 @@ SCRAM-SHA-256$:&l User mapping specific options, as keyword=value strings + + + + range_length_histogram anyarray + + + A histogram of the lengths of non-empty and non-null range values of a + range type column. (Null for non-range types.) + + + + + + empty_range_frac float4 + + + Fraction of column entries whose values are empty ranges. + (Null for non-range types.) + + + + + + range_bounds_histograms anyarray + + + Histograms of lower and upper bounds of non-empty, non-null ranges, + combined into a single array of range values. The lower and upper bounds + of each value correspond to the histograms of lower and upper bounds + respectively. (Null for non-range types.) + + diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 8608e3fa5b1..ccd6c7ffdb7 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -243,7 +243,28 @@ CREATE VIEW pg_stats WITH (security_barrier) AS WHEN stakind3 = 5 THEN stanumbers3 WHEN stakind4 = 5 THEN stanumbers4 WHEN stakind5 = 5 THEN stanumbers5 -END AS elem_count_histogram +END AS elem_count_histogram, +CASE +WHEN stakind1 = 6 THEN stavalues1 +WHEN stakind2 = 6 THEN stavalues2 +WHEN stakind3 = 6 THEN stavalues3 +WHEN stakind4 = 6 THEN stavalues4 +WHEN stakind5 = 6 THEN stavalues5 +END AS range_length_histogram, +CASE +WHEN stakind1 = 6 THEN stanumbers1[1] +WHEN stakind2 = 6 THEN stanumbers2[1] +WHEN stakind3 = 6 THEN stanumbers3[1] +WHEN stakind4 = 6 THEN stanumbers4[1] +WHEN stakind5 = 6 THEN stanumbers5[1] +END AS empty_range_frac, +CASE +WHEN stakind1 = 7 THEN stavalues1 +WHEN stakind2 = 7 THEN stavalues2 +WHEN stakind3 = 7 THEN stavalues3 +WHEN stakind4 = 7 THEN stavalues4 +WHEN stakind5 = 7 THEN stavalues5 +END AS range_bounds_histograms FROM pg_statistic s JOIN pg_class c ON (c.oid = s.starelid) JOIN pg_attribute a ON (c.oid = attrelid AND attnum = s.staattnum) LEFT JOIN pg_namespace n ON (n.oid = c.relnamespace) diff --git a/src/include/catalog/pg_statistic.h b/src/include/catalog/pg_statistic.h index 8770c5b4c60..10401dece0d 100644 --- a/src/include/catalog/pg_statistic.h +++ b/src/include/catalog/pg_statistic.h @@ -152,6 +152,9 @@ DECLARE_FOREIGN_KEY((starelid, staattnum), pg_attribute, (attrelid, attnum)); * data "kind" will appear in any particular slot. Instead, search the * stakind fields to see if the desired data is available. (The standard * function get_attstatsslot() may be used for this.) + * + * Note: The pg_stats view needs to be modified whenever a new slot kind is + * added to core. */ /* dif
Re: pg_stats and range statistics
Hi Tomas, On 12.07.2021 16:04, Tomas Vondra wrote: On 7/12/21 1:10 PM, Egor Rogov wrote: Hi, thanks for the review and corrections. On 11.07.2021 21:54, Soumyadeep Chakraborty wrote: Hello, This should have been added with [1]. Excerpt from the documentation: "pg_stats is also designed to present the information in a more readable format than the underlying catalog — at the cost that its schema must be extended whenever new slot types are defined for pg_statistic." [2] So, I added a reminder in pg_statistic.h. Good point. Attached is v2 of this patch with some cosmetic changes. I wonder why "TODO: catalog version bump"? This patch doesn't change catalog structure, or I miss something? It changes system_views.sql, which is catalog change, as it redefines the pg_stats system view (it adds 3 more columns). So it changes what you get after initdb, hence catversion has to be bumped. Renamed the columns a bit and updated the docs to be a bit more descriptive. (range_length_empty_frac -> empty_range_frac, range_bounds_histogram -> range_bounds_histograms) I intended to make the same prefix ("range_") for all columns concerned with range types, although I'm fine with the proposed naming. Yeah, I'd vote to change empty_range_frac -> range_empty_frac. One question: We do have the option of representing the histogram of lower bounds separately from the histogram of upper bounds, as two separate view columns. Don't know if there is much utility though and there is a fair bit of added complexity: see below. Thoughts? I thought about it too, and decided not to transform the underlying data structure. As far as I can see, pg_stats never employed such transformations. For example, STATISTIC_KIND_DECHIST is an array containing the histogram followed by the average in its last element. It is shown in pg_stats.elem_count_histogram as is, although it arguably may be splitted into two fields. All in all, I believe pg_stats's job is to "unpack" stavalues and stanumbers into meaningful fields, and not to try to go deeper than that. Not firm opinion, but the pg_stats is meant to be easier to read/understand for humans. So far the transformation were simple because all the data was fairly simple, but the range stuff may need more complex transformation. For example we do quite a bit more in pg_stats_ext views, because it deals with multi-column stats. In pg_stats_ext, yes, but not in pg_stats (at least until now). Since no one has expressed a strong desire for a more complex transformation, should we proceed with the proposed approach (with further renaming empty_range_frac -> range_empty_frac as you suggested)? Or should we wait more for someone to weigh in? regards
Re: pg_stats and range statistics
On 7/12/21 1:10 PM, Egor Rogov wrote: > Hi, > > thanks for the review and corrections. > > On 11.07.2021 21:54, Soumyadeep Chakraborty wrote: >> Hello, >> >> This should have been added with [1]. >> >> Excerpt from the documentation: >> "pg_stats is also designed to present the information in a more readable >> format than the underlying catalog — at the cost that its schema must >> be extended whenever new slot types are defined for pg_statistic." [2] >> >> So, I added a reminder in pg_statistic.h. > > Good point. > > >> Attached is v2 of this patch with some cosmetic changes. > > I wonder why "TODO: catalog version bump"? This patch doesn't change > catalog structure, or I miss something? > It changes system_views.sql, which is catalog change, as it redefines the pg_stats system view (it adds 3 more columns). So it changes what you get after initdb, hence catversion has to be bumped. > >> Renamed the columns a >> bit and updated the docs to be a bit more descriptive. >> (range_length_empty_frac -> empty_range_frac, range_bounds_histogram -> >> range_bounds_histograms) > > I intended to make the same prefix ("range_") for all columns concerned > with range types, although I'm fine with the proposed naming. > Yeah, I'd vote to change empty_range_frac -> range_empty_frac. > >> One question: >> >> We do have the option of representing the histogram of lower bounds >> separately >> from the histogram of upper bounds, as two separate view columns. >> Don't know if >> there is much utility though and there is a fair bit of added >> complexity: see >> below. Thoughts? > > I thought about it too, and decided not to transform the underlying data > structure. As far as I can see, pg_stats never employed such > transformations. For example, STATISTIC_KIND_DECHIST is an array > containing the histogram followed by the average in its last element. It > is shown in pg_stats.elem_count_histogram as is, although it arguably > may be splitted into two fields. All in all, I believe pg_stats's job is > to "unpack" stavalues and stanumbers into meaningful fields, and not to > try to go deeper than that. > Not firm opinion, but the pg_stats is meant to be easier to read/understand for humans. So far the transformation were simple because all the data was fairly simple, but the range stuff may need more complex transformation. For example we do quite a bit more in pg_stats_ext views, because it deals with multi-column stats. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: pg_stats and range statistics
Hi, thanks for the review and corrections. On 11.07.2021 21:54, Soumyadeep Chakraborty wrote: Hello, This should have been added with [1]. Excerpt from the documentation: "pg_stats is also designed to present the information in a more readable format than the underlying catalog — at the cost that its schema must be extended whenever new slot types are defined for pg_statistic." [2] So, I added a reminder in pg_statistic.h. Good point. Attached is v2 of this patch with some cosmetic changes. I wonder why "TODO: catalog version bump"? This patch doesn't change catalog structure, or I miss something? Renamed the columns a bit and updated the docs to be a bit more descriptive. (range_length_empty_frac -> empty_range_frac, range_bounds_histogram -> range_bounds_histograms) I intended to make the same prefix ("range_") for all columns concerned with range types, although I'm fine with the proposed naming. One question: We do have the option of representing the histogram of lower bounds separately from the histogram of upper bounds, as two separate view columns. Don't know if there is much utility though and there is a fair bit of added complexity: see below. Thoughts? I thought about it too, and decided not to transform the underlying data structure. As far as I can see, pg_stats never employed such transformations. For example, STATISTIC_KIND_DECHIST is an array containing the histogram followed by the average in its last element. It is shown in pg_stats.elem_count_histogram as is, although it arguably may be splitted into two fields. All in all, I believe pg_stats's job is to "unpack" stavalues and stanumbers into meaningful fields, and not to try to go deeper than that. My attempts via SQL (unnest -> lower|upper -> array_agg) were futile given unnest does not play nice with anyarray. For instance: select unnest(stavalues1) from pg_statistic; ERROR: cannot determine element type of "anyarray" argument Maybe the only option is to write a UDF pg_get_{lower|upper}_bounds_histogram which can do something similar to what calc_hist_selectivity does: /* * Convert histogram of ranges into histograms of its lower and upper * bounds. */ nhist = hslot.nvalues; hist_lower = (RangeBound *) palloc(sizeof(RangeBound) * nhist); hist_upper = (RangeBound *) palloc(sizeof(RangeBound) * nhist); for (i = 0; i < nhist; i++) { bool empty; range_deserialize(rng_typcache, DatumGetRangeTypeP(hslot.values[i]), &hist_lower[i], &hist_upper[i], &empty); /* The histogram should not contain any empty ranges */ if (empty) elog(ERROR, "bounds histogram contains an empty range"); } This is looking good and ready. [1] https://github.com/postgres/postgres/commit/918eee0c497c88260a2e107318843c9b1947bc6f [2] https://www.postgresql.org/docs/devel/view-pg-stats.html Regards, Soumyadeep (VMware)
Re: pg_stats and range statistics
Hello, This should have been added with [1]. Excerpt from the documentation: "pg_stats is also designed to present the information in a more readable format than the underlying catalog — at the cost that its schema must be extended whenever new slot types are defined for pg_statistic." [2] So, I added a reminder in pg_statistic.h. Attached is v2 of this patch with some cosmetic changes. Renamed the columns a bit and updated the docs to be a bit more descriptive. (range_length_empty_frac -> empty_range_frac, range_bounds_histogram -> range_bounds_histograms) One question: We do have the option of representing the histogram of lower bounds separately from the histogram of upper bounds, as two separate view columns. Don't know if there is much utility though and there is a fair bit of added complexity: see below. Thoughts? My attempts via SQL (unnest -> lower|upper -> array_agg) were futile given unnest does not play nice with anyarray. For instance: select unnest(stavalues1) from pg_statistic; ERROR: cannot determine element type of "anyarray" argument Maybe the only option is to write a UDF pg_get_{lower|upper}_bounds_histogram which can do something similar to what calc_hist_selectivity does: /* * Convert histogram of ranges into histograms of its lower and upper * bounds. */ nhist = hslot.nvalues; hist_lower = (RangeBound *) palloc(sizeof(RangeBound) * nhist); hist_upper = (RangeBound *) palloc(sizeof(RangeBound) * nhist); for (i = 0; i < nhist; i++) { bool empty; range_deserialize(rng_typcache, DatumGetRangeTypeP(hslot.values[i]), &hist_lower[i], &hist_upper[i], &empty); /* The histogram should not contain any empty ranges */ if (empty) elog(ERROR, "bounds histogram contains an empty range"); } This is looking good and ready. [1] https://github.com/postgres/postgres/commit/918eee0c497c88260a2e107318843c9b1947bc6f [2] https://www.postgresql.org/docs/devel/view-pg-stats.html Regards, Soumyadeep (VMware) From 1f2ed0911eb3f7a53b16047cefb499692daf5ef6 Mon Sep 17 00:00:00 2001 From: Egor Rogov Date: Sun, 11 Jul 2021 11:09:45 -0700 Subject: [PATCH v2 1/1] Display length and bounds histograms in pg_stats Values corresponding to STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM and STATISTIC_KIND_BOUNDS_HISTOGRAM were not exposed to pg_stats when these slot kinds were introduced in 918eee0c49. This commit adds the missing fields to pg_stats. TODO: catalog version bump --- doc/src/sgml/catalogs.sgml | 32 src/backend/catalog/system_views.sql | 23 +++- src/include/catalog/pg_statistic.h | 3 +++ src/test/regress/expected/rules.out | 26 +- 4 files changed, 82 insertions(+), 2 deletions(-) diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index f517a7d4af..7168ff9a13 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -12877,6 +12877,38 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx non-null elements. (Null for scalar types.) + + + + range_length_histogram anyarray + + + A histogram of the lengths of non-empty and non-null range values of a + range type column. (Null for non-range types.) + + + + + + empty_range_frac float4 + + + Fraction of column entries whose values are empty ranges. + (Null for non-range types.) + + + + + + range_bounds_histograms anyarray + + + Histograms of lower and upper bounds of non-empty, non-null ranges, + combined into a single array of range values. The lower and upper bounds + of each value correspond to the histograms of lower and upper bounds + respectively. (Null for non-range types.) + + diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 999d984068..d8bc622ad5 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -243,7 +243,28 @@ CREATE VIEW pg_stats WITH (security_barrier) AS WHEN stakind3 = 5 THEN stanumbers3 WHEN stakind4 = 5 THEN stanumbers4 WHEN stakind5 = 5 THEN stanumbers5 -END AS elem_count_histogram +END AS elem_count_histogram, +CASE +WHEN stakind1 = 6 THEN stavalues1 +WHEN stakind2 = 6 THEN stavalues2 +WHEN stakind3 = 6 THEN stavalues3 +WHEN stakind4 = 6 THEN stavalues4 +WHEN stakind5 = 6 THEN stavalues5 +END AS range_length_histogram, +CASE +WHEN stakind1 = 6 THEN stanumbers1[1] +WHEN stakind2 = 6 THEN stanumbers2[1] +WHEN stakind3 = 6 THEN stanumbers3[1] +WHEN stakind4 = 6 THEN stanumbers4[1] +WHEN stakind5 = 6 THEN stanumbers5[1] +END AS empty_range_frac, +CASE +WHEN stakind1 = 7 THEN stavalues1 +
Re: pg_stats and range statistics
On 6/18/21 6:22 PM, Egor Rogov wrote: Hi, Statistics for range types are not currently exposed in pg_stats view (i.e. STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM and STATISTIC_KIND_BOUNDS_HISTOGRAM). Shouldn't they? If so, here is a patch for adding them. I think they should be exposed - I don't see why not to do that. I noticed this when working on the count-min sketch experiment too, so thanks for this patch. FWIW I've added the patch to the next CF: https://commitfest.postgresql.org/33/3184/ regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company