Re: pg_stats and range statistics

2023-11-25 Thread Alexander Korotkov
On Sat, Nov 25, 2023 at 11:14 AM Egor Rogov  wrote:
>
> Hi Alexander,
>
> On 25.11.2023 02:06, Alexander Korotkov wrote:
> >
> > In conclusion of all of the above, I decided to revise the patch and
> > show the bounds histogram as it's stored in pg_statistic.  I revised
> > the docs correspondingly.
>
>
> So basically we returned to what it all has started from? I guess it's
> better than nothing, although I have to admit that two-array
> representation is much more readable. Unfortunately it brings in a
> surprising amount of complexity.

Yep, it is.

> Anyway, thanks for looking into this!

And thank you for the feedback!

--
Regards,
Alexander Korotkov




Re: pg_stats and range statistics

2023-11-25 Thread Alexander Korotkov
On Sat, Nov 25, 2023 at 10:58 AM jian he  wrote:
> On Sat, Nov 25, 2023 at 7:06 AM Alexander Korotkov  
> wrote:
> >
> > Hi!
> > Additionally, I found that the current patch can't handle infinite
> > range bounds and discards information about inclusiveness of range
> > bounds.  The infinite bounds could be represented as NULL (while I'm
> > not sure how good this representation is).  Regarding inclusiveness, I
> > don't see the possibility to represent them in a reasonable way within
> > an array of base types.  I also don't feel good about discarding the
> > accuracy in the pg_stats view.
> >
>
> in range_length_histogram, maybe we can document that when calculating
> the length of a range, inclusiveness will be true.

I've revised the patchset.  Edited comment in pg_statistic.h as you
proposed.  And I've added to the documentation a short note on how the
range length histogram is calculated.

--
Regards,
Alexander Korotkov


0001-Update-comments-for-pg_statistic-catalog--20231125-2.patch
Description: Binary data


0002-Display-length-and-bounds-histograms-in-p-20231125-2.patch
Description: Binary data


Re: pg_stats and range statistics

2023-11-25 Thread Egor Rogov

Hi Alexander,

On 25.11.2023 02:06, Alexander Korotkov wrote:


In conclusion of all of the above, I decided to revise the patch and
show the bounds histogram as it's stored in pg_statistic.  I revised
the docs correspondingly.



So basically we returned to what it all has started from? I guess it's 
better than nothing, although I have to admit that two-array 
representation is much more readable. Unfortunately it brings in a 
surprising amount of complexity.


Anyway, thanks for looking into this!






Re: pg_stats and range statistics

2023-11-25 Thread jian he
On Sat, Nov 25, 2023 at 7:06 AM Alexander Korotkov  wrote:
>
> Hi!
> Additionally, I found that the current patch can't handle infinite
> range bounds and discards information about inclusiveness of range
> bounds.  The infinite bounds could be represented as NULL (while I'm
> not sure how good this representation is).  Regarding inclusiveness, I
> don't see the possibility to represent them in a reasonable way within
> an array of base types.  I also don't feel good about discarding the
> accuracy in the pg_stats view.
>

in range_length_histogram, maybe we can document that when calculating
the length of a range, inclusiveness will be true.




Re: pg_stats and range statistics

2023-11-25 Thread jian he
On Sat, Nov 25, 2023 at 7:06 AM Alexander Korotkov  wrote:
>
> Hi!
>
> I'm going to push this if there are no objections.
>
> --
> Regards,
> Alexander Korotkov

src/include/catalog/pg_statistic.h
268:  * range type's subdiff function. Only non-null rows are considered.

should it be:  * range type's subdiff function. Only non-null,
non-empty rows are considered.

Other than that, it looks fine to me.




Re: pg_stats and range statistics

2023-11-24 Thread Alexander Korotkov
Hi!

On Wed, Sep 6, 2023 at 6:18 PM jian he  wrote:
> +lower(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4)])
> should be
> +
> ranges_lower(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4)])
>
> +upper(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4)])
> should be
> +
> ranges_upper(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4)])
>
> https://www.postgresql.org/docs/current/catalog-pg-type.html
> there is no association between numrange and their base type numeric.
> so for template: anyarray ranges_lower(anyarray). I don't think we can
> input numrange array and return a numeric array.
>
> https://www.postgresql.org/docs/current/extend-type-system.html#EXTEND-TYPES-POLYMORPHIC
> >> When the return value of a function is declared as a polymorphic type, 
> >> there must be at least one argument position that is also >> polymorphic, 
> >> and the actual data type(s) supplied for the polymorphic arguments 
> >> determine the actual result type for that call.
>
>
> regression=# select
> ranges_lower(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4),
> numrange(5.5,6.6)]);
>  ranges_lower
> ---
>  {1.1,3.3,5.5}
> (1 row)
> regression=# \gdesc
> Column|Type
> --+
>  ranges_lower | numrange[]
> (1 row)
>
> I don't think you can cast literal ' {1.1,3.3,5.5}' to numrange[].

Thank you for noticing this.  Indeed, our polymorphic type system
doesn't support this case.  In order to support this, we need
something like "anyrangearray" pseudo-type.  However, it seems
overkill to introduce a new pseudo-type just to update pg_stats.

Additionally, I found that the current patch can't handle infinite
range bounds and discards information about inclusiveness of range
bounds.  The infinite bounds could be represented as NULL (while I'm
not sure how good this representation is).  Regarding inclusiveness, I
don't see the possibility to represent them in a reasonable way within
an array of base types.  I also don't feel good about discarding the
accuracy in the pg_stats view.

In conclusion of all of the above, I decided to revise the patch and
show the bounds histogram as it's stored in pg_statistic.  I revised
the docs correspondingly.

Also for some reason, the patch added description of new columns to
the documentation of pg_user_mapping table.  I've fixed that by moving
them to the documentation of pg_stats view.

Also, I've extracted the new comment in pg_statistic.h into a separate patch.

I'm going to push this if there are no objections.

--
Regards,
Alexander Korotkov


0001-Add-comment-to-pg_statistic-catalog-table-20231125.patch
Description: Binary data


0002-Display-length-and-bounds-histograms-in-pg_-20231125.patch
Description: Binary data


Re: pg_stats and range statistics

2023-09-06 Thread jian he
hi. I played around with the 2023-Apr 4 latest patch.

+lower(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4)])
should be
+
ranges_lower(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4)])

+upper(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4)])
should be
+
ranges_upper(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4)])

https://www.postgresql.org/docs/current/catalog-pg-type.html
there is no association between numrange and their base type numeric.
so for template: anyarray ranges_lower(anyarray). I don't think we can
input numrange array and return a numeric array.

https://www.postgresql.org/docs/current/extend-type-system.html#EXTEND-TYPES-POLYMORPHIC
>> When the return value of a function is declared as a polymorphic type, there 
>> must be at least one argument position that is also >> polymorphic, and the 
>> actual data type(s) supplied for the polymorphic arguments determine the 
>> actual result type for that call.


regression=# select
ranges_lower(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4),
numrange(5.5,6.6)]);
 ranges_lower
---
 {1.1,3.3,5.5}
(1 row)
regression=# \gdesc
Column|Type
--+
 ranges_lower | numrange[]
(1 row)

I don't think you can cast literal ' {1.1,3.3,5.5}' to numrange[].




Re: pg_stats and range statistics

2023-04-03 Thread Gregory Stark (as CFM)
On Fri, 24 Mar 2023 at 14:48, Egor Rogov  wrote:
>
> Done.

> There is one thing I'm not sure what to do about. This check:
>
>   if (typentry->typtype != TYPTYPE_RANGE)
>   ereport(ERROR,
>   (errcode(ERRCODE_DATATYPE_MISMATCH),
>errmsg("expected array of ranges")));
>
> doesn't work, because the range_get_typcache() call errors out first
> ("type %u is not a range type"). The message doesn't look friendly
> enough for user-faced SQL function. Should we duplicate
> range_get_typcache's logic and replace the error message?

> Okay. I've corrected the examples a bit.

It sounds like you've addressed Tomas's feedback and still have one
open question.

Fwiw I rebased it, it seemed to merge fine automatically.

I've updated the CF entry to Needs Review. But at this late date it
may have to wait until the next release.




-- 
Gregory Stark
As Commitfest Manager
From 87424880f1a970448979681684e6916f33567eeb Mon Sep 17 00:00:00 2001
From: Greg Stark 
Date: Mon, 3 Apr 2023 17:04:11 -0400
Subject: [PATCH] pg_stats and range statistics

---
 doc/src/sgml/func.sgml|  36 ++
 doc/src/sgml/system-views.sgml|  40 +++
 src/backend/catalog/system_views.sql  |  30 -
 src/backend/utils/adt/rangetypes_typanalyze.c | 107 ++
 src/include/catalog/pg_proc.dat   |  10 ++
 src/test/regress/expected/rangetypes.out  |  22 
 src/test/regress/expected/rules.out   |  34 +-
 src/test/regress/sql/rangetypes.sql   |   7 ++
 8 files changed, 284 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 918a492234..548078a12e 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -19643,6 +19643,24 @@ SELECT NULLIF(value, '(none)') ...

   
 
+  
+   
+
+ ranges_lower
+
+ranges_lower ( anyarray )
+anyarray
+   
+   
+Extracts lower bounds of ranges in the array (NULL if
+the range is empty or the lower bound is infinite).
+   
+   
+lower(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4)])
+{1.1,3.3}
+   
+  
+
   

 
@@ -19661,6 +19679,24 @@ SELECT NULLIF(value, '(none)') ...

   
 
+  
+   
+
+ ranges_upper
+
+ranges_upper ( anyarray )
+anyarray
+   
+   
+Extracts upper bounds of ranges (NULL if the
+range is empty or the upper bound is infinite).
+   
+   
+upper(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4)])
+{2.2,4.4}
+   
+  
+
   

 
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index bb1a418450..d7760838ae 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -3784,6 +3784,46 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
non-null elements.  (Null for scalar types.)
   
  
+
+ 
+  
+   range_length_histogram anyarray
+  
+  
+   A histogram of the lengths of non-empty and non-null range values of a
+   range type column. (Null for non-range types.)
+  
+ 
+
+ 
+  
+   range_empty_frac float4
+  
+  
+   Fraction of column entries whose values are empty ranges.
+   (Null for non-range types.)
+  
+ 
+
+ 
+  
+   range_lower_histogram anyarray
+  
+  
+   A histogram of lower bounds of non-empty and non-null range values.
+   (Null for non-range types.)
+  
+ 
+
+ 
+  
+   range_upper_histogram anyarray
+  
+  
+   A histogram of upper bounds of non-empty and non-null range values.
+   (Null for non-range types.)
+  
+ 
 

   
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 574cbc2e44..3fb7ee448f 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -243,7 +243,35 @@ CREATE VIEW pg_stats WITH (security_barrier) AS
 WHEN stakind3 = 5 THEN stanumbers3
 WHEN stakind4 = 5 THEN stanumbers4
 WHEN stakind5 = 5 THEN stanumbers5
-END AS elem_count_histogram
+END AS elem_count_histogram,
+CASE
+WHEN stakind1 = 6 THEN stanumbers1[1]
+WHEN stakind2 = 6 THEN stanumbers2[1]
+WHEN stakind3 = 6 THEN stanumbers3[1]
+WHEN stakind4 = 6 THEN stanumbers4[1]
+WHEN stakind5 = 6 THEN stanumbers5[1]
+END AS range_empty_frac,
+CASE
+WHEN stakind1 = 6 THEN stavalues1
+WHEN stakind2 = 6 THEN stavalues2
+WHEN stakind3 = 6 THEN stavalues3
+WHEN stakind4 = 6 THEN stavalues4
+WHEN stakind5 = 6 THEN stavalues5
+END AS range_length_histogram,
+CASE
+WHE

Re: pg_stats and range statistics

2023-03-24 Thread Egor Rogov

On 24.03.2023 01:46, Tomas Vondra wrote:



So if you could clean it up a bit, and do something about the two open
items I mentioned (a bunch of tests on different array,



I've added some tests to resgress/sql/rangetypes.sql, based on the same 
dataset that is used to test lower() and upper().




and behavior
consistent with lower/upper),



Done. This required to switch from construct_array(), which doesn't 
support NULLs, to construct_md_array(), which does. A nice side effect 
is that now we also support multidimentional arrays.


I've moved a common part of ranges_lower_bounds() and 
ranges_upper_bounds() to ranges_bounds_common(), following Justin's advice.



There is one thing I'm not sure what to do about. This check:

 if (typentry->typtype != TYPTYPE_RANGE)
 ereport(ERROR,
 (errcode(ERRCODE_DATATYPE_MISMATCH),
  errmsg("expected array of ranges")));

doesn't work, because the range_get_typcache() call errors out first 
("type %u is not a range type"). The message doesn't look friendly 
enough for user-faced SQL function. Should we duplicate 
range_get_typcache's logic and replace the error message?




  that'd be great.


Do we stick with the ranges_upper(anyarray) and ranges_lower(anyarray)
functions? This approach is okay with me. Tomas, have you made up your
mind?


I think the function approach is fine, but in my January 22 message I
was wondering why we're not actually naming them simply lower/upper.



I'd expect from lower(anyarray) function to return the lowest element in 
the array. This name doesn't hint that the function takes an array of 
ranges. So, ranges_ prefix seems justified to me.






Do we want to document these functions? They are very
pg_statistic-specific and won't be useful for end users imo.


I don't see why not to document them. Sure, we're using them in a fairly
specific context, but I don't see why not to let people use them too
(which would be hard without docs).



Okay. I've corrected the examples a bit.

The patch is attached.


Thanks,
Egor
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 0987eb805a..c4d86e1679 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -19446,6 +19446,24 @@ SELECT NULLIF(value, '(none)') ...

   
 
+  
+   
+
+ ranges_lower
+
+ranges_lower ( anyarray )
+anyarray
+   
+   
+Extracts lower bounds of ranges in the array (NULL 
if
+the range is empty or the lower bound is infinite).
+   
+   
+lower(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4)])
+{1.1,3.3}
+   
+  
+
   

 
@@ -19464,6 +19482,24 @@ SELECT NULLIF(value, '(none)') ...

   
 
+  
+   
+
+ ranges_upper
+
+ranges_upper ( anyarray )
+anyarray
+   
+   
+Extracts upper bounds of ranges (NULL if the
+range is empty or the upper bound is infinite).
+   
+   
+upper(ARRAY[numrange(1.1,2.2),numrange(3.3,4.4)])
+{2.2,4.4}
+   
+  
+
   

 
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 7c8fc3f654..a48d775467 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -3783,6 +3783,46 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
non-null elements.  (Null for scalar types.)
   
  
+
+ 
+  
+   range_length_histogram anyarray
+  
+  
+   A histogram of the lengths of non-empty and non-null range values of a
+   range type column. (Null for non-range types.)
+  
+ 
+
+ 
+  
+   range_empty_frac float4
+  
+  
+   Fraction of column entries whose values are empty ranges.
+   (Null for non-range types.)
+  
+ 
+
+ 
+  
+   range_lower_histogram anyarray
+  
+  
+   A histogram of lower bounds of non-empty and non-null range values.
+   (Null for non-range types.)
+  
+ 
+
+ 
+  
+   range_upper_histogram anyarray
+  
+  
+   A histogram of upper bounds of non-empty and non-null range values.
+   (Null for non-range types.)
+  
+ 
 

   
diff --git a/src/backend/catalog/system_views.sql 
b/src/backend/catalog/system_views.sql
index 8ea159dbde..882344a2d7 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -243,7 +243,35 @@ CREATE VIEW pg_stats WITH (security_barrier) AS
 WHEN stakind3 = 5 THEN stanumbers3
 WHEN stakind4 = 5 THEN stanumbers4
 WHEN stakind5 = 5 THEN stanumbers5
-END AS elem_count_histogram
+END AS elem_count_histogram,
+CASE
+WHEN stakind1 = 6 THEN stanumbers1[1]
+WHEN stakind2 = 6 THEN stanumbers2[1]
+WHEN stakind3 = 6 THEN stanumbers3[1]
+W

Re: pg_stats and range statistics

2023-03-23 Thread Tomas Vondra



On 3/20/23 20:54, Egor Rogov wrote:
> On 20.03.2023 22:27, Gregory Stark (as CFM) wrote:
>> On Sun, 22 Jan 2023 at 18:22, Tomas Vondra
>>  wrote:
>>> I wonder if we have other functions doing something similar, i.e.
>>> accepting a polymorphic type and then imposing additional restrictions
>>> on it.
>> Meh, there's things like array comparison functions that require both
>> arguments to be the same kind of arrays. And array_agg that requires
>> the elements to be the same type as the state array (ie, same type as
>> the first element). Not sure there are any taking just one specific
>> type though.
>>
 Shouldn't this add some sql tests ?
>>> Yeah, I guess we should have a couple tests calling these functions on
>>> different range arrays.
>>>
>>> This reminds me lower()/upper() have some extra rules about handling
>>> empty ranges / infinite boundaries etc. These functions should behave
>>> consistently (as if we called lower() in a loop) and I'm pretty sure
>>> that's not the current state.
>> Are we still waiting on these two items? Egor, do you think you'll
>> have a chance to work it for this month?
> 
> 
> I can try to tidy things up, but I'm not sure if we reached a consensus.
> 

We don't have any objections, and that's probably the best consensus we
can get here, I guess ...

So if you could clean it up a bit, and do something about the two open
items I mentioned (a bunch of tests on different array, and behavior
consistent with lower/upper), that'd be great.

> Do we stick with the ranges_upper(anyarray) and ranges_lower(anyarray)
> functions? This approach is okay with me. Tomas, have you made up your
> mind?
> 

I think the function approach is fine, but in my January 22 message I
was wondering why we're not actually naming them simply lower/upper.

> Do we want to document these functions? They are very
> pg_statistic-specific and won't be useful for end users imo.
> 

I don't see why not to document them. Sure, we're using them in a fairly
specific context, but I don't see why not to let people use them too
(which would be hard without docs).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pg_stats and range statistics

2023-03-20 Thread Egor Rogov

On 20.03.2023 22:27, Gregory Stark (as CFM) wrote:

On Sun, 22 Jan 2023 at 18:22, Tomas Vondra
 wrote:

I wonder if we have other functions doing something similar, i.e.
accepting a polymorphic type and then imposing additional restrictions
on it.

Meh, there's things like array comparison functions that require both
arguments to be the same kind of arrays. And array_agg that requires
the elements to be the same type as the state array (ie, same type as
the first element). Not sure there are any taking just one specific
type though.


Shouldn't this add some sql tests ?

Yeah, I guess we should have a couple tests calling these functions on
different range arrays.

This reminds me lower()/upper() have some extra rules about handling
empty ranges / infinite boundaries etc. These functions should behave
consistently (as if we called lower() in a loop) and I'm pretty sure
that's not the current state.

Are we still waiting on these two items? Egor, do you think you'll
have a chance to work it for this month?



I can try to tidy things up, but I'm not sure if we reached a consensus.

Do we stick with the ranges_upper(anyarray) and ranges_lower(anyarray) 
functions? This approach is okay with me. Tomas, have you made up your mind?


Do we want to document these functions? They are very 
pg_statistic-specific and won't be useful for end users imo.







Re: pg_stats and range statistics

2023-03-20 Thread Gregory Stark (as CFM)
On Sun, 22 Jan 2023 at 18:22, Tomas Vondra
 wrote:
>
> I wonder if we have other functions doing something similar, i.e.
> accepting a polymorphic type and then imposing additional restrictions
> on it.

Meh, there's things like array comparison functions that require both
arguments to be the same kind of arrays. And array_agg that requires
the elements to be the same type as the state array (ie, same type as
the first element). Not sure there are any taking just one specific
type though.

> > Shouldn't this add some sql tests ?
>
> Yeah, I guess we should have a couple tests calling these functions on
> different range arrays.
>
> This reminds me lower()/upper() have some extra rules about handling
> empty ranges / infinite boundaries etc. These functions should behave
> consistently (as if we called lower() in a loop) and I'm pretty sure
> that's not the current state.

Are we still waiting on these two items? Egor, do you think you'll
have a chance to work it for this month?

-- 
Gregory Stark
As Commitfest Manager




Re: pg_stats and range statistics

2023-01-23 Thread Egor Rogov

On 23.01.2023 13:01, Egor Rogov wrote:


On 23.01.2023 02:21, Tomas Vondra wrote:

On 1/22/23 22:33, Justin Pryzby wrote:

On Sun, Jan 22, 2023 at 07:19:41PM +0100, Tomas Vondra wrote:

On 1/21/23 19:53, Egor Rogov wrote:

Hi Tomas,
On 21.01.2023 00:50, Tomas Vondra wrote:
This simply adds two functions, accepting/producing anyarray - 
one for
lower bounds, one for upper bounds. I don't think it can be done 
with a

plain subquery (or at least I don't know how).
Anyarray is an alien to SQL, so functions are well justified here. 
What

makes me a bit uneasy is two almost identical functions. Should we
consider other options like a function with an additional 
parameter or a

function returning an array of bounds arrays (which is somewhat
wasteful, but probably it doesn't matter much here)?


I thought about that, but I think the alternatives (e.g. a single
function with a parameter determining which boundary to return). But I
don't think it's better.

What about a common function, maybe called like:

ranges_upper_bounds(PG_FUNCTION_ARGS)
{
 AnyArrayType *array = PG_GETARG_ANY_ARRAY_P(0);
 Oid element_type = AARR_ELEMTYPE(array);
 TypeCacheEntry *typentry;

 /* Get information about range type; note column might be a 
domain */

 typentry = range_get_typcache(fcinfo, getBaseType(element_type));

 return ranges_bounds_common(typentry, array, false);
}

That saves 40 LOC.


Thanks, that's better. But I'm still not sure it's a good idea to add
function with anyarray argument, when we need it to be an array of
ranges ...

I wonder if we have other functions doing something similar, i.e.
accepting a polymorphic type and then imposing additional restrictions
on it.



I couldn't find such examples, but adding an adhoc polymorphic type 
just doesn't look right for me. Besides, you'll end up adding not just 
anyrangearray type, but also anymultirangearray, 
anycompatiblerangearray, anycompatiblemultirangearray, and maybe their 
"non"-counterparts like anynonrangearray, and all of these are not of 
much use. And one day you may need an array of arrays or something...


I wonder if it's possible to teach SQL to work with anyarray type - at 
runtime the actual type of anyarray elements is known, right? In fact, 
unnest() alone is enough to eliminate the need of C functions altogether.



When started to look at how we deal with anyarray columns, I came across 
the following comment in parse_coerce.c for 
enforce_generic_type_consistency():


* A special case is that we could see ANYARRAY as an actual_arg_type even
 * when allow_poly is false (this is possible only because pg_statistic has
 * columns shown as anyarray in the catalogs).

It makes me realize how anyarray as-a-real-type is specific to 
pg_statistic. Even if it's possible to somehow postpone type inference 
for this case from parse time to execute time, it clearly doesn't worth 
the effort.


So, I am for the simplest possible approach, that is, the two proposed 
functions ranges_upper(anyarray) and ranges_lower(anyarray). I am not 
even sure if it's worth documenting them, as they are very 
pg_statistic-specific and likely won't be useful for end users.







Shouldn't this add some sql tests ?


Yeah, I guess we should have a couple tests calling these functions on
different range arrays.

This reminds me lower()/upper() have some extra rules about handling
empty ranges / infinite boundaries etc. These functions should behave
consistently (as if we called lower() in a loop) and I'm pretty sure
that's not the current state.



I can try to tidy things up, but first we need to decide on the 
general approach.








Re: pg_stats and range statistics

2023-01-23 Thread Egor Rogov

Hi,

On 23.01.2023 02:21, Tomas Vondra wrote:


On 1/22/23 22:33, Justin Pryzby wrote:

On Sun, Jan 22, 2023 at 07:19:41PM +0100, Tomas Vondra wrote:

On 1/21/23 19:53, Egor Rogov wrote:

Hi Tomas,
On 21.01.2023 00:50, Tomas Vondra wrote:

This simply adds two functions, accepting/producing anyarray - one for
lower bounds, one for upper bounds. I don't think it can be done with a
plain subquery (or at least I don't know how).

Anyarray is an alien to SQL, so functions are well justified here. What
makes me a bit uneasy is two almost identical functions. Should we
consider other options like a function with an additional parameter or a
function returning an array of bounds arrays (which is somewhat
wasteful, but probably it doesn't matter much here)?


I thought about that, but I think the alternatives (e.g. a single
function with a parameter determining which boundary to return). But I
don't think it's better.

What about a common function, maybe called like:

ranges_upper_bounds(PG_FUNCTION_ARGS)
{
 AnyArrayType *array = PG_GETARG_ANY_ARRAY_P(0);
 Oid element_type = AARR_ELEMTYPE(array);
 TypeCacheEntry *typentry;

 /* Get information about range type; note column might be a domain */
 typentry = range_get_typcache(fcinfo, getBaseType(element_type));

 return ranges_bounds_common(typentry, array, false);
}

That saves 40 LOC.


Thanks, that's better. But I'm still not sure it's a good idea to add
function with anyarray argument, when we need it to be an array of
ranges ...

I wonder if we have other functions doing something similar, i.e.
accepting a polymorphic type and then imposing additional restrictions
on it.



I couldn't find such examples, but adding an adhoc polymorphic type just 
doesn't look right for me. Besides, you'll end up adding not just 
anyrangearray type, but also anymultirangearray, 
anycompatiblerangearray, anycompatiblemultirangearray, and maybe their 
"non"-counterparts like anynonrangearray, and all of these are not of 
much use. And one day you may need an array of arrays or something...


I wonder if it's possible to teach SQL to work with anyarray type - at 
runtime the actual type of anyarray elements is known, right? In fact, 
unnest() alone is enough to eliminate the need of C functions altogether.




Shouldn't this add some sql tests ?


Yeah, I guess we should have a couple tests calling these functions on
different range arrays.

This reminds me lower()/upper() have some extra rules about handling
empty ranges / infinite boundaries etc. These functions should behave
consistently (as if we called lower() in a loop) and I'm pretty sure
that's not the current state.



I can try to tidy things up, but first we need to decide on the general 
approach.






regards






Re: pg_stats and range statistics

2023-01-22 Thread Tomas Vondra



On 1/22/23 22:33, Justin Pryzby wrote:
> On Sun, Jan 22, 2023 at 07:19:41PM +0100, Tomas Vondra wrote:
>> On 1/21/23 19:53, Egor Rogov wrote:
>>> Hi Tomas,
>>> On 21.01.2023 00:50, Tomas Vondra wrote:
 This simply adds two functions, accepting/producing anyarray - one for
 lower bounds, one for upper bounds. I don't think it can be done with a
 plain subquery (or at least I don't know how).
>>>
>>> Anyarray is an alien to SQL, so functions are well justified here. What
>>> makes me a bit uneasy is two almost identical functions. Should we
>>> consider other options like a function with an additional parameter or a
>>> function returning an array of bounds arrays (which is somewhat
>>> wasteful, but probably it doesn't matter much here)?
>>>
>>
>> I thought about that, but I think the alternatives (e.g. a single
>> function with a parameter determining which boundary to return). But I
>> don't think it's better.
> 
> What about a common function, maybe called like:
> 
> ranges_upper_bounds(PG_FUNCTION_ARGS)
> {
> AnyArrayType *array = PG_GETARG_ANY_ARRAY_P(0);
> Oid element_type = AARR_ELEMTYPE(array);
> TypeCacheEntry *typentry;
> 
> /* Get information about range type; note column might be a domain */
> typentry = range_get_typcache(fcinfo, getBaseType(element_type));
> 
> return ranges_bounds_common(typentry, array, false);
> }
> 
> That saves 40 LOC.
> 

Thanks, that's better. But I'm still not sure it's a good idea to add
function with anyarray argument, when we need it to be an array of
ranges ...

I wonder if we have other functions doing something similar, i.e.
accepting a polymorphic type and then imposing additional restrictions
on it.

> Shouldn't this add some sql tests ?
> 

Yeah, I guess we should have a couple tests calling these functions on
different range arrays.

This reminds me lower()/upper() have some extra rules about handling
empty ranges / infinite boundaries etc. These functions should behave
consistently (as if we called lower() in a loop) and I'm pretty sure
that's not the current state.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pg_stats and range statistics

2023-01-22 Thread Justin Pryzby
On Sun, Jan 22, 2023 at 07:19:41PM +0100, Tomas Vondra wrote:
> On 1/21/23 19:53, Egor Rogov wrote:
> > Hi Tomas,
> > On 21.01.2023 00:50, Tomas Vondra wrote:
> >> This simply adds two functions, accepting/producing anyarray - one for
> >> lower bounds, one for upper bounds. I don't think it can be done with a
> >> plain subquery (or at least I don't know how).
> > 
> > Anyarray is an alien to SQL, so functions are well justified here. What
> > makes me a bit uneasy is two almost identical functions. Should we
> > consider other options like a function with an additional parameter or a
> > function returning an array of bounds arrays (which is somewhat
> > wasteful, but probably it doesn't matter much here)?
> > 
> 
> I thought about that, but I think the alternatives (e.g. a single
> function with a parameter determining which boundary to return). But I
> don't think it's better.

What about a common function, maybe called like:

ranges_upper_bounds(PG_FUNCTION_ARGS)
{
AnyArrayType *array = PG_GETARG_ANY_ARRAY_P(0);
Oid element_type = AARR_ELEMTYPE(array);
TypeCacheEntry *typentry;

/* Get information about range type; note column might be a domain */
typentry = range_get_typcache(fcinfo, getBaseType(element_type));

return ranges_bounds_common(typentry, array, false);
}

That saves 40 LOC.

Shouldn't this add some sql tests ?

-- 
Justin




Re: pg_stats and range statistics

2023-01-22 Thread Tomas Vondra
On 1/21/23 19:53, Egor Rogov wrote:
> Hi Tomas,
> 
> On 21.01.2023 00:50, Tomas Vondra wrote:
>> Hi Egor,
>>
>> While reviewing a patch improving join estimates for ranges [1] I
>> realized we don't show stats for ranges in pg_stats, and I recalled we
>> had this patch.
>>
>> I rebased the v2, and I decided to took a stab at showing separate
>> histograms for lower/upper histogram bounds. I believe it makes it way
>> more readable, which is what pg_stats is about IMHO.
> 
> 
> Thanks for looking into this.
> 
> I have to admit it looks much better this way, so +1.
> 

OK, good to hear.

> 
>> This simply adds two functions, accepting/producing anyarray - one for
>> lower bounds, one for upper bounds. I don't think it can be done with a
>> plain subquery (or at least I don't know how).
> 
> 
> Anyarray is an alien to SQL, so functions are well justified here. What
> makes me a bit uneasy is two almost identical functions. Should we
> consider other options like a function with an additional parameter or a
> function returning an array of bounds arrays (which is somewhat
> wasteful, but probably it doesn't matter much here)?
> 

I thought about that, but I think the alternatives (e.g. a single
function with a parameter determining which boundary to return). But I
don't think it's better.

Moreover, I think this is pretty similar to lower/upper, which already
work on range values. So if we have separate functions for that, we
should do the same thing here.

I renamed the functions to ranges_lower/ranges_upper, but maybe why not
to even call the functions lower/upper too?

The main trouble with the function I can think of is that we only have
anyarray type, not anyrangearray. So the functions will get called for
arbitrary array, and the check that it's array of ranges happens inside.
I'm not sure if that's a good or bad idea, or what would it take to add
a new polymorphic type ...

For now I at least kept "ranges_" to make it less likely.

> 
>> Finally, it renames the empty_range_frac to start with range_, per the
>> earlier discussion. I wonder if the new column names for lower/upper
>> bounds (range_lower_bounds_histograms/range_upper_bounds_histograms) are
>> too long ...
> 
> 
> It seems so. The ending -s should be left out since it's a single
> histogram now. And I think that
> range_lower_histogram/range_upper_histogram are descriptive enough.
> 
> I'm adding one more patch to shorten the column names, refresh the docs,
> and make 'make check' happy (unfortunately, we have to edit
> src/regress/expected/rules.out every time pg_stats definition changes).
> 

Thanks. I noticed the docs were added to pg_user_mapping by mistake, not
to pg_stats. So I fixed that, and I also added the new functions.

Finally, I reordered the fields a bit - moved range_empty_frac to keep
the histogram fields together.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom 8049d2b7e5d1636d5fb2b7d421d6b29a39389fb3 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Fri, 20 Jan 2023 20:50:41 +0100
Subject: [PATCH 1/2] Display length and bounds histograms in pg_stats

Values corresponding to STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM and
STATISTIC_KIND_BOUNDS_HISTOGRAM were not exposed to pg_stats when these
slot kinds were introduced in 918eee0c49.

This commit adds the missing fields to pg_stats.

TODO: catalog version bump
---
 doc/src/sgml/catalogs.sgml|  40 ++
 src/backend/catalog/system_views.sql  |  30 -
 src/backend/utils/adt/rangetypes_typanalyze.c | 118 ++
 src/include/catalog/pg_proc.dat   |  10 ++
 src/include/catalog/pg_statistic.h|   3 +
 src/test/regress/expected/rules.out   |  34 -
 6 files changed, 233 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index c1e4048054..e0851c52b4 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9634,6 +9634,46 @@ SCRAM-SHA-256$:&l
User mapping specific options, as keyword=value strings
   
  
+
+ 
+  
+   range_length_histogram anyarray
+  
+  
+   A histogram of the lengths of non-empty and non-null range values of a
+   range type column. (Null for non-range types.)
+  
+ 
+
+ 
+  
+   range_empty_frac float4
+  
+  
+   Fraction of column entries whose values are empty ranges.
+   (Null for non-range types.)
+  
+ 
+
+ 
+  
+   range_lower_histogram anyarray
+  
+  
+   A histogram of lower bounds of non-empty and non-null range values.
+   (Null for non-range types.)
+  
+ 
+
+ 
+  
+   range_upper_histogram anyarray
+  
+  
+   A histogram of upper bounds of non-empty and non-null range values.
+   (Null for non-range types.)
+  
+ 
 

   
diff --git a/src/backend/catalog/system_views.

Re: pg_stats and range statistics

2023-01-21 Thread Egor Rogov

Hi Tomas,

On 21.01.2023 00:50, Tomas Vondra wrote:

Hi Egor,

While reviewing a patch improving join estimates for ranges [1] I
realized we don't show stats for ranges in pg_stats, and I recalled we
had this patch.

I rebased the v2, and I decided to took a stab at showing separate
histograms for lower/upper histogram bounds. I believe it makes it way
more readable, which is what pg_stats is about IMHO.



Thanks for looking into this.

I have to admit it looks much better this way, so +1.



This simply adds two functions, accepting/producing anyarray - one for
lower bounds, one for upper bounds. I don't think it can be done with a
plain subquery (or at least I don't know how).



Anyarray is an alien to SQL, so functions are well justified here. What 
makes me a bit uneasy is two almost identical functions. Should we 
consider other options like a function with an additional parameter or a 
function returning an array of bounds arrays (which is somewhat 
wasteful, but probably it doesn't matter much here)?




Finally, it renames the empty_range_frac to start with range_, per the
earlier discussion. I wonder if the new column names for lower/upper
bounds (range_lower_bounds_histograms/range_upper_bounds_histograms) are
too long ...



It seems so. The ending -s should be left out since it's a single 
histogram now. And I think that 
range_lower_histogram/range_upper_histogram are descriptive enough.


I'm adding one more patch to shorten the column names, refresh the docs, 
and make 'make check' happy (unfortunately, we have to edit 
src/regress/expected/rules.out every time pg_stats definition changes).





regards

[1] https://commitfest.postgresql.org/41/3821/
From b339c0eab5616cec61e9d9e85398034861608d30 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Fri, 20 Jan 2023 20:50:41 +0100
Subject: [PATCH 1/3] Display length and bounds histograms in pg_stats

Values corresponding to STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM and
STATISTIC_KIND_BOUNDS_HISTOGRAM were not exposed to pg_stats when these
slot kinds were introduced in 918eee0c49.

This commit adds the missing fields to pg_stats.

TODO: catalog version bump
---
 doc/src/sgml/catalogs.sgml   | 32 
 src/backend/catalog/system_views.sql | 23 +++-
 src/include/catalog/pg_statistic.h   |  3 +++
 src/test/regress/expected/rules.out  | 26 +-
 4 files changed, 82 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index c1e4048054e..c8bd84c56eb 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9634,6 +9634,38 @@ SCRAM-SHA-256$:&l
User mapping specific options, as keyword=value strings
   
  
+
+ 
+  
+   range_length_histogram anyarray
+  
+  
+   A histogram of the lengths of non-empty and non-null range values of a
+   range type column. (Null for non-range types.)
+  
+ 
+
+ 
+  
+   empty_range_frac float4
+  
+  
+   Fraction of column entries whose values are empty ranges.
+   (Null for non-range types.)
+  
+ 
+
+ 
+  
+   range_bounds_histograms anyarray
+  
+  
+   Histograms of lower and upper bounds of non-empty, non-null ranges,
+   combined into a single array of range values. The lower and upper bounds
+   of each value correspond to the histograms of lower and upper bounds
+   respectively. (Null for non-range types.)
+  
+ 
 

   
diff --git a/src/backend/catalog/system_views.sql 
b/src/backend/catalog/system_views.sql
index 8608e3fa5b1..ccd6c7ffdb7 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -243,7 +243,28 @@ CREATE VIEW pg_stats WITH (security_barrier) AS
 WHEN stakind3 = 5 THEN stanumbers3
 WHEN stakind4 = 5 THEN stanumbers4
 WHEN stakind5 = 5 THEN stanumbers5
-END AS elem_count_histogram
+END AS elem_count_histogram,
+CASE
+WHEN stakind1 = 6 THEN stavalues1
+WHEN stakind2 = 6 THEN stavalues2
+WHEN stakind3 = 6 THEN stavalues3
+WHEN stakind4 = 6 THEN stavalues4
+WHEN stakind5 = 6 THEN stavalues5
+END AS range_length_histogram,
+CASE
+WHEN stakind1 = 6 THEN stanumbers1[1]
+WHEN stakind2 = 6 THEN stanumbers2[1]
+WHEN stakind3 = 6 THEN stanumbers3[1]
+WHEN stakind4 = 6 THEN stanumbers4[1]
+WHEN stakind5 = 6 THEN stanumbers5[1]
+END AS empty_range_frac,
+CASE
+WHEN stakind1 = 7 THEN stavalues1
+WHEN stakind2 = 7 THEN stavalues2
+WHEN stakind3 = 7 THEN stavalues3
+WHEN stakind4 = 7 THEN stavalues4
+WHEN stakind5 = 7 THEN stavalues5
+END AS range_bounds_histograms
 FROM pg_statistic s JOIN pg_class c ON (c.oi

Re: pg_stats and range statistics

2023-01-20 Thread Tomas Vondra
Hi Egor,

While reviewing a patch improving join estimates for ranges [1] I
realized we don't show stats for ranges in pg_stats, and I recalled we
had this patch.

I rebased the v2, and I decided to took a stab at showing separate
histograms for lower/upper histogram bounds. I believe it makes it way
more readable, which is what pg_stats is about IMHO.

This simply adds two functions, accepting/producing anyarray - one for
lower bounds, one for upper bounds. I don't think it can be done with a
plain subquery (or at least I don't know how).

Finally, it renames the empty_range_frac to start with range_, per the
earlier discussion. I wonder if the new column names for lower/upper
bounds (range_lower_bounds_histograms/range_upper_bounds_histograms) are
too long ...

regards

[1] https://commitfest.postgresql.org/41/3821/

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom b339c0eab5616cec61e9d9e85398034861608d30 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Fri, 20 Jan 2023 20:50:41 +0100
Subject: [PATCH 1/3] Display length and bounds histograms in pg_stats

Values corresponding to STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM and
STATISTIC_KIND_BOUNDS_HISTOGRAM were not exposed to pg_stats when these
slot kinds were introduced in 918eee0c49.

This commit adds the missing fields to pg_stats.

TODO: catalog version bump
---
 doc/src/sgml/catalogs.sgml   | 32 
 src/backend/catalog/system_views.sql | 23 +++-
 src/include/catalog/pg_statistic.h   |  3 +++
 src/test/regress/expected/rules.out  | 26 +-
 4 files changed, 82 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index c1e4048054e..c8bd84c56eb 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9634,6 +9634,38 @@ SCRAM-SHA-256$:&l
User mapping specific options, as keyword=value strings
   
  
+
+ 
+  
+   range_length_histogram anyarray
+  
+  
+   A histogram of the lengths of non-empty and non-null range values of a
+   range type column. (Null for non-range types.)
+  
+ 
+
+ 
+  
+   empty_range_frac float4
+  
+  
+   Fraction of column entries whose values are empty ranges.
+   (Null for non-range types.)
+  
+ 
+
+ 
+  
+   range_bounds_histograms anyarray
+  
+  
+   Histograms of lower and upper bounds of non-empty, non-null ranges,
+   combined into a single array of range values. The lower and upper bounds
+   of each value correspond to the histograms of lower and upper bounds
+   respectively. (Null for non-range types.)
+  
+ 
 

   
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b1..ccd6c7ffdb7 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -243,7 +243,28 @@ CREATE VIEW pg_stats WITH (security_barrier) AS
 WHEN stakind3 = 5 THEN stanumbers3
 WHEN stakind4 = 5 THEN stanumbers4
 WHEN stakind5 = 5 THEN stanumbers5
-END AS elem_count_histogram
+END AS elem_count_histogram,
+CASE
+WHEN stakind1 = 6 THEN stavalues1
+WHEN stakind2 = 6 THEN stavalues2
+WHEN stakind3 = 6 THEN stavalues3
+WHEN stakind4 = 6 THEN stavalues4
+WHEN stakind5 = 6 THEN stavalues5
+END AS range_length_histogram,
+CASE
+WHEN stakind1 = 6 THEN stanumbers1[1]
+WHEN stakind2 = 6 THEN stanumbers2[1]
+WHEN stakind3 = 6 THEN stanumbers3[1]
+WHEN stakind4 = 6 THEN stanumbers4[1]
+WHEN stakind5 = 6 THEN stanumbers5[1]
+END AS empty_range_frac,
+CASE
+WHEN stakind1 = 7 THEN stavalues1
+WHEN stakind2 = 7 THEN stavalues2
+WHEN stakind3 = 7 THEN stavalues3
+WHEN stakind4 = 7 THEN stavalues4
+WHEN stakind5 = 7 THEN stavalues5
+END AS range_bounds_histograms
 FROM pg_statistic s JOIN pg_class c ON (c.oid = s.starelid)
  JOIN pg_attribute a ON (c.oid = attrelid AND attnum = s.staattnum)
  LEFT JOIN pg_namespace n ON (n.oid = c.relnamespace)
diff --git a/src/include/catalog/pg_statistic.h b/src/include/catalog/pg_statistic.h
index 8770c5b4c60..10401dece0d 100644
--- a/src/include/catalog/pg_statistic.h
+++ b/src/include/catalog/pg_statistic.h
@@ -152,6 +152,9 @@ DECLARE_FOREIGN_KEY((starelid, staattnum), pg_attribute, (attrelid, attnum));
  * data "kind" will appear in any particular slot.  Instead, search the
  * stakind fields to see if the desired data is available.  (The standard
  * function get_attstatsslot() may be used for this.)
+ *
+ * Note: The pg_stats view needs to be modified whenever a new slot kind is
+ * added to core.
  */
 
 /*
dif

Re: pg_stats and range statistics

2021-07-23 Thread Egor Rogov

Hi Tomas,

On 12.07.2021 16:04, Tomas Vondra wrote:

On 7/12/21 1:10 PM, Egor Rogov wrote:

Hi,

thanks for the review and corrections.

On 11.07.2021 21:54, Soumyadeep Chakraborty wrote:

Hello,

This should have been added with [1].

Excerpt from the documentation:
"pg_stats is also designed to present the information in a more readable
format than the underlying catalog — at the cost that its schema must
be extended whenever new slot types are defined for pg_statistic." [2]

So, I added a reminder in pg_statistic.h.

Good point.



Attached is v2 of this patch with some cosmetic changes.

I wonder why "TODO: catalog version bump"? This patch doesn't change
catalog structure, or I miss something?


It changes system_views.sql, which is catalog change, as it redefines
the pg_stats system view (it adds 3 more columns). So it changes what
you get after initdb, hence catversion has to be bumped.


Renamed the columns a
bit and updated the docs to be a bit more descriptive.
(range_length_empty_frac -> empty_range_frac, range_bounds_histogram ->
range_bounds_histograms)

I intended to make the same prefix ("range_") for all columns concerned
with range types, although I'm fine with the proposed naming.


Yeah, I'd vote to change empty_range_frac -> range_empty_frac.


One question:

We do have the option of representing the histogram of lower bounds
separately
from the histogram of upper bounds, as two separate view columns.
Don't know if
there is much utility though and there is a fair bit of added
complexity: see
below. Thoughts?

I thought about it too, and decided not to transform the underlying data
structure. As far as I can see, pg_stats never employed such
transformations. For example, STATISTIC_KIND_DECHIST is an array
containing the histogram followed by the average in its last element. It
is shown in pg_stats.elem_count_histogram as is, although it arguably
may be splitted into two fields. All in all, I believe pg_stats's job is
to "unpack" stavalues and stanumbers into meaningful fields, and not to
try to go deeper than that.


Not firm opinion, but the pg_stats is meant to be easier to
read/understand for humans. So far the transformation were simple
because all the data was fairly simple, but the range stuff may need
more complex transformation.

For example we do quite a bit more in pg_stats_ext views, because it
deals with multi-column stats.



In pg_stats_ext, yes, but not in pg_stats (at least until now).

Since no one has expressed a strong desire for a more complex 
transformation, should we proceed with the proposed approach (with 
further renaming empty_range_frac -> range_empty_frac as you suggested)? 
Or should we wait more for someone to weigh in?






regards






Re: pg_stats and range statistics

2021-07-12 Thread Tomas Vondra
On 7/12/21 1:10 PM, Egor Rogov wrote:
> Hi,
> 
> thanks for the review and corrections.
> 
> On 11.07.2021 21:54, Soumyadeep Chakraborty wrote:
>> Hello,
>>
>> This should have been added with [1].
>>
>> Excerpt from the documentation:
>> "pg_stats is also designed to present the information in a more readable
>> format than the underlying catalog — at the cost that its schema must
>> be extended whenever new slot types are defined for pg_statistic." [2]
>>
>> So, I added a reminder in pg_statistic.h.
> 
> Good point.
> 
> 
>> Attached is v2 of this patch with some cosmetic changes.
> 
> I wonder why "TODO: catalog version bump"? This patch doesn't change
> catalog structure, or I miss something?
> 

It changes system_views.sql, which is catalog change, as it redefines
the pg_stats system view (it adds 3 more columns). So it changes what
you get after initdb, hence catversion has to be bumped.

> 
>> Renamed the columns a
>> bit and updated the docs to be a bit more descriptive.
>> (range_length_empty_frac -> empty_range_frac, range_bounds_histogram ->
>> range_bounds_histograms)
> 
> I intended to make the same prefix ("range_") for all columns concerned
> with range types, although I'm fine with the proposed naming.
> 

Yeah, I'd vote to change empty_range_frac -> range_empty_frac.

> 
>> One question:
>>
>> We do have the option of representing the histogram of lower bounds
>> separately
>> from the histogram of upper bounds, as two separate view columns.
>> Don't know if
>> there is much utility though and there is a fair bit of added
>> complexity: see
>> below. Thoughts?
> 
> I thought about it too, and decided not to transform the underlying data
> structure. As far as I can see, pg_stats never employed such
> transformations. For example, STATISTIC_KIND_DECHIST is an array
> containing the histogram followed by the average in its last element. It
> is shown in pg_stats.elem_count_histogram as is, although it arguably
> may be splitted into two fields. All in all, I believe pg_stats's job is
> to "unpack" stavalues and stanumbers into meaningful fields, and not to
> try to go deeper than that.
> 

Not firm opinion, but the pg_stats is meant to be easier to
read/understand for humans. So far the transformation were simple
because all the data was fairly simple, but the range stuff may need
more complex transformation.

For example we do quite a bit more in pg_stats_ext views, because it
deals with multi-column stats.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pg_stats and range statistics

2021-07-12 Thread Egor Rogov

Hi,

thanks for the review and corrections.

On 11.07.2021 21:54, Soumyadeep Chakraborty wrote:

Hello,

This should have been added with [1].

Excerpt from the documentation:
"pg_stats is also designed to present the information in a more readable
format than the underlying catalog — at the cost that its schema must
be extended whenever new slot types are defined for pg_statistic." [2]

So, I added a reminder in pg_statistic.h.


Good point.



Attached is v2 of this patch with some cosmetic changes.


I wonder why "TODO: catalog version bump"? This patch doesn't change 
catalog structure, or I miss something?




Renamed the columns a
bit and updated the docs to be a bit more descriptive.
(range_length_empty_frac -> empty_range_frac, range_bounds_histogram ->
range_bounds_histograms)


I intended to make the same prefix ("range_") for all columns concerned 
with range types, although I'm fine with the proposed naming.




One question:

We do have the option of representing the histogram of lower bounds separately
from the histogram of upper bounds, as two separate view columns. Don't know if
there is much utility though and there is a fair bit of added complexity: see
below. Thoughts?


I thought about it too, and decided not to transform the underlying data 
structure. As far as I can see, pg_stats never employed such 
transformations. For example, STATISTIC_KIND_DECHIST is an array 
containing the histogram followed by the average in its last element. It 
is shown in pg_stats.elem_count_histogram as is, although it arguably 
may be splitted into two fields. All in all, I believe pg_stats's job is 
to "unpack" stavalues and stanumbers into meaningful fields, and not to 
try to go deeper than that.





My attempts via SQL (unnest -> lower|upper -> array_agg) were futile given
unnest does not play nice with anyarray. For instance:

select unnest(stavalues1) from pg_statistic;
ERROR:  cannot determine element type of "anyarray" argument

Maybe the only option is to write a UDF pg_get_{lower|upper}_bounds_histogram
which can do something similar to what calc_hist_selectivity does:

/*
  * Convert histogram of ranges into histograms of its lower and upper
  * bounds.
  */
nhist = hslot.nvalues;
hist_lower = (RangeBound *) palloc(sizeof(RangeBound) * nhist);
hist_upper = (RangeBound *) palloc(sizeof(RangeBound) * nhist);
for (i = 0; i < nhist; i++)
{
bool empty;

range_deserialize(rng_typcache, DatumGetRangeTypeP(hslot.values[i]),
   &hist_lower[i], &hist_upper[i], &empty);
/* The histogram should not contain any empty ranges */
if (empty)
elog(ERROR, "bounds histogram contains an empty range");
}

This is looking good and ready.

[1] 
https://github.com/postgres/postgres/commit/918eee0c497c88260a2e107318843c9b1947bc6f
[2] https://www.postgresql.org/docs/devel/view-pg-stats.html

Regards,
Soumyadeep (VMware)





Re: pg_stats and range statistics

2021-07-11 Thread Soumyadeep Chakraborty
Hello,

This should have been added with [1].

Excerpt from the documentation:
"pg_stats is also designed to present the information in a more readable
format than the underlying catalog — at the cost that its schema must
be extended whenever new slot types are defined for pg_statistic." [2]

So, I added a reminder in pg_statistic.h.

Attached is v2 of this patch with some cosmetic changes. Renamed the columns a
bit and updated the docs to be a bit more descriptive.
(range_length_empty_frac -> empty_range_frac, range_bounds_histogram ->
range_bounds_histograms)

One question:

We do have the option of representing the histogram of lower bounds separately
from the histogram of upper bounds, as two separate view columns. Don't know if
there is much utility though and there is a fair bit of added complexity: see
below. Thoughts?

My attempts via SQL (unnest -> lower|upper -> array_agg) were futile given
unnest does not play nice with anyarray. For instance:

select unnest(stavalues1) from pg_statistic;
ERROR:  cannot determine element type of "anyarray" argument

Maybe the only option is to write a UDF pg_get_{lower|upper}_bounds_histogram
which can do something similar to what calc_hist_selectivity does:

/*
 * Convert histogram of ranges into histograms of its lower and upper
 * bounds.
 */
nhist = hslot.nvalues;
hist_lower = (RangeBound *) palloc(sizeof(RangeBound) * nhist);
hist_upper = (RangeBound *) palloc(sizeof(RangeBound) * nhist);
for (i = 0; i < nhist; i++)
{
bool empty;

range_deserialize(rng_typcache, DatumGetRangeTypeP(hslot.values[i]),
  &hist_lower[i], &hist_upper[i], &empty);
/* The histogram should not contain any empty ranges */
if (empty)
elog(ERROR, "bounds histogram contains an empty range");
}

This is looking good and ready.

[1] 
https://github.com/postgres/postgres/commit/918eee0c497c88260a2e107318843c9b1947bc6f
[2] https://www.postgresql.org/docs/devel/view-pg-stats.html

Regards,
Soumyadeep (VMware)
From 1f2ed0911eb3f7a53b16047cefb499692daf5ef6 Mon Sep 17 00:00:00 2001
From: Egor Rogov 
Date: Sun, 11 Jul 2021 11:09:45 -0700
Subject: [PATCH v2 1/1] Display length and bounds histograms in pg_stats

Values corresponding to STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM and
STATISTIC_KIND_BOUNDS_HISTOGRAM were not exposed to pg_stats when these
slot kinds were introduced in 918eee0c49.

This commit adds the missing fields to pg_stats.

TODO: catalog version bump
---
 doc/src/sgml/catalogs.sgml   | 32 
 src/backend/catalog/system_views.sql | 23 +++-
 src/include/catalog/pg_statistic.h   |  3 +++
 src/test/regress/expected/rules.out  | 26 +-
 4 files changed, 82 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index f517a7d4af..7168ff9a13 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -12877,6 +12877,38 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
non-null elements.  (Null for scalar types.)
   
  
+
+ 
+  
+   range_length_histogram anyarray
+  
+  
+   A histogram of the lengths of non-empty and non-null range values of a
+   range type column. (Null for non-range types.)
+  
+ 
+
+ 
+  
+   empty_range_frac float4
+  
+  
+   Fraction of column entries whose values are empty ranges.
+   (Null for non-range types.)
+  
+ 
+
+ 
+  
+   range_bounds_histograms anyarray
+  
+  
+   Histograms of lower and upper bounds of non-empty, non-null ranges,
+   combined into a single array of range values. The lower and upper bounds
+   of each value correspond to the histograms of lower and upper bounds
+   respectively. (Null for non-range types.)
+  
+ 
 

   
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 999d984068..d8bc622ad5 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -243,7 +243,28 @@ CREATE VIEW pg_stats WITH (security_barrier) AS
 WHEN stakind3 = 5 THEN stanumbers3
 WHEN stakind4 = 5 THEN stanumbers4
 WHEN stakind5 = 5 THEN stanumbers5
-END AS elem_count_histogram
+END AS elem_count_histogram,
+CASE
+WHEN stakind1 = 6 THEN stavalues1
+WHEN stakind2 = 6 THEN stavalues2
+WHEN stakind3 = 6 THEN stavalues3
+WHEN stakind4 = 6 THEN stavalues4
+WHEN stakind5 = 6 THEN stavalues5
+END AS range_length_histogram,
+CASE
+WHEN stakind1 = 6 THEN stanumbers1[1]
+WHEN stakind2 = 6 THEN stanumbers2[1]
+WHEN stakind3 = 6 THEN stanumbers3[1]
+WHEN stakind4 = 6 THEN stanumbers4[1]
+WHEN stakind5 = 6 THEN stanumbers5[1]
+END AS empty_range_frac,
+CASE
+WHEN stakind1 = 7 THEN stavalues1
+  

Re: pg_stats and range statistics

2021-06-18 Thread Tomas Vondra

On 6/18/21 6:22 PM, Egor Rogov wrote:

Hi,

Statistics for range types are not currently exposed in pg_stats view 
(i.e. STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM and 
STATISTIC_KIND_BOUNDS_HISTOGRAM).


Shouldn't they? If so, here is a patch for adding them.



I think they should be exposed - I don't see why not to do that. I 
noticed this when working on the count-min sketch experiment too, so 
thanks for this patch.


FWIW I've added the patch to the next CF:

https://commitfest.postgresql.org/33/3184/


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company