Doc update for pg_stat_statements normalization

2023-02-24 Thread Imseih (AWS), Sami
Replacing constants in pg_stat_statements is on a best effort basis.
It is not unlikely that on a busy workload with heavy entry deallocation,
the user may observe the query with the constants in pg_stat_statements.

From what I can see, this is because the only time an entry is normalized is
during post_parse_analyze, and the entry may be deallocated by the time query
execution ends. At that point, the original form ( with constants ) of the query
is used.

It is not clear how prevalent this is in real-world workloads, but it's easily 
reproducible
on a workload with high entry deallocation. Attached are the repro steps on the 
latest
branch.

I think the only thing to do here is to call this out in docs with a suggestion 
to increase
pg_stat_statements.max to reduce the likelihood. I also attached the suggested
doc enhancement as well.

Any thoughts?

Regards,

--
Sami Imseih
Amazon Web Services


### pg_stat_statements.max is min allowed
postgres=# show pg_stat_statements.max ;
 pg_stat_statements.max

 100
(1 row)

### create 200 tables
do $$
begin
for i in 1 .. 200
loop
execute 'drop table if exists foo'||i;
execute 'create table foo'||i||'(id int, c2 text)';
end loop;
end; $$ ;

### create a pgbench script to insert into the tables.
for (( i=1; i<=200; i++ ))
do 
   echo "INSERT INTO foo"$i" (id, c2) values (1, 'somedata');" >> 
/tmp/pgbench.sql
done

### run pgbench
pgbench -c 50 -f /tmp/pgbench.sql -T 1200


### observe pg_stat_statements
postgres=# select query from pg_stat_statements where query like '%foo%' and 
query not like '%$1%';
query
-
 INSERT INTO foo31 (id, c2) values (1, 'somedata')
 INSERT INTO foo20 (id, c2) values (1, 'somedata')
 INSERT INTO foo191 (id, c2) values (1, 'somedata')
 INSERT INTO foo170 (id, c2) values (1, 'somedata')
 INSERT INTO foo167 (id, c2) values (1, 'somedata')
 INSERT INTO foo32 (id, c2) values (1, 'somedata')
 INSERT INTO foo36 (id, c2) values (1, 'somedata')
 INSERT INTO foo43 (id, c2) values (1, 'somedata')
 INSERT INTO foo181 (id, c2) values (1, 'somedata')
 INSERT INTO foo88 (id, c2) values (1, 'somedata')
(10 rows)

postgres=# select query from pg_stat_statements where query like '%foo%' and 
query  like '%$1%' limit 5;
   query

 INSERT INTO foo33 (id, c2) values ($1, $2)
 INSERT INTO foo59 (id, c2) values ($1, $2)
 INSERT INTO foo50 (id, c2) values ($1, $2)
 INSERT INTO foo42 (id, c2) values ($1, $2)
 INSERT INTO foo91 (id, c2) values ($1, $2)

### observe the # of deallocations
postgres=# select * from pg_stat_statements_info ;
 dealloc |  stats_reset
-+---
  113371 | 2023-02-24 06:24:21.912275-06
(1 row)


0001-doc-update-regarding-pg_stat_statements-normalizatio.patch
Description: 0001-doc-update-regarding-pg_stat_statements-normalizatio.patch


Re: Doc update for pg_stat_statements normalization

2023-02-24 Thread Michael Paquier
On Fri, Feb 24, 2023 at 08:54:00PM +, Imseih (AWS), Sami wrote:
> I think the only thing to do here is to call this out in docs with a
> suggestion to increase pg_stat_statements.max to reduce the
> likelihood. I also attached the suggested  doc enhancement as well.

Improving the docs about that sounds like a good idea.  This would be
less surprising for users, if we had some details about that.

> Any thoughts?

The risk of deallocation of an entry between the post-analyze hook and
the planner/utility hook represented with two calls of pgss_store()
means that this will never be able to work correctly as long as we
don't know the shape of the normalized query in the second code path
(planner, utility execution) updating the entries with the call
information, etc.  And we only want to pay the cost of normalization
once, after the post-analyze where we do the query jumbling.

Could things be done in a more stable way?  For example, imagine that
we have an extra Query field called void *private_data that extensions
can use to store custom data associated to a query ID, then we could
do something like that:
- In the post-analyze hook, check if an entry with the query ID
calculated exists.
-- If the entry exists, grab a copy of the existing query string,
which may be normalized or not, and save it into Query->private_data.
-- If the entry does not exist, normalize the query, store it in
Query->private_data but do not yet create an entry in the hash table.
- In the planner/utility hook, fetch the normalized query from
private_data, then use it if an entry needs to be created in the hash
table.  The entry may have been deallocated since the post-analyze
hook, in which case it is re-created with the normalized copy saved in
the first phase.
--
Michael


signature.asc
Description: PGP signature


Re: Doc update for pg_stat_statements normalization

2023-02-25 Thread Julien Rouhaud
On Sat, Feb 25, 2023 at 02:58:36PM +0900, Michael Paquier wrote:
> On Fri, Feb 24, 2023 at 08:54:00PM +, Imseih (AWS), Sami wrote:
> > I think the only thing to do here is to call this out in docs with a
> > suggestion to increase pg_stat_statements.max to reduce the
> > likelihood. I also attached the suggested  doc enhancement as well.
> 
> Improving the docs about that sounds like a good idea.  This would be
> less surprising for users, if we had some details about that.
> 
> > Any thoughts?
> 
> The risk of deallocation of an entry between the post-analyze hook and
> the planner/utility hook represented with two calls of pgss_store()
> means that this will never be able to work correctly as long as we
> don't know the shape of the normalized query in the second code path
> (planner, utility execution) updating the entries with the call
> information, etc.  And we only want to pay the cost of normalization
> once, after the post-analyze where we do the query jumbling.

Note also that this is a somewhat wanted behavior (to evict queries that didn't
have any planning or execution stats record), per the
STICKY_DECREASE_FACTOR and related stuff.

> Could things be done in a more stable way?  For example, imagine that
> we have an extra Query field called void *private_data that extensions
> can use to store custom data associated to a query ID, then we could
> do something like that:
> - In the post-analyze hook, check if an entry with the query ID
> calculated exists.
> -- If the entry exists, grab a copy of the existing query string,
> which may be normalized or not, and save it into Query->private_data.
> -- If the entry does not exist, normalize the query, store it in
> Query->private_data but do not yet create an entry in the hash table.
> - In the planner/utility hook, fetch the normalized query from
> private_data, then use it if an entry needs to be created in the hash
> table.  The entry may have been deallocated since the post-analyze
> hook, in which case it is re-created with the normalized copy saved in
> the first phase.

I think the idea of a "private_data" like thing has been discussed before and
rejected IIRC, as it could be quite expensive and would also need to
accommodate for multiple extensions and so on.

Overall, I think that if the pgss eviction rate is high enough that it's
problematic for doing performance analysis, the performance overhead will be so
bad that simply removing pg_stat_statements will give you a ~ x2 performance
increase.  I don't see much point trying to make such a performance killer
scenario more usable.




Re: Doc update for pg_stat_statements normalization

2023-02-25 Thread Imseih (AWS), Sami
>> Could things be done in a more stable way?  For example, imagine that
>> we have an extra Query field called void *private_data that extensions
>> can use to store custom data associated to a query ID, then we could
>> do something like that:
>> - In the post-analyze hook, check if an entry with the query ID
>> calculated exists.
>> -- If the entry exists, grab a copy of the existing query string,
>> which may be normalized or not, and save it into Query->private_data.
>> -- If the entry does not exist, normalize the query, store it in
>> Query->private_data but do not yet create an entry in the hash table.
>> - In the planner/utility hook, fetch the normalized query from
>> private_data, then use it if an entry needs to be created in the hash
>> table.  The entry may have been deallocated since the post-analyze
>> hook, in which case it is re-created with the normalized copy saved in
>> the first phase.

>I think the idea of a "private_data" like thing has been discussed before 
> and
>rejected IIRC, as it could be quite expensive and would also need to
>accommodate for multiple extensions and so on.

The overhead of storing this additional private data for the life of the query
execution may not be  desirable. I think we also will need to copy the
private data to QueryDesc as well to make it available to planner/utility/exec
hooks.

>Overall, I think that if the pgss eviction rate is high enough that it's
>problematic for doing performance analysis, the performance overhead will 
> be so
>bad that simply removing pg_stat_statements will give you a ~ x2 
> performance
>increase.  I don't see much point trying to make such a performance killer
>scenario more usable.

In v14, we added a dealloc metric to pg_stat_statements_info, which is helpful.
However, this only deals with the pgss_hash entry deallocation.
I think we should also add a metric for the text file garbage collection.

Regards

-- 
Sami Imseih
Amazon Web Services



Re: Doc update for pg_stat_statements normalization

2023-02-26 Thread Michael Paquier
On Sat, Feb 25, 2023 at 01:59:04PM +, Imseih (AWS), Sami wrote:
> The overhead of storing this additional private data for the life of the query
> execution may not be  desirable.

Okay, but why?

> I think we also will need to copy the
> private data to QueryDesc as well to make it available to planner/utility/exec
> hooks.

This seems like the key point to me here.  If we copy more information
into the Query structures, then we basically have no need for sticky
entries, which could be an advantage on its own as it simplifies the
deallocation and lookup logic.

For a DML or a SELECT, the manipulation of the hash table would still
be a three-step process (post-analyze, planner and execution end), but
the first step would have no need to use an exclusive lock on the hash
table because we could just read and copy over the Query the
normalized query if an entry exists, meaning that we could actually
relax things a bit?  This relaxation has as cost the extra memory used
to store more data to allow the insertion to use a proper state of the
Query[Desc] coming from the JumbleState (this extra data has no need
to be JumbleState, just the results we generate from it aka the
normalized query).

> In v14, we added a dealloc metric to pg_stat_statements_info, which is 
> helpful.
> However, this only deals with the pgss_hash entry deallocation.
> I think we should also add a metric for the text file garbage collection.

This sounds like a good idea on its own.
--
Michael


signature.asc
Description: PGP signature


Re: Doc update for pg_stat_statements normalization

2023-02-27 Thread Imseih (AWS), Sami
>On Sat, Feb 25, 2023 at 01:59:04PM +, Imseih (AWS), Sami wrote:>
>> The overhead of storing this additional private data for the life of the 
> query
>> execution may not be  desirable.

>Okay, but why?

Additional memory to maintain the JumbleState data in other structs, and
the additional copy operations.

>This seems like the key point to me here.  If we copy more information
>into the Query structures, then we basically have no need for sticky
>entries, which could be an advantage on its own as it simplifies the
>deallocation and lookup logic.

Removing the sticky entry logic will be a big plus, and of course we can
eliminate a query not showing up properly normalized.

>For a DML or a SELECT, the manipulation of the hash table would still
>be a three-step process (post-analyze, planner and execution end), but
>the first step would have no need to use an exclusive lock on the hash
>table because we could just read and copy over the Query the
>normalized query if an entry exists, meaning that we could actually
>relax things a bit?  

No lock is held while normalizing, and a shared lock is held while storing,
so there is no apparent benefit from that aspect.

>This relaxation has as cost the extra memory used
>to store more data to allow the insertion to use a proper state of the
>Query[Desc] coming from the JumbleState (this extra data has no need
>to be JumbleState, just the results we generate from it aka the
>normalized query).

Wouldn't be less in terms of memory usage to just store the required
JumbleState fields in Query[Desc]?

clocations_count,
highest_extern_param_id,
clocations_locations,
clocations_length

>> In v14, we added a dealloc metric to pg_stat_statements_info, which is 
> helpful.
>> However, this only deals with the pgss_hash entry deallocation.
>> I think we should also add a metric for the text file garbage collection.

>This sounds like a good idea on its own.

I can create a separate patch for this.

Regards,

--
Sami Imseih
Amazon Web services



Re: Doc update for pg_stat_statements normalization

2023-02-27 Thread Michael Paquier
On Mon, Feb 27, 2023 at 10:53:26PM +, Imseih (AWS), Sami wrote:
> Wouldn't be less in terms of memory usage to just store the required
> JumbleState fields in Query[Desc]?
> 
> clocations_count,
> highest_extern_param_id,
> clocations_locations,
> clocations_length

Yes, these would be enough to ensure a proper rebuild of the
normalized query in either the 2nd or 3rd call of pgss_store().  With
a high deallocation rate, perhaps we just don't care about bearing
the extra computation to build a normalized qury more than once, still
it could be noticeable?

Anything that gets changed is going to need some serious benchmarking
(based on deallocation rate, string length, etc.) to check that the
cost of this extra memory is worth the correctness gained when storing
the normalization.  FWIW, I am all in if it means code simplifications
with better performance and better correctness of the results.
--
Michael


signature.asc
Description: PGP signature


Re: Doc update for pg_stat_statements normalization

2023-02-27 Thread Michael Paquier
On Fri, Feb 24, 2023 at 08:54:00PM +, Imseih (AWS), Sami wrote:
> I think the only thing to do here is to call this out in docs with a 
> suggestion to increase
> pg_stat_statements.max to reduce the likelihood. I also attached the suggested
> doc enhancement as well.

+  
+   A query text may be observed with constants in
+   pg_stat_statements, especially when there is a high
+   rate of entry deallocations. To reduce the likelihood of this occuring, 
consider
+   increasing pg_stat_statements.max.
+   The pg_stat_statements_info view provides entry
+   deallocation statistics.
+  

I am OK with an addition to the documentation to warn that one may
have to increase the maximum number of entries that can be stored if
seeing a non-normalized entry that should have been normalized.

Shouldn't this text make it clear that this concerns only query
strings that can be normalized?  Utility queries can have constants,
for one (well, not for long for most of them with the work I have been
doing lately, but there will still be some exceptions like CREATE
VIEW or utilities with A_Const nodes).
--
Michael


signature.asc
Description: PGP signature


Re: Doc update for pg_stat_statements normalization

2023-02-28 Thread Imseih (AWS), Sami
> I am OK with an addition to the documentation to warn that one may
> have to increase the maximum number of entries that can be stored if
> seeing a non-normalized entry that should have been normalized.

I agree. We introduce the concept of a plannable statement in a 
previous section and we can then make this distinction in the new
paragraph. 

I also added a link to pg_stat_statements_info since that is introduced
later on int the doc.


Regards,

-- 
Sami Imseih
Amazon Web Services




0001-doc-update-regarding-pg_stat_statements-normalizatio.patch
Description: 0001-doc-update-regarding-pg_stat_statements-normalizatio.patch


Re: Doc update for pg_stat_statements normalization

2023-02-28 Thread Michael Paquier
On Tue, Feb 28, 2023 at 11:11:30PM +, Imseih (AWS), Sami wrote:
> I agree. We introduce the concept of a plannable statement in a 
> previous section and we can then make this distinction in the new
> paragraph. 
> 
> I also added a link to pg_stat_statements_info since that is introduced
> later on int the doc.

I have reworded the paragraph a bit to be more general so as it would
not need an update once more normalization is applied to utility
queries (I am going to fix the part where we mention that we use the
strings for utilities, which is not the case anymore now):
+  
+   Queries on which normalization can be applied may be observed with constant
+   values in pg_stat_statements, especially when there
+   is a high rate of entry deallocations. To reduce the likelihood of this
+   happening, consider increasing pg_stat_statements.max.
+   The pg_stat_statements_info view, discussed below
+   in ,
+   provides statistics about entry deallocations.
+  

Are you OK with this text?
--
Michael
From dd8938e4ba1b1f29d14b3fa2dc76301f42592cad Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 1 Mar 2023 09:05:08 +0900
Subject: [PATCH v3] doc update regarding pg_stat_statements normalization.

It is quite possible that a query text willl not normalize
(replace constants with $1, $2..) when there is a high rate
pgss_hash deallocation. This commit calls this out in docs.
---
 doc/src/sgml/pgstatstatements.sgml | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index efc36da602..f1ba78c8cb 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -516,6 +516,16 @@
pg_stat_statements entry.
   
 
+  
+   Queries on which normalization can be applied may be observed with constant
+   values in pg_stat_statements, especially when there
+   is a high rate of entry deallocations. To reduce the likelihood of this
+   happening, consider increasing pg_stat_statements.max.
+   The pg_stat_statements_info view, discussed below
+   in ,
+   provides statistics about entry deallocations.
+  
+
   
In some cases, queries with visibly different texts might get merged into a
single pg_stat_statements entry.  Normally this will happen
-- 
2.39.2



signature.asc
Description: PGP signature


Re: Doc update for pg_stat_statements normalization

2023-02-28 Thread Imseih (AWS), Sami
> + 
> + Queries on which normalization can be applied may be observed with constant
> + values in pg_stat_statements, especially when there
> + is a high rate of entry deallocations. To reduce the likelihood of this
> + happening, consider increasing pg_stat_statements.max.
> + The pg_stat_statements_info view, discussed below
> + in ,
> + provides statistics about entry deallocations.
> + 

> Are you OK with this text?

Yes, that makes sense.


Regards,

--
Sami Imseih
Amazon Web Services





Re: Doc update for pg_stat_statements normalization

2023-02-28 Thread Michael Paquier
On Wed, Mar 01, 2023 at 12:43:40AM +, Imseih (AWS), Sami wrote:
> Yes, that makes sense.

Okay, thanks.  Done that now on HEAD.
--
Michael


signature.asc
Description: PGP signature