Re: Parallel Query Stats

2024-02-25 Thread Tomas Vondra
Hi Benoit,

On 4/5/23 15:00, Benoit Lobréau wrote:
> Hi hackers,
> 
> Parallelism has been in core since 9.6, it's a great feature that got
> several
> upgrades since then. However, it lacks metrics to determine if and how
> parallelism is used and help tune parameters related to it.
> 

True.

> Currently, the only information available are pg_stat_activity.backend_type
> and pg_stat_activity.leader_pid. These could be sampled to get statistics
> about the number of queries that are using parallel workers and the
> number of
> workers spawned (globally or per statement), but this is not ideal because:
> 
> * the sampling period would require a high frequency to get stats
>   close enough from reality without missing lots of short duration
>   queries;
> * with sampling we cannot get an accurate count of parallel queries;
> * we don't know how many queries can't get the workers they asked for.
> 
> We thought about several places where we could add some user facing
> metrics, and would
> like some input about the design before working on a larger patch. The
> various chosen
> names are obviously not settled.
> 

I agree just sampling pg_stat_activity is insufficient to get a good
overview and decide whether an adjustment of the parallel workers (or
other GUCs) is needed.

> # Traces
> 
> We could add a GUC "log_parallel_draught": it would add a message in the
> logs when a
> query or utility asks for parallel workers but can't get all of them.
> 
> The message could look like this. It could be issued several times per
> query
> since workers can be requested for different parts of the plan.
> 
>   LOG:  Parallel worker draught detected: worker launched: 0, requested: 2
>   STATEMENT:  explain analyze select * from pq_foo inner join pq_bar
> using(id);
> 
>   LOG:  Parallel worker draught detected: worker launched: 0, requested: 1
>   CONTEXT:  while scanning relation "public.pv_tbl"
>   STATEMENT:  VACUUM (PARALLEL 2, VERBOSE) pv_tbl;
> 
>   LOG:  Parallel worker draught detected: worker launched: 0, requested: 1
>   STATEMENT:  CREATE INDEX ON pi_tbl(i);
> 
> This could be used in tools like pgBadger to aggregate stats
> on statements that didn't get their workers, but we might need additionnal
> information to know why we are lacking workers.
> 
> We have a working PoC patch for this since it seems the most
> straightforward to implement and use.
> 

I commented on this in the separate thread nearby.

> # pg_stat_bgworker view
> 
> I was initially thinking about metrics like:
> * number of parallel queries
> * number of parallel queries that didn't get their workers
> But without a number of eligible queries, it's not very useful.
> 
> Instead, some metrics could be useful:
> * how many workers were requested
> * how many workers were obtained.
> The data would be updated as the workers are spawned
> (or aren't). It would be interesting to have this information per
> background worker type in order to identify which pool is the source of a
> parallel worker draught.
> 
> The view could look like this:
> 
> * bgworker_type: possible values would be: logical replication worker /
> parallel
> worker / parallel maintenance worker / a name given by an extension;
> * datname: the database where the workers were connected if applicable,
> or null
>   otherwise;
> * active: number of currently running workers;
> * requested: number of requested workers ;
> * obtained: number of obtained workers ;
> * duration: the aggregation of all durations; we could update this field
> when a
>   background worker finishes and add the duration from the one still
> running to
>   produce an more accurate number;
> * stats_reset: the reset would be handled the same way other pg_stat* views
>   handle it.
> 
> The parallel maintenance worker type doesn't exist in pg_stat_activity.
> I think
> it would be worthwhile to add it since this kind of parallel worker has
> it's
> own pool.
> 
> This view could be used by monitoring or metrology tools to raise alerts or
> trace graphs of the background worker usage, and determine if, when and
> where
> there is a shortage of workers.
> 
> Tools like pg_activity, check_postgres/check_pgactivity or prometheus
> exporters could use these stats.
> 

I'm not against adding a new statistics view like the one you describe,
but maybe it'd be better to start with just adding something basic to
pg_stat_database?

I think a minimum improvement would be to extend pg_stat_database with
the number of requested and started parallel workers, and perhaps also
the number of running parallel workers (similar to numbackends).

Not sure about the "duration" - it seems pretty different from the
worker counters, and the aggregate for all queries does not seem
particularly useful (especially if not knowing the number of queries).

And we already have this in pg_stat_statements ...

> # pg_stat_statements
> 
> This view is dedicated to per-query statistics. We could add a few metrics
> related to parallelism:
> 

Parallel Query Stats

2023-04-05 Thread Benoit Lobréau

Hi hackers,

Parallelism has been in core since 9.6, it's a great feature that got 
several

upgrades since then. However, it lacks metrics to determine if and how
parallelism is used and help tune parameters related to it.

Currently, the only information available are pg_stat_activity.backend_type
and pg_stat_activity.leader_pid. These could be sampled to get statistics
about the number of queries that are using parallel workers and the 
number of

workers spawned (globally or per statement), but this is not ideal because:

* the sampling period would require a high frequency to get stats
  close enough from reality without missing lots of short duration
  queries;
* with sampling we cannot get an accurate count of parallel queries;
* we don't know how many queries can't get the workers they asked for.

We thought about several places where we could add some user facing 
metrics, and would
like some input about the design before working on a larger patch. The 
various chosen

names are obviously not settled.

# Traces

We could add a GUC "log_parallel_draught": it would add a message in the 
logs when a

query or utility asks for parallel workers but can't get all of them.

The message could look like this. It could be issued several times per query
since workers can be requested for different parts of the plan.

  LOG:  Parallel worker draught detected: worker launched: 0, requested: 2
  STATEMENT:  explain analyze select * from pq_foo inner join pq_bar 
using(id);


  LOG:  Parallel worker draught detected: worker launched: 0, requested: 1
  CONTEXT:  while scanning relation "public.pv_tbl"
  STATEMENT:  VACUUM (PARALLEL 2, VERBOSE) pv_tbl;

  LOG:  Parallel worker draught detected: worker launched: 0, requested: 1
  STATEMENT:  CREATE INDEX ON pi_tbl(i);

This could be used in tools like pgBadger to aggregate stats
on statements that didn't get their workers, but we might need additionnal
information to know why we are lacking workers.

We have a working PoC patch for this since it seems the most
straightforward to implement and use.

# pg_stat_bgworker view

I was initially thinking about metrics like:
* number of parallel queries
* number of parallel queries that didn't get their workers
But without a number of eligible queries, it's not very useful.

Instead, some metrics could be useful:
* how many workers were requested
* how many workers were obtained.
The data would be updated as the workers are spawned
(or aren't). It would be interesting to have this information per
background worker type in order to identify which pool is the source of a
parallel worker draught.

The view could look like this:

* bgworker_type: possible values would be: logical replication worker / 
parallel

worker / parallel maintenance worker / a name given by an extension;
* datname: the database where the workers were connected if applicable, 
or null

  otherwise;
* active: number of currently running workers;
* requested: number of requested workers ;
* obtained: number of obtained workers ;
* duration: the aggregation of all durations; we could update this field 
when a
  background worker finishes and add the duration from the one still 
running to

  produce an more accurate number;
* stats_reset: the reset would be handled the same way other pg_stat* views
  handle it.

The parallel maintenance worker type doesn't exist in pg_stat_activity. 
I think

it would be worthwhile to add it since this kind of parallel worker has it's
own pool.

This view could be used by monitoring or metrology tools to raise alerts or
trace graphs of the background worker usage, and determine if, when and 
where

there is a shortage of workers.

Tools like pg_activity, check_postgres/check_pgactivity or prometheus
exporters could use these stats.

# pg_stat_statements

This view is dedicated to per-query statistics. We could add a few metrics
related to parallelism:

* parallelized_calls: how many executions were planned with parallelism;
* parallelized_draught_calls: how many executions were planned with 
parallelism but

  didn't get all their workers;
* parallel_workers_requested: how many workers were requested for this 
parallel

  statement;
* parallel_workers_total: how many workers were obtained for this 
parallel statement;


The information is useful to detect queries that didn't get their 
workers on a

regular basis. If it's sampled we could know when. It could be used by tools
like POWA to eg. visualize the query runtime depending on the number of
workers, the moment of the day it lacks the requested workers, etc.

The two last could help estimate if a query makes a heavy use of 
parallelism.


Note: I have skimmed throught the thread "Expose Parallelism counters 
planned/execute

in pg_stat_statements" [1] and still need to take a closer look at it.

[1] 
https://www.postgresql.org/message-id/flat/6acbe570-068e-bd8e-95d5-00c737b865e8%40gmail.com


# pg_stat_all_tables and pg_stat_all_indexes

We could add a parallel_seq_sc