Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)

2023-01-01 Thread Maxim Boguk
On Sun, Jan 1, 2023 at 6:55 PM MichaelDBA  wrote:

> You said it's a dedicated server, but pgbouncer is running locally,
> right?  PGBouncer has a small footprint, but is the CPU high for it?
>

There are 4 pgbouncer processes in so_reuseport mode.
I never saw more than 40% of a single CPU core per one pgbouncer process
(most time under 20%).
So it's an unlikely result of pgbouncer being overloaded.


-- 
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone UA: +380 99 143 
Phone AU: +61  45 218 5678


Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)

2023-01-01 Thread MichaelDBA
You said it's a dedicated server, but pgbouncer is running locally, 
right?  PGBouncer has a small footprint, but is the CPU high for it?


Maxim Boguk wrote on 1/1/2023 11:51 AM:



On Sun, Jan 1, 2023 at 6:43 PM MichaelDBA > wrote:


Hi Maxim,

10-20 active, concurrent connections is way below any CPU load
problem you should have with 48 available vCPUs.
You never explicitly said what the load is, so what is it in the
context of the 1,5,15?


LA 10-15 all time, servers are really overprovisioned (2-3x by 
available CPU resources) because an application is quite sensitive to 
the database latency.
And during these latency spikes - EXECUTE work without any issues 
(e.g. only PARSE/BIND suck).



--
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone UA: +380 99 143 
Phone AU: +61  45 218 5678




Regards,

Michael Vitale

michael...@sqlexec.com 

703-600-9343






Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)

2023-01-01 Thread Maxim Boguk
On Sun, Jan 1, 2023 at 6:43 PM MichaelDBA  wrote:

> Hi Maxim,
>
> 10-20 active, concurrent connections is way below any CPU load problem you
> should have with 48 available vCPUs.
> You never explicitly said what the load is, so what is it in the context
> of the 1,5,15?
>
>
LA 10-15 all time, servers are really overprovisioned (2-3x by available
CPU resources) because an application is quite sensitive to the database
latency.
And during these latency spikes - EXECUTE work without any issues (e.g.
only PARSE/BIND suck).


-- 
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone UA: +380 99 143 
Phone AU: +61  45 218 5678


Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)

2023-01-01 Thread MichaelDBA

Hi Maxim,

10-20 active, concurrent connections is way below any CPU load problem 
you should have with 48 available vCPUs.
You never explicitly said what the load is, so what is it in the context 
of the 1,5,15?


Maxim Boguk wrote on 1/1/2023 11:30 AM:

1)usual load (e.g. no anomalies)
10-20 concurrent query runs (e.g. issues isn't related to the load 
spike or similar anomalies)

additionally 5-10 short idle in transaction (usual amount too)
total around 300 active connections to the database (after local 
pgbouncer in transaction mode)



Regards,

Michael Vitale

michael...@sqlexec.com 

703-600-9343






Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)

2023-01-01 Thread Maxim Boguk
On Sun, Jan 1, 2023 at 3:27 PM MichaelDBA  wrote:

> Howdy,
>
> Few additional questions:
>
>1. How many concurrent, active connections are running when these BIND
>problems occur?  select count(*) from pg_stat_activity where state in
>('active','idle in transaction')
>2. Are the queries using gigantic IN () values?
>3. Perhaps unrelated, but is log_temp_files turned on, and if so, do
>you have a lot of logs related to that?
>
> Regards,
> Michael Vitale, just another PG DBA
>

1)usual load (e.g. no anomalies)
10-20 concurrent query runs (e.g. issues isn't related to the load spike or
similar anomalies)
additionally 5-10 short idle in transaction (usual amount too)
total around 300 active connections to the database (after local pgbouncer
in transaction mode)

2)no... long BIND for huge parameter lists is a known issue for me, in this
case there is nothing like that... just (every?) PARSE/BIND stuck for a
short period (including ones which don't require pg_statistic table
access)...
There are some funny samples from the latest spike:
2023-01-01 15:45:09.151 UTC 2421121  from [local]
[vxid:109/20732521 txid:0] [BIND] LOG:  duration: 338.830 ms  bind
: ROLLBACK
2023-01-01 15:45:09.151 UTC 2365255  from [local] [vxid:41/21277531
txid:2504447286] [PARSE] LOG:  duration: 338.755 ms  parse :
select nextval ('jobs_id_seq')
along with normal select/insert/update/delete operations stuck for a short
time too...

3)log_temp_files on for sure, I found no correlation with temp file usage,
as well as no correlation between latency spikes and logged autovacuum
actions.

PS: '[BIND] LOG:  duration: 338.830 ms  bind : ROLLBACK' on a
definitely not overloaded and perfectly healthy server - probably the most
curious log entry of 2022 year for me.

-- 
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone UA: +380 99 143 
Phone AU: +61  45 218 5678


Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)

2023-01-01 Thread MichaelDBA

Howdy,

Few additional questions:

1. How many concurrent, active connections are running when these BIND
   problems occur?  select count(*) from pg_stat_activity where state
   in ('active','idle in transaction')
2. Are the queries using gigantic IN () values?
3. Perhaps unrelated, but islog_temp_files turned on, and if so, do you
   have a lot of logs related to that?

Regards,
Michael Vitale, just another PG DBA









Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)

2023-01-01 Thread Maxim Boguk
On Sat, Dec 31, 2022 at 4:32 PM Justin Pryzby  wrote:

> On Sat, Dec 31, 2022 at 02:26:08PM +0200, Maxim Boguk wrote:
> > Hi,
> >
> > When performing post-mortem analysis of some short latency spikes on a
> > heavily loaded database, I found that the reason for (less than 10 second
> > latency spike) wasn't on the EXECUTE stage but on the BIND stage.
> > At the same time graphical monitoring shows that during this few second
> > period there were some queries waiting in the BIND stage.
> >
> > Logging setup:
> > log_min_duration_statement=200ms
> > log_lock_waits=on
> > deadlock_timeout=100ms
> > So I expected that every lock waiting over 100ms (>deadlock_timeout)
> should
> > be in the log.
> > But in the log I see only spikes on slow BIND but not lock waits logged.
>
> What version postgres?  What settings have non-default values ?
> What OS/version?  What environment/hardware?  VM/image/provider/...
> What are the queries that are running BIND ?  What parameter types ?
> Are the slow BINDs failing?  Are their paramters being logged ?
> What else is running besides postgres ?  Are the DB clients local or
> remote ?  It shouldn't matter, but what client library?
>

What version of postgres? - 14.6

What settings have non-default values ? - a lot (it's 48 core Amazon EC2
server with 396GB of RAM)
(e.g. it carefully tuned database for particular workload)

What OS/version? - Ubuntu 20.04LTS

What environment/hardware? - 48 core Amazon EC2 server with 396GB of RAM
and local NVME storage
(i3en.12xlarge)

What are the queries that are running BIND ?  - nothing special, e.g.
during problem period a lot completely different queries become stuck in
BIND and PARSE stage but no long duration (>100ms) EXECUTE calls found, in
general it feel that whole BIND/PARSE mechanics lock for short period
 LOG SAMPLE ==
2023-01-01 09:07:31.622 UTC 1848286 ** from [local] [vxid:84/20886619
txid:0] [PARSE] LOG:  duration: 235.472 ms  parse : SELECT
COUNT(*) FROM "job_stats_master" WHERE (job_stats_master.created_at >
= '2022-12-31 09:07:31.35') AND (job_stats_master.created_at <
'2023-01-01 09:07:31.35') AND "job_stats_master"."employer_id" = 
AND "job_stats_master"."action" = 2 AND "job_stats_master"."job_board_id" =
 AND "job_stats_master"."ip_matching_id" = *
2023-01-01 09:07:31.622 UTC 1898699 **  from [local] [vxid:158/22054921
txid:0] [BIND] LOG:  duration: 231.274 ms  bind : SELECT id, name
FROM job_types WHERE id IN ($1)
2023-01-01 09:07:31.622 UTC 1898699 *** from [local] [vxid:158/22054921
txid:0] [BIND] DETAIL:  parameters: $1 = '0'
2023-01-01 09:07:31.622 UTC 1794756 *** from [local] [vxid:281/10515416
txid:0] [BIND] LOG:  duration: 231.024 ms  bind : SELECT id, name
FROM job_types WHERE id IN ($1)
2023-01-01 09:07:31.622 UTC 1794756 *** from [local] [vxid:281/10515416
txid:0] [BIND] DETAIL:  parameters: $1 = '0'

... 5 pages of BIND/PARSE of different/unrelated to each other queries
logged with over 100ms runtime

2023-01-01 09:07:31.623 UTC 1806315 *** from [local] [vxid:231/17406673
txid:0] [BIND] LOG:  duration: 139.372 ms  bind : SELECT
employers.*, third_party_employer_pixels.facebook_pixel_id AS
facebook_pixel_id, third_party_employer_pixels.google_pixel_id   AS
google_pixel_id, third_party_employer_pixels.google_actionsAS
google_actions, employer_pixel_configurations.solutionAS
tracking_solution, employer_pixel_configurations.domain_name AS
domain, settings.use_multiple_bids   FROM employers LEFT JOIN
third_party_employer_pixels   ON third_party_employer_pixels.employer_id =
employers.id LEFT JOIN employer_pixel_configurations ON
employer_pixel_configurations.id = employers.id LEFT JOIN settings
 ON settings.id = employers.setting_id WHERE employers.id =
$1
2023-01-01 09:07:31.623 UTC 1806315 *** from [local] [vxid:231/17406673
txid:0] [BIND] DETAIL:  parameters: $1 = '*'
2023-01-01 09:07:31.624 UTC 1806321 *** from [local] [vxid:176/21846997
txid:0] [BIND] LOG:  duration: 120.237 ms  bind : SELECT
job_boards.*, enterprises.product_type,
feed_settings.use_employer_exported_name as use_employer_exported_name,
integration_job_board_settings.integration_status as integration_status
FROM job_boards LEFT JOIN integration_job_board_settings ON
integration_job_board_settings.id =
job_boards.integration_job_board_setting_id LEFT JOIN enterprises ON
enterprises.id = job_boards.enterprise_id LEFT JOIN feed_settings ON
feed_settings.id = job_boards.feed_setting_id WHERE job_boards.id = $1
2023-01-01 09:07:31.624 UTC 1806321 *** from [local] [vxid:176/21846997
txid:0] [BIND] DETAIL:  parameters: $1 = ''
===
What really curious in the log: that every of 100+ stuck in PARSE/BIND
stage queries that had been logged (and thus unstuck) in the same exact
moment... that highly likely means that they all had been