Re: Index bloat and REINDEX/VACUUM optimization for partial index

2023-09-01 Thread Maxim Boguk
>
> But anyway, PostgreSQL has features to prevent the index bloat from
> becoming too severe of a problem, and you should figure out why they are
> not working for you.  The most common ones I know of are 1) long open
> snapshots preventing clean up, 2) all index scans being bitmap index scans,
> which don't to micro-vacuuming/index hinting the way ordinary btree
> index scans do, and 3) running the queries on a hot-standby, where index
> hint bits must be ignored.  If you could identify and solve this issue,
> then you wouldn't need to twist yourself into knots avoiding non-HOT
> updates.
>

I am not sure that kill bits could be a complete fix for indexes with tens
of millions dead entries and only a handful of live entries. As I
understand the mechanics of killbits - they help to avoid excessive heap
visibility checks for dead tuples, but tuples with killbit are still should
be read from the index first. And with many millions of dead entries it
isn't free.

PS: ignoring killbits on hot standby slaves is a source of endless pain in
many cases.

--
Maxim Boguk
Senior Postgresql DBA

Phone UA: +380 99 143 
Phone AU: +61  45 218 5678


Re: Index bloat and REINDEX/VACUUM optimization for partial index

2023-08-31 Thread Maxim Boguk
> At any moment, there are *around 1000-1500 tasks in pending statuses*
> (Init + InProgress) out of around 500 million tasks.
>
> Now, we have a task monitoring query that will look for all pending tasks
> that have not received any update in the last n minutes.
>
> ```
> SELECT [columns list]
>   FROM tasks
>   WHERE status NOT IN (3,4,5) AND created > NOW() - INTERVAL '30 days' AND
> updated < NOW() - interval '30 minutes'
> ```
>
> Since we are only interested in the pending tasks, I created a partial
> index
>  `*"tasks_pending_status_created_type_idx" btree (status, created,
> task_type) WHERE status <> ALL (ARRAY[3, 4, 5])*`.
>
> This worked great initially, however this started to get bloated very very
> quickly because, every task starts in pending state, gets multiple updates
> (and many of them are not HOT updates, working on optimizing fill factor
> now), and eventually gets deleted from the index (as status changes to
> success).
>

>From my experience I suspect that there is a problem with "of around 500
million tasks."
Autovacuum indeed cleans old dead index entries, but how many such dead
index entries will be collected on the 500M table before autovacuum kicks
in?

With the default value of autovacuum_vacuum_scale_factor (The default is
0.2 (20% of table size).) index will collect like 100M outdated/dead index
entries before autovacuum kicks in and cleans them all (in a worst case),
and of course it will lead to huge index bloat and awful performance.

Even if you scale down autovacuum_vacuum_scale_factor to some unreasonable
low value like 0.01, the index still bloats to the 5M dead entries before
autovacuum run, and constant vacuuming of a huge 500M table will put a huge
load on the database server.

Unfortunately there is no easy way out of this situation from database
side, in general I recommend not trying to implement a fast pacing queue
like load inside of a huge and constantly growing table, it never works
well because you cannot keep up partial efficient indexes for the queue in
a clean/non-bloated state.

In my opinion the best solution is to keep list of entries to process ("*around
1000-1500 tasks in pending statuses")* duplicated in the separate tiny
table (via triggers or implement it on the application level), in that case
autovacuum will be able quickly clean dead entries from the index.

Kind Regards,
Maxim


-- 
Maxim Boguk
Senior Postgresql DBA

Phone UA: +380 99 143 
Phone AU: +61  45 218 5678


Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)

2023-01-05 Thread Maxim Boguk
On Thu, Jan 5, 2023 at 1:31 PM MichaelDBA  wrote:

>
> What happens if you take pg_stat_statements out of the picture (remove
> from shared_preload_libraries)?  Does your BIND problem go away?
>

I didn't test this idea, because it requires restart of the database (it
cannot be done quickly) and without pg_stat_statements there will be no
adequate performance monitoring of the database.
But I'm pretty sure that the issue will go away with pg_stat_statements
disabled.

-- 
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone UA: +380 99 143 
Phone AU: +61  45 218 5678


Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)

2023-01-05 Thread Maxim Boguk
On Sat, Dec 31, 2022 at 2:26 PM Maxim Boguk  wrote:

> Hi,
>
> When performing post-mortem analysis of some short latency spikes on a
> heavily loaded database, I found that the reason for (less than 10 second
> latency spike) wasn't on the EXECUTE stage but on the BIND stage.
> At the same time graphical monitoring shows that during this few second
> period there were some queries waiting in the BIND stage.
>
> Logging setup:
> log_min_duration_statement=200ms
> log_lock_waits=on
> deadlock_timeout=100ms
> So I expected that every lock waiting over 100ms (>deadlock_timeout)
> should be in the log.
> But in the log I see only spikes on slow BIND but not lock waits logged.
> (
> grep BIND /var/log/postgresql/postgresql-2022-12-29.log | grep 'duration'
> | perl -pe 's/^(2022-12-29 \d\d:\d\d:\d).*$/$1/' | sort | uniq -c | less
> ...
>   9 2022-12-29 00:12:5
>   2 2022-12-29 00:13:1
>   3 2022-12-29 00:13:5
> !!!  68 2022-12-29 00:14:0
>   5 2022-12-29 00:14:1
>   3 2022-12-29 00:14:2
>   2 2022-12-29 00:14:3
> ).
> But no lock waits on the BIND stage logged during the problem period (and
> no lock waits in general).
> Similar issues happen a few times per day without any visible pattern (but
> on the same tables usually).
> No CPU or IO load/latency spikes found during problem periods.
> No EXECUTE slowdown found in the log during that time.
>


Followup research of this issue lead me to following results:
Every logged spike of BIND/PARSE response time correlated with
corresponding backend waiting on
wait_event_type = LWLock
wait_event = pg_stat_statements
and all of these spikes happen during increment of
pg_stat_statements_info.dealloc counter.

Some searching about this issue lead me to following blog post about
similar issue:
https://yhuelf.github.io/2021/09/30/pg_stat_statements_bottleneck.html

However, we already have pg_stat_statements.max=1 so further increase
of this parameter
seems counterproductive (the size of
14/main/pg_stat_tmp/pgss_query_texts.stat is already over 20MB).


Open questions remains:
1)Is it expected behaviour of pg_stat_statements to block every BIND/PARSE
during deallocation of least used entries for the whole period of cleanup?


2)Any recommended workaround for this issue for systems with strict latency
SLA
(block every database query (used extended query protocol) for 200-500ms
50+ times per day at random time - isn't acceptable for our case
unfortunately)?


3)Why only BIND/PARSE locks but not EXECUTE?
(may be some difference in implementation of plan vs exec
pg_stat_statements counters?).


Kind Regards,
Maxim


-- 
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone UA: +380 99 143 
Phone AU: +61  45 218 5678


Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)

2023-01-01 Thread Maxim Boguk
On Sun, Jan 1, 2023 at 6:55 PM MichaelDBA  wrote:

> You said it's a dedicated server, but pgbouncer is running locally,
> right?  PGBouncer has a small footprint, but is the CPU high for it?
>

There are 4 pgbouncer processes in so_reuseport mode.
I never saw more than 40% of a single CPU core per one pgbouncer process
(most time under 20%).
So it's an unlikely result of pgbouncer being overloaded.


-- 
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone UA: +380 99 143 
Phone AU: +61  45 218 5678


Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)

2023-01-01 Thread Maxim Boguk
On Sun, Jan 1, 2023 at 6:43 PM MichaelDBA  wrote:

> Hi Maxim,
>
> 10-20 active, concurrent connections is way below any CPU load problem you
> should have with 48 available vCPUs.
> You never explicitly said what the load is, so what is it in the context
> of the 1,5,15?
>
>
LA 10-15 all time, servers are really overprovisioned (2-3x by available
CPU resources) because an application is quite sensitive to the database
latency.
And during these latency spikes - EXECUTE work without any issues (e.g.
only PARSE/BIND suck).


-- 
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone UA: +380 99 143 
Phone AU: +61  45 218 5678


Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)

2023-01-01 Thread Maxim Boguk
On Sun, Jan 1, 2023 at 3:27 PM MichaelDBA  wrote:

> Howdy,
>
> Few additional questions:
>
>1. How many concurrent, active connections are running when these BIND
>problems occur?  select count(*) from pg_stat_activity where state in
>('active','idle in transaction')
>2. Are the queries using gigantic IN () values?
>3. Perhaps unrelated, but is log_temp_files turned on, and if so, do
>you have a lot of logs related to that?
>
> Regards,
> Michael Vitale, just another PG DBA
>

1)usual load (e.g. no anomalies)
10-20 concurrent query runs (e.g. issues isn't related to the load spike or
similar anomalies)
additionally 5-10 short idle in transaction (usual amount too)
total around 300 active connections to the database (after local pgbouncer
in transaction mode)

2)no... long BIND for huge parameter lists is a known issue for me, in this
case there is nothing like that... just (every?) PARSE/BIND stuck for a
short period (including ones which don't require pg_statistic table
access)...
There are some funny samples from the latest spike:
2023-01-01 15:45:09.151 UTC 2421121  from [local]
[vxid:109/20732521 txid:0] [BIND] LOG:  duration: 338.830 ms  bind
: ROLLBACK
2023-01-01 15:45:09.151 UTC 2365255  from [local] [vxid:41/21277531
txid:2504447286] [PARSE] LOG:  duration: 338.755 ms  parse :
select nextval ('jobs_id_seq')
along with normal select/insert/update/delete operations stuck for a short
time too...

3)log_temp_files on for sure, I found no correlation with temp file usage,
as well as no correlation between latency spikes and logged autovacuum
actions.

PS: '[BIND] LOG:  duration: 338.830 ms  bind : ROLLBACK' on a
definitely not overloaded and perfectly healthy server - probably the most
curious log entry of 2022 year for me.

-- 
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone UA: +380 99 143 
Phone AU: +61  45 218 5678


Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)

2023-01-01 Thread Maxim Boguk
On Sat, Dec 31, 2022 at 4:32 PM Justin Pryzby  wrote:

> On Sat, Dec 31, 2022 at 02:26:08PM +0200, Maxim Boguk wrote:
> > Hi,
> >
> > When performing post-mortem analysis of some short latency spikes on a
> > heavily loaded database, I found that the reason for (less than 10 second
> > latency spike) wasn't on the EXECUTE stage but on the BIND stage.
> > At the same time graphical monitoring shows that during this few second
> > period there were some queries waiting in the BIND stage.
> >
> > Logging setup:
> > log_min_duration_statement=200ms
> > log_lock_waits=on
> > deadlock_timeout=100ms
> > So I expected that every lock waiting over 100ms (>deadlock_timeout)
> should
> > be in the log.
> > But in the log I see only spikes on slow BIND but not lock waits logged.
>
> What version postgres?  What settings have non-default values ?
> What OS/version?  What environment/hardware?  VM/image/provider/...
> What are the queries that are running BIND ?  What parameter types ?
> Are the slow BINDs failing?  Are their paramters being logged ?
> What else is running besides postgres ?  Are the DB clients local or
> remote ?  It shouldn't matter, but what client library?
>

What version of postgres? - 14.6

What settings have non-default values ? - a lot (it's 48 core Amazon EC2
server with 396GB of RAM)
(e.g. it carefully tuned database for particular workload)

What OS/version? - Ubuntu 20.04LTS

What environment/hardware? - 48 core Amazon EC2 server with 396GB of RAM
and local NVME storage
(i3en.12xlarge)

What are the queries that are running BIND ?  - nothing special, e.g.
during problem period a lot completely different queries become stuck in
BIND and PARSE stage but no long duration (>100ms) EXECUTE calls found, in
general it feel that whole BIND/PARSE mechanics lock for short period
 LOG SAMPLE ==
2023-01-01 09:07:31.622 UTC 1848286 ** from [local] [vxid:84/20886619
txid:0] [PARSE] LOG:  duration: 235.472 ms  parse : SELECT
COUNT(*) FROM "job_stats_master" WHERE (job_stats_master.created_at >
= '2022-12-31 09:07:31.35') AND (job_stats_master.created_at <
'2023-01-01 09:07:31.35') AND "job_stats_master"."employer_id" = 
AND "job_stats_master"."action" = 2 AND "job_stats_master"."job_board_id" =
 AND "job_stats_master"."ip_matching_id" = *
2023-01-01 09:07:31.622 UTC 1898699 **  from [local] [vxid:158/22054921
txid:0] [BIND] LOG:  duration: 231.274 ms  bind : SELECT id, name
FROM job_types WHERE id IN ($1)
2023-01-01 09:07:31.622 UTC 1898699 *** from [local] [vxid:158/22054921
txid:0] [BIND] DETAIL:  parameters: $1 = '0'
2023-01-01 09:07:31.622 UTC 1794756 *** from [local] [vxid:281/10515416
txid:0] [BIND] LOG:  duration: 231.024 ms  bind : SELECT id, name
FROM job_types WHERE id IN ($1)
2023-01-01 09:07:31.622 UTC 1794756 *** from [local] [vxid:281/10515416
txid:0] [BIND] DETAIL:  parameters: $1 = '0'

... 5 pages of BIND/PARSE of different/unrelated to each other queries
logged with over 100ms runtime

2023-01-01 09:07:31.623 UTC 1806315 *** from [local] [vxid:231/17406673
txid:0] [BIND] LOG:  duration: 139.372 ms  bind : SELECT
employers.*, third_party_employer_pixels.facebook_pixel_id AS
facebook_pixel_id, third_party_employer_pixels.google_pixel_id   AS
google_pixel_id, third_party_employer_pixels.google_actionsAS
google_actions, employer_pixel_configurations.solutionAS
tracking_solution, employer_pixel_configurations.domain_name AS
domain, settings.use_multiple_bids   FROM employers LEFT JOIN
third_party_employer_pixels   ON third_party_employer_pixels.employer_id =
employers.id LEFT JOIN employer_pixel_configurations ON
employer_pixel_configurations.id = employers.id LEFT JOIN settings
 ON settings.id = employers.setting_id WHERE employers.id =
$1
2023-01-01 09:07:31.623 UTC 1806315 *** from [local] [vxid:231/17406673
txid:0] [BIND] DETAIL:  parameters: $1 = '*'
2023-01-01 09:07:31.624 UTC 1806321 *** from [local] [vxid:176/21846997
txid:0] [BIND] LOG:  duration: 120.237 ms  bind : SELECT
job_boards.*, enterprises.product_type,
feed_settings.use_employer_exported_name as use_employer_exported_name,
integration_job_board_settings.integration_status as integration_status
FROM job_boards LEFT JOIN integration_job_board_settings ON
integration_job_board_settings.id =
job_boards.integration_job_board_setting_id LEFT JOIN enterprises ON
enterprises.id = job_boards.enterprise_id LEFT JOIN feed_settings ON
feed_settings.id = job_boards.feed_setting_id WHERE job_boards.id = $1
2023-01-01 09:07:31.624 UTC 1806321 *** from [local] [vxid:176/21846997
txid:0] [BIND] DETAIL:  p

How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)

2022-12-31 Thread Maxim Boguk
Hi,

When performing post-mortem analysis of some short latency spikes on a
heavily loaded database, I found that the reason for (less than 10 second
latency spike) wasn't on the EXECUTE stage but on the BIND stage.
At the same time graphical monitoring shows that during this few second
period there were some queries waiting in the BIND stage.

Logging setup:
log_min_duration_statement=200ms
log_lock_waits=on
deadlock_timeout=100ms
So I expected that every lock waiting over 100ms (>deadlock_timeout) should
be in the log.
But in the log I see only spikes on slow BIND but not lock waits logged.
(
grep BIND /var/log/postgresql/postgresql-2022-12-29.log | grep 'duration' |
perl -pe 's/^(2022-12-29 \d\d:\d\d:\d).*$/$1/' | sort | uniq -c | less
...
  9 2022-12-29 00:12:5
  2 2022-12-29 00:13:1
  3 2022-12-29 00:13:5
!!!  68 2022-12-29 00:14:0
  5 2022-12-29 00:14:1
  3 2022-12-29 00:14:2
  2 2022-12-29 00:14:3
).
But no lock waits on the BIND stage logged during the problem period (and
no lock waits in general).
Similar issues happen a few times per day without any visible pattern (but
on the same tables usually).
No CPU or IO load/latency spikes found during problem periods.
No EXECUTE slowdown found in the log during that time.

So currently I have two hypotheses in research:
1)during BIND stage not every lock waits logged
2)there are some not a lock related intermittent slowdown of BIND

I ask for any ideas how to debug this issue (duration of such spike usually
under 1s but given how many TPS database serving - 1s is too much and
affect end users).


-- 
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone UA: +380 99 143 
Phone AU: +61  45 218 5678


Re: High insert rate server, unstable insert latency and load peaks with buffer_content and XidGenLock LWlocks with Postgresql 12 version

2020-04-09 Thread Maxim Boguk
On Fri, Apr 10, 2020 at 1:16 AM Justin Pryzby  wrote:

> On Fri, Apr 10, 2020 at 12:51:03AM +1000, Maxim Boguk wrote:
> > With database on dedicated server I encountered unusual load profile:
> > multi thread (200 connections static size pool via pgbouncer) insert only
> > into single table around 15.000 insert/s.
> >
> > Usually insert took 0.025ms and amount active backends (via
> > pg_stat_activity) usually stay in 1-5-10 range.
> > But every so while (few times per minute actually) number of active
> backend
> > go up to all 200 allowed connections.
> > Which lead to serious latency in latency sensitive load.
> >
> > No problem with IO latency or CPU usage found during performance analyze.
> > syncronous_commit = off
>
> Can you share other settings ?  shared_buffers, checkpoint_*, bgwriter_*
> and
> max_wal_size ?  And version()
>


version -  PostgreSQL 12.2 (Ubuntu 12.2-2.pgdg18.04+1) on
x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0,
64-bit
shared_buffers 140GB
checkpoint_timeout  1h
checkpoint_flush_after 0
checkpoint_completion_target 0.9
bgwriter_delay 10ms
bgwriter_flush_after 0
bgwriter_lru_maxpages 1
bgwriter_lru_multiplier 10
max_wal_size 128GB

Checkpoints happens every 1h and lag spiked doesn't depend on checkpointer
activity.
buffers_checkpoint 92% writes, buffers_clean 2% writes, buffers_backend 6%
writes (over course of 5 minutes).
Nothing especially suspicious on graphical monitoring of these values as
well.


-- 
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone RU: +7  985 433 
Phone UA: +380 99 143 
Phone AU: +61  45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно
когда я так делаю ещё раз?"


High insert rate server, unstable insert latency and load peaks with buffer_content and XidGenLock LWlocks with Postgresql 12 version

2020-04-09 Thread Maxim Boguk
Hi,

I have performance issues which I never seen before in my 20+ years
experience with PostgreSQL.

With database on dedicated server I encountered unusual load profile:
multi thread (200 connections static size pool via pgbouncer) insert only
into single table around 15.000 insert/s.

Usually insert took 0.025ms and amount active backends (via
pg_stat_activity) usually stay in 1-5-10 range.
But every so while (few times per minute actually) number of active backend
go up to all 200 allowed connections.
Which lead to serious latency in latency sensitive load.

No problem with IO latency or CPU usage found during performance analyze.
syncronous_commit = off

To analyze what going with locks I run
\o ~/tmp/watch_events.log
select wait_event_type,wait_event,count(*) from pg_stat_activity where
state='active' and backend_type='client backend' group by 1,2 order by 3
desc
\watch 0.1

Normal output when all goes well:
 wait_event_type | wait_event | count
-++---
 Client  | ClientRead | 5
 || 4
(few processes running queries and few processes doing network IO)

Bad case (few times per minute, huge latency peak, some inserts took up to
100ms to run):
 wait_event_type |   wait_event   | count
-++---
 LWLock  | buffer_content |   178
 LWLock  | XidGenLock |21
 IO  | SLRUSync   | 1
 || 1

So there are almost all backends waiting on buffer_content lock and some
backends waiting for XidGenLock .
And always one backend in SLRUSync.

If anyone can have any good idea whats going on in that case and how I can
fix it - any ideas welcome.
So far I out of ideas.


-- 
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone RU: +7  985 433 
Phone UA: +380 99 143 
Phone AU: +61  45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно
когда я так делаю ещё раз?"


Re: zabbix on postgresql - very slow delete of events

2019-07-24 Thread Maxim Boguk
On Wed, Jul 24, 2019 at 6:12 PM Kristian Ejvind 
wrote:

> Hi Maxim
>
>
>
> Thanks for your advice, and let me start with your second email, which
> I'll copy here:
>
>
>
> =
>
> Hi Kristian,
>
>
>
> After comparing structure of zabbix tables with same in my zabbix
> installation I found one very weird difference.
>
> Why type of events.eventid had been changed from default bigint to numeric?
>
>
>
> I suspect that the difference between events.eventid (numeric) type
> and event_recovery.*_eventid (bigint) types might lead to inability of use
> index during foreign key checks.
>
> Anyway it will be clearly visible on the pg_stat_xact_user_tables results
> (I now expect to see 3 sequential scan on event_recovery and may be on some
> other tables as well).
>
>
>
> Kind Regards,
>
> Maxim
>
> =
>
>
>
> Well spotted! On closer examination it seems that data types are wrong in
> several places. I suspect that this comes
>
> from the time when our Zabbix ran on a MySQL database, which was converted
> over to PostgreSQL a few years
>
> ago. I agree this discrepancy is suspicious and I will continue to examine
> it.
>
>
>
> Regarding your ideas in the email below, I can say that 1) is not valid,
> disk latency is in the range of a few ms.
>
> This is the output from your recommended query, which seems to verify your
> suspicions.
>
>
>
> zabbix_34=# begin; delete from zabbix.events where eventid = 7123123;
> select * from pg_stat_xact_user_tables where seq_scan>0 or idx_scan>0 order
> by seq_scan+idx_scan desc; rollback;
>
> Time: 0.113 ms
>
> Time: 4798.189 ms (00:04.798)
>
> relid  | schemaname |relname | seq_scan | seq_tup_read | idx_scan
> | idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del | n_tup_hot_upd
>
>
> +++--+--+--+---+---+---+---+---
>
>   41940 | zabbix | event_recovery |3 | 35495224 |0
> | 0 | 0 | 0 | 1 | 0
>
>   41675 | zabbix | alerts |1 |   544966 |1
> | 0 | 0 | 0 | 0 | 0
>
>   42573 | zabbix | problem|2 |13896 |0
> | 0 | 0 | 0 | 0 | 0
>
>   41943 | zabbix | event_tag  |1 |22004 |0
> | 0 | 0 | 0 | 0 | 0
>
>   41649 | zabbix | acknowledges   |1 |   47 |0
> | 0 | 0 | 0 | 0 | 0
>
>   41951 | zabbix | events |0 |0 |1
> | 1 | 0 | 0 | 1 | 0
>
> 260215 | zabbix | event_suppress |1 |0 |0
> | 0 | 0 | 0 | 0 | 0
>

Hi Kristian,

This result definitely proves that indexes not used during foreign key
checks (see that non-zero seq_scan counters for linked tables).
Only possible reason (IMHO) that wrong usage numeric in place of bigint.
I recommend change types of events.eventid (and any other similar fields)
to bigint.
It should resolve your performance issues with deletes on events table (as
additional bonus - bigint a lot faster and compact type than numeric).

-- 
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone RU: +7  985 433 
Phone UA: +380 99 143 
Phone AU: +61  45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно
когда я так делаю ещё раз?"


Re: zabbix on postgresql - very slow delete of events

2019-07-24 Thread Maxim Boguk
>
>
> All these queries execute well below 1 ms, using indexes.
>
>
>
> Let's delete one row. See explain results here:
> https://explain.depesz.com/s/aycf . 5 seconds to delete a single row,
> wow!
>
> This shows that it is the foreign key constraints on event_recovery and
> alerts that take a lot of time.
>
> But why? I far as I can see, the delete is fully CPU bound during
> execution.
>
>
>
> Deleting the corresponding row directly from event_recovery or alerts
> executes in less than 0.1 ms.
>
>
>
> Any ideas?
>
>
>
> I've observed that alerts and event_recovery tables both have more than
> one foreign key that references events, if that matters.
>
>
>
Hi Kristian,

After comparing structure of zabbix tables with same in my zabbix
installation I found one very weird difference.
Why type of events.eventid had been changed from default bigint to numeric?

I suspect that the difference between events.eventid (numeric) type
and event_recovery.*_eventid (bigint) types might lead to inability of use
index during foreign key checks.
Anyway it will be clearly visible on the pg_stat_xact_user_tables results
(I now expect to see 3 sequential scan on event_recovery and may be on some
other tables as well).

Kind Regards,
Maxim


-- 
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone RU: +7  985 433 
Phone UA: +380 99 143 
Phone AU: +61  45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно
когда я так делаю ещё раз?"


Re: zabbix on postgresql - very slow delete of events

2019-07-24 Thread Maxim Boguk
Hi Kristian,

If you look for explain analyze results for delete,
you will see that 99% of time query spent on the foreign key triggers
checks.
In the same time the database have indexes on foreign key side in place.


I recommend try this:

\timing on
BEGIN;
delete from zabbix.events where eventid = [some testing id];
select * from pg_stat_xact_user_tables where seq_scan>0 or idx_scan>0 order
by seq_scan+idx_scan desc;
ABORT;

And provide result of the last query and how long delete runs.
It might help us understand whats going on.

Currently I have 3 ideas:
1)very very slow and overloaded IO subsystem
2)a lot of stuff being delete by ON DELETE CASCADE
3)some locking prevent foreign key checks run fast



On Wed, Jul 24, 2019 at 11:12 AM Kristian Ejvind 
wrote:

> Hi.
>
> Well, the events table has both a primary key and foreign keys referencing
> it, which is not possible
> on a partitioned table in postgresql 10. How did you work around this
> issue?
>
> On the other hand, if we can get the deletion of rows from the events
> table run at normal speed, I
> can't imagine we would have a problem with it in a long time. After all,
> although our Zabbix installation
> definitely is larger than "small", it's still far from "large".
>
> I think I would need assistance with debugging why postgresql behaves like
> it does.
> Is there a defect with deleting data from a table that has multiple
> foreign keys referencing it from a  certain table?
> Is there a problem with the query optimizer that chooses the wrong plan
> when working on the foreign key constraints?
> How do I inspect how the db works on the deletion of rows from the
> referencing tables?
>
> Regards
> Kristian
>
>
>
> ?On 2019-07-23, 16:33, "Kenneth Marshall"  wrote:
>
> On Tue, Jul 23, 2019 at 01:41:53PM +, Kristian Ejvind wrote:
> > Thanks Kenneth. In fact we've already partitioned the largest
> history* and trends* tables
> > and that has been running fine for a year. Performance was vastly
> improved. But since you
> > can't have a unique index on a partitioned table in postgres 10, we
> haven't worked on that.
> >
> > Regards
> > Kristian
>
> Hi Kristian,
>
> Why are you not partitioning the events and alerts tables as well? That
> would eliminate this problem and you already have the infrastructure in
> place to support the management since you are using it for the history
> and trends tables.
>
> Regards,
> Ken
>
>
>
>
>
>
> Resurs Bank AB
> Kristian Ejvind
> Linux System Administrator
> IT Operations | Technical Operations
>
> Ekslingan 8
> Box 222 09, SE-25467 Helsingborg
>
> Direkt Tfn:
> Mobil: +46 728571483
> Vxl: +46 42 382000
> Fax:
> E-post: kristian.ejv...@resurs.se
> Webb: http://www.resursbank.se
>
>
>
>

-- 
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone RU: +7  985 433 
Phone UA: +380 99 143 
Phone AU: +61  45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно
когда я так делаю ещё раз?"