Re: Index bloat and REINDEX/VACUUM optimization for partial index
> > But anyway, PostgreSQL has features to prevent the index bloat from > becoming too severe of a problem, and you should figure out why they are > not working for you. The most common ones I know of are 1) long open > snapshots preventing clean up, 2) all index scans being bitmap index scans, > which don't to micro-vacuuming/index hinting the way ordinary btree > index scans do, and 3) running the queries on a hot-standby, where index > hint bits must be ignored. If you could identify and solve this issue, > then you wouldn't need to twist yourself into knots avoiding non-HOT > updates. > I am not sure that kill bits could be a complete fix for indexes with tens of millions dead entries and only a handful of live entries. As I understand the mechanics of killbits - they help to avoid excessive heap visibility checks for dead tuples, but tuples with killbit are still should be read from the index first. And with many millions of dead entries it isn't free. PS: ignoring killbits on hot standby slaves is a source of endless pain in many cases. -- Maxim Boguk Senior Postgresql DBA Phone UA: +380 99 143 Phone AU: +61 45 218 5678
Re: Index bloat and REINDEX/VACUUM optimization for partial index
> At any moment, there are *around 1000-1500 tasks in pending statuses* > (Init + InProgress) out of around 500 million tasks. > > Now, we have a task monitoring query that will look for all pending tasks > that have not received any update in the last n minutes. > > ``` > SELECT [columns list] > FROM tasks > WHERE status NOT IN (3,4,5) AND created > NOW() - INTERVAL '30 days' AND > updated < NOW() - interval '30 minutes' > ``` > > Since we are only interested in the pending tasks, I created a partial > index > `*"tasks_pending_status_created_type_idx" btree (status, created, > task_type) WHERE status <> ALL (ARRAY[3, 4, 5])*`. > > This worked great initially, however this started to get bloated very very > quickly because, every task starts in pending state, gets multiple updates > (and many of them are not HOT updates, working on optimizing fill factor > now), and eventually gets deleted from the index (as status changes to > success). > >From my experience I suspect that there is a problem with "of around 500 million tasks." Autovacuum indeed cleans old dead index entries, but how many such dead index entries will be collected on the 500M table before autovacuum kicks in? With the default value of autovacuum_vacuum_scale_factor (The default is 0.2 (20% of table size).) index will collect like 100M outdated/dead index entries before autovacuum kicks in and cleans them all (in a worst case), and of course it will lead to huge index bloat and awful performance. Even if you scale down autovacuum_vacuum_scale_factor to some unreasonable low value like 0.01, the index still bloats to the 5M dead entries before autovacuum run, and constant vacuuming of a huge 500M table will put a huge load on the database server. Unfortunately there is no easy way out of this situation from database side, in general I recommend not trying to implement a fast pacing queue like load inside of a huge and constantly growing table, it never works well because you cannot keep up partial efficient indexes for the queue in a clean/non-bloated state. In my opinion the best solution is to keep list of entries to process ("*around 1000-1500 tasks in pending statuses")* duplicated in the separate tiny table (via triggers or implement it on the application level), in that case autovacuum will be able quickly clean dead entries from the index. Kind Regards, Maxim -- Maxim Boguk Senior Postgresql DBA Phone UA: +380 99 143 Phone AU: +61 45 218 5678
Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)
On Thu, Jan 5, 2023 at 1:31 PM MichaelDBA wrote: > > What happens if you take pg_stat_statements out of the picture (remove > from shared_preload_libraries)? Does your BIND problem go away? > I didn't test this idea, because it requires restart of the database (it cannot be done quickly) and without pg_stat_statements there will be no adequate performance monitoring of the database. But I'm pretty sure that the issue will go away with pg_stat_statements disabled. -- Maxim Boguk Senior Postgresql DBA https://dataegret.com/ Phone UA: +380 99 143 Phone AU: +61 45 218 5678
Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)
On Sat, Dec 31, 2022 at 2:26 PM Maxim Boguk wrote: > Hi, > > When performing post-mortem analysis of some short latency spikes on a > heavily loaded database, I found that the reason for (less than 10 second > latency spike) wasn't on the EXECUTE stage but on the BIND stage. > At the same time graphical monitoring shows that during this few second > period there were some queries waiting in the BIND stage. > > Logging setup: > log_min_duration_statement=200ms > log_lock_waits=on > deadlock_timeout=100ms > So I expected that every lock waiting over 100ms (>deadlock_timeout) > should be in the log. > But in the log I see only spikes on slow BIND but not lock waits logged. > ( > grep BIND /var/log/postgresql/postgresql-2022-12-29.log | grep 'duration' > | perl -pe 's/^(2022-12-29 \d\d:\d\d:\d).*$/$1/' | sort | uniq -c | less > ... > 9 2022-12-29 00:12:5 > 2 2022-12-29 00:13:1 > 3 2022-12-29 00:13:5 > !!! 68 2022-12-29 00:14:0 > 5 2022-12-29 00:14:1 > 3 2022-12-29 00:14:2 > 2 2022-12-29 00:14:3 > ). > But no lock waits on the BIND stage logged during the problem period (and > no lock waits in general). > Similar issues happen a few times per day without any visible pattern (but > on the same tables usually). > No CPU or IO load/latency spikes found during problem periods. > No EXECUTE slowdown found in the log during that time. > Followup research of this issue lead me to following results: Every logged spike of BIND/PARSE response time correlated with corresponding backend waiting on wait_event_type = LWLock wait_event = pg_stat_statements and all of these spikes happen during increment of pg_stat_statements_info.dealloc counter. Some searching about this issue lead me to following blog post about similar issue: https://yhuelf.github.io/2021/09/30/pg_stat_statements_bottleneck.html However, we already have pg_stat_statements.max=1 so further increase of this parameter seems counterproductive (the size of 14/main/pg_stat_tmp/pgss_query_texts.stat is already over 20MB). Open questions remains: 1)Is it expected behaviour of pg_stat_statements to block every BIND/PARSE during deallocation of least used entries for the whole period of cleanup? 2)Any recommended workaround for this issue for systems with strict latency SLA (block every database query (used extended query protocol) for 200-500ms 50+ times per day at random time - isn't acceptable for our case unfortunately)? 3)Why only BIND/PARSE locks but not EXECUTE? (may be some difference in implementation of plan vs exec pg_stat_statements counters?). Kind Regards, Maxim -- Maxim Boguk Senior Postgresql DBA https://dataegret.com/ Phone UA: +380 99 143 Phone AU: +61 45 218 5678
Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)
On Sun, Jan 1, 2023 at 6:55 PM MichaelDBA wrote: > You said it's a dedicated server, but pgbouncer is running locally, > right? PGBouncer has a small footprint, but is the CPU high for it? > There are 4 pgbouncer processes in so_reuseport mode. I never saw more than 40% of a single CPU core per one pgbouncer process (most time under 20%). So it's an unlikely result of pgbouncer being overloaded. -- Maxim Boguk Senior Postgresql DBA https://dataegret.com/ Phone UA: +380 99 143 Phone AU: +61 45 218 5678
Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)
On Sun, Jan 1, 2023 at 6:43 PM MichaelDBA wrote: > Hi Maxim, > > 10-20 active, concurrent connections is way below any CPU load problem you > should have with 48 available vCPUs. > You never explicitly said what the load is, so what is it in the context > of the 1,5,15? > > LA 10-15 all time, servers are really overprovisioned (2-3x by available CPU resources) because an application is quite sensitive to the database latency. And during these latency spikes - EXECUTE work without any issues (e.g. only PARSE/BIND suck). -- Maxim Boguk Senior Postgresql DBA https://dataegret.com/ Phone UA: +380 99 143 Phone AU: +61 45 218 5678
Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)
On Sun, Jan 1, 2023 at 3:27 PM MichaelDBA wrote: > Howdy, > > Few additional questions: > >1. How many concurrent, active connections are running when these BIND >problems occur? select count(*) from pg_stat_activity where state in >('active','idle in transaction') >2. Are the queries using gigantic IN () values? >3. Perhaps unrelated, but is log_temp_files turned on, and if so, do >you have a lot of logs related to that? > > Regards, > Michael Vitale, just another PG DBA > 1)usual load (e.g. no anomalies) 10-20 concurrent query runs (e.g. issues isn't related to the load spike or similar anomalies) additionally 5-10 short idle in transaction (usual amount too) total around 300 active connections to the database (after local pgbouncer in transaction mode) 2)no... long BIND for huge parameter lists is a known issue for me, in this case there is nothing like that... just (every?) PARSE/BIND stuck for a short period (including ones which don't require pg_statistic table access)... There are some funny samples from the latest spike: 2023-01-01 15:45:09.151 UTC 2421121 from [local] [vxid:109/20732521 txid:0] [BIND] LOG: duration: 338.830 ms bind : ROLLBACK 2023-01-01 15:45:09.151 UTC 2365255 from [local] [vxid:41/21277531 txid:2504447286] [PARSE] LOG: duration: 338.755 ms parse : select nextval ('jobs_id_seq') along with normal select/insert/update/delete operations stuck for a short time too... 3)log_temp_files on for sure, I found no correlation with temp file usage, as well as no correlation between latency spikes and logged autovacuum actions. PS: '[BIND] LOG: duration: 338.830 ms bind : ROLLBACK' on a definitely not overloaded and perfectly healthy server - probably the most curious log entry of 2022 year for me. -- Maxim Boguk Senior Postgresql DBA https://dataegret.com/ Phone UA: +380 99 143 Phone AU: +61 45 218 5678
Re: How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)
On Sat, Dec 31, 2022 at 4:32 PM Justin Pryzby wrote: > On Sat, Dec 31, 2022 at 02:26:08PM +0200, Maxim Boguk wrote: > > Hi, > > > > When performing post-mortem analysis of some short latency spikes on a > > heavily loaded database, I found that the reason for (less than 10 second > > latency spike) wasn't on the EXECUTE stage but on the BIND stage. > > At the same time graphical monitoring shows that during this few second > > period there were some queries waiting in the BIND stage. > > > > Logging setup: > > log_min_duration_statement=200ms > > log_lock_waits=on > > deadlock_timeout=100ms > > So I expected that every lock waiting over 100ms (>deadlock_timeout) > should > > be in the log. > > But in the log I see only spikes on slow BIND but not lock waits logged. > > What version postgres? What settings have non-default values ? > What OS/version? What environment/hardware? VM/image/provider/... > What are the queries that are running BIND ? What parameter types ? > Are the slow BINDs failing? Are their paramters being logged ? > What else is running besides postgres ? Are the DB clients local or > remote ? It shouldn't matter, but what client library? > What version of postgres? - 14.6 What settings have non-default values ? - a lot (it's 48 core Amazon EC2 server with 396GB of RAM) (e.g. it carefully tuned database for particular workload) What OS/version? - Ubuntu 20.04LTS What environment/hardware? - 48 core Amazon EC2 server with 396GB of RAM and local NVME storage (i3en.12xlarge) What are the queries that are running BIND ? - nothing special, e.g. during problem period a lot completely different queries become stuck in BIND and PARSE stage but no long duration (>100ms) EXECUTE calls found, in general it feel that whole BIND/PARSE mechanics lock for short period LOG SAMPLE == 2023-01-01 09:07:31.622 UTC 1848286 ** from [local] [vxid:84/20886619 txid:0] [PARSE] LOG: duration: 235.472 ms parse : SELECT COUNT(*) FROM "job_stats_master" WHERE (job_stats_master.created_at > = '2022-12-31 09:07:31.35') AND (job_stats_master.created_at < '2023-01-01 09:07:31.35') AND "job_stats_master"."employer_id" = AND "job_stats_master"."action" = 2 AND "job_stats_master"."job_board_id" = AND "job_stats_master"."ip_matching_id" = * 2023-01-01 09:07:31.622 UTC 1898699 ** from [local] [vxid:158/22054921 txid:0] [BIND] LOG: duration: 231.274 ms bind : SELECT id, name FROM job_types WHERE id IN ($1) 2023-01-01 09:07:31.622 UTC 1898699 *** from [local] [vxid:158/22054921 txid:0] [BIND] DETAIL: parameters: $1 = '0' 2023-01-01 09:07:31.622 UTC 1794756 *** from [local] [vxid:281/10515416 txid:0] [BIND] LOG: duration: 231.024 ms bind : SELECT id, name FROM job_types WHERE id IN ($1) 2023-01-01 09:07:31.622 UTC 1794756 *** from [local] [vxid:281/10515416 txid:0] [BIND] DETAIL: parameters: $1 = '0' ... 5 pages of BIND/PARSE of different/unrelated to each other queries logged with over 100ms runtime 2023-01-01 09:07:31.623 UTC 1806315 *** from [local] [vxid:231/17406673 txid:0] [BIND] LOG: duration: 139.372 ms bind : SELECT employers.*, third_party_employer_pixels.facebook_pixel_id AS facebook_pixel_id, third_party_employer_pixels.google_pixel_id AS google_pixel_id, third_party_employer_pixels.google_actionsAS google_actions, employer_pixel_configurations.solutionAS tracking_solution, employer_pixel_configurations.domain_name AS domain, settings.use_multiple_bids FROM employers LEFT JOIN third_party_employer_pixels ON third_party_employer_pixels.employer_id = employers.id LEFT JOIN employer_pixel_configurations ON employer_pixel_configurations.id = employers.id LEFT JOIN settings ON settings.id = employers.setting_id WHERE employers.id = $1 2023-01-01 09:07:31.623 UTC 1806315 *** from [local] [vxid:231/17406673 txid:0] [BIND] DETAIL: parameters: $1 = '*' 2023-01-01 09:07:31.624 UTC 1806321 *** from [local] [vxid:176/21846997 txid:0] [BIND] LOG: duration: 120.237 ms bind : SELECT job_boards.*, enterprises.product_type, feed_settings.use_employer_exported_name as use_employer_exported_name, integration_job_board_settings.integration_status as integration_status FROM job_boards LEFT JOIN integration_job_board_settings ON integration_job_board_settings.id = job_boards.integration_job_board_setting_id LEFT JOIN enterprises ON enterprises.id = job_boards.enterprise_id LEFT JOIN feed_settings ON feed_settings.id = job_boards.feed_setting_id WHERE job_boards.id = $1 2023-01-01 09:07:31.624 UTC 1806321 *** from [local] [vxid:176/21846997 txid:0] [BIND] DETAIL: p
How to analyze of short but heavy intermittent slowdown on BIND on production database (or BIND vs log_lock_waits)
Hi, When performing post-mortem analysis of some short latency spikes on a heavily loaded database, I found that the reason for (less than 10 second latency spike) wasn't on the EXECUTE stage but on the BIND stage. At the same time graphical monitoring shows that during this few second period there were some queries waiting in the BIND stage. Logging setup: log_min_duration_statement=200ms log_lock_waits=on deadlock_timeout=100ms So I expected that every lock waiting over 100ms (>deadlock_timeout) should be in the log. But in the log I see only spikes on slow BIND but not lock waits logged. ( grep BIND /var/log/postgresql/postgresql-2022-12-29.log | grep 'duration' | perl -pe 's/^(2022-12-29 \d\d:\d\d:\d).*$/$1/' | sort | uniq -c | less ... 9 2022-12-29 00:12:5 2 2022-12-29 00:13:1 3 2022-12-29 00:13:5 !!! 68 2022-12-29 00:14:0 5 2022-12-29 00:14:1 3 2022-12-29 00:14:2 2 2022-12-29 00:14:3 ). But no lock waits on the BIND stage logged during the problem period (and no lock waits in general). Similar issues happen a few times per day without any visible pattern (but on the same tables usually). No CPU or IO load/latency spikes found during problem periods. No EXECUTE slowdown found in the log during that time. So currently I have two hypotheses in research: 1)during BIND stage not every lock waits logged 2)there are some not a lock related intermittent slowdown of BIND I ask for any ideas how to debug this issue (duration of such spike usually under 1s but given how many TPS database serving - 1s is too much and affect end users). -- Maxim Boguk Senior Postgresql DBA https://dataegret.com/ Phone UA: +380 99 143 Phone AU: +61 45 218 5678
Re: High insert rate server, unstable insert latency and load peaks with buffer_content and XidGenLock LWlocks with Postgresql 12 version
On Fri, Apr 10, 2020 at 1:16 AM Justin Pryzby wrote: > On Fri, Apr 10, 2020 at 12:51:03AM +1000, Maxim Boguk wrote: > > With database on dedicated server I encountered unusual load profile: > > multi thread (200 connections static size pool via pgbouncer) insert only > > into single table around 15.000 insert/s. > > > > Usually insert took 0.025ms and amount active backends (via > > pg_stat_activity) usually stay in 1-5-10 range. > > But every so while (few times per minute actually) number of active > backend > > go up to all 200 allowed connections. > > Which lead to serious latency in latency sensitive load. > > > > No problem with IO latency or CPU usage found during performance analyze. > > syncronous_commit = off > > Can you share other settings ? shared_buffers, checkpoint_*, bgwriter_* > and > max_wal_size ? And version() > version - PostgreSQL 12.2 (Ubuntu 12.2-2.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0, 64-bit shared_buffers 140GB checkpoint_timeout 1h checkpoint_flush_after 0 checkpoint_completion_target 0.9 bgwriter_delay 10ms bgwriter_flush_after 0 bgwriter_lru_maxpages 1 bgwriter_lru_multiplier 10 max_wal_size 128GB Checkpoints happens every 1h and lag spiked doesn't depend on checkpointer activity. buffers_checkpoint 92% writes, buffers_clean 2% writes, buffers_backend 6% writes (over course of 5 minutes). Nothing especially suspicious on graphical monitoring of these values as well. -- Maxim Boguk Senior Postgresql DBA https://dataegret.com/ Phone RU: +7 985 433 Phone UA: +380 99 143 Phone AU: +61 45 218 5678 LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b Skype: maxim.boguk "Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно когда я так делаю ещё раз?"
High insert rate server, unstable insert latency and load peaks with buffer_content and XidGenLock LWlocks with Postgresql 12 version
Hi, I have performance issues which I never seen before in my 20+ years experience with PostgreSQL. With database on dedicated server I encountered unusual load profile: multi thread (200 connections static size pool via pgbouncer) insert only into single table around 15.000 insert/s. Usually insert took 0.025ms and amount active backends (via pg_stat_activity) usually stay in 1-5-10 range. But every so while (few times per minute actually) number of active backend go up to all 200 allowed connections. Which lead to serious latency in latency sensitive load. No problem with IO latency or CPU usage found during performance analyze. syncronous_commit = off To analyze what going with locks I run \o ~/tmp/watch_events.log select wait_event_type,wait_event,count(*) from pg_stat_activity where state='active' and backend_type='client backend' group by 1,2 order by 3 desc \watch 0.1 Normal output when all goes well: wait_event_type | wait_event | count -++--- Client | ClientRead | 5 || 4 (few processes running queries and few processes doing network IO) Bad case (few times per minute, huge latency peak, some inserts took up to 100ms to run): wait_event_type | wait_event | count -++--- LWLock | buffer_content | 178 LWLock | XidGenLock |21 IO | SLRUSync | 1 || 1 So there are almost all backends waiting on buffer_content lock and some backends waiting for XidGenLock . And always one backend in SLRUSync. If anyone can have any good idea whats going on in that case and how I can fix it - any ideas welcome. So far I out of ideas. -- Maxim Boguk Senior Postgresql DBA https://dataegret.com/ Phone RU: +7 985 433 Phone UA: +380 99 143 Phone AU: +61 45 218 5678 LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b Skype: maxim.boguk "Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно когда я так делаю ещё раз?"
Re: zabbix on postgresql - very slow delete of events
On Wed, Jul 24, 2019 at 6:12 PM Kristian Ejvind wrote: > Hi Maxim > > > > Thanks for your advice, and let me start with your second email, which > I'll copy here: > > > > = > > Hi Kristian, > > > > After comparing structure of zabbix tables with same in my zabbix > installation I found one very weird difference. > > Why type of events.eventid had been changed from default bigint to numeric? > > > > I suspect that the difference between events.eventid (numeric) type > and event_recovery.*_eventid (bigint) types might lead to inability of use > index during foreign key checks. > > Anyway it will be clearly visible on the pg_stat_xact_user_tables results > (I now expect to see 3 sequential scan on event_recovery and may be on some > other tables as well). > > > > Kind Regards, > > Maxim > > = > > > > Well spotted! On closer examination it seems that data types are wrong in > several places. I suspect that this comes > > from the time when our Zabbix ran on a MySQL database, which was converted > over to PostgreSQL a few years > > ago. I agree this discrepancy is suspicious and I will continue to examine > it. > > > > Regarding your ideas in the email below, I can say that 1) is not valid, > disk latency is in the range of a few ms. > > This is the output from your recommended query, which seems to verify your > suspicions. > > > > zabbix_34=# begin; delete from zabbix.events where eventid = 7123123; > select * from pg_stat_xact_user_tables where seq_scan>0 or idx_scan>0 order > by seq_scan+idx_scan desc; rollback; > > Time: 0.113 ms > > Time: 4798.189 ms (00:04.798) > > relid | schemaname |relname | seq_scan | seq_tup_read | idx_scan > | idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del | n_tup_hot_upd > > > +++--+--+--+---+---+---+---+--- > > 41940 | zabbix | event_recovery |3 | 35495224 |0 > | 0 | 0 | 0 | 1 | 0 > > 41675 | zabbix | alerts |1 | 544966 |1 > | 0 | 0 | 0 | 0 | 0 > > 42573 | zabbix | problem|2 |13896 |0 > | 0 | 0 | 0 | 0 | 0 > > 41943 | zabbix | event_tag |1 |22004 |0 > | 0 | 0 | 0 | 0 | 0 > > 41649 | zabbix | acknowledges |1 | 47 |0 > | 0 | 0 | 0 | 0 | 0 > > 41951 | zabbix | events |0 |0 |1 > | 1 | 0 | 0 | 1 | 0 > > 260215 | zabbix | event_suppress |1 |0 |0 > | 0 | 0 | 0 | 0 | 0 > Hi Kristian, This result definitely proves that indexes not used during foreign key checks (see that non-zero seq_scan counters for linked tables). Only possible reason (IMHO) that wrong usage numeric in place of bigint. I recommend change types of events.eventid (and any other similar fields) to bigint. It should resolve your performance issues with deletes on events table (as additional bonus - bigint a lot faster and compact type than numeric). -- Maxim Boguk Senior Postgresql DBA https://dataegret.com/ Phone RU: +7 985 433 Phone UA: +380 99 143 Phone AU: +61 45 218 5678 LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b Skype: maxim.boguk "Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно когда я так делаю ещё раз?"
Re: zabbix on postgresql - very slow delete of events
> > > All these queries execute well below 1 ms, using indexes. > > > > Let's delete one row. See explain results here: > https://explain.depesz.com/s/aycf . 5 seconds to delete a single row, > wow! > > This shows that it is the foreign key constraints on event_recovery and > alerts that take a lot of time. > > But why? I far as I can see, the delete is fully CPU bound during > execution. > > > > Deleting the corresponding row directly from event_recovery or alerts > executes in less than 0.1 ms. > > > > Any ideas? > > > > I've observed that alerts and event_recovery tables both have more than > one foreign key that references events, if that matters. > > > Hi Kristian, After comparing structure of zabbix tables with same in my zabbix installation I found one very weird difference. Why type of events.eventid had been changed from default bigint to numeric? I suspect that the difference between events.eventid (numeric) type and event_recovery.*_eventid (bigint) types might lead to inability of use index during foreign key checks. Anyway it will be clearly visible on the pg_stat_xact_user_tables results (I now expect to see 3 sequential scan on event_recovery and may be on some other tables as well). Kind Regards, Maxim -- Maxim Boguk Senior Postgresql DBA https://dataegret.com/ Phone RU: +7 985 433 Phone UA: +380 99 143 Phone AU: +61 45 218 5678 LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b Skype: maxim.boguk "Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно когда я так делаю ещё раз?"
Re: zabbix on postgresql - very slow delete of events
Hi Kristian, If you look for explain analyze results for delete, you will see that 99% of time query spent on the foreign key triggers checks. In the same time the database have indexes on foreign key side in place. I recommend try this: \timing on BEGIN; delete from zabbix.events where eventid = [some testing id]; select * from pg_stat_xact_user_tables where seq_scan>0 or idx_scan>0 order by seq_scan+idx_scan desc; ABORT; And provide result of the last query and how long delete runs. It might help us understand whats going on. Currently I have 3 ideas: 1)very very slow and overloaded IO subsystem 2)a lot of stuff being delete by ON DELETE CASCADE 3)some locking prevent foreign key checks run fast On Wed, Jul 24, 2019 at 11:12 AM Kristian Ejvind wrote: > Hi. > > Well, the events table has both a primary key and foreign keys referencing > it, which is not possible > on a partitioned table in postgresql 10. How did you work around this > issue? > > On the other hand, if we can get the deletion of rows from the events > table run at normal speed, I > can't imagine we would have a problem with it in a long time. After all, > although our Zabbix installation > definitely is larger than "small", it's still far from "large". > > I think I would need assistance with debugging why postgresql behaves like > it does. > Is there a defect with deleting data from a table that has multiple > foreign keys referencing it from a certain table? > Is there a problem with the query optimizer that chooses the wrong plan > when working on the foreign key constraints? > How do I inspect how the db works on the deletion of rows from the > referencing tables? > > Regards > Kristian > > > > ?On 2019-07-23, 16:33, "Kenneth Marshall" wrote: > > On Tue, Jul 23, 2019 at 01:41:53PM +, Kristian Ejvind wrote: > > Thanks Kenneth. In fact we've already partitioned the largest > history* and trends* tables > > and that has been running fine for a year. Performance was vastly > improved. But since you > > can't have a unique index on a partitioned table in postgres 10, we > haven't worked on that. > > > > Regards > > Kristian > > Hi Kristian, > > Why are you not partitioning the events and alerts tables as well? That > would eliminate this problem and you already have the infrastructure in > place to support the management since you are using it for the history > and trends tables. > > Regards, > Ken > > > > > > > Resurs Bank AB > Kristian Ejvind > Linux System Administrator > IT Operations | Technical Operations > > Ekslingan 8 > Box 222 09, SE-25467 Helsingborg > > Direkt Tfn: > Mobil: +46 728571483 > Vxl: +46 42 382000 > Fax: > E-post: kristian.ejv...@resurs.se > Webb: http://www.resursbank.se > > > > -- Maxim Boguk Senior Postgresql DBA https://dataegret.com/ Phone RU: +7 985 433 Phone UA: +380 99 143 Phone AU: +61 45 218 5678 LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b Skype: maxim.boguk "Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно когда я так делаю ещё раз?"