On Tue, Jan 17, 2023 at 4:52 PM Ashwin Agrawal <ashwins...@gmail.com> wrote:
> > We recently saw many backends (close to max_connection limit) get stalled > in 'startup' in one of the production environments for Greenplum (fork of > PostgreSQL). Tracing the reason, it was found all the tuples created by > bootstrap (xmin=1) in pg_attribute were at super high block numbers (for > example beyond 30,000). Tracing the reason for the backend startup stall > exactly matched Tom's reasoning in [1]. Stalls became much longer in > presence of sub-transaction overflow or presence of long running > transactions as tuple visibility took longer. The thread ruled out the > possibility of system catalog rows to be present in higher block numbers > instead of in front for pg_attribute. > > This thread provides simple reproduction on the latest version of > PostgreSQL and RCA for how bootstrap catalog entries can move to higher > blocks and as a result cause stalls for backend starts. Simple fix to avoid > the issue provided at the end. > > The cause is syncscan triggering during VACUUM FULL. VACUUM FULL rewrites > the table by performing the seqscan as well. And > heapam_relation_copy_for_cluster() conveys feel free to use syncscan. Hence > logic to not start from block 0 instead some other block already in cache > is possible and opens the possibility to move the bootstrap tuples to > anywhere else in the table. > > ------------------------------------------------------------------ > Repro > ------------------------------------------------------------------ > -- create database to play > drop database if exists test; > create database test; > \c test > > -- function just to create many tables to increase pg_attribute size > -- (ideally many column table might do the job more easily) > CREATE OR REPLACE FUNCTION public.f(id integer) > RETURNS void > LANGUAGE plpgsql > STRICT > AS $function$ > declare > sql text; > i int; > begin > for i in id..id+9999 loop > sql='create table if not exists tbl'||i||' (id int)'; > execute sql; > end loop; > end; > $function$; > > select f(10000); > select f(20000); > select f(30000); > select f(40000); > > -- validate pg_attribute size is greater than 1/4 of shared_buffers > -- for syncscan to triggger > show shared_buffers; > select pg_size_pretty(pg_relation_size('pg_attribute')); > select ctid, xmin, attrelid, attname from pg_attribute where xmin = 1 > limit 5; > > -- perform seq scan of pg_attribute to page past bootstrapped tuples > copy (select * from pg_attribute limit 2000) to '/tmp/p'; > > -- this will internally use syncscan starting with block after bootstrap > tuples > -- and hence move bootstrap tuples last to higher block numbers > vacuum full pg_attribute; > > ------------------------------------------------------------------ > Sample run > ------------------------------------------------------------------ > show shared_buffers; > shared_buffers > ---------------- > 128MB > (1 row) > > select pg_size_pretty(pg_relation_size('pg_attribute')); > pg_size_pretty > ---------------- > 40 MB > (1 row) > > select ctid, xmin, attrelid, attname from pg_attribute where xmin = 1 > limit 5; > ctid | xmin | attrelid | attname > -------+------+----------+-------------- > (0,1) | 1 | 1255 | oid > (0,2) | 1 | 1255 | proname > (0,3) | 1 | 1255 | pronamespace > (0,4) | 1 | 1255 | proowner > (0,5) | 1 | 1255 | prolang > (5 rows) > > copy (select * from pg_attribute limit 2000) to '/tmp/p'; > COPY 2000 > vacuum full pg_attribute; > VACUUM > select ctid, xmin, attrelid, attname from pg_attribute where xmin = 1 > limit 5; > ctid | xmin | attrelid | attname > -----------+------+----------+-------------- > (5115,14) | 1 | 1255 | oid > (5115,15) | 1 | 1255 | proname > (5115,16) | 1 | 1255 | pronamespace > (5115,17) | 1 | 1255 | proowner > (5115,18) | 1 | 1255 | prolang > (5 rows) > > > Note: > -- used logic causing the problem to fix it as well on the system :-) > -- scan till block where bootstrap tuples are located > select ctid, xmin, attrelid, attname from pg_attribute where xmin = 1 > limit 5; > -- now due to syncscan triggering it will pick the blocks with bootstrap > tuples first and help to bring them back to front > vacuum full pg_attribute; > > ------------------------------------------------------------------ > Patch to avoid the problem: > ------------------------------------------------------------------ > diff --git a/src/backend/access/heap/heapam_handler.c > b/src/backend/access/heap/heapam_handler.c > index a3414a76e8..4c031914a3 100644 > --- a/src/backend/access/heap/heapam_handler.c > +++ b/src/backend/access/heap/heapam_handler.c > @@ -757,7 +757,17 @@ heapam_relation_copy_for_cluster(Relation OldHeap, > Relation NewHeap, > pgstat_progress_update_param(PROGRESS_CLUSTER_PHASE, > > PROGRESS_CLUSTER_PHASE_SEQ_SCAN_HEAP); > > - tableScan = table_beginscan(OldHeap, SnapshotAny, 0, > (ScanKey) NULL); > + /* > + * For system catalog tables avoid syncscan, so that scan > always > + * starts from block 0 during rewrite and helps retain > bootstrap > + * tuples in initial pages only. If using syncscan, then > bootstrap > + * tuples may move to higher blocks, which will lead to > degraded > + * performance for relcache initialization during > connection starts. > + */ > + if (is_system_catalog) > + tableScan = table_beginscan_strat(OldHeap, > SnapshotAny, 0, (ScanKey) NULL, true, false); > + else > + tableScan = table_beginscan(OldHeap, SnapshotAny, > 0, (ScanKey) NULL); > heapScan = (HeapScanDesc) tableScan; > indexScan = NULL; > ------------------------------------------------------------------ > > > 1] https://www.postgresql.org/message-id/27844.1338148415%40sss.pgh.pa.us > Missed to receive comment/reply to earlier email on pgsql-hackers@lists.postgresql.org hence trying via pgsql-hack...@postgresql.org this time (as not sure was missed or no interest). Also, I wish to add more scenarios where the problem manifests. During RelationCacheInitializePhase3() -> load_critical_index() performs sequential search for tuples in pg_class for ClassOidIndexId, AttributeRelidNumIndexId, IndexRelidIndexId, OpclassOidIndexId, AccessMethodProcedureIndexId, RewriteRelRulenameIndexId and TriggerRelidNameIndexId. We found on systems that tuples corresponding to these indexes are not always present in starting blocks of pg_class. Specially for pg_opclass_oid_index, pg_rewrite_rel_rulename_index, pg_amproc_fam_proc_index, pg_trigger_tgrelid_tgname_index, pg_index_indexrelid_index to be present many times in block numbers over 2000 and such. Not fully sure on reasoning for this - maybe REINDEX (moves them to higher block numbers). Under any situation where tuple visibility slows down (let's say due to sub-transaction overflow) and relcache is invalidated, a lot of backends were seen stalled in the "startup" phase. -- *Ashwin Agrawal (VMware)*