Hi, On 2021-08-16 17:42:48 -0400, Álvaro Herrera wrote: > On 2021-Aug-16, Álvaro Herrera wrote: > > > Here's the reversal patch for the 14 branch. (It applies cleanly to > > master, but the unused member of PgStat_StatTabEntry needs to be > > removed and catversion bumped). > > I have pushed this to both branches. (I did not remove the item from > the release notes in the 14 branch.) > > It upsets me to have reverted it, but after spending so much time trying > to correct the problems, I believe it just wasn't salvageable within the > beta-period code freeze constraints.
:( > I described the issues I ran into > in earlier messages; I think a good starting point to re-develop this is > to revert the reversal commit, then apply my patch at > https://postgr.es/m/0794d7ca-5183-486b-9c5e-6d434867c...@www.fastmail.com > then do something about the remaining problems that were complained > about. (Maybe: add an "ancestor OID" member to PgStat_StatTabEntry so > that the collector knows to propagate counts from children to ancestors > when the upd/ins/del counts are received. My suspicion is that it'd be a lot easier to implement this efficiently if there were no propagation done outside of actually analyzing tables. I.e. have do_autovacuum() build a hashtable of (parent_table_id, count) and use that to make the analyze decisions. And then only propagate up the costs to parents of tables when a child is analyzed (and thus looses its changes_since_analyze) value. Then we can use hashtable_value + changes_since_analyze for partitioning decisions of partitioned tables. I've prototyped this, and it does seem to make do_autovacuum() cheaper. I've attached that prototype, but note it's in a rough state. However, unless we change the way inheritance parents are stored, it still requires repetitive get_partition_ancestors() (or get_partition_parent()) calls in do_autovacuum(), which I think is problematic due to the index scans you pointed out as well. The obvious way to address that would be to store parent oids in pg_class - I suspect duplicating parents in pg_class is the best way out, but pretty it is not. > However, consider developing it as follow-up to Horiguchi-san's shmem > pgstat rather than current pgstat implementation.) +1 It might be worth to first tackle reusing samples from a relation's children when building inheritance stats. Either by storing the samples somewhere (not cheap) and reusing them, or by at least updating a partition's stats when analyzing the parent. Greetings, Andres Freund
commit ec796bd8ee2970e2eae3b3839e1bb96696393dc7 Author: Andres Freund <and...@anarazel.de> Date: 2021-07-30 17:20:21 -0700 tmp Author: Reviewed-By: Discussion: https://postgr.es/m/ Backpatch: diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index 0c9591415e4..df021215281 100644 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -320,12 +320,12 @@ do_analyze_rel(Relation onerel, VacuumParams *params, PgStat_Counter startwritetime = 0; if (inh) - ereport(elevel, + ereport(LOG, (errmsg("analyzing \"%s.%s\" inheritance tree", get_namespace_name(RelationGetNamespace(onerel)), RelationGetRelationName(onerel)))); else - ereport(elevel, + ereport(LOG, (errmsg("analyzing \"%s.%s\"", get_namespace_name(RelationGetNamespace(onerel)), RelationGetRelationName(onerel)))); @@ -682,6 +682,18 @@ do_analyze_rel(Relation onerel, VacuumParams *params, in_outer_xact); } + /* + * Let the collector know about the ancestor tables of this partition. + */ + if (!inh && + (va_cols == NIL) && + onerel->rd_rel->relispartition && + onerel->rd_rel->relkind == RELKIND_RELATION && + onerel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT) + { + pgstat_report_anl_ancestors(RelationGetRelid(onerel)); + } + /* * Now report ANALYZE to the stats collector. For regular tables, we do * it only if not doing inherited stats. For partitioned tables, we only @@ -695,22 +707,6 @@ do_analyze_rel(Relation onerel, VacuumParams *params, pgstat_report_analyze(onerel, totalrows, totaldeadrows, (va_cols == NIL)); - /* - * If this is a manual analyze of all columns of a permanent leaf - * partition, and not doing inherited stats, also let the collector know - * about the ancestor tables of this partition. Autovacuum does the - * equivalent of this at the start of its run, so there's no reason to do - * it there. - */ - if (!inh && !IsAutoVacuumWorkerProcess() && - (va_cols == NIL) && - onerel->rd_rel->relispartition && - onerel->rd_rel->relkind == RELKIND_RELATION && - onerel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT) - { - pgstat_report_anl_ancestors(RelationGetRelid(onerel)); - } - /* * If this isn't part of VACUUM ANALYZE, let index AMs do cleanup. * @@ -1183,6 +1179,8 @@ acquire_sample_rows(Relation onerel, int elevel, BlockSamplerData prefetch_bs; #endif + elog(LOG, "acquiring %d sample rows for %s", targrows, + NameStr(onerel->rd_rel->relname)); Assert(targrows > 0); totalblocks = RelationGetNumberOfBlocks(onerel); diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c index 912ef9cb54c..73a872371d1 100644 --- a/src/backend/postmaster/autovacuum.c +++ b/src/backend/postmaster/autovacuum.c @@ -74,6 +74,7 @@ #include "access/xact.h" #include "catalog/dependency.h" #include "catalog/namespace.h" +#include "catalog/partition.h" #include "catalog/pg_database.h" #include "catalog/pg_inherits.h" #include "commands/dbcommands.h" @@ -327,16 +328,21 @@ static void do_autovacuum(void); static void FreeWorkerInfo(int code, Datum arg); static autovac_table *table_recheck_autovac(Oid relid, HTAB *table_toast_map, + HTAB *table_partition_info_map, TupleDesc pg_class_desc, - int effective_multixact_freeze_max_age); -static void recheck_relation_needs_vacanalyze(Oid relid, AutoVacOpts *avopts, + int effective_multixact_freeze_max_age, + PgStat_Counter additional_analyze_rows); +static void recheck_relation_needs_vacanalyze(Oid relid, + AutoVacOpts *avopts, Form_pg_class classForm, int effective_multixact_freeze_max_age, + PgStat_Counter additional_analyze_rows, bool *dovacuum, bool *doanalyze, bool *wraparound); static void relation_needs_vacanalyze(Oid relid, AutoVacOpts *relopts, Form_pg_class classForm, PgStat_StatTabEntry *tabentry, int effective_multixact_freeze_max_age, + PgStat_Counter additional_analyze_rows, bool *dovacuum, bool *doanalyze, bool *wraparound); static void autovacuum_do_vac_analyze(autovac_table *tab, @@ -1944,6 +1950,12 @@ get_database_list(void) return dblist; } +typedef struct AutovacPartitionInfo +{ + Oid partitioned_table_oid; + PgStat_Counter changes_since_analyze; +} AutovacPartitionInfo; + /* * Process a database table-by-table * @@ -1961,16 +1973,15 @@ do_autovacuum(void) List *orphan_oids = NIL; HASHCTL ctl; HTAB *table_toast_map; + HTAB *partition_info_map; ListCell *volatile cell; PgStat_StatDBEntry *shared; PgStat_StatDBEntry *dbentry; BufferAccessStrategy bstrategy; - ScanKeyData key; TupleDesc pg_class_desc; int effective_multixact_freeze_max_age; bool did_vacuum = false; bool found_concurrent_worker = false; - bool updated = false; int i; /* @@ -2053,19 +2064,26 @@ do_autovacuum(void) &ctl, HASH_ELEM | HASH_BLOBS); + ctl.keysize = sizeof(Oid); + ctl.entrysize = sizeof(AutovacPartitionInfo); + + partition_info_map = hash_create("Autovacuum Partitioned Table Rowcount Hash", + 100, + &ctl, + HASH_ELEM | HASH_BLOBS); + + /* * Scan pg_class to determine which tables to vacuum. * - * We do this in three passes: First we let pgstat collector know about - * the partitioned table ancestors of all partitions that have recently - * acquired rows for analyze. This informs the second pass about the - * total number of tuple count in partitioning hierarchies. + * We do this in two passes: * - * On the second pass, we collect the list of plain relations, - * materialized views and partitioned tables. On the third one we collect - * TOAST tables. + * On the first pass, we collect the list of plain relations, materialized + * views and partitioned tables. For partitions we sum up the sum of the + * changes in partitioned tables. On the second pass one we collect TOAST + * and partitioned tables tables. * - * The reason for doing the third pass is that during it we want to use + * The reason for doing the second pass is that during it we want to use * the main relation's pg_class.reloptions entry if the TOAST table does * not have any, and we cannot obtain it unless we know beforehand what's * the main table OID. @@ -2077,44 +2095,7 @@ do_autovacuum(void) relScan = table_beginscan_catalog(classRel, 0, NULL); /* - * First pass: before collecting the list of tables to vacuum, let stat - * collector know about partitioned-table ancestors of each partition. - */ - while ((tuple = heap_getnext(relScan, ForwardScanDirection)) != NULL) - { - Form_pg_class classForm = (Form_pg_class) GETSTRUCT(tuple); - Oid relid = classForm->oid; - PgStat_StatTabEntry *tabentry; - - /* Only consider permanent leaf partitions */ - if (!classForm->relispartition || - classForm->relkind == RELKIND_PARTITIONED_TABLE || - classForm->relpersistence == RELPERSISTENCE_TEMP) - continue; - - /* - * No need to do this for partitions that haven't acquired any rows. - */ - tabentry = pgstat_fetch_stat_tabentry(relid); - if (tabentry && - tabentry->changes_since_analyze - - tabentry->changes_since_analyze_reported > 0) - { - pgstat_report_anl_ancestors(relid); - updated = true; - } - } - - /* Acquire fresh stats for the next passes, if needed */ - if (updated) - { - autovac_refresh_stats(); - dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId); - shared = pgstat_fetch_stat_dbentry(InvalidOid); - } - - /* - * On the second pass, we collect main tables to vacuum, and also the main + * On the first pass, we collect main tables to vacuum, and also the main * table relid to TOAST relid mapping. */ while ((tuple = heap_getnext(relScan, ForwardScanDirection)) != NULL) @@ -2128,8 +2109,7 @@ do_autovacuum(void) bool wraparound; if (classForm->relkind != RELKIND_RELATION && - classForm->relkind != RELKIND_MATVIEW && - classForm->relkind != RELKIND_PARTITIONED_TABLE) + classForm->relkind != RELKIND_MATVIEW) continue; relid = classForm->oid; @@ -2167,6 +2147,7 @@ do_autovacuum(void) /* Check if it needs vacuum or analyze */ relation_needs_vacanalyze(relid, relopts, classForm, tabentry, effective_multixact_freeze_max_age, + 0, &dovacuum, &doanalyze, &wraparound); /* Relations that need work are added to table_oids */ @@ -2200,17 +2181,40 @@ do_autovacuum(void) } } } + + /* Sum up pending changes to leaf partitions in parent partition */ + if (classForm->relispartition && + classForm->relpersistence != RELPERSISTENCE_TEMP && + (tabentry && tabentry->changes_since_analyze > 0)) + { + List *ancestors = get_partition_ancestors(classForm->oid); + ListCell *lc; + + foreach(lc, ancestors) + { + Oid ancestor_oid = lfirst_oid(lc); + AutovacPartitionInfo *ancestor; + bool found; + + ancestor = hash_search(partition_info_map, + &ancestor_oid, + HASH_ENTER, &found); + if (!found) + ancestor->changes_since_analyze = 0; + ancestor->changes_since_analyze += tabentry->changes_since_analyze; + + elog(LOG, "reporting up %lu changes from %u to ancestor %u", + tabentry->changes_since_analyze, relid, ancestor_oid); + } + + continue; + } } table_endscan(relScan); - /* third pass: check TOAST tables */ - ScanKeyInit(&key, - Anum_pg_class_relkind, - BTEqualStrategyNumber, F_CHAREQ, - CharGetDatum(RELKIND_TOASTVALUE)); - - relScan = table_beginscan_catalog(classRel, 1, &key); + /* second pass: check TOAST and partitioned tables */ + relScan = table_beginscan_catalog(classRel, 0, NULL); while ((tuple = heap_getnext(relScan, ForwardScanDirection)) != NULL) { Form_pg_class classForm = (Form_pg_class) GETSTRUCT(tuple); @@ -2221,6 +2225,10 @@ do_autovacuum(void) bool doanalyze; bool wraparound; + if (classForm->relkind != RELKIND_TOASTVALUE && + classForm->relkind != RELKIND_PARTITIONED_TABLE) + continue; + /* * We cannot safely process other backends' temp tables, so skip 'em. */ @@ -2228,33 +2236,63 @@ do_autovacuum(void) continue; relid = classForm->oid; - - /* - * fetch reloptions -- if this toast table does not have them, try the - * main rel - */ relopts = extract_autovac_opts(tuple, pg_class_desc); - if (relopts == NULL) + + if (classForm->relkind == RELKIND_TOASTVALUE) { - av_relation *hentry; - bool found; + /* + * If this toast table does not have a reloption, try the main rel + */ + if (relopts == NULL) + { + av_relation *hentry; + bool found; - hentry = hash_search(table_toast_map, &relid, HASH_FIND, &found); - if (found && hentry->ar_hasrelopts) - relopts = &hentry->ar_reloptions; + hentry = hash_search(table_toast_map, &relid, HASH_FIND, &found); + if (found && hentry->ar_hasrelopts) + relopts = &hentry->ar_reloptions; + } + + /* Fetch the pgstat entry for this table */ + tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared, + shared, dbentry); + + relation_needs_vacanalyze(relid, relopts, classForm, tabentry, + effective_multixact_freeze_max_age, + 0, + &dovacuum, &doanalyze, &wraparound); + + /* ignore analyze for toast tables */ + if (dovacuum) + table_oids = lappend_oid(table_oids, relid); } + else if (classForm->relkind == RELKIND_PARTITIONED_TABLE) + { + AutovacPartitionInfo *partition_info; + PgStat_Counter additional_analyze_rows = 0; + bool found; - /* Fetch the pgstat entry for this table */ - tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared, - shared, dbentry); + tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared, + shared, dbentry); - relation_needs_vacanalyze(relid, relopts, classForm, tabentry, - effective_multixact_freeze_max_age, - &dovacuum, &doanalyze, &wraparound); + partition_info = hash_search(partition_info_map, &relid, HASH_FIND, &found); + if (found) + { + elog(LOG, "additional changes to %u: %llu", + relid, (long long unsigned) partition_info->changes_since_analyze); + additional_analyze_rows = partition_info->changes_since_analyze; + } - /* ignore analyze for toast tables */ - if (dovacuum) - table_oids = lappend_oid(table_oids, relid); + relation_needs_vacanalyze(relid, relopts, classForm, tabentry, + effective_multixact_freeze_max_age, + additional_analyze_rows, + &dovacuum, &doanalyze, &wraparound); + + Assert(!dovacuum && !wraparound); + + if (doanalyze) + table_oids = lappend_oid(table_oids, relid); + } } table_endscan(relScan); @@ -2369,12 +2407,14 @@ do_autovacuum(void) { Oid relid = lfirst_oid(cell); HeapTuple classTup; + Form_pg_class classForm; autovac_table *tab; bool isshared; bool skipit; double stdVacuumCostDelay; int stdVacuumCostLimit; dlist_iter iter; + PgStat_Counter additional_analyze_rows = 0; CHECK_FOR_INTERRUPTS(); @@ -2405,7 +2445,8 @@ do_autovacuum(void) classTup = SearchSysCache1(RELOID, ObjectIdGetDatum(relid)); if (!HeapTupleIsValid(classTup)) continue; /* somebody deleted the rel, forget it */ - isshared = ((Form_pg_class) GETSTRUCT(classTup))->relisshared; + classForm = (Form_pg_class) GETSTRUCT(classTup); + isshared = classForm->relisshared; ReleaseSysCache(classTup); /* @@ -2467,9 +2508,27 @@ do_autovacuum(void) * that somebody just finished vacuuming this table. The window to * the race condition is not closed but it is very small. */ + + if (classForm->relkind == RELKIND_PARTITIONED_TABLE) + { + AutovacPartitionInfo *partition_info; + bool found; + + partition_info = hash_search(partition_info_map, &relid, HASH_FIND, &found); + if (found) + { + elog(LOG, "additional changes to %u: %llu", + relid, (long long unsigned) partition_info->changes_since_analyze); + additional_analyze_rows = partition_info->changes_since_analyze; + } + } + MemoryContextSwitchTo(AutovacMemCxt); - tab = table_recheck_autovac(relid, table_toast_map, pg_class_desc, - effective_multixact_freeze_max_age); + tab = table_recheck_autovac(relid, + table_toast_map, partition_info_map, + pg_class_desc, + effective_multixact_freeze_max_age, + additional_analyze_rows); if (tab == NULL) { /* someone else vacuumed the table, or it went away */ @@ -2845,8 +2904,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared, */ static autovac_table * table_recheck_autovac(Oid relid, HTAB *table_toast_map, + HTAB *partition_info_map, TupleDesc pg_class_desc, - int effective_multixact_freeze_max_age) + int effective_multixact_freeze_max_age, + PgStat_Counter additional_analyze_rows) { Form_pg_class classForm; HeapTuple classTup; @@ -2895,6 +2956,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map, { recheck_relation_needs_vacanalyze(relid, avopts, classForm, effective_multixact_freeze_max_age, + additional_analyze_rows, &dovacuum, &doanalyze, &wraparound); /* Quick exit if a relation doesn't need to be vacuumed or analyzed */ @@ -2910,6 +2972,7 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map, recheck_relation_needs_vacanalyze(relid, avopts, classForm, effective_multixact_freeze_max_age, + additional_analyze_rows, &dovacuum, &doanalyze, &wraparound); /* OK, it needs something done */ @@ -3039,6 +3102,7 @@ recheck_relation_needs_vacanalyze(Oid relid, AutoVacOpts *avopts, Form_pg_class classForm, int effective_multixact_freeze_max_age, + PgStat_Counter additional_analyze_rows, bool *dovacuum, bool *doanalyze, bool *wraparound) @@ -3058,6 +3122,7 @@ recheck_relation_needs_vacanalyze(Oid relid, relation_needs_vacanalyze(relid, avopts, classForm, tabentry, effective_multixact_freeze_max_age, + additional_analyze_rows, dovacuum, doanalyze, wraparound); /* ignore ANALYZE for toast tables */ @@ -3108,6 +3173,7 @@ relation_needs_vacanalyze(Oid relid, Form_pg_class classForm, PgStat_StatTabEntry *tabentry, int effective_multixact_freeze_max_age, + PgStat_Counter additional_analyze_rows, /* output params below */ bool *dovacuum, bool *doanalyze, @@ -3223,7 +3289,7 @@ relation_needs_vacanalyze(Oid relid, reltuples = classForm->reltuples; vactuples = tabentry->n_dead_tuples; instuples = tabentry->inserts_since_vacuum; - anltuples = tabentry->changes_since_analyze; + anltuples = tabentry->changes_since_analyze + additional_analyze_rows; /* If the table hasn't yet been vacuumed, take reltuples as zero */ if (reltuples < 0) @@ -3252,6 +3318,14 @@ relation_needs_vacanalyze(Oid relid, (vac_ins_base_thresh >= 0 && instuples > vacinsthresh); *doanalyze = (anltuples > anlthresh); } + else if (additional_analyze_rows && AutoVacuumingActive()) + { + anlthresh = (float4) anl_base_thresh; + anltuples = additional_analyze_rows; + *dovacuum = false; + *wraparound = false; + *doanalyze = (anltuples > anlthresh); + } else { /* diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 11702f2a804..d1e9fd3f75c 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -4870,6 +4870,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len) tabentry->n_live_tuples = tabmsg->t_counts.t_delta_live_tuples; tabentry->n_dead_tuples = tabmsg->t_counts.t_delta_dead_tuples; tabentry->changes_since_analyze = tabmsg->t_counts.t_changed_tuples; + //tabentry->changes_since_analyze_reported = tabmsg->t_counts.t_changed_tuples; tabentry->changes_since_analyze_reported = 0; tabentry->inserts_since_vacuum = tabmsg->t_counts.t_tuples_inserted; tabentry->blocks_fetched = tabmsg->t_counts.t_blocks_fetched; @@ -5269,8 +5270,11 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len) */ if (msg->m_resetcounter) { + elog(LOG, "resetting change_since_analyze of %u, was %lu/%lu", + msg->m_tableoid, tabentry->changes_since_analyze, + tabentry->changes_since_analyze_reported); + tabentry->changes_since_analyze_reported = tabentry->changes_since_analyze; tabentry->changes_since_analyze = 0; - tabentry->changes_since_analyze_reported = 0; } if (msg->m_autovacuum) @@ -5301,12 +5305,10 @@ pgstat_recv_anl_ancestors(PgStat_MsgAnlAncestors *msg, int len) PgStat_StatTabEntry *ancestor; ancestor = pgstat_get_tab_entry(dbentry, ancestor_relid, true); - ancestor->changes_since_analyze += - tabentry->changes_since_analyze - tabentry->changes_since_analyze_reported; + elog(LOG, "anl increase to %u of %lu", + ancestor_relid, tabentry->changes_since_analyze); + ancestor->changes_since_analyze += tabentry->changes_since_analyze; } - - tabentry->changes_since_analyze_reported = tabentry->changes_since_analyze; - } /* ----------