On Fri, Oct 2, 2015 at 4:27 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > ... do you want to produce an updated patch? I'm not planning to look at > it until tomorrow anyway.
Attached, revised patch deals with the issues around removing the query text file when garbage collection encounters trouble. There is no reason to be optimistic about any error within gc_qtexts() not recurring during a future garbage collection. OOM might be an exception, but probably not, since gc_qtexts() is reached when a new entry is created with a new query text, which in general makes OOM progressively more likely. Thanks for accommodating me here. -- Peter Geoghegan
From 9295ecac575d6bed76e3b66a3b4cc1577517f2fc Mon Sep 17 00:00:00 2001 From: Peter Geoghegan <peter.geoghega...@gmail.com> Date: Tue, 29 Sep 2015 15:32:51 -0700 Subject: [PATCH] Fix pg_stat_statements garbage collection bugs Previously, pg_stat_statements garbage collection could encounter problems when many large, technically distinct queries were executed without ever reaching the end of execution due to some error. Garbage collection could enter a state where it would never complete successfully, with the external query text file left to grow indefinitely. For example, this could happen when a data integration process fails repeatedly before finally succeeding (perhaps this process could recur intermittently over a fairly long period). Problematic queries might have referenced what are technically distinct tables each time, as when a CREATE TABLE is performed as part of the data integration operation, giving each execution a unique queryid. A spuriously high mean query length was previously possible due to failed queries' sticky entries artificially inflating mean query length during hash table eviction. The mechanism by which garbage collection is throttled to prevent thrashing could therefore put it off for far too long. By the time a garbage collection actually occurred, the file may have become so big that it may not be possible for memory to be allocated for all query texts. Worse still, once the quasi-arbitrary MaxAllocSize threshold was crossed, it became impossible for the system to recover even with sufficient memory to allocate a buffer for all query texts. To fix these issues, pg_stat_statements no longer considers sticky entries when calculating mean query text, which should prevent garbage collection from getting an inaccurate idea of mean query text. MaxAllocSize no longer caps the size of the buffer that is used to store query texts in memory during garbage collection; the cap is changed to MaxAllocHugeSize. Garbage collection no longer tolerates OOMs (or exceeding the new MaxAllocHugeSize cap) when allocating a buffer to store all existing query texts. Garbage collection now unlink()s the query text file at the first sign of trouble, giving it a clean slate, rather than assuming the issue will somehow later resolve itself, which seemed over optimistic in general. As a precaution, pg_stat_statements may now avoid storing very large query texts after parse analysis but ahead of query execution (entries for which no statistics exist yet). Backpatch to 9.4, where external query text storage was introduced. --- contrib/pg_stat_statements/pg_stat_statements.c | 84 ++++++++++++++++++++++--- 1 file changed, 75 insertions(+), 9 deletions(-) diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index 59b8a2e..3e26213 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -103,6 +103,7 @@ static const uint32 PGSS_PG_MAJOR_VERSION = PG_VERSION_NUM / 100; #define USAGE_INIT (1.0) /* including initial planning */ #define ASSUMED_MEDIAN_INIT (10.0) /* initial assumed median usage */ #define ASSUMED_LENGTH_INIT 1024 /* initial assumed mean query length */ +#define LENGTH_OUTLIER_FACTOR 5 /* mean query length multiplier */ #define USAGE_DECREASE_FACTOR (0.99) /* decreased every entry_dealloc */ #define STICKY_DECREASE_FACTOR (0.50) /* factor for sticky entries */ #define USAGE_DEALLOC_PERCENT 5 /* free this % of entries at once */ @@ -1111,6 +1112,10 @@ pgss_hash_string(const char *str) * If jstate is not NULL then we're trying to create an entry for which * we have no statistics as yet; we just want to record the normalized * query string. total_time, rows, bufusage are ignored in this case. + * A query text may not actually end up being stored in this case + * (storing oversized texts is sometimes avoided), and in any event is + * not guaranteed to still be around if and when actual results need to + * be stored in a subsequent pgss_store() call. */ static void pgss_store(const char *query, uint32 queryId, @@ -1159,7 +1164,27 @@ pgss_store(const char *query, uint32 queryId, */ if (jstate) { + Size mean_query_len = Max(ASSUMED_LENGTH_INIT, + pgss->mean_query_len); + LWLockRelease(pgss->lock); + + /* + * When creating an entry for which we have no statistics yet, + * avoid storing very large query texts. This prevents bloating of + * the query text file in certain scenarios. + * + * For example, excessive churn might otherwise happen when a user + * repeatedly runs a data integration transaction that fails after + * we create sticky entries that are bound to never "unstick". + * This might be due to transactions creating a new table for each + * failed attempt; a new relation OID is involved each time, making + * each failed attempt create distinct entries with large query + * texts. + */ + if (query_len > mean_query_len * LENGTH_OUTLIER_FACTOR) + return; + norm_query = generate_normalized_query(jstate, query, &query_len, encoding); @@ -1733,11 +1758,20 @@ entry_dealloc(void) entries[i++] = entry; /* "Sticky" entries get a different usage decay rate. */ if (entry->counters.calls == 0) + { entry->counters.usage *= STICKY_DECREASE_FACTOR; + /* Accumulate generic placeholder length into total size. */ + totlen += pgss->mean_query_len; + } else + { entry->counters.usage *= USAGE_DECREASE_FACTOR; - /* Accumulate total size, too. */ - totlen += entry->query_len + 1; + /* Accumulate total size (use placeholder for dropped texts). */ + if (entry->query_len >= 0) + totlen += entry->query_len + 1; + else + totlen += pgss->mean_query_len; + } } qsort(entries, i, sizeof(pgssEntry *), entry_cmp); @@ -1892,7 +1926,7 @@ qtext_load_file(Size *buffer_size) } /* Allocate buffer; beware that off_t might be wider than size_t */ - if (stat.st_size <= MaxAllocSize) + if (stat.st_size <= MaxAllocHugeSize) buf = (char *) malloc(stat.st_size); else buf = NULL; @@ -1984,6 +2018,10 @@ need_gc_qtexts(void) * for file's large size. We go to the trouble of maintaining the mean * query length in order to prevent garbage collection from thrashing * uselessly. + * + * Note that sticky entries do not contribute to mean query length. + * Large query texts from a succession of slightly distinct queries that + * all fail to finish execution will not stall this process. */ if (extent < pgss->mean_query_len * pgss_max * 2) return false; @@ -2001,14 +2039,17 @@ need_gc_qtexts(void) * becomes unreasonably large, with no other method of compaction likely to * occur in the foreseeable future. * - * The caller must hold an exclusive lock on pgss->lock. + * The caller must hold an exclusive lock on pgss->lock. At the first sign of + * trouble we unlink() the query text file to get a clean slate (although the + * existing statistics themselves are retained), rather than risk thrashing by + * allowing the same problem case to recur indefinitely. */ static void gc_qtexts(void) { char *qbuffer; Size qbuffer_size; - FILE *qfile; + FILE *qfile = NULL; HASH_SEQ_STATUS hash_seq; pgssEntry *entry; Size extent; @@ -2023,12 +2064,12 @@ gc_qtexts(void) return; /* - * Load the old texts file. If we fail (out of memory, for instance) just - * skip the garbage collection. + * Load the old texts file. If we fail (out of memory, for instance), + * invalidate query texts. This should be rare. */ qbuffer = qtext_load_file(&qbuffer_size); if (qbuffer == NULL) - return; + goto gc_fail; /* * We overwrite the query texts file in place, so as to reduce the risk of @@ -2147,7 +2188,32 @@ gc_fail: entry->query_len = -1; } - /* Seems like a good idea to bump the GC count even though we failed */ + /* Unlink query text file to get a clean slate and prevent thrashing */ + unlink(PGSS_TEXT_FILE); + pgss->mean_query_len = ASSUMED_LENGTH_INIT; + pgss->extent = 0; + /* Allocate new query text temp file */ + qfile = AllocateFile(PGSS_TEXT_FILE, PG_BINARY_W); + if (qfile == NULL) + ereport(LOG, + (errcode_for_file_access(), + errmsg("could not write new pg_stat_statement file \"%s\": %m", + PGSS_TEXT_FILE))); + else + FreeFile(qfile); + + /* + * Bump the GC count even though we failed. + * + * This is needed to make concurrent readers of file without any lock on + * pgss->lock notice existence of new version of file. Once readers + * subsequently observe a change in GC count with pgss->lock held, that + * forces a safe reopen of file. Writers also require that we bump + * here, of course. (As required by locking protocol, readers and + * writers don't trust earlier file contents until gc_count is found + * unchanged after pgss->lock acquired in shared or exclusive mode + * respectively.) + */ record_gc_qtexts(); } -- 1.9.1
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers