This is a follow-up to the thread at http://www.postgresql.org/message-id/4eb5fa1b.1090...@2ndquadrant.com
A quick summary: that thread proposed adding a relation_free_space() function to the pageinspect extension. Various review comments were received, among which was the suggestion that the code belonged in pg_stattuple as a faster way to calculate free_percent. === I've attached an extension that produces largely pgstattuple-compatible numbers for a table without doing a full-table scan. It scans through the table, skipping blocks that have their visibility map bit set. For such pages, it gets the free space from the free space map, and assumes that all remaining space on the page is taken by live tuples. It scans other pages tuple-by-tuple and counts live and dead tuples and free space. Here's a comparison of fastbloat and pgstattuple output on a 50-million row table with some holes created with a single big DELETE statement: ams=# select * from fastbloat('x'); table_len | scanned_percent | approx_tuple_count | approx_tuple_len | approx_tuple_percent | dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space | free_percent ------------+-----------------+--------------------+------------------+----------------------+------------------+----------------+--------------------+------------+-------------- 6714761216 | 17 | 41318301 | 5483815648 | 81.67 | 8681708 | 1111258624 | 16.55 | 80972912 | 1.21 (1 row) Time: 639.455 ms ams=# select * from pgstattuple('x'); table_len | tuple_count | tuple_len | tuple_percent | dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space | free_percent ------------+-------------+------------+---------------+------------------+----------------+--------------------+------------+-------------- 6714761216 | 41318292 | 5288741376 | 78.76 | 8681708 | 1111258624 | 16.55 | 91810372 | 1.37 (1 row) Time: 15610.651 ms In the above, the table_len is nblocks*BLCKSZ, and the dead_tuple_count, dead_tuple_len, dead_tuple_percent, free_space, and free_percent are all exact. scanned_percent shows the percentage of pages that were scanned tuple-by-tuple (the others having been skipped based on the VM bit). The live tuple count, size, and percentage are all estimates. The approx_tuple_count is calculated using vac_estimate_reltuples based on the pages/tuples that were scanned. The approx_tuple_len is the exact size of the live tuples on scanned pages, plus the approximate size from skipped pages (BLCKSZ-GetRecordedFreeSpace()). This is an overestimate, because it's counting the line pointer array as live tuple space. Even in the worst case, when every page has dead tuples, fastbloat is marginally faster than pgstattuple. The same table as the first example, but with every even-numbered row deleted: ams=# select * from fastbloat('x'); table_len | scanned_percent | approx_tuple_count | approx_tuple_len | approx_tuple_percent | dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space | free_percent ------------+-----------------+--------------------+------------------+----------------------+------------------+----------------+--------------------+------------+-------------- 6714761216 | 100 | 20659150 | 2644371200 | 39.38 | 20659142 | 2644370176 | 39.38 | 1203068996 | 17.92 (1 row) Time: 8924.511 ms ams=# select * from pgstattuple('x'); table_len | tuple_count | tuple_len | tuple_percent | dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space | free_percent ------------+-------------+------------+---------------+------------------+----------------+--------------------+------------+-------------- 6714761216 | 20659150 | 2644371200 | 39.38 | 20659142 | 2644370176 | 39.38 | 1203068996 | 17.92 (1 row) Time: 13338.712 ms Since the code depends on the visibility map to determine which pages to skip, it does not support indexes (which have no visibility map). (Just drop the attached files into contrib/fastbloat, and "make install" should just work. Then just "create extension fastbloat".) Questions and suggestions welcome. -- Abhijit
/* * contrib/fastbloat/fastbloat.c * * Abhijit Menon-Sen <a...@2ndquadrant.com> * Portions Copyright (c) 2001,2002 Tatsuo Ishii (from pg_stattuple) * * Permission to use, copy, modify, and distribute this software and * its documentation for any purpose, without fee, and without a * written agreement is hereby granted, provided that the above * copyright notice and this paragraph and the following two * paragraphs appear in all copies. * * IN NO EVENT SHALL THE AUTHOR BE LIABLE TO ANY PARTY FOR DIRECT, * INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS * DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED * OF THE POSSIBILITY OF SUCH DAMAGE. * * THE AUTHOR SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR * A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS ON AN "AS * IS" BASIS, AND THE AUTHOR HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE, * SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS. */ #include "postgres.h" #include "access/visibilitymap.h" #include "access/transam.h" #include "access/xact.h" #include "access/multixact.h" #include "access/htup_details.h" #include "catalog/namespace.h" #include "funcapi.h" #include "miscadmin.h" #include "storage/bufmgr.h" #include "storage/freespace.h" #include "storage/procarray.h" #include "storage/lmgr.h" #include "utils/builtins.h" #include "utils/tqual.h" #include "commands/vacuum.h" PG_MODULE_MAGIC; PG_FUNCTION_INFO_V1(fastbloat); PG_FUNCTION_INFO_V1(fastbloatbyid); extern Datum fastbloat(PG_FUNCTION_ARGS); extern Datum fastbloatbyid(PG_FUNCTION_ARGS); /* * tuple_percent, dead_tuple_percent and free_percent are computable, * so not defined here. */ typedef struct fastbloat_output_type { uint64 table_len; uint64 tuple_count; uint64 tuple_len; uint64 dead_tuple_count; uint64 dead_tuple_len; uint64 free_space; uint64 total_pages; uint64 scanned_pages; } fastbloat_output_type; static Datum build_output_type(fastbloat_output_type *stat, FunctionCallInfo fcinfo); static Datum fbstat_relation(Relation rel, FunctionCallInfo fcinfo); static Datum fbstat_heap(Relation rel, FunctionCallInfo fcinfo); static HTSV_Result HeapTupleSatisfiesVacuumNoHint(HeapTuple htup, TransactionId OldestXmin); /* * build a fastbloat_output_type tuple */ static Datum build_output_type(fastbloat_output_type *stat, FunctionCallInfo fcinfo) { #define NCOLUMNS 10 #define NCHARS 32 HeapTuple tuple; char *values[NCOLUMNS]; char values_buf[NCOLUMNS][NCHARS]; int i; double tuple_percent; double dead_tuple_percent; double free_percent; /* free/reusable space in % */ double scanned_percent; TupleDesc tupdesc; AttInMetadata *attinmeta; /* Build a tuple descriptor for our result type */ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE) elog(ERROR, "return type must be a row type"); /* * Generate attribute metadata needed later to produce tuples from raw C * strings */ attinmeta = TupleDescGetAttInMetadata(tupdesc); if (stat->table_len == 0) { tuple_percent = 0.0; dead_tuple_percent = 0.0; free_percent = 0.0; } else { tuple_percent = 100.0 * stat->tuple_len / stat->table_len; dead_tuple_percent = 100.0 * stat->dead_tuple_len / stat->table_len; free_percent = 100.0 * stat->free_space / stat->table_len; } scanned_percent = 0.0; if (stat->total_pages != 0) { scanned_percent = 100 * stat->scanned_pages / stat->total_pages; } for (i = 0; i < NCOLUMNS; i++) values[i] = values_buf[i]; i = 0; snprintf(values[i++], NCHARS, INT64_FORMAT, stat->table_len); snprintf(values[i++], NCHARS, "%.2f", scanned_percent); snprintf(values[i++], NCHARS, INT64_FORMAT, stat->tuple_count); snprintf(values[i++], NCHARS, INT64_FORMAT, stat->tuple_len); snprintf(values[i++], NCHARS, "%.2f", tuple_percent); snprintf(values[i++], NCHARS, INT64_FORMAT, stat->dead_tuple_count); snprintf(values[i++], NCHARS, INT64_FORMAT, stat->dead_tuple_len); snprintf(values[i++], NCHARS, "%.2f", dead_tuple_percent); snprintf(values[i++], NCHARS, INT64_FORMAT, stat->free_space); snprintf(values[i++], NCHARS, "%.2f", free_percent); tuple = BuildTupleFromCStrings(attinmeta, values); return HeapTupleGetDatum(tuple); } /* Returns a tuple with live/dead tuple statistics for the named table. */ Datum fastbloat(PG_FUNCTION_ARGS) { text *relname = PG_GETARG_TEXT_P(0); RangeVar *relrv; Relation rel; if (!superuser()) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), (errmsg("must be superuser to use fastbloat functions")))); relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname)); rel = relation_openrv(relrv, AccessShareLock); PG_RETURN_DATUM(fbstat_relation(rel, fcinfo)); } /* As above, but takes a reloid instead of a relation name. */ Datum fastbloatbyid(PG_FUNCTION_ARGS) { Oid relid = PG_GETARG_OID(0); Relation rel; if (!superuser()) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), (errmsg("must be superuser to use fastbloat functions")))); rel = relation_open(relid, AccessShareLock); PG_RETURN_DATUM(fbstat_relation(rel, fcinfo)); } /* * A helper function to reject unsupported relation types. We depend on * the visibility map to decide which pages we can skip, so we can't * support indexes, for example, which don't have a VM. */ static Datum fbstat_relation(Relation rel, FunctionCallInfo fcinfo) { const char *err; /* * Reject attempts to read non-local temporary relations; we would be * likely to get wrong data since we have no visibility into the owning * session's local buffers. */ if (RELATION_IS_OTHER_TEMP(rel)) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("cannot access temporary tables of other sessions"))); switch (rel->rd_rel->relkind) { case RELKIND_RELATION: case RELKIND_MATVIEW: return fbstat_heap(rel, fcinfo); case RELKIND_TOASTVALUE: err = "toast value"; break; case RELKIND_SEQUENCE: err = "sequence"; break; case RELKIND_INDEX: err = "index"; break; case RELKIND_VIEW: err = "view"; break; case RELKIND_COMPOSITE_TYPE: err = "composite type"; break; case RELKIND_FOREIGN_TABLE: err = "foreign table"; break; default: err = "unknown"; break; } ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" (%s) is not supported", RelationGetRelationName(rel), err))); return 0; } /* * This function takes an already open relation and scans its pages, * skipping those that have the corresponding visibility map bit set. * For pages we skip, we find the free space from the free space map * and approximate tuple_len on that basis. For the others, we count * the exact number of dead tuples etc. * * This scan is loosely based on vacuumlazy.c:lazy_scan_heap(), but * we do not try to avoid skipping single pages. */ static Datum fbstat_heap(Relation rel, FunctionCallInfo fcinfo) { BlockNumber scanned, nblocks, blkno; Buffer vmbuffer = InvalidBuffer; HeapTupleData tuple; fastbloat_output_type stat = {0}; BufferAccessStrategy bstrategy; TransactionId OldestXmin; OldestXmin = GetOldestXmin(rel, true); bstrategy = GetAccessStrategy(BAS_BULKREAD); scanned = 0; nblocks = RelationGetNumberOfBlocks(rel); for (blkno = 0; blkno < nblocks; blkno++) { Buffer buf; Page page; OffsetNumber offnum, maxoff; Size freespace; /* * If the page has only visible tuples, then we can find out the * free space from the FSM and move on. The remainder of space * on the page (including that used by line pointers) */ if (visibilitymap_test(rel, blkno, &vmbuffer)) { freespace = GetRecordedFreeSpace(rel, blkno); stat.tuple_len += BLCKSZ - freespace; stat.free_space += freespace; continue; } buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL, bstrategy); page = BufferGetPage(buf); if (PageIsNew(page)) { ReleaseBuffer(buf); continue; } scanned++; stat.free_space += PageGetHeapFreeSpace(page); /* * Look at each tuple on the page and decide whether it's live * or dead, then count it and its size. Unlike lazy_scan_heap, * we can afford to ignore problems and special cases. We also * do not need to hold a content lock on the buffer because of * our non-hint-bit-setting copy of HeapTupleSatisfiesVacuum. */ maxoff = PageGetMaxOffsetNumber(page); for (offnum = FirstOffsetNumber; offnum <= maxoff; offnum = OffsetNumberNext(offnum)) { ItemId itemid; itemid = PageGetItemId(page, offnum); if (!ItemIdIsUsed(itemid) || ItemIdIsRedirected(itemid) || ItemIdIsDead(itemid)) { continue; } Assert(ItemIdIsNormal(itemid)); ItemPointerSet(&(tuple.t_self), blkno, offnum); tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid); tuple.t_len = ItemIdGetLength(itemid); tuple.t_tableOid = RelationGetRelid(rel); switch (HeapTupleSatisfiesVacuumNoHint(&tuple, OldestXmin)) { case HEAPTUPLE_DEAD: case HEAPTUPLE_RECENTLY_DEAD: stat.dead_tuple_len += tuple.t_len; stat.dead_tuple_count++; break; case HEAPTUPLE_LIVE: stat.tuple_len += tuple.t_len; stat.tuple_count++; break; case HEAPTUPLE_INSERT_IN_PROGRESS: case HEAPTUPLE_DELETE_IN_PROGRESS: break; default: elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result"); break; } } ReleaseBuffer(buf); } stat.table_len = (uint64) nblocks * BLCKSZ; stat.tuple_count = vac_estimate_reltuples(rel, false, nblocks, scanned, stat.tuple_count); stat.total_pages = nblocks; stat.scanned_pages = scanned; if (BufferIsValid(vmbuffer)) { ReleaseBuffer(vmbuffer); vmbuffer = InvalidBuffer; } relation_close(rel, AccessShareLock); return build_output_type(&stat, fcinfo); } /* * This is a copy of HeapTupleSatisfiesVacuum with all the hint-bit * setting code removed (and thus, some tests reversed). We use this * so that we don't need to hold any content locks on the buffer while * scanning for dead tuples. (We could use HeapTupleIsSurelyDead(), but * that's unnecessarily conservative.) * * See HTSV for comments. */ static HTSV_Result HeapTupleSatisfiesVacuumNoHint(HeapTuple htup, TransactionId OldestXmin) { HeapTupleHeader tuple = htup->t_data; Assert(ItemPointerIsValid(&htup->t_self)); Assert(htup->t_tableOid != InvalidOid); if (!HeapTupleHeaderXminCommitted(tuple)) { if (HeapTupleHeaderXminInvalid(tuple)) return HEAPTUPLE_DEAD; else if (tuple->t_infomask & HEAP_MOVED_OFF) { TransactionId xvac = HeapTupleHeaderGetXvac(tuple); if (TransactionIdIsCurrentTransactionId(xvac)) return HEAPTUPLE_DELETE_IN_PROGRESS; if (TransactionIdIsInProgress(xvac)) return HEAPTUPLE_DELETE_IN_PROGRESS; if (TransactionIdDidCommit(xvac)) { return HEAPTUPLE_DEAD; } } else if (tuple->t_infomask & HEAP_MOVED_IN) { TransactionId xvac = HeapTupleHeaderGetXvac(tuple); if (TransactionIdIsCurrentTransactionId(xvac)) return HEAPTUPLE_INSERT_IN_PROGRESS; if (TransactionIdIsInProgress(xvac)) return HEAPTUPLE_INSERT_IN_PROGRESS; if (!TransactionIdDidCommit(xvac)) { return HEAPTUPLE_DEAD; } } else if (TransactionIdIsInProgress(HeapTupleHeaderGetRawXmin(tuple))) { if (tuple->t_infomask & HEAP_XMAX_INVALID) return HEAPTUPLE_INSERT_IN_PROGRESS; if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) || HeapTupleHeaderIsOnlyLocked(tuple)) return HEAPTUPLE_INSERT_IN_PROGRESS; return HEAPTUPLE_DELETE_IN_PROGRESS; } else if (!TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple))) { return HEAPTUPLE_DEAD; } } if (tuple->t_infomask & HEAP_XMAX_INVALID) return HEAPTUPLE_LIVE; if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask)) { if (!(tuple->t_infomask & HEAP_XMAX_COMMITTED)) { if (tuple->t_infomask & HEAP_XMAX_IS_MULTI) { if ((tuple->t_infomask & (HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_KEYSHR_LOCK)) && MultiXactIdIsRunning(HeapTupleHeaderGetRawXmax(tuple))) return HEAPTUPLE_LIVE; } else { if (TransactionIdIsInProgress(HeapTupleHeaderGetRawXmax(tuple))) return HEAPTUPLE_LIVE; } } return HEAPTUPLE_LIVE; } if (tuple->t_infomask & HEAP_XMAX_IS_MULTI) { TransactionId xmax; if (MultiXactIdIsRunning(HeapTupleHeaderGetRawXmax(tuple))) { Assert(!HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask)); xmax = HeapTupleGetUpdateXid(tuple); Assert(TransactionIdIsValid(xmax)); if (TransactionIdIsInProgress(xmax)) return HEAPTUPLE_DELETE_IN_PROGRESS; else if (TransactionIdDidCommit(xmax)) return HEAPTUPLE_RECENTLY_DEAD; return HEAPTUPLE_LIVE; } Assert(!(tuple->t_infomask & HEAP_XMAX_COMMITTED)); xmax = HeapTupleGetUpdateXid(tuple); Assert(TransactionIdIsValid(xmax)); Assert(!TransactionIdIsInProgress(xmax)); if (TransactionIdDidCommit(xmax)) { if (!TransactionIdPrecedes(xmax, OldestXmin)) return HEAPTUPLE_RECENTLY_DEAD; else return HEAPTUPLE_DEAD; } return HEAPTUPLE_LIVE; } if (!(tuple->t_infomask & HEAP_XMAX_COMMITTED)) { if (TransactionIdIsInProgress(HeapTupleHeaderGetRawXmax(tuple))) return HEAPTUPLE_DELETE_IN_PROGRESS; else if (!TransactionIdDidCommit(HeapTupleHeaderGetRawXmax(tuple))) { return HEAPTUPLE_LIVE; } } if (!TransactionIdPrecedes(HeapTupleHeaderGetRawXmax(tuple), OldestXmin)) return HEAPTUPLE_RECENTLY_DEAD; return HEAPTUPLE_DEAD; }
fastbloat--1.0.sql
Description: application/sql
# fastbloat extension comment = 'calculate bloat statistics without a full-table scan' default_version = '1.0' module_pathname = '$libdir/fastbloat' relocatable = true
# contrib/fastbloat/Makefile MODULE_big = fastbloat OBJS = fastbloat.o EXTENSION = fastbloat DATA = fastbloat--1.0.sql REGRESS = fastbloat ifdef USE_PGXS PG_CONFIG = pg_config PGXS := $(shell $(PG_CONFIG) --pgxs) include $(PGXS) else subdir = contrib/fastbloat top_builddir = ../.. include $(top_builddir)/src/Makefile.global include $(top_srcdir)/contrib/contrib-global.mk endif
fastbloat --------- This extension attempts to measure table bloat without incurring the cost of a full-table scan. The pgstattuple extension performs a full-table scan and returns an exact count of live and dead tuples (and their sizes) and free space. This code returns the same dead tuple and free space statistics along with an estimate for the number and size of live tuples. It does this by skipping pages that have only visible tuples according to the visibility map (if a page has the corresponding VM bit set, then it contains no dead tuples). For such pages, it derives the free space value from the free space map, and assumes that the rest of the space on the page is taken up by live tuples. For pages that cannot be skipped, it scans each tuple, recording its presence and size in the appropriate counters, and adding up the free space on the page. At the end, it estimates the total number of live tuples based on the number of pages and tuples scanned (in the same way that VACUUM estimates pg_class.reltuples). Here is an example comparing the output of fastbloat and pgstattuple for the same table: ams=# select * from fastbloat('x'); table_len | scanned_percent | approx_tuple_count | approx_tuple_len | approx_tuple_percent | dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space | free_percent ------------+-----------------+--------------------+------------------+----------------------+------------------+----------------+--------------------+------------+-------------- 6714761216 | 17 | 41318301 | 5483815648 | 81.67 | 8681708 | 1111258624 | 16.55 | 80972912 | 1.21 (1 row) Time: 639.455 ms ams=# select * from pgstattuple('x'); table_len | tuple_count | tuple_len | tuple_percent | dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space | free_percent ------------+-------------+------------+---------------+------------------+----------------+--------------------+------------+-------------- 6714761216 | 41318292 | 5288741376 | 78.76 | 8681708 | 1111258624 | 16.55 | 91810372 | 1.37 (1 row) Time: 15610.651 ms In the above figures, table_len is exact (being based only on the number of pages), as are dead_tuple_count, dead_tuple_len, dead_tuple_percent, free_space, and free_percent. In the fastbloat output, scanned_percent is the percentage of pages that were scanned tuple-by-tuple (the remainder having been skipped, thanks to the visibility map). The approximate tuple count is quite close to the actual value (as shown by pgstattuple), but the approx_tuple_len overestimates the size of live tuples (and thus approx_tuple_percent is also higher than pgstattuple's calculation). The larger scanned_percent is, the better the estimates are. fastbloat is several times faster in the example above because it does less work (by skipping pages and by not needing to hold content locks on buffers as it scans other pages—see the code for details). The less bloated the table, the less time it takes; but even in the worst case, if all pages have dead tuples, it's marginally faster than pgstattuple. ams=# select * from fastbloat('x'); table_len | scanned_percent | approx_tuple_count | approx_tuple_len | approx_tuple_percent | dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space | free_percent ------------+-----------------+--------------------+------------------+----------------------+------------------+----------------+--------------------+------------+-------------- 6714761216 | 100 | 20659150 | 2644371200 | 39.38 | 20659142 | 2644370176 | 39.38 | 1203068996 | 17.92 (1 row) Time: 8924.511 ms ams=# select * from pgstattuple('x'); table_len | tuple_count | tuple_len | tuple_percent | dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space | free_percent ------------+-------------+------------+---------------+------------------+----------------+--------------------+------------+-------------- 6714761216 | 20659150 | 2644371200 | 39.38 | 20659142 | 2644370176 | 39.38 | 1203068996 | 17.92 (1 row) Time: 13338.712 ms (Note that the statistics from fastbloat are exact in this case, because every page was scanned according to scanned_percent.) Author: Abhijit Menon-Sen <a...@2ndquadrant.com>
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers