Hi all (Long time!)
I recently ran into an issue where a monitoring component was calling pg_catalog.pg_database_size() at a ~15s interval on a large schema-sharded database. It took so long that the query was timing out before the next 15s scrape interval would start. The resulting filesystem metadata activity was also hammering the inode cache etc, affecting normal operations. ================ pg_database_size() is expensive ================ The pg_database_size() function will stat every fork of every relation, which is pretty expensive on a big DB with a lot of extents and high relation counts. In my toy database: test=# select pg_database_size('craig'); pg_database_size ------------------ 338084643 (1 row) # strace -c -p 1009203 [...running...] test=# select pg_database_size('craig'); pg_database_size ------------------ 338084643 (1 row) output of strace is: strace: Process 1009203 attached ^Cstrace: Process 1009203 detached % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 91.18 0.001148 2 393 newfstatat 3.42 0.000043 14 3 sendto 1.59 0.000020 6 3 1 openat 1.35 0.000017 4 4 getdents64 1.27 0.000016 8 2 close 0.95 0.000012 6 2 1 recvfrom 0.24 0.000003 1 2 fstat 0.00 0.000000 0 1 epoll_wait ------ ----------- ----------- --------- --------- ---------------- 100.00 0.001259 3 410 2 total # find base/16385 | wc -l 394 so it's fstatat()ing (almost) every file including all forks, indexes, extents, etc. No surprise it was taking forever on a very complex DB. ================ Docs for pg_database_size() don't mention performance impact ================ When I checked the docs to verify my recollection that pg_database_size() would stat() every relfilenode (all extents, all forks including maps, etc) I was surprised to see the docs don't say much at all about it: https://www.postgresql.org/docs/current/functions-admin.html#FUNCTIONS-ADMIN-DBOBJECT > pg_database_size ( oid ) → bigint > > Computes the total disk space used by the database with the specified name or > OID. To use this function, you must have CONNECT privilege on the specified > database (which is granted by default) or have privileges of the > pg_read_all_stats role. It strikes me that we should probably warn about the cost of this function on large DBs in the docs. ================ Lower-impact alternatives? ================ Which leads to ... what's the alternative, then, if we should tell users not to poll pg_database_size() from monitoring systems? FS-based sizing isn't really enough ---------------- Asking users to monitor at the filesystem level works, kind-of, but it'll lead to confusion due to WAL and temp files in simple installs. To get decent results they will need to have a separate dedicated volume for pg_wal. And which temp files are counted will differ; IIRC pg_database_size() does not count extents created by an in-progress REINDEX etc, but DOES count temp table sizes, for example. FS-based monitoring will also include things like spilled pg_replslot spilled reorder buffers, which can be considerable and aren't reasonably considered part of the "database size" or included in pg_database_size(). And of course it can see only the sum of all database sizes on a multi-database postgres instance unless the user has one volume per database using distinct tablespaces. So filesystem-based monitoring is not really a proper replacement. Using relpages from pg_class is closer but problematic ----------------- Should the user aggregate relpages from pg_class, multiply by the page size, and call it kind-of-good-enough? That seems awfully crude, and user-unfriendly at that, though works in trivial cases: test=# select pg_size_pretty(pg_database_size('test')); pg_size_pretty ---------------- 309 MB (1 row) test=# select pg_size_pretty(sum(relpages)*(1024*8)) from pg_class; pg_size_pretty ---------------- 308 MB (1 row) However, it breaks down pretty easily with anything less trivial. For one thing, relpages lags behind VACUUM, so quickly growing or changing tables and temp tables won't generally be captured. E.g. run CREATE TEMPORARY TABLE bar AS SELECT repeat('foofoo',x) AS filler1, repeat('morefillerhere',x) AS filler2 FROM generate_series(1,10000) x; ... then pg_database_size() will reflect the temp table, but summing relpages won't, even if you're running it in the session that created the temp table. The relpages for bar is initially 0 until after VACUUM. For example, I got a pg_database_size() of 322MB, an initial relpages sum based estimate of 308MB (unchanged from before temp table creation) and a relpages sum estimate after VACUUM of 319MB. The on-disk size of the temp table is 11MB. Meanwhile the actual size of base/16385 is 323MB according to `du` so ... good enough. But the overall size of the whole datadir is 1093MB at the moment due to WAL. Excluding WAL it's 356, mostly due to other DBs in the same instance's base/. ============== Any other alternatives? ============== I'm interested in any other options anyone might suggest on better ways to track DB size. Obviously there's going to be a trade-off between freshness and cost; pg_database_size() is expensive because it's immediate and current, whereas using relpages is inaccurate because it waits for vacuum. Any suggestions for a middle ground? If using relpages is considered good-enough, would anyone be interested if I was to cook up a patch for a more user-friendly interface like pg_database_size_cached() to present it to users? Whew. /novel. -- Craig Ringer EnterpriseDB