On Wed, Mar 03, 2021 at 11:36:26AM +0000, Tharakan, Robins wrote: > While reviewing a failed upgrade from Postgres v9.5 (to v9.6) I saw that the > instance had ~200 million (in-use) Large Objects. I was able to reproduce > this on a test instance which too fails with a similar error.
If pg_upgrade can't handle millions of objects/transactions/XIDs, that seems like a legitimate complaint, since apparently the system is working okay otherwise. But it also seems like you're using it outside the range of its intended use (See also [1]). I'm guessing that not many people are going to spend time running tests of pg_upgrade, each of which takes 25hr, not to mention some multiple of 128GB RAM+swap. Creating millions of large objects was too slow for me to test like this: | time { echo 'begin;'; for a in `seq 1 99999`; do echo '\lo_import /dev/null'; done; echo 'commit;'; } |psql -qh /tmp postgres& This seems to be enough for what's needed: | ALTER SYSTEM SET fsync=no; ALTER SYSTEM SET full_page_writes=no; SELECT pg_reload_conf(); | INSERT INTO pg_largeobject_metadata SELECT a, 0 FROM generate_series(100000, 200111222)a; Now, testing the pg_upgrade was killed after runnning 100min and using 60GB RAM, so you might say that's a problem too. I converted getBlobs() to use a cursor, like dumpBlobs(), but it was still killed. I think a test case and a way to exercizes this failure with a more reasonable amount of time and resources might be a prerequisite for a patch to fix it. pg_upgrade is meant for "immediate" upgrades, frequently allowing upgrade in minutes, where pg_dump |pg_restore might take hours or days. There's two components to consider: the catalog/metadata part, and the data part. If the data is large (let's say more than 100GB), then pg_upgrade is expected to be an improvement over the "dump and restore" process, which is usually infeasible for large DBs measure in TB. But the *catalog* part is large, and pg_upgrade still has to run pg_dump, and pg_restore. The time to do this might dominate over the data part. Our own customers DBs are 100s of GB to 10TB. For large customers, pg_upgrade takes 45min. In the past, we had tables with many column defaults, which caused the dump+restore to be slow at a larger fraction of customers. If it were me, in an EOL situation, I would look at either: 1) find a way to do dump+restore rather than pg_upgrade; and/or, 2) separately pg_dump the large objects, drop as many as you can, then pg_upgrade the DB, then restore the large objects. (And find a better way to store them in the future). I was able to hack pg_upgrade to call pg_restore --single (with a separate invocation to handle --create). That passes tests...but I can't say much beyond that. Regarding your existing patch: "make check" only tests SQL features. For development, you'll want to configure like: |./configure --enable-debug --enable-cassert --enable-tap-tests And then use "make check-world", and in particular: time make check -C src/bin/pg_resetwal time make check -C src/bin/pg_upgrade I don't think pg_restore needs a user-facing option for XIDs. I think it should "just work", since a user might be as likely to shoot themselves in the foot with a commandline option as they are to make an upgrade succeed that would otherwise fail. pg_upgrade has a --check mode, and if that passes, the upgrade is intended to work, and not fail halfway through between the schema dump and restore, with the expectation that the user know to rerun with some commandline flags. If you pursue the patch with setting a different XID threshold, maybe you could count the number of objects to be created, or transactions to be used, and use that as the argument to resetxlog ? I'm not sure, but pg_restore -l might be a good place to start looking. I think a goal for this patch should be to allow an increased number of objects to be handled by pg_upgrade. Large objects may be a special case, and increasing the number of other objects to be restored to the 100s of millions might be unimportant. -- Justin [1] https://www.postgresql.org/message-id/502641.1606334432%40sss.pgh.pa.us | Does pg_dump really have sane performance for that situation, or | are we soon going to be fielding requests to make it not be O(N^2) | in the number of listed tables?
>From cfc7400bb021659d49170e8b17d067c8e1b9fa33 Mon Sep 17 00:00:00 2001 From: Justin Pryzby <pryz...@telsasoft.com> Date: Tue, 9 Mar 2021 14:06:17 -0600 Subject: [PATCH] pg_dump: use a cursor in getBlobs.. ..to mitigate huge memory use in the case of millions of large objects --- src/bin/pg_dump/pg_dump.c | 96 +++++++++++++++++++++------------------ 1 file changed, 52 insertions(+), 44 deletions(-) diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index aa02ada079..3fd7f48605 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -3333,7 +3333,7 @@ getBlobs(Archive *fout) BlobInfo *binfo; DumpableObject *bdata; PGresult *res; - int ntups; + int ntups, total = 0; int i; int i_oid; int i_lomowner; @@ -3341,9 +3341,12 @@ getBlobs(Archive *fout) int i_rlomacl; int i_initlomacl; int i_initrlomacl; + const char *blobFetchQry = "FETCH 1000 IN blob"; pg_log_info("reading large objects"); + appendPQExpBuffer(blobQry, "DECLARE blob CURSOR FOR "); + /* Fetch BLOB OIDs, and owner/ACL data if >= 9.0 */ if (fout->remoteVersion >= 90600) { @@ -3393,58 +3396,66 @@ getBlobs(Archive *fout) "NULL::oid AS initrlomacl " " FROM pg_largeobject"); - res = ExecuteSqlQuery(fout, blobQry->data, PGRES_TUPLES_OK); + ExecuteSqlStatement(fout, blobQry->data); + destroyPQExpBuffer(blobQry); - i_oid = PQfnumber(res, "oid"); - i_lomowner = PQfnumber(res, "rolname"); - i_lomacl = PQfnumber(res, "lomacl"); - i_rlomacl = PQfnumber(res, "rlomacl"); - i_initlomacl = PQfnumber(res, "initlomacl"); - i_initrlomacl = PQfnumber(res, "initrlomacl"); + do { + res = ExecuteSqlQuery(fout, blobFetchQry, PGRES_TUPLES_OK); - ntups = PQntuples(res); + i_oid = PQfnumber(res, "oid"); + i_lomowner = PQfnumber(res, "rolname"); + i_lomacl = PQfnumber(res, "lomacl"); + i_rlomacl = PQfnumber(res, "rlomacl"); + i_initlomacl = PQfnumber(res, "initlomacl"); + i_initrlomacl = PQfnumber(res, "initrlomacl"); - /* - * Each large object has its own BLOB archive entry. - */ - binfo = (BlobInfo *) pg_malloc(ntups * sizeof(BlobInfo)); + ntups = PQntuples(res); + total += ntups; - for (i = 0; i < ntups; i++) - { - binfo[i].dobj.objType = DO_BLOB; - binfo[i].dobj.catId.tableoid = LargeObjectRelationId; - binfo[i].dobj.catId.oid = atooid(PQgetvalue(res, i, i_oid)); - AssignDumpId(&binfo[i].dobj); + /* + * Each large object has its own BLOB archive entry. + */ + binfo = (BlobInfo *) pg_malloc(ntups * sizeof(BlobInfo)); - binfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_oid)); - binfo[i].rolname = pg_strdup(PQgetvalue(res, i, i_lomowner)); - binfo[i].blobacl = pg_strdup(PQgetvalue(res, i, i_lomacl)); - binfo[i].rblobacl = pg_strdup(PQgetvalue(res, i, i_rlomacl)); - binfo[i].initblobacl = pg_strdup(PQgetvalue(res, i, i_initlomacl)); - binfo[i].initrblobacl = pg_strdup(PQgetvalue(res, i, i_initrlomacl)); + for (i = 0; i < ntups; i++) + { + binfo[i].dobj.objType = DO_BLOB; + binfo[i].dobj.catId.tableoid = LargeObjectRelationId; + binfo[i].dobj.catId.oid = atooid(PQgetvalue(res, i, i_oid)); + AssignDumpId(&binfo[i].dobj); + + binfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_oid)); + binfo[i].rolname = pg_strdup(PQgetvalue(res, i, i_lomowner)); + binfo[i].blobacl = pg_strdup(PQgetvalue(res, i, i_lomacl)); + binfo[i].rblobacl = pg_strdup(PQgetvalue(res, i, i_rlomacl)); + binfo[i].initblobacl = pg_strdup(PQgetvalue(res, i, i_initlomacl)); + binfo[i].initrblobacl = pg_strdup(PQgetvalue(res, i, i_initrlomacl)); + + if (PQgetisnull(res, i, i_lomacl) && + PQgetisnull(res, i, i_rlomacl) && + PQgetisnull(res, i, i_initlomacl) && + PQgetisnull(res, i, i_initrlomacl)) + binfo[i].dobj.dump &= ~DUMP_COMPONENT_ACL; - if (PQgetisnull(res, i, i_lomacl) && - PQgetisnull(res, i, i_rlomacl) && - PQgetisnull(res, i, i_initlomacl) && - PQgetisnull(res, i, i_initrlomacl)) - binfo[i].dobj.dump &= ~DUMP_COMPONENT_ACL; + /* + * In binary-upgrade mode for blobs, we do *not* dump out the blob + * data, as it will be copied by pg_upgrade, which simply copies the + * pg_largeobject table. We *do* however dump out anything but the + * data, as pg_upgrade copies just pg_largeobject, but not + * pg_largeobject_metadata, after the dump is restored. + */ + if (dopt->binary_upgrade) + binfo[i].dobj.dump &= ~DUMP_COMPONENT_DATA; + } - /* - * In binary-upgrade mode for blobs, we do *not* dump out the blob - * data, as it will be copied by pg_upgrade, which simply copies the - * pg_largeobject table. We *do* however dump out anything but the - * data, as pg_upgrade copies just pg_largeobject, but not - * pg_largeobject_metadata, after the dump is restored. - */ - if (dopt->binary_upgrade) - binfo[i].dobj.dump &= ~DUMP_COMPONENT_DATA; - } + PQclear(res); + } while (ntups != 0); /* * If we have any large objects, a "BLOBS" archive entry is needed. This * is just a placeholder for sorting; it carries no data now. */ - if (ntups > 0) + if (total > 0) { bdata = (DumpableObject *) pg_malloc(sizeof(DumpableObject)); bdata->objType = DO_BLOB_DATA; @@ -3452,9 +3463,6 @@ getBlobs(Archive *fout) AssignDumpId(bdata); bdata->name = pg_strdup("BLOBS"); } - - PQclear(res); - destroyPQExpBuffer(blobQry); } /* -- 2.17.0