Re: pg_upgrade failing for 200+ million Large Objects

Justin Pryzby Tue, 09 Mar 2021 12:08:45 -0800

On Wed, Mar 03, 2021 at 11:36:26AM +0000, Tharakan, Robins wrote:
> While reviewing a failed upgrade from Postgres v9.5 (to v9.6) I saw that the
> instance had ~200 million (in-use) Large Objects. I was able to reproduce
> this on a test instance which too fails with a similar error.


If pg_upgrade can't handle millions of objects/transactions/XIDs, that seems
like a legitimate complaint, since apparently the system is working okay
otherwise.

But it also seems like you're using it outside the range of its intended use
(See also [1]).  I'm guessing that not many people are going to spend time
running tests of pg_upgrade, each of which takes 25hr, not to mention some
multiple of 128GB RAM+swap.

Creating millions of large objects was too slow for me to test like this:
| time { echo 'begin;'; for a in `seq 1 99999`; do echo '\lo_import /dev/null'; 
done; echo 'commit;'; } |psql -qh /tmp postgres&

This seems to be enough for what's needed:
| ALTER SYSTEM SET fsync=no; ALTER SYSTEM SET full_page_writes=no; SELECT 
pg_reload_conf();
| INSERT INTO pg_largeobject_metadata SELECT a, 0 FROM generate_series(100000, 
200111222)a;

Now, testing the pg_upgrade was killed after runnning 100min and using 60GB
RAM, so you might say that's a problem too.  I converted getBlobs() to use a
cursor, like dumpBlobs(), but it was still killed.  I think a test case and a
way to exercizes this failure with a more reasonable amount of time and
resources might be a prerequisite for a patch to fix it.

pg_upgrade is meant for "immediate" upgrades, frequently allowing upgrade in
minutes, where pg_dump |pg_restore might take hours or days.  There's two
components to consider: the catalog/metadata part, and the data part.  If the
data is large (let's say more than 100GB), then pg_upgrade is expected to be an
improvement over the "dump and restore" process, which is usually infeasible
for large DBs measure in TB.

But the *catalog* part is large, and pg_upgrade still has to run pg_dump, and
pg_restore.  The time to do this might dominate over the data part.  Our own
customers DBs are 100s of GB to 10TB.  For large customers, pg_upgrade takes
45min.  In the past, we had tables with many column defaults, which caused the
dump+restore to be slow at a larger fraction of customers.

If it were me, in an EOL situation, I would look at either: 1) find a way to do
dump+restore rather than pg_upgrade; and/or, 2) separately pg_dump the large
objects, drop as many as you can, then pg_upgrade the DB, then restore the
large objects.  (And find a better way to store them in the future).

I was able to hack pg_upgrade to call pg_restore --single (with a separate
invocation to handle --create).  That passes tests...but I can't say much
beyond that.

Regarding your existing patch: "make check" only tests SQL features.
For development, you'll want to configure like:
|./configure --enable-debug --enable-cassert --enable-tap-tests
And then use "make check-world", and in particular:
time make check -C src/bin/pg_resetwal
time make check -C src/bin/pg_upgrade

I don't think pg_restore needs a user-facing option for XIDs.  I think it
should "just work", since a user might be as likely to shoot themselves in the
foot with a commandline option as they are to make an upgrade succeed that
would otherwise fail.  pg_upgrade has a --check mode, and if that passes, the
upgrade is intended to work, and not fail halfway through between the schema
dump and restore, with the expectation that the user know to rerun with some
commandline flags.  If you pursue the patch with setting a different XID
threshold, maybe you could count the number of objects to be created, or
transactions to be used, and use that as the argument to resetxlog ?  I'm not
sure, but pg_restore -l might be a good place to start looking.

I think a goal for this patch should be to allow an increased number of
objects to be handled by pg_upgrade.  Large objects may be a special case, and
increasing the number of other objects to be restored to the 100s of millions
might be unimportant.

-- 
Justin

[1] https://www.postgresql.org/message-id/502641.1606334432%40sss.pgh.pa.us
| Does pg_dump really have sane performance for that situation, or
| are we soon going to be fielding requests to make it not be O(N^2)
| in the number of listed tables?

>From cfc7400bb021659d49170e8b17d067c8e1b9fa33 Mon Sep 17 00:00:00 2001
From: Justin Pryzby <pryz...@telsasoft.com>
Date: Tue, 9 Mar 2021 14:06:17 -0600
Subject: [PATCH] pg_dump: use a cursor in getBlobs..

..to mitigate huge memory use in the case of millions of large objects
---
 src/bin/pg_dump/pg_dump.c | 96 +++++++++++++++++++++------------------
 1 file changed, 52 insertions(+), 44 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index aa02ada079..3fd7f48605 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -3333,7 +3333,7 @@ getBlobs(Archive *fout)
 	BlobInfo   *binfo;
 	DumpableObject *bdata;
 	PGresult   *res;
-	int			ntups;
+	int			ntups, total = 0;
 	int			i;
 	int			i_oid;
 	int			i_lomowner;
@@ -3341,9 +3341,12 @@ getBlobs(Archive *fout)
 	int			i_rlomacl;
 	int			i_initlomacl;
 	int			i_initrlomacl;
+	const char		*blobFetchQry = "FETCH 1000 IN blob";
 
 	pg_log_info("reading large objects");
 
+	appendPQExpBuffer(blobQry, "DECLARE blob CURSOR FOR ");
+
 	/* Fetch BLOB OIDs, and owner/ACL data if >= 9.0 */
 	if (fout->remoteVersion >= 90600)
 	{
@@ -3393,58 +3396,66 @@ getBlobs(Archive *fout)
 							 "NULL::oid AS initrlomacl "
 							 " FROM pg_largeobject");
 
-	res = ExecuteSqlQuery(fout, blobQry->data, PGRES_TUPLES_OK);
+	ExecuteSqlStatement(fout, blobQry->data);
+	destroyPQExpBuffer(blobQry);
 
-	i_oid = PQfnumber(res, "oid");
-	i_lomowner = PQfnumber(res, "rolname");
-	i_lomacl = PQfnumber(res, "lomacl");
-	i_rlomacl = PQfnumber(res, "rlomacl");
-	i_initlomacl = PQfnumber(res, "initlomacl");
-	i_initrlomacl = PQfnumber(res, "initrlomacl");
+	do {
+		res = ExecuteSqlQuery(fout, blobFetchQry, PGRES_TUPLES_OK);
 
-	ntups = PQntuples(res);
+		i_oid = PQfnumber(res, "oid");
+		i_lomowner = PQfnumber(res, "rolname");
+		i_lomacl = PQfnumber(res, "lomacl");
+		i_rlomacl = PQfnumber(res, "rlomacl");
+		i_initlomacl = PQfnumber(res, "initlomacl");
+		i_initrlomacl = PQfnumber(res, "initrlomacl");
 
-	/*
-	 * Each large object has its own BLOB archive entry.
-	 */
-	binfo = (BlobInfo *) pg_malloc(ntups * sizeof(BlobInfo));
+		ntups = PQntuples(res);
+		total += ntups;
 
-	for (i = 0; i < ntups; i++)
-	{
-		binfo[i].dobj.objType = DO_BLOB;
-		binfo[i].dobj.catId.tableoid = LargeObjectRelationId;
-		binfo[i].dobj.catId.oid = atooid(PQgetvalue(res, i, i_oid));
-		AssignDumpId(&binfo[i].dobj);
+		/*
+		 * Each large object has its own BLOB archive entry.
+		 */
+		binfo = (BlobInfo *) pg_malloc(ntups * sizeof(BlobInfo));
 
-		binfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_oid));
-		binfo[i].rolname = pg_strdup(PQgetvalue(res, i, i_lomowner));
-		binfo[i].blobacl = pg_strdup(PQgetvalue(res, i, i_lomacl));
-		binfo[i].rblobacl = pg_strdup(PQgetvalue(res, i, i_rlomacl));
-		binfo[i].initblobacl = pg_strdup(PQgetvalue(res, i, i_initlomacl));
-		binfo[i].initrblobacl = pg_strdup(PQgetvalue(res, i, i_initrlomacl));
+		for (i = 0; i < ntups; i++)
+		{
+			binfo[i].dobj.objType = DO_BLOB;
+			binfo[i].dobj.catId.tableoid = LargeObjectRelationId;
+			binfo[i].dobj.catId.oid = atooid(PQgetvalue(res, i, i_oid));
+			AssignDumpId(&binfo[i].dobj);
+
+			binfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_oid));
+			binfo[i].rolname = pg_strdup(PQgetvalue(res, i, i_lomowner));
+			binfo[i].blobacl = pg_strdup(PQgetvalue(res, i, i_lomacl));
+			binfo[i].rblobacl = pg_strdup(PQgetvalue(res, i, i_rlomacl));
+			binfo[i].initblobacl = pg_strdup(PQgetvalue(res, i, i_initlomacl));
+			binfo[i].initrblobacl = pg_strdup(PQgetvalue(res, i, i_initrlomacl));
+
+			if (PQgetisnull(res, i, i_lomacl) &&
+				PQgetisnull(res, i, i_rlomacl) &&
+				PQgetisnull(res, i, i_initlomacl) &&
+				PQgetisnull(res, i, i_initrlomacl))
+				binfo[i].dobj.dump &= ~DUMP_COMPONENT_ACL;
 
-		if (PQgetisnull(res, i, i_lomacl) &&
-			PQgetisnull(res, i, i_rlomacl) &&
-			PQgetisnull(res, i, i_initlomacl) &&
-			PQgetisnull(res, i, i_initrlomacl))
-			binfo[i].dobj.dump &= ~DUMP_COMPONENT_ACL;
+			/*
+			 * In binary-upgrade mode for blobs, we do *not* dump out the blob
+			 * data, as it will be copied by pg_upgrade, which simply copies the
+			 * pg_largeobject table. We *do* however dump out anything but the
+			 * data, as pg_upgrade copies just pg_largeobject, but not
+			 * pg_largeobject_metadata, after the dump is restored.
+			 */
+			if (dopt->binary_upgrade)
+				binfo[i].dobj.dump &= ~DUMP_COMPONENT_DATA;
+		}
 
-		/*
-		 * In binary-upgrade mode for blobs, we do *not* dump out the blob
-		 * data, as it will be copied by pg_upgrade, which simply copies the
-		 * pg_largeobject table. We *do* however dump out anything but the
-		 * data, as pg_upgrade copies just pg_largeobject, but not
-		 * pg_largeobject_metadata, after the dump is restored.
-		 */
-		if (dopt->binary_upgrade)
-			binfo[i].dobj.dump &= ~DUMP_COMPONENT_DATA;
-	}
+		PQclear(res);
+	} while (ntups != 0);
 
 	/*
 	 * If we have any large objects, a "BLOBS" archive entry is needed. This
 	 * is just a placeholder for sorting; it carries no data now.
 	 */
-	if (ntups > 0)
+	if (total > 0)
 	{
 		bdata = (DumpableObject *) pg_malloc(sizeof(DumpableObject));
 		bdata->objType = DO_BLOB_DATA;
@@ -3452,9 +3463,6 @@ getBlobs(Archive *fout)
 		AssignDumpId(bdata);
 		bdata->name = pg_strdup("BLOBS");
 	}
-
-	PQclear(res);
-	destroyPQExpBuffer(blobQry);
 }
 
 /*
-- 
2.17.0

Re: pg_upgrade failing for 200+ million Large Objects

Reply via email to