Re: Patch: dumping tables data in multiple chunks in pg_dump

Hannu Krosing Wed, 28 Jan 2026 09:30:36 -0800

v13 has added a proper test comparing original and restored table data



On Tue, Jan 27, 2026 at 11:43 PM Hannu Krosing <[email protected]> wrote:
>
> Hi David
>
> Thanks for reviewing.
>
> Please hold back reviewing this v12 patch until I have verified that
> it passes CfBot and until I have finished my testing with 17TB table.
>
> I would appreciate pointers or help on adding the data correctness
> tests to the tap tests as real tests.
> For now I put them as DO$$ ... $$ blocks in the parallel test .pl and
> I check manually that the table data checksums match
> If we add them we should add them to other places in pg_dump tests as
> well. Currently we just test that dump and restore do not fail, not
> that the restored data is correct.
>
> On Fri, Jan 23, 2026 at 3:15 AM David Rowley <[email protected]> wrote:
> >
> > On Fri, 23 Jan 2026 at 06:05, Hannu Krosing <[email protected]> wrote:
> > >
> > > Fixing all the warnings
> >
> > I think overall this needs significantly more care and precision than
> > what you've given it so far. For example, you have:
> >
> > +    if(dopt->max_table_segment_pages != InvalidBlockNumber)
> > +        appendPQExpBufferStr(query,
> > "pg_relation_size(c.oid)/current_setting('block_size')::int AS
> > relpages, ");
> > +    else
> > +        appendPQExpBufferStr(query, "c.relpages, ");
> >
> > Note that pg_class.relpages is "int". Later the code in master does:
> >
> > tblinfo[i].relpages = atoi(PQgetvalue(res, i, i_relpages));
>
> I have now fixed the base issue by changing the data type of
> TableInfo.relpages to BlockNumber, and also changed the way we get it
> by
>
> 1. converting it to unsigned int ( c.relpages::oid ) in the query
> 2. reading it from the result using strtoul()
>
> (technically it should have been enough to just use strtoul() as it
> already wraps signed ints to unsigned ones, but having it converted in
> the query seems cleaner)
>
> This allowed removing casts to (BlockNumber) everywhere where
> .relpages was used.
>
> Functionally value was ever only used for ordering and even this
> loosley, which explains why patch v10 did not break anything.
>
> I also changed the data type of TocEntry.dataLength from pgoff_t to
> uint64. The current clearly had an overflow in case when off_t was 32
> bit and sum of relpages from heap and toast was larger than allowed
> for it.
>
> > If you look in vacuum.c, you'll see "pgcform->relpages = (int32)
> > num_pages;" that the value stored in relpages will be negative when
> > the table is >= 16TB (assuming 8k pages). Your pg_relation_size
> > expression is not going to produce an INT. It'll produce a BIGINT, per
> > "select pg_typeof(pg_relation_size('pg_class') /
> > current_setting('block_size')::int);". So the atoi() can receive a
> > string of digits representing an integer larger than INT_MAX in this
> > case. Looking at [1], I see:
>
> As said above this should be fixed now by using correct type in struch
> and strtoul().
> To be sure  I have now created a 17TB  table and running some tests on
> this as well.
> Will let you know here when done.
>
> > "7.22.1 Numeric conversion functions 1 The functions atof, atoi, atol,
> > and atoll need not affect the value of the integer expression errno on
> > an error. If the value of the result cannot be represented, *the
> > behavior is undefined.*"
> >
> > And testing locally, I see that my Microsoft compiler will just return
> > INT_MAX on overflow, whereas I see gcc does nothing to prevent
> > overflows and just continues to multiply by 10 regardless of what
> > overflows occur, which I think would just make the code work by
> > accident.
>
> As .relpages was only ever used for ordering parallel copies it does
> work just not optimally.
>
> The old code has similar overflow/wraparound for case when off_t is 32
> bit int and the sum of relpages from heap and toast table is above
> INT_MAX
>
> I have removed the whole part where this was partially fixed for the
> case when one of them was > 0x7fffffff (i.e. negative) by pinning the
> dataLength to INT_MAX in that case
>
> > Aside from that, nothing in the documentation mentions that this is
> > for "heap" tables only. That should be mentioned as it'll just result
> > in people posting questions about why it's not working for some other
> > table access method. There's also not much care for white space.
> > You've introduced a bunch of whitespace changes unrelated to code
> > changes you've made, plus there's not much regard for following
> > project standard. For example, you commonly do "if(" and don't
> > consistently follow the bracing rules, e.g:
> >
> > + for(chkptr = optarg; *chkptr != '\0'; chkptr++)
> > +     if(*chkptr == '-')
>
> I assumed that it is the classical "single statemet -- no braces.
>
> Do we have a writeup of our coding standards somewhere ?
>
> Now this specific case is rewritten using while() so shoud be ok.
>
> > Things like the following help convey the level of care that's gone into 
> > this:
> >
> > +/*
> > + * option_parse_int
> > + *
> > + * Parse integer value for an option.  If the parsing is successful, 
> > returns
> > + * true and stores the result in *result if that's given; if parsing fails,
> > + * returns false.
> > + */
> > +bool
> > +option_parse_uint32(const char *optarg, const char *optname,
> >
> > i.e zero effort gone in to modify the comments after pasting them from
> > option_parse_int().
> >
> > Another example:
> >
> > + pg_log_error("%s musst be in range %lu..%lu",
> >
> > Also, I have no comprehension of why you'd use uint64 for the valid
> > range when the function is for processing uint32 types in:
>
> The uint64 there I picked up from the referenced long unsigned usage
> in pg_resetval after I managed to get pg_log_warning to print out -1
> for format %u and did not want to go to debug why that happens.
>
> I have now made all the arguments uint32
>
> > +bool
> > +option_parse_uint32(const char *optarg, const char *optname,
> > + uint64 min_range, uint64 max_range,
> > + uint32 *result)
> >
> > In its current state, it's quite hard to take this patch seriously.
> > Please spend longer self-reviewing it before posting. You could
> > temporarily hard-code something for testing which makes at least 1
> > table appear to be larger than 16TB and ensure your code works. What
> > you have is visually broken and depends on whatever the atoi
> > implementation opts to do in the overflow case. These are all things
> > diligent commiters will be testing and it's sad to see how little
> > effort you're putting into this. How do you expect this community to
> > scale with this quality level of patch submissions? You've been around
> > long enough and should know and do better.  Are you just expecting the
> > committer to fix these things for you? That work does not get done via
> > magic wand. Being on v10 already, I'd have expected the patch to be
> > far beyond proof of concept grade. If you're withholding investing
> > time on this until you see more community buy-in, then I'd suggest you
> > write that and withhold further revisions until you're happy with the
> > level of buy-in.
>
> > I'm also still not liking your de-normalised TableInfo representation
> > for "is_segment".
> > IMO, InvalidBlockNumber should be used to represent
> > open bounded ranges, and if there's no chunking, then startPage and
> > endPage will both be InvalidBlockNumber.
>
> That's what I ended up doing
>
> I switched to using startPage = InvalidBlockNumber to indicate that no
> chunking is in effect.
>
> This is safe because when chunking is in use I always try to set both
> chunk end pages, and lower bound I can always set the lower bound.
>
> Only for the last page is the endPage left to InvalidBlockNumber.
>
> > IMO, what you have now
> > needlessly allows invalid states where is_segment == true and
> > startPage, endPage are not set correctly. If you want to keep the code
> > simple, hide the complexity in a macro or an inline function. There's
> > just no performance reason to materialise the more complex condition
> > into a dedicated boolean flag.

From e598191f7464ca2ecfa9779a823d1aa8a409cdf7 Mon Sep 17 00:00:00 2001
From: Hannu Krosing <[email protected]>
Date: Wed, 28 Jan 2026 18:24:19 +0100
Subject: [PATCH v13] * changed flag name to max-table-segment-pages * added
 check for amname = "heap" * added simple chunked dump and restore test *
 changed the data type of TableInfo.relpages to BlockNumber,   * select it
 using relpages:oid to get unsigned int out   * read it in from query result
 using strtoul()   * removed a bunch of casts from .relpages to (BlocNumber) *
 changed the data type of TocEntry.dataLength to uint64   current pgoff_t
 certainly had an overflow in 32bit case when heap relpages + toast relpages >
 INT_MAX * switched to using of
 pg_relation_size(c.oid)/current_setting('block_size')::int when
 --max-table-segment-pages is set * added documentation * added
 option_parse_uint32(...) to be used for full range of pages numbers

* TESTS: added test  to compare original and restored table contents
---
 doc/src/sgml/ref/pg_dump.sgml             |  24 +++
 src/bin/pg_dump/pg_backup.h               |   2 +
 src/bin/pg_dump/pg_backup_archiver.c      |   2 +
 src/bin/pg_dump/pg_backup_archiver.h      |   2 +-
 src/bin/pg_dump/pg_dump.c                 | 169 +++++++++++++++++-----
 src/bin/pg_dump/pg_dump.h                 |  22 ++-
 src/bin/pg_dump/t/004_pg_dump_parallel.pl |  31 ++++
 src/fe_utils/option_utils.c               |  55 +++++++
 src/include/fe_utils/option_utils.h       |   3 +
 9 files changed, 268 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 688e23c0e90..1811c67d141 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1088,6 +1088,30 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><option>--max-table-segment-pages=<replaceable class="parameter">npages</replaceable></option></term>
+      <listitem>
+       <para>
+        Dump data in segments based on number of pages in the main relation.
+        If the number of data pages in the relation is more than <replaceable class="parameter">npages</replaceable> 
+        the data is split into segments based on that number of pages.
+        Individual segments can be dumped in parallel.
+       </para>
+
+       <note>
+        <para>
+         The option <option>--max-table-segment-pages</option> is applied to only pages
+         in the main heap and if the table has a large TOASTed part this has to be
+         taken into account when deciding on the number of pages to use.
+         In the extreme case a single 8kB heap page can have ~200 toast pointers each 
+         corresponding to 1GB of data. If this data is also non-compressible then a 
+         single-page segment can dump as 200GB file.
+        </para>
+       </note>
+
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
       <term><option>--no-comments</option></term>
       <listitem>
diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index d9041dad720..b63ae05d895 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -27,6 +27,7 @@
 #include "common/file_utils.h"
 #include "fe_utils/simple_list.h"
 #include "libpq-fe.h"
+#include "storage/block.h"
 
 
 typedef enum trivalue
@@ -178,6 +179,7 @@ typedef struct _dumpOptions
 	bool		aclsSkip;
 	const char *lockWaitTimeout;
 	int			dump_inserts;	/* 0 = COPY, otherwise rows per INSERT */
+	BlockNumber	max_table_segment_pages; /* chunk when relpages is above this */
 
 	/* flags for various command-line long options */
 	int			disable_dollar_quoting;
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 4a63f7392ae..ed1913d66bc 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -44,6 +44,7 @@
 #include "pg_backup_archiver.h"
 #include "pg_backup_db.h"
 #include "pg_backup_utils.h"
+#include "storage/block.h"
 
 #define TEXT_DUMP_HEADER "--\n-- PostgreSQL database dump\n--\n\n"
 #define TEXT_DUMPALL_HEADER "--\n-- PostgreSQL database cluster dump\n--\n\n"
@@ -154,6 +155,7 @@ InitDumpOptions(DumpOptions *opts)
 	opts->dumpSchema = true;
 	opts->dumpData = true;
 	opts->dumpStatistics = false;
+	opts->max_table_segment_pages = InvalidBlockNumber;
 }
 
 /*
diff --git a/src/bin/pg_dump/pg_backup_archiver.h b/src/bin/pg_dump/pg_backup_archiver.h
index 325b53fc9bd..b6a9f16a122 100644
--- a/src/bin/pg_dump/pg_backup_archiver.h
+++ b/src/bin/pg_dump/pg_backup_archiver.h
@@ -377,7 +377,7 @@ struct _tocEntry
 	size_t		defnLen;		/* length of dumped definition */
 
 	/* working state while dumping/restoring */
-	pgoff_t		dataLength;		/* item's data size; 0 if none or unknown */
+	uint64		dataLength;		/* item's data size; 0 if none or unknown */
 	int			reqs;			/* do we need schema and/or data of object
 								 * (REQ_* bit mask) */
 	bool		created;		/* set for DATA member if TABLE was created */
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 687dc98e46d..0badb245b55 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -539,6 +539,7 @@ main(int argc, char **argv)
 		{"exclude-extension", required_argument, NULL, 17},
 		{"sequence-data", no_argument, &dopt.sequence_data, 1},
 		{"restrict-key", required_argument, NULL, 25},
+		{"max-table-segment-pages", required_argument, NULL, 26},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -803,6 +804,13 @@ main(int argc, char **argv)
 				dopt.restrict_key = pg_strdup(optarg);
 				break;
 
+			case 26:
+				if (!option_parse_uint32(optarg, "--max-table-segment-pages", 1, MaxBlockNumber,
+									  &dopt.max_table_segment_pages))
+					exit_nicely(1);
+				pg_log_warning("CHUNKING: set dopt.max_table_segment_pages to [%u]", dopt.max_table_segment_pages);
+				break;
+
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -1372,6 +1380,9 @@ help(const char *progname)
 	printf(_("  --extra-float-digits=NUM     override default setting for extra_float_digits\n"));
 	printf(_("  --filter=FILENAME            include or exclude objects and data from dump\n"
 			 "                               based on expressions in FILENAME\n"));
+	printf(_("  --max-table-segment-pages=NUMPAGES\n"
+		     "                               Number of main table pages above which data is \n"
+			 "                               copied out in chunks, also determines the chunk size\n"));
 	printf(_("  --if-exists                  use IF EXISTS when dropping objects\n"));
 	printf(_("  --include-foreign-data=PATTERN\n"
 			 "                               include data of foreign tables on foreign\n"
@@ -2412,7 +2423,7 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 	 * a filter condition was specified.  For other cases a simple COPY
 	 * suffices.
 	 */
-	if (tdinfo->filtercond || tbinfo->relkind == RELKIND_FOREIGN_TABLE)
+	if (tdinfo->filtercond || is_segment(tdinfo) || tbinfo->relkind == RELKIND_FOREIGN_TABLE)
 	{
 		/* Temporary allows to access to foreign tables to dump data */
 		if (tbinfo->relkind == RELKIND_FOREIGN_TABLE)
@@ -2428,9 +2439,23 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 		else
 			appendPQExpBufferStr(q, "* ");
 
-		appendPQExpBuffer(q, "FROM %s %s) TO stdout;",
+		appendPQExpBuffer(q, "FROM %s %s",
 						  fmtQualifiedDumpable(tbinfo),
 						  tdinfo->filtercond ? tdinfo->filtercond : "");
+		if (is_segment(tdinfo))
+		{
+			appendPQExpBufferStr(q, tdinfo->filtercond?" AND ":" WHERE ");
+			if(tdinfo->startPage == 0)
+				appendPQExpBuffer(q, "ctid <= '(%u,32000)'", tdinfo->endPage);			
+			else if(tdinfo->endPage != InvalidBlockNumber)
+				appendPQExpBuffer(q, "ctid BETWEEN '(%u,1)' AND '(%u,32000)'",
+								 tdinfo->startPage, tdinfo->endPage);
+			else
+				appendPQExpBuffer(q, "ctid >= '(%u,1)'", tdinfo->startPage);
+			pg_log_warning("CHUNKING: pages [%u:%u]",tdinfo->startPage, tdinfo->endPage);
+		}
+
+		appendPQExpBuffer(q, ") TO stdout;");
 	}
 	else
 	{
@@ -2438,6 +2463,9 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
 						  fmtQualifiedDumpable(tbinfo),
 						  column_list);
 	}
+
+	pg_log_warning("CHUNKING: data query: %s", q->data);
+	
 	res = ExecuteSqlQuery(fout, q->data, PGRES_COPY_OUT);
 	PQclear(res);
 	destroyPQExpBuffer(clistBuf);
@@ -2933,42 +2961,95 @@ dumpTableData(Archive *fout, const TableDataInfo *tdinfo)
 	{
 		TocEntry   *te;
 
-		te = ArchiveEntry(fout, tdinfo->dobj.catId, tdinfo->dobj.dumpId,
-						  ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
-									   .namespace = tbinfo->dobj.namespace->dobj.name,
-									   .owner = tbinfo->rolname,
-									   .description = "TABLE DATA",
-									   .section = SECTION_DATA,
-									   .createStmt = tdDefn,
-									   .copyStmt = copyStmt,
-									   .deps = &(tbinfo->dobj.dumpId),
-									   .nDeps = 1,
-									   .dumpFn = dumpFn,
-									   .dumpArg = tdinfo));
-
-		/*
-		 * Set the TocEntry's dataLength in case we are doing a parallel dump
-		 * and want to order dump jobs by table size.  We choose to measure
-		 * dataLength in table pages (including TOAST pages) during dump, so
-		 * no scaling is needed.
-		 *
-		 * However, relpages is declared as "integer" in pg_class, and hence
-		 * also in TableInfo, but it's really BlockNumber a/k/a unsigned int.
-		 * Cast so that we get the right interpretation of table sizes
-		 * exceeding INT_MAX pages.
+		/* data chunking works off relpages, which are computed exactly using
+		 * pg_relation_size() when --max-table-segment-pages was set
+		 * 
+		 * We also don't chunk if table access method is not "heap"
+		 * TODO: we may add chunking for other access methods later, maybe 
+		 * based on primary key tranges
 		 */
-		te->dataLength = (BlockNumber) tbinfo->relpages;
-		te->dataLength += (BlockNumber) tbinfo->toastpages;
+		if (tbinfo->relpages <= dopt->max_table_segment_pages || 
+			strcmp(tbinfo->amname, "heap") != 0)
+		{
+			te = ArchiveEntry(fout, tdinfo->dobj.catId, tdinfo->dobj.dumpId,
+							ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
+										.namespace = tbinfo->dobj.namespace->dobj.name,
+										.owner = tbinfo->rolname,
+										.description = "TABLE DATA",
+										.section = SECTION_DATA,
+										.createStmt = tdDefn,
+										.copyStmt = copyStmt,
+										.deps = &(tbinfo->dobj.dumpId),
+										.nDeps = 1,
+										.dumpFn = dumpFn,
+										.dumpArg = tdinfo));
 
-		/*
-		 * If pgoff_t is only 32 bits wide, the above refinement is useless,
-		 * and instead we'd better worry about integer overflow.  Clamp to
-		 * INT_MAX if the correct result exceeds that.
-		 */
-		if (sizeof(te->dataLength) == 4 &&
-			(tbinfo->relpages < 0 || tbinfo->toastpages < 0 ||
-			 te->dataLength < 0))
-			te->dataLength = INT_MAX;
+			/*
+			 * Set the TocEntry's dataLength in case we are doing a parallel dump
+			 * and want to order dump jobs by table size.  We choose to measure
+			 * dataLength in table pages (including TOAST pages) during dump, so
+			 * no scaling is needed.
+			 *
+			 * While pg_class.relpages which stores BlockNumber, a/k/a unsigned int,
+			 * is declared as "integer" we convert it back and store it as 
+			 * BlockNumber in TableInfo.
+			 * And dataLenght is pgoff_t (long int) so does now overflow for
+			 * 2 x UINT32_MAX 
+			 */
+			te->dataLength = tbinfo->relpages;
+			te->dataLength += tbinfo->toastpages;
+		}
+		else
+		{
+			uint64 current_chunk_start = 0;
+			PQExpBuffer chunk_desc = createPQExpBuffer();
+			
+			pg_log_warning("CHUNKING: toc for chunked relpages [%u]", tbinfo->relpages);
+
+			/* TODO - use uint 64 for current_chunk_start to avoid wraparound */
+			while (current_chunk_start < tbinfo->relpages)
+			{
+				TableDataInfo *chunk_tdinfo = (TableDataInfo *) pg_malloc(sizeof(TableDataInfo));
+
+				memcpy(chunk_tdinfo, tdinfo, sizeof(TableDataInfo));
+				AssignDumpId(&chunk_tdinfo->dobj);
+				//addObjectDependency(&chunk_tdinfo->dobj, tbinfo->dobj.dumpId); /* do we need this here */
+//				chunk_tdinfo->is_segment = true;
+				chunk_tdinfo->startPage = (BlockNumber) current_chunk_start;
+				chunk_tdinfo->endPage = chunk_tdinfo->startPage + dopt->max_table_segment_pages - 1;
+
+				pg_log_warning("CHUNKING: toc for pages [%u:%u]",chunk_tdinfo->startPage, chunk_tdinfo->endPage);
+				
+				current_chunk_start += dopt->max_table_segment_pages;
+				if (current_chunk_start >= tbinfo->relpages)
+					chunk_tdinfo->endPage = InvalidBlockNumber; /* last chunk is for "all the rest" */
+
+				printfPQExpBuffer(chunk_desc, "TABLE DATA (pages %u:%u)", chunk_tdinfo->startPage, chunk_tdinfo->endPage);
+
+				te = ArchiveEntry(fout, chunk_tdinfo->dobj.catId, chunk_tdinfo->dobj.dumpId,
+							ARCHIVE_OPTS(.tag = tbinfo->dobj.name,
+										.namespace = tbinfo->dobj.namespace->dobj.name,
+										.owner = tbinfo->rolname,
+										.description = chunk_desc->data,
+										.section = SECTION_DATA,
+										.createStmt = tdDefn,
+										.copyStmt = copyStmt,
+										.deps = &(tbinfo->dobj.dumpId),
+										.nDeps = 1,
+										.dumpFn = dumpFn,
+										.dumpArg = chunk_tdinfo));
+
+				if(chunk_tdinfo->endPage == InvalidBlockNumber)
+					te->dataLength = tbinfo->relpages - chunk_tdinfo->startPage;
+				else
+					te->dataLength = dopt->max_table_segment_pages;
+				/* let's assume toast pages distribute evenly among chunks */
+				if(tbinfo->relpages)
+					te->dataLength += te->dataLength * tbinfo->toastpages / tbinfo->relpages;
+			}
+
+			destroyPQExpBuffer(chunk_desc);
+		}
 	}
 
 	destroyPQExpBuffer(copyBuf);
@@ -3092,6 +3173,8 @@ makeTableDataInfo(DumpOptions *dopt, TableInfo *tbinfo)
 	tdinfo->dobj.namespace = tbinfo->dobj.namespace;
 	tdinfo->tdtable = tbinfo;
 	tdinfo->filtercond = NULL;	/* might get set later */
+	tdinfo->startPage = InvalidBlockNumber; /* we use this as indication that no chunking is needed */
+	tdinfo->endPage = InvalidBlockNumber;
 	addObjectDependency(&tdinfo->dobj, tbinfo->dobj.dumpId);
 
 	/* A TableDataInfo contains data, of course */
@@ -7254,8 +7337,16 @@ getTables(Archive *fout, int *numTables)
 						 "c.relnamespace, c.relkind, c.reltype, "
 						 "c.relowner, "
 						 "c.relchecks, "
-						 "c.relhasindex, c.relhasrules, c.relpages, "
-						 "c.reltuples, c.relallvisible, ");
+						 "c.relhasindex, c.relhasrules, ");
+
+	/* fetch current relation size if chunking is requested */
+	if(dopt->max_table_segment_pages != InvalidBlockNumber)
+		appendPQExpBufferStr(query, "pg_relation_size(c.oid)/current_setting('block_size')::int AS relpages, ");
+	else
+		/* pg_class.relpages stores BlockNumber (uint32) in an int field, convert to oid to get unsigned int out */
+		appendPQExpBufferStr(query, "c.relpages::oid, ");
+
+	appendPQExpBufferStr(query, "c.reltuples, c.relallvisible, ");
 
 	if (fout->remoteVersion >= 180000)
 		appendPQExpBufferStr(query, "c.relallfrozen, ");
@@ -7495,7 +7586,7 @@ getTables(Archive *fout, int *numTables)
 		tblinfo[i].ncheck = atoi(PQgetvalue(res, i, i_relchecks));
 		tblinfo[i].hasindex = (strcmp(PQgetvalue(res, i, i_relhasindex), "t") == 0);
 		tblinfo[i].hasrules = (strcmp(PQgetvalue(res, i, i_relhasrules), "t") == 0);
-		tblinfo[i].relpages = atoi(PQgetvalue(res, i, i_relpages));
+		tblinfo[i].relpages = strtoul(PQgetvalue(res, i, i_relpages), NULL, 10);
 		if (PQgetisnull(res, i, i_toastpages))
 			tblinfo[i].toastpages = 0;
 		else
diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h
index 4c4b14e5fc7..be71661ac41 100644
--- a/src/bin/pg_dump/pg_dump.h
+++ b/src/bin/pg_dump/pg_dump.h
@@ -16,6 +16,7 @@
 
 #include "pg_backup.h"
 #include "catalog/pg_publication_d.h"
+#include "storage/block.h"
 
 
 #define oidcmp(x,y) ( ((x) < (y) ? -1 : ((x) > (y)) ?  1 : 0) )
@@ -335,7 +336,11 @@ typedef struct _tableInfo
 	Oid			owning_tab;		/* OID of table owning sequence */
 	int			owning_col;		/* attr # of column owning sequence */
 	bool		is_identity_sequence;
-	int32		relpages;		/* table's size in pages (from pg_class) */
+	BlockNumber	relpages;		/* table's size in pages (from pg_class) 
+	                             * converted to unsigned integer
+								 * when --max-table-segment-pages is set
+								 * the computed from pog_relation_size()
+	                             */
 	int			toastpages;		/* toast table's size in pages, if any */
 
 	bool		interesting;	/* true if need to collect more data */
@@ -413,8 +418,21 @@ typedef struct _tableDataInfo
 	DumpableObject dobj;
 	TableInfo  *tdtable;		/* link to table to dump */
 	char	   *filtercond;		/* WHERE condition to limit rows dumped */
+	/* startPage and endPage to support segmented dump */
+	BlockNumber	startPage;		/* As we always know the lowest segment page
+								 * number we can use InvalidBlockNumber here
+								 * to recognize no segmenting case.
+								 * When 0 for the first page of first
+								 * segment we can omit in range query */
+	BlockNumber	endPage;		/* last page in segment for page-range dump,
+	                    		 * startPage+max_table_segment_pages-1 for 
+								 * most segments, but InvalidBlockNumber for
+								 * the last one to indicate open range
+								 */
 } TableDataInfo;
 
+#define is_segment(tdiptr) (tdiptr->startPage != InvalidBlockNumber)
+
 typedef struct _indxInfo
 {
 	DumpableObject dobj;
@@ -448,7 +466,7 @@ typedef struct _indexAttachInfo
 typedef struct _relStatsInfo
 {
 	DumpableObject dobj;
-	int32		relpages;
+	BlockNumber	relpages;
 	char	   *reltuples;
 	int32		relallvisible;
 	int32		relallfrozen;
diff --git a/src/bin/pg_dump/t/004_pg_dump_parallel.pl b/src/bin/pg_dump/t/004_pg_dump_parallel.pl
index 738f34b1c1b..4f35aeed9b9 100644
--- a/src/bin/pg_dump/t/004_pg_dump_parallel.pl
+++ b/src/bin/pg_dump/t/004_pg_dump_parallel.pl
@@ -11,6 +11,7 @@ use Test::More;
 my $dbname1 = 'regression_src';
 my $dbname2 = 'regression_dest1';
 my $dbname3 = 'regression_dest2';
+my $dbname4 = 'regression_dest3';
 
 my $node = PostgreSQL::Test::Cluster->new('main');
 $node->init;
@@ -21,6 +22,7 @@ my $backupdir = $node->backup_dir;
 $node->run_log([ 'createdb', $dbname1 ]);
 $node->run_log([ 'createdb', $dbname2 ]);
 $node->run_log([ 'createdb', $dbname3 ]);
+$node->run_log([ 'createdb', $dbname4 ]);
 
 $node->safe_psql(
 	$dbname1,
@@ -87,4 +89,33 @@ $node->command_ok(
 	],
 	'parallel restore as inserts');
 
+$node->command_ok(
+	[
+		'pg_dump',
+		'--format' => 'directory',
+		'--max-table-segment-pages' => 2,
+		'--no-sync',
+		'--jobs' => 2,
+		'--file' => "$backupdir/dump3",
+		$node->connstr($dbname1),
+	],
+	'parallel dump with chunks of two heap pages');
+
+$node->command_ok(
+	[
+		'pg_restore', '--verbose',
+		'--dbname' => $node->connstr($dbname4),
+		'--jobs' => 3,
+		"$backupdir/dump3",
+	],
+	'parallel restore with chunks of two heap pages');
+
+my $table = 'tplain';
+my $tablehash_query = "SELECT '$table', sum(hashtext(t::text)), count(*) FROM $table AS t";
+
+my $result_1 = $node->safe_psql($dbname1, $tablehash_query);
+my $result_4 = $node->safe_psql($dbname4, $tablehash_query);
+
+is($result_4, $result_1, "Hash check for $table: restored db ($result_4) vs original db ($result_1)");
+
 done_testing();
diff --git a/src/fe_utils/option_utils.c b/src/fe_utils/option_utils.c
index cc483ae176c..aff1fbd31a3 100644
--- a/src/fe_utils/option_utils.c
+++ b/src/fe_utils/option_utils.c
@@ -83,6 +83,61 @@ option_parse_int(const char *optarg, const char *optname,
 	return true;
 }
 
+/*
+ * option_parse_uint32
+ *
+ * Parse unsigned integer value for an option.  If the parsing is successful,
+ * returns true and stores the result in *result if that's given;
+ * if parsing fails, returns false.
+ */
+bool
+option_parse_uint32(const char *optarg, const char *optname,
+				 uint32 min_range, uint32 max_range,
+				 uint32 *result)
+{
+	char	   		*endptr;
+	unsigned long	val;
+
+	/* Fail if there is a minus sign at the start of value */
+	while(isspace((unsigned char) *optarg))
+		optarg++;
+	if(*optarg == '-')
+	{
+		pg_log_error("value \"%s\" for option %s can not be negative",
+					optarg, optname);
+		return false;
+	}
+
+	errno = 0;
+	val = strtoul(optarg, &endptr, 10);
+
+	/*
+	 * Skip any trailing whitespace; if anything but whitespace remains before
+	 * the terminating character, fail.
+	 */
+	while (*endptr != '\0' && isspace((unsigned char) *endptr))
+		endptr++;
+
+	if (*endptr != '\0')
+	{
+		pg_log_error("invalid value \"%s\" for option %s",
+					 optarg, optname);
+		return false;
+	}
+
+	/* as min_range and max_range are uint32 then the range check will
+	 * catch the case where unsigned long val is outside 32 bit range */
+	if (errno == ERANGE || val < min_range || val > max_range)
+	{
+		pg_log_error("%s not in range %u..%u", optname, min_range, max_range);
+		return false;
+	}
+
+	if (result)
+		*result = (uint32) val;
+	return true;
+}
+
 /*
  * Provide strictly harmonized handling of the --sync-method option.
  */
diff --git a/src/include/fe_utils/option_utils.h b/src/include/fe_utils/option_utils.h
index 0db6e3b6e91..c74cd1fb595 100644
--- a/src/include/fe_utils/option_utils.h
+++ b/src/include/fe_utils/option_utils.h
@@ -22,6 +22,9 @@ extern void handle_help_version_opts(int argc, char *argv[],
 extern bool option_parse_int(const char *optarg, const char *optname,
 							 int min_range, int max_range,
 							 int *result);
+extern bool option_parse_uint32(const char *optarg, const char *optname,
+							 uint32 min_range, uint32 max_range,
+							 uint32 *result);
 extern bool parse_sync_method(const char *optarg,
 							  DataDirSyncMethod *sync_method);
 
-- 
2.43.0

Re: Patch: dumping tables data in multiple chunks in pg_dump

Reply via email to