On 10/28/25 21:54, Gregory Smith wrote:
> On Sun, Oct 26, 2025 at 5:52 PM Tomas Vondra <[email protected]
> <mailto:[email protected]>> wrote:
> 
>     I like (a) more, because it's more consistent with how I understand
>     m_w_m. It's weird
>     to say "use up to 20GB of memory" and then the system overrides that
>     with "1GB". 
>     I don't think it affects performance, though.
> 
> 
> There wasn't really that much gain from 1GB -> 20GB, I was using that
> setting for QA purposes more than measured performance.  During the
> early parts of an OSM build, you need to have a big Node Cache to hit
> max speed, 1/2 or more of a ~90GB file.  Once that part finishes,
> the 45GB+ cache block frees up and index building starts.  I just looked
> at how much was just freed and thought "ehhh...split it in half and
> maybe 20GB maintenance mem?"  Results seemed a little better than the
> 1GB setting I started at, so I've ran with that 20GB setting since.  
> 
> That was back in PG14 and so many bottlenecks have moved around.  Since
> reporting this bug I've done a set of PG18 tests with m_w_m=256MB, and
> one of them just broke my previous record time running PG17.  So even
> that size setting seems fine.
> 

Right, that matches my observations from testing the fixes.

I'd attribute this to caching effects when the accumulated GIN entries
fit into L3.

>     I also wonder how far are we from hitting the uint32 limits. FAICS with
>     m_w_m=24GB we might end up with too many elements, even with serial
>     index builds. It'd have to be a quite weird data set, though.
> 
> 
> Since I'm starting to doubt I ever really needed even 20GB, I wouldn't
> stress about supporting that much being important.  I'll see if I can
> trigger an overflow with a test case though, maybe it's worth protecting
> against even if it's not a functional setting.
> 

Yeah, I definitely want to protect against this. I believe similar
failures can happen even with much lower m_w_m values (possibly ~2-3GB),
although only with weird/skewed data sets. AFAICS a constant
single-element array would trigger this, but I haven't tested that.

Serial builds can fail with large maintenance_work_mem too, like this:

  ERROR: posting list is too long
  HINT: Reduce "maintenance_work_mem".

but it's deterministic, and it's actually a proper error message, not
just some weird "invalid alloc size".

Attached is a v3 of the patch series. 0001 and 0002 were already posted,
and I believe either of those would address the issue. 0003 is more of
an optimization, further reducing the memory usage.

I'm putting this through additional testing, which takes time. But it
seems there's still some loose end in 0001, as I just got the "invalid
alloc request" failure with it applied ... I'll take a look tomorrow.


regards

-- 
Tomas Vondra
From a287cd4e711fba07029d873c9049010599616b6d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Sun, 26 Oct 2025 21:14:44 +0100
Subject: [PATCH v3 1/3] Allow parallel GIN builds to allocate large chunks

The parallel GIN builds used palloc/repalloc to maintain TID lists,
which with high maintance_work_mem values can lead to failures like

    ERROR:  invalid memory alloc request size 1113001620

The reason is that while merging intermediate worker data, we call
GinBufferStoreTuple() which coalesces the TID lists, and the result
may not fit into MaxAllocSize.

Fixed by allowing huge allocations when merging TID lists, including
an existing palloc call in ginMergeItemPointers().

Report by Greg Smith, investigation and fix by me. Batchpatched to 18,
where parallel GIN builds were introduced.

Reported-by: Gregory Smith <[email protected]>
Discussion: https://postgr.es/m/cahljucwdwn-pe2bmze4kux7x5wwt_6rowta0muqffedlez6...@mail.gmail.com
Backpatch-through: 18
---
 src/backend/access/gin/gininsert.c      | 7 ++++---
 src/backend/access/gin/ginpostinglist.c | 3 ++-
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 3d71b442aa9..2355b96b351 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1496,10 +1496,11 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 		 * still pass 0 as number of elements in that array though.
 		 */
 		if (buffer->items == NULL)
-			buffer->items = palloc((buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+			buffer->items = palloc_extended((buffer->nitems + tup->nitems) * sizeof(ItemPointerData),
+											MCXT_ALLOC_HUGE);
 		else
-			buffer->items = repalloc(buffer->items,
-									 (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+			buffer->items = repalloc_huge(buffer->items,
+										  (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
 
 		new = ginMergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfrozen */
 								   (buffer->nitems - buffer->nfrozen),	/* num of unfrozen */
diff --git a/src/backend/access/gin/ginpostinglist.c b/src/backend/access/gin/ginpostinglist.c
index 48eadec87b0..4fe46135238 100644
--- a/src/backend/access/gin/ginpostinglist.c
+++ b/src/backend/access/gin/ginpostinglist.c
@@ -381,7 +381,8 @@ ginMergeItemPointers(ItemPointerData *a, uint32 na,
 {
 	ItemPointerData *dst;
 
-	dst = (ItemPointer) palloc((na + nb) * sizeof(ItemPointerData));
+	dst = (ItemPointer) palloc_extended((na + nb) * sizeof(ItemPointerData),
+										MCXT_ALLOC_HUGE);
 
 	/*
 	 * If the argument arrays don't overlap, we can just append them to each
-- 
2.51.0

From dfe964ae6b9d8bbb58143ed3ebd74ccb44cb340e Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Sun, 26 Oct 2025 21:23:37 +0100
Subject: [PATCH v3 2/3] Split TID lists during parallel GIN build

When building intermediate TID lists during parallel GIN builds, split
the sorted lists into smaller chunks, to limit the amount of memory
needed when merging the chunks later.

The leader may need to keep in memory up to one chunk per worker, and
possibly one extra chunk (before evicting some of the data). We limit
the chunk size so that memory usage does not exceed MaxAllocSize (1GB).

This is desirable even with huge allocations allowed. Larger chunks do
not improve performance, so that the increased memory usage is useless.

Report by Greg Smith, investigation and fix by me. Batchpatched to 18,
where parallel GIN builds were introduced.

Reported-by: Gregory Smith <[email protected]>
Discussion: https://postgr.es/m/cahljucwdwn-pe2bmze4kux7x5wwt_6rowta0muqffedlez6...@mail.gmail.com
Backpatch-through: 18
---
 src/backend/access/gin/gininsert.c | 48 +++++++++++++++++++++++-------
 1 file changed, 37 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 2355b96b351..d15e9a0cb0b 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -152,7 +152,9 @@ typedef struct
 	 * only in the leader process.
 	 */
 	GinLeader  *bs_leader;
-	int			bs_worker_id;
+
+	/* number of participating workers (including leader) */
+	int			bs_num_workers;
 
 	/* used to pass information from workers to leader */
 	double		bs_numtuples;
@@ -483,6 +485,11 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
 /*
  * ginFlushBuildState
  *		Write all data from BuildAccumulator into the tuplesort.
+ *
+ * The number of TIDs written to the tuplesort at once is limited, to reduce
+ * the amount of memory needed when merging the intermediate results later.
+ * The leader will see up to two chunks per worker, so calculate the limit to
+ * not need more than MaxAllocSize overall.
  */
 static void
 ginFlushBuildState(GinBuildState *buildstate, Relation index)
@@ -493,6 +500,11 @@ ginFlushBuildState(GinBuildState *buildstate, Relation index)
 	uint32		nlist;
 	OffsetNumber attnum;
 	TupleDesc	tdesc = RelationGetDescr(index);
+	uint32		maxlen;
+
+	/* maximum number of TIDs per chunk (two chunks per worker) */
+	maxlen = MaxAllocSize / sizeof(ItemPointerData);
+	maxlen /= (2 * buildstate->bs_num_workers);
 
 	ginBeginBAScan(&buildstate->accum);
 	while ((list = ginGetBAEntry(&buildstate->accum,
@@ -501,20 +513,31 @@ ginFlushBuildState(GinBuildState *buildstate, Relation index)
 		/* information about the key */
 		CompactAttribute *attr = TupleDescCompactAttr(tdesc, (attnum - 1));
 
-		/* GIN tuple and tuple length */
-		GinTuple   *tup;
-		Size		tuplen;
+		/* start of the chunk */
+		uint32		offset = 0;
 
-		/* there could be many entries, so be willing to abort here */
-		CHECK_FOR_INTERRUPTS();
+		/* split the entry into smaller chunk with up to maxlen items */
+		while (offset < nlist)
+		{
+			/* GIN tuple and tuple length */
+			GinTuple   *tup;
+			Size		tuplen;
+			uint32		len = Min(maxlen, nlist - offset);
 
-		tup = _gin_build_tuple(attnum, category,
-							   key, attr->attlen, attr->attbyval,
-							   list, nlist, &tuplen);
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+
+			tup = _gin_build_tuple(attnum, category,
+								   key, attr->attlen, attr->attbyval,
+								   &list[offset], len,
+								   &tuplen);
+
+			offset += maxlen;
 
-		tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
+			tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
 
-		pfree(tup);
+			pfree(tup);
+		}
 	}
 
 	MemoryContextReset(buildstate->tmpCtx);
@@ -2018,6 +2041,9 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 	/* remember how much space is allowed for the accumulated entries */
 	state->work_mem = (sortmem / 2);
 
+	/* remember how many workers participate in the build */
+	state->bs_num_workers = ginshared->scantuplesortstates;
+
 	/* Begin "partial" tuplesort */
 	state->bs_sortstate = tuplesort_begin_index_gin(heap, index,
 													state->work_mem,
-- 
2.51.0

From af9327b95d1a2d016dc207a532e6d668d944b49c Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Mon, 27 Oct 2025 23:58:10 +0100
Subject: [PATCH v3 3/3] Trim TIDs during parallel GIN builds more eagerly

The code frozen the beginning of TID lists only when merging the lists,
which means we'd only trim the list when adding the next chunk. But we
can do better - we can update the number of frozen items earlier.

This is not expected to make a huge difference, but it can't hurt and
it's virtually free.

Discussion: https://postgr.es/m/cahljucwdwn-pe2bmze4kux7x5wwt_6rowta0muqffedlez6...@mail.gmail.com
---
 src/backend/access/gin/gininsert.c | 129 ++++++++++++++---------------
 1 file changed, 64 insertions(+), 65 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index d15e9a0cb0b..37d72811b9a 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1401,6 +1401,48 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 static bool
 GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
 {
+	/*
+	 * Try freeze TIDs at the beginning of the list, i.e. exclude them from
+	 * the mergesort. We can do that with TIDs before the first TID in the new
+	 * tuple we're about to add into the buffer.
+	 *
+	 * We do this incrementally when adding data into the in-memory buffer,
+	 * and not later (e.g. when hitting a memory limit), because it allows us
+	 * to skip the frozen data during the mergesort, making it cheaper.
+	 */
+
+	/*
+	 * Check if the last TID in the current list is frozen. This is the case
+	 * when merging non-overlapping lists, e.g. in each parallel worker.
+	 */
+	if ((buffer->nitems > 0) &&
+		(ItemPointerCompare(&buffer->items[buffer->nitems - 1],
+							GinTupleGetFirst(tup)) == 0))
+		buffer->nfrozen = buffer->nitems;
+
+	/*
+	 * Now find the last TID we know to be frozen, i.e. the last TID right
+	 * before the new GIN tuple.
+	 *
+	 * Start with the first not-yet-frozen tuple, and walk until we find the
+	 * first TID that's higher. If we already know the whole list is frozen
+	 * (i.e. nfrozen == nitems), this does nothing.
+	 *
+	 * XXX This might do a binary search for sufficiently long lists, but it
+	 * does not seem worth the complexity. Overlapping lists should be rare
+	 * common, TID comparisons are cheap, and we should quickly freeze most of
+	 * the list.
+	 */
+	for (int i = buffer->nfrozen; i < buffer->nitems; i++)
+	{
+		/* Is the TID after the first TID of the new tuple? Can't freeze. */
+		if (ItemPointerCompare(&buffer->items[i],
+							   GinTupleGetFirst(tup)) > 0)
+			break;
+
+		buffer->nfrozen++;
+	}
+
 	/* not enough TIDs to trim (1024 is somewhat arbitrary number) */
 	if (buffer->nfrozen < 1024)
 		return false;
@@ -1445,6 +1487,9 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 	ItemPointerData *items;
 	Datum		key;
 
+	int			nnew;
+	ItemPointer new;
+
 	AssertCheckGinBuffer(buffer);
 
 	key = _gin_parse_tuple_key(tup);
@@ -1466,80 +1511,34 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 			buffer->key = (Datum) 0;
 	}
 
-	/*
-	 * Try freeze TIDs at the beginning of the list, i.e. exclude them from
-	 * the mergesort. We can do that with TIDs before the first TID in the new
-	 * tuple we're about to add into the buffer.
-	 *
-	 * We do this incrementally when adding data into the in-memory buffer,
-	 * and not later (e.g. when hitting a memory limit), because it allows us
-	 * to skip the frozen data during the mergesort, making it cheaper.
-	 */
-
-	/*
-	 * Check if the last TID in the current list is frozen. This is the case
-	 * when merging non-overlapping lists, e.g. in each parallel worker.
-	 */
-	if ((buffer->nitems > 0) &&
-		(ItemPointerCompare(&buffer->items[buffer->nitems - 1],
-							GinTupleGetFirst(tup)) == 0))
-		buffer->nfrozen = buffer->nitems;
+	/* add the new TIDs into the buffer, combine using merge-sort */
 
 	/*
-	 * Now find the last TID we know to be frozen, i.e. the last TID right
-	 * before the new GIN tuple.
-	 *
-	 * Start with the first not-yet-frozen tuple, and walk until we find the
-	 * first TID that's higher. If we already know the whole list is frozen
-	 * (i.e. nfrozen == nitems), this does nothing.
-	 *
-	 * XXX This might do a binary search for sufficiently long lists, but it
-	 * does not seem worth the complexity. Overlapping lists should be rare
-	 * common, TID comparisons are cheap, and we should quickly freeze most of
-	 * the list.
+	 * Resize the array - we do this first, because we'll dereference the
+	 * first unfrozen TID, which would fail if the array is NULL. We'll still
+	 * pass 0 as number of elements in that array though.
 	 */
-	for (int i = buffer->nfrozen; i < buffer->nitems; i++)
-	{
-		/* Is the TID after the first TID of the new tuple? Can't freeze. */
-		if (ItemPointerCompare(&buffer->items[i],
-							   GinTupleGetFirst(tup)) > 0)
-			break;
-
-		buffer->nfrozen++;
-	}
-
-	/* add the new TIDs into the buffer, combine using merge-sort */
-	{
-		int			nnew;
-		ItemPointer new;
-
-		/*
-		 * Resize the array - we do this first, because we'll dereference the
-		 * first unfrozen TID, which would fail if the array is NULL. We'll
-		 * still pass 0 as number of elements in that array though.
-		 */
-		if (buffer->items == NULL)
-			buffer->items = palloc_extended((buffer->nitems + tup->nitems) * sizeof(ItemPointerData),
-											MCXT_ALLOC_HUGE);
-		else
-			buffer->items = repalloc_huge(buffer->items,
-										  (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+	if (buffer->items == NULL)
+		buffer->items = palloc_extended((buffer->nitems + tup->nitems) * sizeof(ItemPointerData),
+										MCXT_ALLOC_HUGE);
+	else
+		buffer->items = repalloc_huge(buffer->items,
+									  (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
 
-		new = ginMergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfrozen */
-								   (buffer->nitems - buffer->nfrozen),	/* num of unfrozen */
-								   items, tup->nitems, &nnew);
+	new = ginMergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfrozen */
+							   (buffer->nitems - buffer->nfrozen),	/* num of unfrozen */
+							   items, tup->nitems, &nnew);
 
-		Assert(nnew == (tup->nitems + (buffer->nitems - buffer->nfrozen)));
+	Assert(nnew == (tup->nitems + (buffer->nitems - buffer->nfrozen)));
 
-		memcpy(&buffer->items[buffer->nfrozen], new,
-			   nnew * sizeof(ItemPointerData));
+	memcpy(&buffer->items[buffer->nfrozen], new,
+		   nnew * sizeof(ItemPointerData));
 
-		pfree(new);
+	pfree(new);
 
-		buffer->nitems += tup->nitems;
+	buffer->nitems += tup->nitems;
 
-		AssertCheckItemPointers(buffer);
-	}
+	AssertCheckItemPointers(buffer);
 
 	/* free the decompressed TID list */
 	pfree(items);
-- 
2.51.0

Reply via email to