On 10/26/25 16:16, Tomas Vondra wrote: > On 10/24/25 22:22, Gregory Smith wrote: >> On Fri, Oct 24, 2025 at 8:38 AM Tomas Vondra <[email protected] >> <mailto:[email protected]>> wrote: >> >> ... >>>> Can you show the contents of buffer and tup? I'm especially interested >> in these fields: >> buffer->nitems >> buffer->maxitems >> buffer->nfrozen >> tup->nitems >> >> >> I'll see if I can grab that data at the crash point. >> >> FYI for anyone who wants to replicate this: if you have a system with >> 128GB+ of RAM you could probably recreate the test case. Just have to >> download the Planet file and run osm2pgsql with the overly tweaked >> settings I use. I've published all the details of how I run this >> regression test now. >> >> Settings: https://github.com/gregs1104/pgbent/tree/main/conf/18/conf.d >> <https://github.com/gregs1104/pgbent/tree/main/conf/18/conf.d> >> Script setup: https://github.com/gregs1104/pgbent/blob/main/wl/osm- >> import <https://github.com/gregs1104/pgbent/blob/main/wl/osm-import> >> Test runner: https://github.com/gregs1104/pgbent/blob/main/util/osm- >> importer <https://github.com/gregs1104/pgbent/blob/main/util/osm-importer> >> Parse results: https://github.com/gregs1104/pgbent/blob/main/util/ >> pgbench-init-parse <https://github.com/gregs1104/pgbent/blob/main/util/ >> pgbench-init-parse> >> > > I did reproduce this using OSM, although I used different settings, but > that's only affects loading. Setting maintenance_work_mem=20GB is more > than enough to trigger the error during parallel index build. > > So I don't need the data. > >> >> If I'm right, I think there are two ways to fix this: >> (1) apply the trimming earlier, i.e. try to freeze + flush before >> actually merging the data (essentially, update nfrozen earlier) >> (2) use repalloc_huge (and palloc_huge) in GinBufferStoreTuple >> Or we probably should do both. >> >> >> Sounds like (2) is probably mandatory and (1) is good hygiene. >> > > Yes, (2) is mandatory to fix this, and it's also sufficient. See the > attached fix. I'll clean this up and push soon. > > AFAICS (1) is not really needed. I was concerned we might end up with > each worker producing a TID buffer close to maintenance_work_mem, and > then the leader would have to use twice as much memory when merging. But > it turns out I already thought about that, and the workers use a fair > share or maintenance_work_mem, not a new limit. So they produce smaller > chunks, and those should not exceed maintenance_work_mem when merging. > > I tried "freezing" the existing buffer more eagerly (before merging the > tuple), but that made no difference. The workers produce data with a lot > of overlaps (simply because that's how the parallel builds divide data), > and the amount of trimmed data is tiny. Something like 10k TIDs from a > buffer of 1M TIDs. So a tiny difference, and it'd still fail. > > I'm not against maybe experimenting with this, but it's going to be a > mater-only thing, not for backpatching. > > Maybe we should split the data into smaller chunks while building tuples > in ginFlushBuildState. That'd probably allow enforcing the memory limit > more strictly, because we sometimes hold multiple copies of the TIDs > arrays. But that's for master too. >
I spoke too soon, apparently :-( (2) is not actually a fix. It does fix some cases of invalid alloc size failures, the following call to ginMergeItemPointers() can hit that too, because it does palloc() internally. I didn't notice this before because of the other experimental changes, and because it seems to depend on which of the OSM indexes is being built, with how many workers, etc. I was a bit puzzled how come we don't hit this with serial builds too, because that calls ginMergeItemPointers() too. I guess that's just luck, because with serial builds we're likely flushing the TID list in smaller chunks, appending to an existing tuple. And it seems unlikely to cross the alloc limit for any of those. But for parallel builds we're pretty much guaranteed to see all TIDs for a key at once. I see two ways to fix this: a) Do the (re)palloc_huge change, but then also change the palloc call in ginMergeItemPointers. I'm not sure if we want to change the existing function, or create a static copy in gininsert.c with this tweak (it doesn't need anything else, so it's not that bad). b) Do the data splitting in ginFlushBuildState, so that workers don't generate chunks larger than MaxAllocSize/nworkers (for any key). The leader then merges at most one chunk per worker at a time, so it still fits into the alloc limit. Both seem to work. I like (a) more, because it's more consistent with how I understand m_w_m. It's weird to say "use up to 20GB of memory" and then the system overrides that with "1GB". I don't think it affects performance, though. I'll experiment with this a bit more, I just wanted to mention the fix I posted earlier does not actually fix the issue. I also wonder how far are we from hitting the uint32 limits. FAICS with m_w_m=24GB we might end up with too many elements, even with serial index builds. It'd have to be a quite weird data set, though. regards -- Tomas Vondra
From be3417e947a1f17e5fa4137481668779be532bc5 Mon Sep 17 00:00:00 2001 From: Tomas Vondra <[email protected]> Date: Sun, 26 Oct 2025 21:14:44 +0100 Subject: [PATCH 1/3] a: use huge allocations --- src/backend/access/gin/gininsert.c | 81 +++++++++++++++++++++++++++--- 1 file changed, 75 insertions(+), 6 deletions(-) diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c index 3d71b442aa9..085c85718cc 100644 --- a/src/backend/access/gin/gininsert.c +++ b/src/backend/access/gin/gininsert.c @@ -198,6 +198,10 @@ static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category, ItemPointerData *items, uint32 nitems, Size *len); +static ItemPointer mergeItemPointers(ItemPointerData *a, uint32 na, + ItemPointerData *b, uint32 nb, + int *nmerged); + /* * Adds array of item pointers to tuple's posting list, or * creates posting tree and tuple pointing to tree in case @@ -1496,14 +1500,15 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup) * still pass 0 as number of elements in that array though. */ if (buffer->items == NULL) - buffer->items = palloc((buffer->nitems + tup->nitems) * sizeof(ItemPointerData)); + buffer->items = MemoryContextAllocHuge(CurrentMemoryContext, + (buffer->nitems + tup->nitems) * sizeof(ItemPointerData)); else - buffer->items = repalloc(buffer->items, - (buffer->nitems + tup->nitems) * sizeof(ItemPointerData)); + buffer->items = repalloc_huge(buffer->items, + (buffer->nitems + tup->nitems) * sizeof(ItemPointerData)); - new = ginMergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfrozen */ - (buffer->nitems - buffer->nfrozen), /* num of unfrozen */ - items, tup->nitems, &nnew); + new = mergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfrozen */ + (buffer->nitems - buffer->nfrozen), /* num of unfrozen */ + items, tup->nitems, &nnew); Assert(nnew == (tup->nitems + (buffer->nitems - buffer->nfrozen))); @@ -2441,3 +2446,67 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup) return ItemPointerCompare(GinTupleGetFirst(a), GinTupleGetFirst(b)); } + + +/* + * local copy, doing palloc_huge + */ +static ItemPointer +mergeItemPointers(ItemPointerData *a, uint32 na, + ItemPointerData *b, uint32 nb, + int *nmerged) +{ + ItemPointerData *dst; + + dst = (ItemPointer) MemoryContextAllocHuge(CurrentMemoryContext, + (na + nb) * sizeof(ItemPointerData)); + + /* + * If the argument arrays don't overlap, we can just append them to each + * other. + */ + if (na == 0 || nb == 0 || ginCompareItemPointers(&a[na - 1], &b[0]) < 0) + { + memcpy(dst, a, na * sizeof(ItemPointerData)); + memcpy(&dst[na], b, nb * sizeof(ItemPointerData)); + *nmerged = na + nb; + } + else if (ginCompareItemPointers(&b[nb - 1], &a[0]) < 0) + { + memcpy(dst, b, nb * sizeof(ItemPointerData)); + memcpy(&dst[nb], a, na * sizeof(ItemPointerData)); + *nmerged = na + nb; + } + else + { + ItemPointerData *dptr = dst; + ItemPointerData *aptr = a; + ItemPointerData *bptr = b; + + while (aptr - a < na && bptr - b < nb) + { + int cmp = ginCompareItemPointers(aptr, bptr); + + if (cmp > 0) + *dptr++ = *bptr++; + else if (cmp == 0) + { + /* only keep one copy of the identical items */ + *dptr++ = *bptr++; + aptr++; + } + else + *dptr++ = *aptr++; + } + + while (aptr - a < na) + *dptr++ = *aptr++; + + while (bptr - b < nb) + *dptr++ = *bptr++; + + *nmerged = dptr - dst; + } + + return dst; +} -- 2.51.0
From 1c47f44483939a0a7a47073a838164a3296a51c3 Mon Sep 17 00:00:00 2001 From: Tomas Vondra <[email protected]> Date: Sun, 26 Oct 2025 21:14:44 +0100 Subject: [PATCH] b: split TID lists when flushing --- src/backend/access/gin/gininsert.c | 37 +++++++++++++++++++++--------- 1 file changed, 26 insertions(+), 11 deletions(-) diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c index 3d71b442aa9..eff78cc622d 100644 --- a/src/backend/access/gin/gininsert.c +++ b/src/backend/access/gin/gininsert.c @@ -152,7 +152,9 @@ typedef struct * only in the leader process. */ GinLeader *bs_leader; - int bs_worker_id; + + /* number of participating workers (including leader) */ + int bs_num_workers; /* used to pass information from workers to leader */ double bs_numtuples; @@ -494,27 +496,39 @@ ginFlushBuildState(GinBuildState *buildstate, Relation index) OffsetNumber attnum; TupleDesc tdesc = RelationGetDescr(index); + /* how many TIDs fit into a regular allocation (50% to keep slack) */ + uint32 maxsize = (MaxAllocSize / sizeof(ItemPointerData) / 2); + maxsize /= buildstate->bs_num_workers; /* fair share per worker */ + ginBeginBAScan(&buildstate->accum); while ((list = ginGetBAEntry(&buildstate->accum, &attnum, &key, &category, &nlist)) != NULL) { /* information about the key */ CompactAttribute *attr = TupleDescCompactAttr(tdesc, (attnum - 1)); + uint32 start = 0; - /* GIN tuple and tuple length */ - GinTuple *tup; - Size tuplen; + /* split the list into smaller parts with up to maxsize items */ + while (start < nlist) + { + /* GIN tuple and tuple length */ + GinTuple *tup; + Size tuplen; - /* there could be many entries, so be willing to abort here */ - CHECK_FOR_INTERRUPTS(); + /* there could be many entries, so be willing to abort here */ + CHECK_FOR_INTERRUPTS(); - tup = _gin_build_tuple(attnum, category, - key, attr->attlen, attr->attbyval, - list, nlist, &tuplen); + tup = _gin_build_tuple(attnum, category, + key, attr->attlen, attr->attbyval, + &list[start], Min(maxsize, nlist - start), + &tuplen); - tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen); + start += maxsize; - pfree(tup); + tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen); + + pfree(tup); + } } MemoryContextReset(buildstate->tmpCtx); @@ -2016,6 +2030,7 @@ _gin_parallel_scan_and_build(GinBuildState *state, /* remember how much space is allowed for the accumulated entries */ state->work_mem = (sortmem / 2); + state->bs_num_workers = ginshared->scantuplesortstates; /* Begin "partial" tuplesort */ state->bs_sortstate = tuplesort_begin_index_gin(heap, index, -- 2.51.0
