Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
Jeff, * Jeff Janes (jeff.ja...@gmail.com) wrote: I was going to add another item to make nodeHash.c use the new huge allocator, but after looking at it just now it was not clear to me that it even has such a limitation. nbatch is limited by MaxAllocSize, but nbuckets doesn't seem to be. nodeHash.c:ExecHashTableCreate() allocates -buckets using: palloc(nbuckets * sizeof(HashJoinTuple)) (where HashJoinTuple is actually just a pointer), and reallocates same in ExecHashTableReset(). That limits the current implementation to only about 134M buckets, no? Now, what I was really suggesting wasn't so much changing those specific calls; my point was really that there's a ton of stuff in the HashJoin code that uses 32bit integers for things which, these days, might be too small (nbuckets being one example, imv). There's a lot of code there though and you'd have to really consider which things make sense to have as int64's. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
On Wed, Jun 26, 2013 at 03:48:23PM -0700, Jeff Janes wrote: On Mon, May 13, 2013 at 7:26 AM, Noah Misch n...@leadboat.com wrote: This patch introduces MemoryContextAllocHuge() and repalloc_huge() that check a higher MaxAllocHugeSize limit of SIZE_MAX/2. Chunks don't bother recording whether they were allocated as huge; one can start with palloc() and then repalloc_huge() to grow the value. Since it doesn't record the size, I assume the non-use as a varlena is enforced only by coder discipline and not by the system? We will rely on coder discipline, yes. The allocator actually does record a size. I was referring to the fact that it can't distinguish the result of repalloc(p, 7) from the result of repalloc_huge(p, 7). What is likely to happen if I accidentally let a pointer to huge memory escape to someone who then passes it to varlena constructor without me knowing it? (I tried sabotaging the code to make this happen, but I could not figure out how to). Is there a place we can put an Assert to catch this mistake under enable-cassert builds? Passing a too-large value gives a modulo effect. We could inject an AssertMacro() into SET_VARSIZE(). But it's a hot path, and I don't think this mistake is too likely. The only danger I can think of is that it could sometimes make some sorts slower, as using more memory than is necessary can sometimes slow down an external sort (because the heap is then too big for the fastest CPU cache). If you use more tapes, but not enough more to reduce the number of passes needed, then you can get a slowdown. Interesting point, though I don't fully understand it. The fastest CPU cache will be a tiny L1 data cache; surely that's not the relevant parameter here? I can't imagine that it would make things worse on average, though, as the benefit of doing more sorts as quicksorts rather than merge sorts, or doing mergesort with fewer number of passes, would outweigh sometimes doing a slower mergesort. If someone has a pathological use pattern for which the averages don't work out favorably for them, they could probably play with work_mem to correct the problem. Whereas without the patch, people who want more memory have no options. Agreed. People have mentioned additional things that could be done in this area, but I don't think that applying this patch will make those things harder, or back us into a corner. Taking an incremental approach seems suitable. Committed with some cosmetic tweaks discussed upthread. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
On Sat, Jun 22, 2013 at 12:46 AM, Stephen Frost sfr...@snowman.net wrote: Noah, * Noah Misch (n...@leadboat.com) wrote: This patch introduces MemoryContextAllocHuge() and repalloc_huge() that check a higher MaxAllocHugeSize limit of SIZE_MAX/2. Nice! I've complained about this limit a few different times and just never got around to addressing it. This was made easier by tuplesort growth algorithm improvements in commit 8ae35e91807508872cabd3b0e8db35fc78e194ac. The problem has come up before (TODO item Allow sorts to use more available memory), and Tom floated the idea[1] behind the approach I've used. The next limit faced by sorts is INT_MAX concurrent tuples in memory, which limits helpful work_mem to about 150 GiB when sorting int4. That's frustratingly small. :( I've added a ToDo item to remove that limit from sorts as well. I was going to add another item to make nodeHash.c use the new huge allocator, but after looking at it just now it was not clear to me that it even has such a limitation. nbatch is limited by MaxAllocSize, but nbuckets doesn't seem to be. Cheers, Jeff
Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
On Mon, May 13, 2013 at 7:26 AM, Noah Misch n...@leadboat.com wrote: A memory chunk allocated through the existing palloc.h interfaces is limited to MaxAllocSize (~1 GiB). This is best for most callers; SET_VARSIZE() need not check its own 1 GiB limit, and algorithms that grow a buffer by doubling need not check for overflow. However, a handful of callers are quite happy to navigate those hazards in exchange for the ability to allocate a larger chunk. This patch introduces MemoryContextAllocHuge() and repalloc_huge() that check a higher MaxAllocHugeSize limit of SIZE_MAX/2. Chunks don't bother recording whether they were allocated as huge; one can start with palloc() and then repalloc_huge() to grow the value. Since it doesn't record the size, I assume the non-use as a varlena is enforced only by coder discipline and not by the system? ! * represented in a varlena header. Callers that never use the allocation as ! * a varlena can access the higher limit with MemoryContextAllocHuge(). Both ! * limits permit code to assume that it may compute (in size_t math) twice an ! * allocation's size without overflow. What is likely to happen if I accidentally let a pointer to huge memory escape to someone who then passes it to varlena constructor without me knowing it? (I tried sabotaging the code to make this happen, but I could not figure out how to). Is there a place we can put an Assert to catch this mistake under enable-cassert builds? I have not yet done a detailed code review, but this applies and builds cleanly, passes make check with and without enable-cassert, it does what it says (and gives performance improvements when it does kick in), and we want this. No doc changes should be needed, we probably don't want run an automatic regression test of the size needed to usefully test this, and as far as I know there is no infrastructure for big memory only tests. The only danger I can think of is that it could sometimes make some sorts slower, as using more memory than is necessary can sometimes slow down an external sort (because the heap is then too big for the fastest CPU cache). If you use more tapes, but not enough more to reduce the number of passes needed, then you can get a slowdown. I can't imagine that it would make things worse on average, though, as the benefit of doing more sorts as quicksorts rather than merge sorts, or doing mergesort with fewer number of passes, would outweigh sometimes doing a slower mergesort. If someone has a pathological use pattern for which the averages don't work out favorably for them, they could probably play with work_mem to correct the problem. Whereas without the patch, people who want more memory have no options. People have mentioned additional things that could be done in this area, but I don't think that applying this patch will make those things harder, or back us into a corner. Taking an incremental approach seems suitable. Cheers, Jeff
Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
On Sat, Jun 22, 2013 at 03:46:49AM -0400, Stephen Frost wrote: * Noah Misch (n...@leadboat.com) wrote: The next limit faced by sorts is INT_MAX concurrent tuples in memory, which limits helpful work_mem to about 150 GiB when sorting int4. That's frustratingly small. :( I could appreciate a desire to remove that limit. The way to do that is to audit all uses of int variables in tuplesort.c and tuplestore.c, changing them to Size where they can be used as indexes into the memtuples array. Nonetheless, this new limit is about 50x the current limit; you need an (unpartitioned) table of 2B+ rows to encounter it. I'm happy with that. ! if (memtupsize * grow_ratio INT_MAX) ! newmemtupsize = (int) (memtupsize * grow_ratio); ! else ! newmemtupsize = INT_MAX; /* We won't make any further enlargement attempts */ state-growmemtuples = false; I'm not a huge fan of moving directly to INT_MAX. Are we confident that everything can handle that cleanly..? I feel like it might be a bit safer to shy a bit short of INT_MAX (say, by 1K). Perhaps that's overly paranoid, but there's an awful lot of callers and some loop which +2's and then overflows would suck, eg: Where are you seeing an awful lot of callers? The code that needs to be correct with respect to the INT_MAX limit is all in tuplesort.c/tuplestore.c. Consequently, I chose to verify that code rather than add a safety factor. (I did add an unrelated safety factor to repalloc_huge() itself.) Also, could this be used to support hashing larger sets..? If we change NTUP_PER_BUCKET to one, we could end up wanting to create a hash table larger than INT_MAX since, with 8-byte pointers, that'd only be around 134M tuples. The INT_MAX limit is an internal limit of tuplesort/tuplestore; other consumers of the huge allocation APIs are only subject to that limit if they find reasons to enforce it on themselves. (Incidentally, the internal limit in question is INT_MAX tuples, not INT_MAX bytes.) -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
On Sat, Jun 22, 2013 at 11:36:45AM +0100, Simon Riggs wrote: On 13 May 2013 15:26, Noah Misch n...@leadboat.com wrote: I'm concerned that people will accidentally use MaxAllocSize. Can we put in a runtime warning if someone tests AllocSizeIsValid() with a larger value? I don't see how we could. To preempt a repalloc() failure, you test with AllocSizeIsValid(); testing a larger value is not a programming error. To demonstrate, I put this to use in tuplesort.c; the patch also updates tuplestore.c to keep them similar. Here's the trace_sort from building the pgbench_accounts primary key at scale factor 7500, maintenance_work_mem = '56GB'; memtuples itself consumed 17.2 GiB: LOG: internal sort ended, 48603324 KB used: CPU 75.65s/305.46u sec elapsed 391.21 sec Compare: LOG: external sort ended, 1832846 disk blocks used: CPU 77.45s/988.11u sec elapsed 1146.05 sec Cool. I'd like to put in an explicit test for this somewhere. Obviously not part of normal regression, but somewhere, at least, so we have automated testing that we all agree on. (yes, I know we don't have that for replication/recovery yet, but thats why I don't want to repeat that mistake). Probably the easiest way to test from nothing is to run pgbench -i -s 7500 under a high work_mem. I agree that an automated test suite dedicated to coverage of scale-dependent matters would be valuable, though I'm disinclined to start one in conjunction with this particular patch. The comment at MaxAllocSize said that aset.c expects doubling the size of an arbitrary allocation to never overflow, but I couldn't find the code in question. AllocSetAlloc() does double sizes of blocks used to aggregate small allocations, so maxBlockSize had better stay under SIZE_MAX/2. Nonetheless, that expectation does apply to dozens of repalloc() users outside aset.c, and I preserved it for repalloc_huge(). 64-bit builds will never notice, and I won't cry for the resulting 2 GiB limit on 32-bit. Agreed. Can we document this for the relevant parameters? I attempted to cover most of that in the comment above MaxAllocHugeSize, but I did not mention the maxBlockSize constraint. I'll add an Assert(AllocHugeSizeIsValid(maxBlockSize)) and a comment to AllocSetContextCreate(). Did I miss documenting anything else notable? -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
On Monday, June 24, 2013, Noah Misch wrote: On Sat, Jun 22, 2013 at 03:46:49AM -0400, Stephen Frost wrote: * Noah Misch (n...@leadboat.com javascript:;) wrote: The next limit faced by sorts is INT_MAX concurrent tuples in memory, which limits helpful work_mem to about 150 GiB when sorting int4. That's frustratingly small. :( I could appreciate a desire to remove that limit. The way to do that is to audit all uses of int variables in tuplesort.c and tuplestore.c, changing them to Size where they can be used as indexes into the memtuples array. Right, that's about what I figured would need to be done. Nonetheless, this new limit is about 50x the current limit; you need an (unpartitioned) table of 2B+ rows to encounter it. I'm happy with that. Definitely better but I could see cases with that many tuples in the not-too-distant future, esp. when used with MinMax indexes... ! if (memtupsize * grow_ratio INT_MAX) ! newmemtupsize = (int) (memtupsize * grow_ratio); ! else ! newmemtupsize = INT_MAX; /* We won't make any further enlargement attempts */ state-growmemtuples = false; I'm not a huge fan of moving directly to INT_MAX. Are we confident that everything can handle that cleanly..? I feel like it might be a bit safer to shy a bit short of INT_MAX (say, by 1K). Perhaps that's overly paranoid, but there's an awful lot of callers and some loop which +2's and then overflows would suck, eg: Where are you seeing an awful lot of callers? The code that needs to be correct with respect to the INT_MAX limit is all in tuplesort.c/tuplestore.c. Consequently, I chose to verify that code rather than add a safety factor. (I did add an unrelated safety factor to repalloc_huge() itself.) Ok, I was thinking this code was used beyond tuplesort (I was thinking it was actually associated with palloc). Apologies for the confusion. :) Also, could this be used to support hashing larger sets..? If we change NTUP_PER_BUCKET to one, we could end up wanting to create a hash table larger than INT_MAX since, with 8-byte pointers, that'd only be around 134M tuples. The INT_MAX limit is an internal limit of tuplesort/tuplestore; other consumers of the huge allocation APIs are only subject to that limit if they find reasons to enforce it on themselves. (Incidentally, the internal limit in question is INT_MAX tuples, not INT_MAX bytes.) There's other places where we use integers for indexes into arrays of tuples (at least hashing is another area..) and those are then also subject to INT_MAX, which was really what I was getting at. We might move the hashing code to use the _huge functions and would then need to adjust that code to use Size for the index into the hash table array of pointers. Thanks, Stephen
Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
Noah, * Noah Misch (n...@leadboat.com) wrote: This patch introduces MemoryContextAllocHuge() and repalloc_huge() that check a higher MaxAllocHugeSize limit of SIZE_MAX/2. Nice! I've complained about this limit a few different times and just never got around to addressing it. This was made easier by tuplesort growth algorithm improvements in commit 8ae35e91807508872cabd3b0e8db35fc78e194ac. The problem has come up before (TODO item Allow sorts to use more available memory), and Tom floated the idea[1] behind the approach I've used. The next limit faced by sorts is INT_MAX concurrent tuples in memory, which limits helpful work_mem to about 150 GiB when sorting int4. That's frustratingly small. :( [...] --- 1024,1041 * new array elements even if no other memory were currently used. * * We do the arithmetic in float8, because otherwise the product of ! * memtupsize and allowedMem could overflow. Any inaccuracy in the ! * result should be insignificant; but even if we computed a ! * completely insane result, the checks below will prevent anything ! * really bad from happening. */ double grow_ratio; grow_ratio = (double) state-allowedMem / (double) memNowUsed; ! if (memtupsize * grow_ratio INT_MAX) ! newmemtupsize = (int) (memtupsize * grow_ratio); ! else ! newmemtupsize = INT_MAX; /* We won't make any further enlargement attempts */ state-growmemtuples = false; I'm not a huge fan of moving directly to INT_MAX. Are we confident that everything can handle that cleanly..? I feel like it might be a bit safer to shy a bit short of INT_MAX (say, by 1K). Perhaps that's overly paranoid, but there's an awful lot of callers and some loop which +2's and then overflows would suck, eg: int x = INT_MAX; for (x-1; (x-1) INT_MAX; x += 2) { myarray[x] = 5; } Also, could this be used to support hashing larger sets..? If we change NTUP_PER_BUCKET to one, we could end up wanting to create a hash table larger than INT_MAX since, with 8-byte pointers, that'd only be around 134M tuples. Haven't had a chance to review the rest, but +1 on the overall idea. :) Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
On 13 May 2013 15:26, Noah Misch n...@leadboat.com wrote: A memory chunk allocated through the existing palloc.h interfaces is limited to MaxAllocSize (~1 GiB). This is best for most callers; SET_VARSIZE() need not check its own 1 GiB limit, and algorithms that grow a buffer by doubling need not check for overflow. However, a handful of callers are quite happy to navigate those hazards in exchange for the ability to allocate a larger chunk. This patch introduces MemoryContextAllocHuge() and repalloc_huge() that check a higher MaxAllocHugeSize limit of SIZE_MAX/2. Chunks don't bother recording whether they were allocated as huge; one can start with palloc() and then repalloc_huge() to grow the value. I like the design and think its workable. I'm concerned that people will accidentally use MaxAllocSize. Can we put in a runtime warning if someone tests AllocSizeIsValid() with a larger value? To demonstrate, I put this to use in tuplesort.c; the patch also updates tuplestore.c to keep them similar. Here's the trace_sort from building the pgbench_accounts primary key at scale factor 7500, maintenance_work_mem = '56GB'; memtuples itself consumed 17.2 GiB: LOG: internal sort ended, 48603324 KB used: CPU 75.65s/305.46u sec elapsed 391.21 sec Compare: LOG: external sort ended, 1832846 disk blocks used: CPU 77.45s/988.11u sec elapsed 1146.05 sec Cool. I'd like to put in an explicit test for this somewhere. Obviously not part of normal regression, but somewhere, at least, so we have automated testing that we all agree on. (yes, I know we don't have that for replication/recovery yet, but thats why I don't want to repeat that mistake). This was made easier by tuplesort growth algorithm improvements in commit 8ae35e91807508872cabd3b0e8db35fc78e194ac. The problem has come up before (TODO item Allow sorts to use more available memory), and Tom floated the idea[1] behind the approach I've used. The next limit faced by sorts is INT_MAX concurrent tuples in memory, which limits helpful work_mem to about 150 GiB when sorting int4. I have not added variants like palloc_huge() and palloc0_huge(), and I have not added to the frontend palloc.h interface. There's no particular barrier to doing any of that. I don't expect more than a dozen or so callers, so most of the variations might go unused. The comment at MaxAllocSize said that aset.c expects doubling the size of an arbitrary allocation to never overflow, but I couldn't find the code in question. AllocSetAlloc() does double sizes of blocks used to aggregate small allocations, so maxBlockSize had better stay under SIZE_MAX/2. Nonetheless, that expectation does apply to dozens of repalloc() users outside aset.c, and I preserved it for repalloc_huge(). 64-bit builds will never notice, and I won't cry for the resulting 2 GiB limit on 32-bit. Agreed. Can we document this for the relevant parameters? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
On 22 June 2013 08:46, Stephen Frost sfr...@snowman.net wrote: The next limit faced by sorts is INT_MAX concurrent tuples in memory, which limits helpful work_mem to about 150 GiB when sorting int4. That's frustratingly small. :( But that has nothing to do with this patch, right? And is easily fixed, yes? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
* Simon Riggs (si...@2ndquadrant.com) wrote: On 22 June 2013 08:46, Stephen Frost sfr...@snowman.net wrote: The next limit faced by sorts is INT_MAX concurrent tuples in memory, which limits helpful work_mem to about 150 GiB when sorting int4. That's frustratingly small. :( But that has nothing to do with this patch, right? And is easily fixed, yes? I don't know about 'easily fixed' (consider supporting a HashJoin of 2B records) but I do agree that dealing with places in the code where we are using an int4 to keep track of the number of objects in memory is outside the scope of this patch. Hopefully we are properly range-checking and limiting ourselves to only what a given node can support and not solely depending on MaxAllocSize to keep us from overflowing some int4 which we're using as an index for an array or as a count of how many objects we've currently got in memory, but we'll want to consider carefully what happens with such large sets as we're adding support into nodes for these Huge allocations (along with the recent change to allow 1TB work_mem, which may encourage users with systems large enough to actually try to set it that high... :) Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
On Sat, Jun 22, 2013 at 3:46 AM, Stephen Frost sfr...@snowman.net wrote: I'm not a huge fan of moving directly to INT_MAX. Are we confident that everything can handle that cleanly..? I feel like it might be a bit safer to shy a bit short of INT_MAX (say, by 1K). Maybe it would be better to stick with INT_MAX and fix any bugs we find. If there are magic numbers short of INT_MAX that cause problems, it would likely be better to find out about those problems and adjust the relevant code, rather than trying to dodge them. We'll have to confront all of those problems eventually as we come to support larger and larger sorts; I don't see much value in putting it off. Especially since we're early in the release cycle. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
A memory chunk allocated through the existing palloc.h interfaces is limited to MaxAllocSize (~1 GiB). This is best for most callers; SET_VARSIZE() need not check its own 1 GiB limit, and algorithms that grow a buffer by doubling need not check for overflow. However, a handful of callers are quite happy to navigate those hazards in exchange for the ability to allocate a larger chunk. This patch introduces MemoryContextAllocHuge() and repalloc_huge() that check a higher MaxAllocHugeSize limit of SIZE_MAX/2. Chunks don't bother recording whether they were allocated as huge; one can start with palloc() and then repalloc_huge() to grow the value. To demonstrate, I put this to use in tuplesort.c; the patch also updates tuplestore.c to keep them similar. Here's the trace_sort from building the pgbench_accounts primary key at scale factor 7500, maintenance_work_mem = '56GB'; memtuples itself consumed 17.2 GiB: LOG: internal sort ended, 48603324 KB used: CPU 75.65s/305.46u sec elapsed 391.21 sec Compare: LOG: external sort ended, 1832846 disk blocks used: CPU 77.45s/988.11u sec elapsed 1146.05 sec This was made easier by tuplesort growth algorithm improvements in commit 8ae35e91807508872cabd3b0e8db35fc78e194ac. The problem has come up before (TODO item Allow sorts to use more available memory), and Tom floated the idea[1] behind the approach I've used. The next limit faced by sorts is INT_MAX concurrent tuples in memory, which limits helpful work_mem to about 150 GiB when sorting int4. I have not added variants like palloc_huge() and palloc0_huge(), and I have not added to the frontend palloc.h interface. There's no particular barrier to doing any of that. I don't expect more than a dozen or so callers, so most of the variations might go unused. The comment at MaxAllocSize said that aset.c expects doubling the size of an arbitrary allocation to never overflow, but I couldn't find the code in question. AllocSetAlloc() does double sizes of blocks used to aggregate small allocations, so maxBlockSize had better stay under SIZE_MAX/2. Nonetheless, that expectation does apply to dozens of repalloc() users outside aset.c, and I preserved it for repalloc_huge(). 64-bit builds will never notice, and I won't cry for the resulting 2 GiB limit on 32-bit. Thanks, nm [1] http://www.postgresql.org/message-id/19908.1297696...@sss.pgh.pa.us -- Noah Misch EnterpriseDB http://www.enterprisedb.com *** a/src/backend/utils/mmgr/aset.c --- b/src/backend/utils/mmgr/aset.c *** *** 557,562 AllocSetDelete(MemoryContext context) --- 557,566 * AllocSetAlloc *Returns pointer to allocated memory of given size; memory is added *to the set. + * + * No request may exceed: + *MAXALIGN_DOWN(SIZE_MAX) - ALLOC_BLOCKHDRSZ - ALLOC_CHUNKHDRSZ + * All callers use a much-lower limit. */ static void * AllocSetAlloc(MemoryContext context, Size size) *** a/src/backend/utils/mmgr/mcxt.c --- b/src/backend/utils/mmgr/mcxt.c *** *** 451,464 MemoryContextContains(MemoryContext context, void *pointer) header = (StandardChunkHeader *) ((char *) pointer - STANDARDCHUNKHEADERSIZE); ! /* !* If the context link doesn't match then we certainly have a non-member !* chunk. Also check for a reasonable-looking size as extra guard against !* being fooled by bogus pointers. !*/ ! if (header-context == context AllocSizeIsValid(header-size)) ! return true; ! return false; } /* --- 451,457 header = (StandardChunkHeader *) ((char *) pointer - STANDARDCHUNKHEADERSIZE); ! return header-context == context; } /* *** *** 735,740 repalloc(void *pointer, Size size) --- 728,790 } /* + * MemoryContextAllocHuge + *Allocate (possibly-expansive) space within the specified context. + * + * See considerations in comment at MaxAllocHugeSize. + */ + void * + MemoryContextAllocHuge(MemoryContext context, Size size) + { + AssertArg(MemoryContextIsValid(context)); + + if (!AllocHugeSizeIsValid(size)) + elog(ERROR, invalid memory alloc request size %lu, +(unsigned long) size); + + context-isReset = false; + + return (*context-methods-alloc) (context, size); + } + + /* + * repalloc_huge + *Adjust the size of a previously allocated chunk, permitting a large + *value. The previous allocation need not have been huge. + */ + void * + repalloc_huge(void *pointer, Size size) + { + StandardChunkHeader *header; + + /* +* Try to detect bogus pointers handed to us, poorly though we can. +* Presumably, a pointer that isn't MAXALIGNED isn't pointing at an +* allocated chunk. +*/ +
Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize
+1 Pavel Dne 13.5.2013 16:29 Noah Misch n...@leadboat.com napsal(a): A memory chunk allocated through the existing palloc.h interfaces is limited to MaxAllocSize (~1 GiB). This is best for most callers; SET_VARSIZE() need not check its own 1 GiB limit, and algorithms that grow a buffer by doubling need not check for overflow. However, a handful of callers are quite happy to navigate those hazards in exchange for the ability to allocate a larger chunk. This patch introduces MemoryContextAllocHuge() and repalloc_huge() that check a higher MaxAllocHugeSize limit of SIZE_MAX/2. Chunks don't bother recording whether they were allocated as huge; one can start with palloc() and then repalloc_huge() to grow the value. To demonstrate, I put this to use in tuplesort.c; the patch also updates tuplestore.c to keep them similar. Here's the trace_sort from building the pgbench_accounts primary key at scale factor 7500, maintenance_work_mem = '56GB'; memtuples itself consumed 17.2 GiB: LOG: internal sort ended, 48603324 KB used: CPU 75.65s/305.46u sec elapsed 391.21 sec Compare: LOG: external sort ended, 1832846 disk blocks used: CPU 77.45s/988.11u sec elapsed 1146.05 sec This was made easier by tuplesort growth algorithm improvements in commit 8ae35e91807508872cabd3b0e8db35fc78e194ac. The problem has come up before (TODO item Allow sorts to use more available memory), and Tom floated the idea[1] behind the approach I've used. The next limit faced by sorts is INT_MAX concurrent tuples in memory, which limits helpful work_mem to about 150 GiB when sorting int4. I have not added variants like palloc_huge() and palloc0_huge(), and I have not added to the frontend palloc.h interface. There's no particular barrier to doing any of that. I don't expect more than a dozen or so callers, so most of the variations might go unused. The comment at MaxAllocSize said that aset.c expects doubling the size of an arbitrary allocation to never overflow, but I couldn't find the code in question. AllocSetAlloc() does double sizes of blocks used to aggregate small allocations, so maxBlockSize had better stay under SIZE_MAX/2. Nonetheless, that expectation does apply to dozens of repalloc() users outside aset.c, and I preserved it for repalloc_huge(). 64-bit builds will never notice, and I won't cry for the resulting 2 GiB limit on 32-bit. Thanks, nm [1] http://www.postgresql.org/message-id/19908.1297696...@sss.pgh.pa.us -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers