Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

2013-07-06 Thread Stephen Frost
Jeff,

* Jeff Janes (jeff.ja...@gmail.com) wrote:
 I was going to add another item to make nodeHash.c use the new huge
 allocator, but after looking at it just now it was not clear to me that it
 even has such a limitation.  nbatch is limited by MaxAllocSize, but
 nbuckets doesn't seem to be.

nodeHash.c:ExecHashTableCreate() allocates -buckets using:

palloc(nbuckets * sizeof(HashJoinTuple)) 

(where HashJoinTuple is actually just a pointer), and reallocates same
in ExecHashTableReset().  That limits the current implementation to only
about 134M buckets, no?

Now, what I was really suggesting wasn't so much changing those specific
calls; my point was really that there's a ton of stuff in the HashJoin
code that uses 32bit integers for things which, these days, might be too
small (nbuckets being one example, imv).  There's a lot of code there
though and you'd have to really consider which things make sense to have
as int64's.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

2013-06-27 Thread Noah Misch
On Wed, Jun 26, 2013 at 03:48:23PM -0700, Jeff Janes wrote:
 On Mon, May 13, 2013 at 7:26 AM, Noah Misch n...@leadboat.com wrote:
  This patch introduces MemoryContextAllocHuge() and repalloc_huge() that
  check
  a higher MaxAllocHugeSize limit of SIZE_MAX/2.  Chunks don't bother
  recording
  whether they were allocated as huge; one can start with palloc() and then
  repalloc_huge() to grow the value.
 
 
 Since it doesn't record the size, I assume the non-use as a varlena is
 enforced only by coder discipline and not by the system?

We will rely on coder discipline, yes.  The allocator actually does record a
size.  I was referring to the fact that it can't distinguish the result of
repalloc(p, 7) from the result of repalloc_huge(p, 7).

 What is likely to happen if I accidentally let a pointer to huge memory
 escape to someone who then passes it to varlena constructor without me
 knowing it?  (I tried sabotaging the code to make this happen, but I could
 not figure out how to).   Is there a place we can put an Assert to catch
 this mistake under enable-cassert builds?

Passing a too-large value gives a modulo effect.  We could inject an
AssertMacro() into SET_VARSIZE().  But it's a hot path, and I don't think this
mistake is too likely.

 The only danger I can think of is that it could sometimes make some sorts
 slower, as using more memory than is necessary can sometimes slow down an
 external sort (because the heap is then too big for the fastest CPU
 cache).  If you use more tapes, but not enough more to reduce the number of
 passes needed, then you can get a slowdown.

Interesting point, though I don't fully understand it.  The fastest CPU cache
will be a tiny L1 data cache; surely that's not the relevant parameter here?

 I can't imagine that it would make things worse on average, though, as the
 benefit of doing more sorts as quicksorts rather than merge sorts, or doing
 mergesort with fewer number of passes, would outweigh sometimes doing a
 slower mergesort.  If someone has a pathological use pattern for which the
 averages don't work out favorably for them, they could probably play with
 work_mem to correct the problem.  Whereas without the patch, people who
 want more memory have no options.

Agreed.

 People have mentioned additional things that could be done in this area,
 but I don't think that applying this patch will make those things harder,
 or back us into a corner.  Taking an incremental approach seems suitable.

Committed with some cosmetic tweaks discussed upthread.

Thanks,
nm

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

2013-06-27 Thread Jeff Janes
On Sat, Jun 22, 2013 at 12:46 AM, Stephen Frost sfr...@snowman.net wrote:

 Noah,

 * Noah Misch (n...@leadboat.com) wrote:
  This patch introduces MemoryContextAllocHuge() and repalloc_huge() that
 check
  a higher MaxAllocHugeSize limit of SIZE_MAX/2.

 Nice!  I've complained about this limit a few different times and just
 never got around to addressing it.

  This was made easier by tuplesort growth algorithm improvements in commit
  8ae35e91807508872cabd3b0e8db35fc78e194ac.  The problem has come up before
  (TODO item Allow sorts to use more available memory), and Tom floated
 the
  idea[1] behind the approach I've used.  The next limit faced by sorts is
  INT_MAX concurrent tuples in memory, which limits helpful work_mem to
 about
  150 GiB when sorting int4.

 That's frustratingly small. :(


I've added a ToDo item to remove that limit from sorts as well.

I was going to add another item to make nodeHash.c use the new huge
allocator, but after looking at it just now it was not clear to me that it
even has such a limitation.  nbatch is limited by MaxAllocSize, but
nbuckets doesn't seem to be.

Cheers,

Jeff


Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

2013-06-26 Thread Jeff Janes
On Mon, May 13, 2013 at 7:26 AM, Noah Misch n...@leadboat.com wrote:

 A memory chunk allocated through the existing palloc.h interfaces is
 limited
 to MaxAllocSize (~1 GiB).  This is best for most callers; SET_VARSIZE()
 need
 not check its own 1 GiB limit, and algorithms that grow a buffer by
 doubling
 need not check for overflow.  However, a handful of callers are quite
 happy to
 navigate those hazards in exchange for the ability to allocate a larger
 chunk.

 This patch introduces MemoryContextAllocHuge() and repalloc_huge() that
 check
 a higher MaxAllocHugeSize limit of SIZE_MAX/2.  Chunks don't bother
 recording
 whether they were allocated as huge; one can start with palloc() and then
 repalloc_huge() to grow the value.


Since it doesn't record the size, I assume the non-use as a varlena is
enforced only by coder discipline and not by the system?

!  * represented in a varlena header.  Callers that never use the
allocation as
!  * a varlena can access the higher limit with MemoryContextAllocHuge().
 Both
!  * limits permit code to assume that it may compute (in size_t math)
twice an
!  * allocation's size without overflow.

What is likely to happen if I accidentally let a pointer to huge memory
escape to someone who then passes it to varlena constructor without me
knowing it?  (I tried sabotaging the code to make this happen, but I could
not figure out how to).   Is there a place we can put an Assert to catch
this mistake under enable-cassert builds?

I have not yet done a detailed code review, but this applies and builds
cleanly, passes make check with and without enable-cassert, it does what it
says (and gives performance improvements when it does kick in), and we want
this.  No doc changes should be needed, we probably don't want run an
automatic regression test of the size needed to usefully test this, and as
far as I know there is no infrastructure for big memory only tests.

The only danger I can think of is that it could sometimes make some sorts
slower, as using more memory than is necessary can sometimes slow down an
external sort (because the heap is then too big for the fastest CPU
cache).  If you use more tapes, but not enough more to reduce the number of
passes needed, then you can get a slowdown.

I can't imagine that it would make things worse on average, though, as the
benefit of doing more sorts as quicksorts rather than merge sorts, or doing
mergesort with fewer number of passes, would outweigh sometimes doing a
slower mergesort.  If someone has a pathological use pattern for which the
averages don't work out favorably for them, they could probably play with
work_mem to correct the problem.  Whereas without the patch, people who
want more memory have no options.

People have mentioned additional things that could be done in this area,
but I don't think that applying this patch will make those things harder,
or back us into a corner.  Taking an incremental approach seems suitable.

Cheers,

Jeff


Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

2013-06-24 Thread Noah Misch
On Sat, Jun 22, 2013 at 03:46:49AM -0400, Stephen Frost wrote:
 * Noah Misch (n...@leadboat.com) wrote:
  The next limit faced by sorts is
  INT_MAX concurrent tuples in memory, which limits helpful work_mem to about
  150 GiB when sorting int4.
 
 That's frustratingly small. :(

I could appreciate a desire to remove that limit.  The way to do that is to
audit all uses of int variables in tuplesort.c and tuplestore.c, changing
them to Size where they can be used as indexes into the memtuples array.
Nonetheless, this new limit is about 50x the current limit; you need an
(unpartitioned) table of 2B+ rows to encounter it.  I'm happy with that.

  !   if (memtupsize * grow_ratio  INT_MAX)
  !   newmemtupsize = (int) (memtupsize * grow_ratio);
  !   else
  !   newmemtupsize = INT_MAX;

  /* We won't make any further enlargement attempts */
  state-growmemtuples = false;
 
 I'm not a huge fan of moving directly to INT_MAX.  Are we confident that
 everything can handle that cleanly..?  I feel like it might be a bit
 safer to shy a bit short of INT_MAX (say, by 1K).  Perhaps that's overly
 paranoid, but there's an awful lot of callers and some loop which +2's
 and then overflows would suck, eg:

Where are you seeing an awful lot of callers?  The code that needs to be
correct with respect to the INT_MAX limit is all in tuplesort.c/tuplestore.c.
Consequently, I chose to verify that code rather than add a safety factor.  (I
did add an unrelated safety factor to repalloc_huge() itself.)

 Also, could this be used to support hashing larger sets..?  If we change
 NTUP_PER_BUCKET to one, we could end up wanting to create a hash table
 larger than INT_MAX since, with 8-byte pointers, that'd only be around
 134M tuples.

The INT_MAX limit is an internal limit of tuplesort/tuplestore; other
consumers of the huge allocation APIs are only subject to that limit if they
find reasons to enforce it on themselves.  (Incidentally, the internal limit
in question is INT_MAX tuples, not INT_MAX bytes.)

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

2013-06-24 Thread Noah Misch
On Sat, Jun 22, 2013 at 11:36:45AM +0100, Simon Riggs wrote:
 On 13 May 2013 15:26, Noah Misch n...@leadboat.com wrote:

 I'm concerned that people will accidentally use MaxAllocSize. Can we
 put in a runtime warning if someone tests AllocSizeIsValid() with a
 larger value?

I don't see how we could.  To preempt a repalloc() failure, you test with
AllocSizeIsValid(); testing a larger value is not a programming error.

   To demonstrate, I put this to use in
  tuplesort.c; the patch also updates tuplestore.c to keep them similar.  
  Here's
  the trace_sort from building the pgbench_accounts primary key at scale 
  factor
  7500, maintenance_work_mem = '56GB'; memtuples itself consumed 17.2 GiB:
 
  LOG:  internal sort ended, 48603324 KB used: CPU 75.65s/305.46u sec elapsed 
  391.21 sec
 
  Compare:
 
  LOG:  external sort ended, 1832846 disk blocks used: CPU 77.45s/988.11u sec 
  elapsed 1146.05 sec
 
 Cool.
 
 I'd like to put in an explicit test for this somewhere. Obviously not
 part of normal regression, but somewhere, at least, so we have
 automated testing that we all agree on. (yes, I know we don't have
 that for replication/recovery yet, but thats why I don't want to
 repeat that mistake).

Probably the easiest way to test from nothing is to run pgbench -i -s 7500
under a high work_mem.  I agree that an automated test suite dedicated to
coverage of scale-dependent matters would be valuable, though I'm disinclined
to start one in conjunction with this particular patch.

  The comment at MaxAllocSize said that aset.c expects doubling the size of an
  arbitrary allocation to never overflow, but I couldn't find the code in
  question.  AllocSetAlloc() does double sizes of blocks used to aggregate 
  small
  allocations, so maxBlockSize had better stay under SIZE_MAX/2.  Nonetheless,
  that expectation does apply to dozens of repalloc() users outside aset.c, 
  and
  I preserved it for repalloc_huge().  64-bit builds will never notice, and I
  won't cry for the resulting 2 GiB limit on 32-bit.
 
 Agreed. Can we document this for the relevant parameters?

I attempted to cover most of that in the comment above MaxAllocHugeSize, but I
did not mention the maxBlockSize constraint.  I'll add an
Assert(AllocHugeSizeIsValid(maxBlockSize)) and a comment to
AllocSetContextCreate().  Did I miss documenting anything else notable?

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

2013-06-24 Thread Stephen Frost
On Monday, June 24, 2013, Noah Misch wrote:

 On Sat, Jun 22, 2013 at 03:46:49AM -0400, Stephen Frost wrote:
  * Noah Misch (n...@leadboat.com javascript:;) wrote:
   The next limit faced by sorts is
   INT_MAX concurrent tuples in memory, which limits helpful work_mem to
 about
   150 GiB when sorting int4.
 
  That's frustratingly small. :(

 I could appreciate a desire to remove that limit.  The way to do that is to
 audit all uses of int variables in tuplesort.c and tuplestore.c, changing
 them to Size where they can be used as indexes into the memtuples array.


Right, that's about what I figured would need to be done.


 Nonetheless, this new limit is about 50x the current limit; you need an
 (unpartitioned) table of 2B+ rows to encounter it.  I'm happy with that.


Definitely better but I could see cases with that many tuples in the
not-too-distant future, esp. when used with MinMax indexes...


   !   if (memtupsize * grow_ratio  INT_MAX)
   !   newmemtupsize = (int) (memtupsize * grow_ratio);
   !   else
   !   newmemtupsize = INT_MAX;
  
   /* We won't make any further enlargement attempts */
   state-growmemtuples = false;
 
  I'm not a huge fan of moving directly to INT_MAX.  Are we confident that
  everything can handle that cleanly..?  I feel like it might be a bit
  safer to shy a bit short of INT_MAX (say, by 1K).  Perhaps that's overly
  paranoid, but there's an awful lot of callers and some loop which +2's
  and then overflows would suck, eg:

 Where are you seeing an awful lot of callers?  The code that needs to be
 correct with respect to the INT_MAX limit is all in
 tuplesort.c/tuplestore.c.
 Consequently, I chose to verify that code rather than add a safety factor.
  (I
 did add an unrelated safety factor to repalloc_huge() itself.)


Ok, I was thinking this code was used beyond tuplesort (I was thinking it
was actually associated with palloc). Apologies for the confusion. :)


  Also, could this be used to support hashing larger sets..?  If we change
  NTUP_PER_BUCKET to one, we could end up wanting to create a hash table
  larger than INT_MAX since, with 8-byte pointers, that'd only be around
  134M tuples.

 The INT_MAX limit is an internal limit of tuplesort/tuplestore; other
 consumers of the huge allocation APIs are only subject to that limit if
 they
 find reasons to enforce it on themselves.  (Incidentally, the internal
 limit
 in question is INT_MAX tuples, not INT_MAX bytes.)


There's other places where we use integers for indexes into arrays of
tuples (at least hashing is another area..) and those are then also subject
to INT_MAX, which was really what I was getting at.  We might move the
hashing code to use the _huge functions and would then need to adjust that
code to use Size for the index into the hash table array of pointers.

Thanks,

Stephen


Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

2013-06-22 Thread Stephen Frost
Noah,

* Noah Misch (n...@leadboat.com) wrote:
 This patch introduces MemoryContextAllocHuge() and repalloc_huge() that check
 a higher MaxAllocHugeSize limit of SIZE_MAX/2.  

Nice!  I've complained about this limit a few different times and just
never got around to addressing it.

 This was made easier by tuplesort growth algorithm improvements in commit
 8ae35e91807508872cabd3b0e8db35fc78e194ac.  The problem has come up before
 (TODO item Allow sorts to use more available memory), and Tom floated the
 idea[1] behind the approach I've used.  The next limit faced by sorts is
 INT_MAX concurrent tuples in memory, which limits helpful work_mem to about
 150 GiB when sorting int4.

That's frustratingly small. :(

[...]
 --- 1024,1041 
* new array elements even if no other memory were currently 
 used.
*
* We do the arithmetic in float8, because otherwise the 
 product of
 !  * memtupsize and allowedMem could overflow.  Any inaccuracy in 
 the
 !  * result should be insignificant; but even if we computed a
 !  * completely insane result, the checks below will prevent 
 anything
 !  * really bad from happening.
*/
   double  grow_ratio;
   
   grow_ratio = (double) state-allowedMem / (double) memNowUsed;
 ! if (memtupsize * grow_ratio  INT_MAX)
 ! newmemtupsize = (int) (memtupsize * grow_ratio);
 ! else
 ! newmemtupsize = INT_MAX;
   
   /* We won't make any further enlargement attempts */
   state-growmemtuples = false;

I'm not a huge fan of moving directly to INT_MAX.  Are we confident that
everything can handle that cleanly..?  I feel like it might be a bit
safer to shy a bit short of INT_MAX (say, by 1K).  Perhaps that's overly
paranoid, but there's an awful lot of callers and some loop which +2's
and then overflows would suck, eg:

int x = INT_MAX;
for (x-1; (x-1)  INT_MAX; x += 2) {
myarray[x] = 5;
}

Also, could this be used to support hashing larger sets..?  If we change
NTUP_PER_BUCKET to one, we could end up wanting to create a hash table
larger than INT_MAX since, with 8-byte pointers, that'd only be around
134M tuples.

Haven't had a chance to review the rest, but +1 on the overall idea. :)

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

2013-06-22 Thread Simon Riggs
On 13 May 2013 15:26, Noah Misch n...@leadboat.com wrote:
 A memory chunk allocated through the existing palloc.h interfaces is limited
 to MaxAllocSize (~1 GiB).  This is best for most callers; SET_VARSIZE() need
 not check its own 1 GiB limit, and algorithms that grow a buffer by doubling
 need not check for overflow.  However, a handful of callers are quite happy to
 navigate those hazards in exchange for the ability to allocate a larger chunk.

 This patch introduces MemoryContextAllocHuge() and repalloc_huge() that check
 a higher MaxAllocHugeSize limit of SIZE_MAX/2.  Chunks don't bother recording
 whether they were allocated as huge; one can start with palloc() and then
 repalloc_huge() to grow the value.

I like the design and think its workable.

I'm concerned that people will accidentally use MaxAllocSize. Can we
put in a runtime warning if someone tests AllocSizeIsValid() with a
larger value?

  To demonstrate, I put this to use in
 tuplesort.c; the patch also updates tuplestore.c to keep them similar.  Here's
 the trace_sort from building the pgbench_accounts primary key at scale factor
 7500, maintenance_work_mem = '56GB'; memtuples itself consumed 17.2 GiB:

 LOG:  internal sort ended, 48603324 KB used: CPU 75.65s/305.46u sec elapsed 
 391.21 sec

 Compare:

 LOG:  external sort ended, 1832846 disk blocks used: CPU 77.45s/988.11u sec 
 elapsed 1146.05 sec

Cool.

I'd like to put in an explicit test for this somewhere. Obviously not
part of normal regression, but somewhere, at least, so we have
automated testing that we all agree on. (yes, I know we don't have
that for replication/recovery yet, but thats why I don't want to
repeat that mistake).

 This was made easier by tuplesort growth algorithm improvements in commit
 8ae35e91807508872cabd3b0e8db35fc78e194ac.  The problem has come up before
 (TODO item Allow sorts to use more available memory), and Tom floated the
 idea[1] behind the approach I've used.  The next limit faced by sorts is
 INT_MAX concurrent tuples in memory, which limits helpful work_mem to about
 150 GiB when sorting int4.

 I have not added variants like palloc_huge() and palloc0_huge(), and I have
 not added to the frontend palloc.h interface.  There's no particular barrier
 to doing any of that.  I don't expect more than a dozen or so callers, so most
 of the variations might go unused.

 The comment at MaxAllocSize said that aset.c expects doubling the size of an
 arbitrary allocation to never overflow, but I couldn't find the code in
 question.  AllocSetAlloc() does double sizes of blocks used to aggregate small
 allocations, so maxBlockSize had better stay under SIZE_MAX/2.  Nonetheless,
 that expectation does apply to dozens of repalloc() users outside aset.c, and
 I preserved it for repalloc_huge().  64-bit builds will never notice, and I
 won't cry for the resulting 2 GiB limit on 32-bit.

Agreed. Can we document this for the relevant parameters?

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

2013-06-22 Thread Simon Riggs
On 22 June 2013 08:46, Stephen Frost sfr...@snowman.net wrote:

The next limit faced by sorts is
 INT_MAX concurrent tuples in memory, which limits helpful work_mem to about
 150 GiB when sorting int4.

 That's frustratingly small. :(


But that has nothing to do with this patch, right? And is easily fixed, yes?

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

2013-06-22 Thread Stephen Frost
* Simon Riggs (si...@2ndquadrant.com) wrote:
 On 22 June 2013 08:46, Stephen Frost sfr...@snowman.net wrote:
 The next limit faced by sorts is
  INT_MAX concurrent tuples in memory, which limits helpful work_mem to about
  150 GiB when sorting int4.
 
  That's frustratingly small. :(
 
 But that has nothing to do with this patch, right? And is easily fixed, yes?

I don't know about 'easily fixed' (consider supporting a HashJoin of 2B
records) but I do agree that dealing with places in the code where we are
using an int4 to keep track of the number of objects in memory is outside
the scope of this patch.

Hopefully we are properly range-checking and limiting ourselves to only
what a given node can support and not solely depending on MaxAllocSize
to keep us from overflowing some int4 which we're using as an index for
an array or as a count of how many objects we've currently got in
memory, but we'll want to consider carefully what happens with such
large sets as we're adding support into nodes for these Huge
allocations (along with the recent change to allow 1TB work_mem, which
may encourage users with systems large enough to actually try to set it
that high... :)

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

2013-06-22 Thread Robert Haas
On Sat, Jun 22, 2013 at 3:46 AM, Stephen Frost sfr...@snowman.net wrote:
 I'm not a huge fan of moving directly to INT_MAX.  Are we confident that
 everything can handle that cleanly..?  I feel like it might be a bit
 safer to shy a bit short of INT_MAX (say, by 1K).

Maybe it would be better to stick with INT_MAX and fix any bugs we
find.  If there are magic numbers short of INT_MAX that cause
problems, it would likely be better to find out about those problems
and adjust the relevant code, rather than trying to dodge them.  We'll
have to confront all of those problems eventually as we come to
support larger and larger sorts; I don't see much value in putting it
off.

Especially since we're early in the release cycle.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

2013-05-13 Thread Noah Misch
A memory chunk allocated through the existing palloc.h interfaces is limited
to MaxAllocSize (~1 GiB).  This is best for most callers; SET_VARSIZE() need
not check its own 1 GiB limit, and algorithms that grow a buffer by doubling
need not check for overflow.  However, a handful of callers are quite happy to
navigate those hazards in exchange for the ability to allocate a larger chunk.

This patch introduces MemoryContextAllocHuge() and repalloc_huge() that check
a higher MaxAllocHugeSize limit of SIZE_MAX/2.  Chunks don't bother recording
whether they were allocated as huge; one can start with palloc() and then
repalloc_huge() to grow the value.  To demonstrate, I put this to use in
tuplesort.c; the patch also updates tuplestore.c to keep them similar.  Here's
the trace_sort from building the pgbench_accounts primary key at scale factor
7500, maintenance_work_mem = '56GB'; memtuples itself consumed 17.2 GiB:

LOG:  internal sort ended, 48603324 KB used: CPU 75.65s/305.46u sec elapsed 
391.21 sec

Compare:

LOG:  external sort ended, 1832846 disk blocks used: CPU 77.45s/988.11u sec 
elapsed 1146.05 sec

This was made easier by tuplesort growth algorithm improvements in commit
8ae35e91807508872cabd3b0e8db35fc78e194ac.  The problem has come up before
(TODO item Allow sorts to use more available memory), and Tom floated the
idea[1] behind the approach I've used.  The next limit faced by sorts is
INT_MAX concurrent tuples in memory, which limits helpful work_mem to about
150 GiB when sorting int4.

I have not added variants like palloc_huge() and palloc0_huge(), and I have
not added to the frontend palloc.h interface.  There's no particular barrier
to doing any of that.  I don't expect more than a dozen or so callers, so most
of the variations might go unused.

The comment at MaxAllocSize said that aset.c expects doubling the size of an
arbitrary allocation to never overflow, but I couldn't find the code in
question.  AllocSetAlloc() does double sizes of blocks used to aggregate small
allocations, so maxBlockSize had better stay under SIZE_MAX/2.  Nonetheless,
that expectation does apply to dozens of repalloc() users outside aset.c, and
I preserved it for repalloc_huge().  64-bit builds will never notice, and I
won't cry for the resulting 2 GiB limit on 32-bit.

Thanks,
nm

[1] http://www.postgresql.org/message-id/19908.1297696...@sss.pgh.pa.us

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com
*** a/src/backend/utils/mmgr/aset.c
--- b/src/backend/utils/mmgr/aset.c
***
*** 557,562  AllocSetDelete(MemoryContext context)
--- 557,566 
   * AllocSetAlloc
   *Returns pointer to allocated memory of given size; memory is 
added
   *to the set.
+  *
+  * No request may exceed:
+  *MAXALIGN_DOWN(SIZE_MAX) - ALLOC_BLOCKHDRSZ - ALLOC_CHUNKHDRSZ
+  * All callers use a much-lower limit.
   */
  static void *
  AllocSetAlloc(MemoryContext context, Size size)
*** a/src/backend/utils/mmgr/mcxt.c
--- b/src/backend/utils/mmgr/mcxt.c
***
*** 451,464  MemoryContextContains(MemoryContext context, void *pointer)
header = (StandardChunkHeader *)
((char *) pointer - STANDARDCHUNKHEADERSIZE);
  
!   /*
!* If the context link doesn't match then we certainly have a non-member
!* chunk.  Also check for a reasonable-looking size as extra guard 
against
!* being fooled by bogus pointers.
!*/
!   if (header-context == context  AllocSizeIsValid(header-size))
!   return true;
!   return false;
  }
  
  /*
--- 451,457 
header = (StandardChunkHeader *)
((char *) pointer - STANDARDCHUNKHEADERSIZE);
  
!   return header-context == context;
  }
  
  /*
***
*** 735,740  repalloc(void *pointer, Size size)
--- 728,790 
  }
  
  /*
+  * MemoryContextAllocHuge
+  *Allocate (possibly-expansive) space within the specified 
context.
+  *
+  * See considerations in comment at MaxAllocHugeSize.
+  */
+ void *
+ MemoryContextAllocHuge(MemoryContext context, Size size)
+ {
+   AssertArg(MemoryContextIsValid(context));
+ 
+   if (!AllocHugeSizeIsValid(size))
+   elog(ERROR, invalid memory alloc request size %lu,
+(unsigned long) size);
+ 
+   context-isReset = false;
+ 
+   return (*context-methods-alloc) (context, size);
+ }
+ 
+ /*
+  * repalloc_huge
+  *Adjust the size of a previously allocated chunk, permitting a 
large
+  *value.  The previous allocation need not have been huge.
+  */
+ void *
+ repalloc_huge(void *pointer, Size size)
+ {
+   StandardChunkHeader *header;
+ 
+   /*
+* Try to detect bogus pointers handed to us, poorly though we can.
+* Presumably, a pointer that isn't MAXALIGNED isn't pointing at an
+* allocated chunk.
+*/
+   

Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

2013-05-13 Thread Pavel Stehule
+1

Pavel
Dne 13.5.2013 16:29 Noah Misch n...@leadboat.com napsal(a):

 A memory chunk allocated through the existing palloc.h interfaces is
 limited
 to MaxAllocSize (~1 GiB).  This is best for most callers; SET_VARSIZE()
 need
 not check its own 1 GiB limit, and algorithms that grow a buffer by
 doubling
 need not check for overflow.  However, a handful of callers are quite
 happy to
 navigate those hazards in exchange for the ability to allocate a larger
 chunk.

 This patch introduces MemoryContextAllocHuge() and repalloc_huge() that
 check
 a higher MaxAllocHugeSize limit of SIZE_MAX/2.  Chunks don't bother
 recording
 whether they were allocated as huge; one can start with palloc() and then
 repalloc_huge() to grow the value.  To demonstrate, I put this to use in
 tuplesort.c; the patch also updates tuplestore.c to keep them similar.
  Here's
 the trace_sort from building the pgbench_accounts primary key at scale
 factor
 7500, maintenance_work_mem = '56GB'; memtuples itself consumed 17.2 GiB:

 LOG:  internal sort ended, 48603324 KB used: CPU 75.65s/305.46u sec
 elapsed 391.21 sec

 Compare:

 LOG:  external sort ended, 1832846 disk blocks used: CPU 77.45s/988.11u
 sec elapsed 1146.05 sec

 This was made easier by tuplesort growth algorithm improvements in commit
 8ae35e91807508872cabd3b0e8db35fc78e194ac.  The problem has come up before
 (TODO item Allow sorts to use more available memory), and Tom floated the
 idea[1] behind the approach I've used.  The next limit faced by sorts is
 INT_MAX concurrent tuples in memory, which limits helpful work_mem to about
 150 GiB when sorting int4.

 I have not added variants like palloc_huge() and palloc0_huge(), and I have
 not added to the frontend palloc.h interface.  There's no particular
 barrier
 to doing any of that.  I don't expect more than a dozen or so callers, so
 most
 of the variations might go unused.

 The comment at MaxAllocSize said that aset.c expects doubling the size of
 an
 arbitrary allocation to never overflow, but I couldn't find the code in
 question.  AllocSetAlloc() does double sizes of blocks used to aggregate
 small
 allocations, so maxBlockSize had better stay under SIZE_MAX/2.
  Nonetheless,
 that expectation does apply to dozens of repalloc() users outside aset.c,
 and
 I preserved it for repalloc_huge().  64-bit builds will never notice, and I
 won't cry for the resulting 2 GiB limit on 32-bit.

 Thanks,
 nm

 [1] http://www.postgresql.org/message-id/19908.1297696...@sss.pgh.pa.us

 --
 Noah Misch
 EnterpriseDB http://www.enterprisedb.com


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers