Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

Tomas Vondra Wed, 07 Jan 2015 11:08:50 -0800

Hi,

On 23.12.2014 10:16, Jeff Davis wrote:
> It seems that these two patches are being reviewed together. Should
> I just combine them into one? My understanding was that some wanted
> to review the memory accounting patch separately.
> 
> On Sun, 2014-12-21 at 20:19 +0100, Tomas Vondra wrote:
>> That's the only conflict, and after fixing it it compiles OK.
>> However, I got a segfault on the very first query I tried :-(
> 
> If lookup_hash_entry doesn't find the group, and there's not enough 
> memory to create it, then it returns NULL; but the caller wasn't 
> checking for NULL. My apologies for such a trivial mistake, I was
> doing most of my testing using DISTINCT. My fix here was done
> quickly, so I'll take a closer look later to make sure I didn't miss
> something else.
> 
> New patch attached (rebased, as well).


I did a review today, using these two patches:

    * memory-accounting-v9.patch (submitted on December 2)
    * hashagg-disk-20141222.patch

I started with some basic performance measurements comparing hashagg
queries without and with the patches (while you compared hashagg and
sort). That's IMHO an interesting comparison, especially when no
batching is necessary - in the optimal case the users should not see any
slowdown (we shouldn't make them pay for the batching unless it actually
is necessary).

So I did this:

    drop table if exists test_hash_agg;

    create table test_hash_agg as
        select
            i AS a,
            mod(i,1000000) AS b,
            mod(i,100000) AS c,
            mod(i,10000) AS d,
            mod(i,1000) AS e,
            mod(i,100) AS f
    from generate_series(1,10000000) s(i);

    vacuum (full, analyze) test_hash_agg;

i.e. a ~500MB table with 10M rows, and columns with different
cardinalities. And then queries like this:

   select count(*) from (select a, count(a) from test_hash_agg
                         group by a) foo;

   -- 10M groups (OOM)
   select count(*) from (select a, array_agg(a) from test_hash_agg
                         group by a) foo;

   -- 100 groups
   select count(*) from (select f, array_agg(f) from test_hash_agg
                         group by f) foo;

which performed quite well, i.e. I've seen absolutely no slowdown.
Which, in the array_agg case, is quite is quite suspicious, because on
every lookup_hash_entry call, it has to do MemoryContextMemAllocated()
on 10M contexts, and I really doubt that can be done in ~0 time.

So I started digging in the code and I noticed this:

    hash_mem = MemoryContextMemAllocated(aggstate->hashcontext, true);

which is IMHO obviously wrong, because that accounts only for the
hashtable itself. It might be correct for aggregates with state passed
by value, but it's incorrect for state passed by reference (e.g.
Numeric, arrays etc.), because initialize_aggregates does this:

    oldContext = MemoryContextSwitchTo(aggstate->aggcontext);
    pergroupstate->transValue = datumCopy(peraggstate->initValue,
                                        peraggstate->transtypeByVal,
                                        peraggstate->transtypeLen);
    MemoryContextSwitchTo(oldContext);

and it's also wrong for all the user-defined aggretates that have no
access to hashcontext at all, and only get reference to aggcontext using
AggCheckCallContext(). array_agg() is a prime example.

In those cases the patch actually does no memory accounting and as
hashcontext has no child contexts, there's no accounting overhead.

After fixing this bug (replacing hashcontext with aggcontext in both
calls to MemoryContextMemAllocated) the performance drops dramatically.
For the query with 100 groups (not requiring any batching) I see this:

test=# explain analyze select count(x) from (select f, array_agg(1) AS x
from test_hash_agg group by f) foo;

                        QUERY PLAN (original patch)
------------------------------------------------------------------------
 Aggregate  (cost=213695.57..213695.58 rows=1 width=32)
            (actual time=2539.156..2539.156 rows=1 loops=1)
   ->  HashAggregate  (cost=213693.07..213694.32 rows=100 width=4)
                      (actual time=2492.264..2539.012 rows=100 loops=1)
         Group Key: test_hash_agg.f
         Batches: 1  Memory Usage: 24kB  Disk Usage:0kB
         ->  Seq Scan on test_hash_agg
                     (cost=0.00..163693.71 rows=9999871 width=4)
                     (actual time=0.022..547.379 rows=10000000 loops=1)
 Planning time: 0.039 ms
 Execution time: 2542.932 ms
(7 rows)

                        QUERY PLAN (fixed patch)
------------------------------------------------------------------------
 Aggregate  (cost=213695.57..213695.58 rows=1 width=32)
            (actual time=5670.885..5670.885 rows=1 loops=1)
   ->  HashAggregate  (cost=213693.07..213694.32 rows=100 width=4)
                      (actual time=5623.254..5670.803 rows=100 loops=1)
         Group Key: test_hash_agg.f
         Batches: 1  Memory Usage: 117642kB  Disk Usage:0kB
         ->  Seq Scan on test_hash_agg
                     (cost=0.00..163693.71 rows=9999871 width=4)
                     (actual time=0.014..577.924 rows=10000000 loops=1)
 Planning time: 0.103 ms
 Execution time: 5677.187 ms
(7 rows)

So the performance drops 2x. With more groups, the performance impact is
even worse. For example with the first query (with 10M groups), this is
what I get in perf:

explain analyze select count(x) from (select a, array_agg(1) AS x from
test_hash_agg group by a) foo;


   PerfTop:    4671 irqs/sec  kernel:11.2%  exact:  0.0%
------------------------------------------------------------------------

    87.07%  postgres                 [.] MemoryContextMemAllocated
     1.63%  [zcommon]                [k] fletcher_4_native
     1.60%  [kernel]                 [k] acpi_processor_ffh_cstate_enter
     0.68%  [kernel]                 [k] xhci_irq
     0.30%  ld-2.19.so               [.] _dl_sysinfo_int80
     0.30%  [kernel]                 [k] memmove
     0.26%  [kernel]                 [k] ia32_syscall
     0.16%  libglib-2.0.so.0.3800.2  [.] 0x000000000008c52a
     0.15%  libpthread-2.19.so       [.] pthread_mutex_lock


and it runs indefinitely (I gave up after a few minutes). I believe this
renders the proposed memory accounting approach dead.

Granted, the 10M groups is a bit extreme example, but the first query
with 100 groups certainly is not.

I understand the effort to avoid the 2% regression measured by Robert on
a PowerPC machine, but I don't think that's a sufficient reason to cause
so much trouble to everyone using array_agg() or user-defined aggregates
based on the same 'create subcontext' pattern.

Especially when the reindex may be often improved by using more
maintenance_work_mem, but there's nothing you can do to improve this (if
it's not batching, then increasing work_mem will do nothing).

The array_agg() patch I submitted to this CF would fix this particular
query, as it removes the child contexts (so there's no need for
recursion in MemoryContextMemAllocated), but it does nothing to the
user-defined aggregates out there. And it's not committed yet.

Also, ISTM this makes it rather unusable as a general accounting
approach. If mere 100 subcontexts results in 2x slowdown, then a handful
of subcontexts will certainly have a measurable impact if we decide to
use this somewhere else (and not just in hash aggregate).

So I was curious how the accounting mechanism I proposed (parallel
MemoryAccounting hierarchy next to MemoryContext) would handle this, so
I've used it instead of the memory-accounting-v9.patch. And I measured
no difference compared to master (no slowdown at all).

I also tried array_agg() queries that actually require batchning, e.g.

  select a, array_agg(1) x from test_hash_agg group by a;

which produces 10M groups, each using a separate 8kB context, which
amounts to 80GB in total. With work_mem=1GB this should proceed just
fine with ~80 batches. In practice, it runs indefinitely (again, I lost
patience after a few minutes), and I see this:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00    0,00  281,00     0,00 140784,00
1002,02   143,25  512,57    0,00  512,57   3,56 100,00

tomas@rimmer ~/tmp/pg-hashagg/base/pgsql_tmp $ du -s
374128  .
tomas@rimmer ~/tmp/pg-hashagg/base/pgsql_tmp $ ls -l | wc -l
130
tomas@rimmer ~/tmp/pg-hashagg/base/pgsql_tmp $ ls -l | head
celkem 372568
-rw------- 1 tomas users 2999480  7. led 19.46 pgsql_tmp23267.2637
-rw------- 1 tomas users 2973520  7. led 19.46 pgsql_tmp23267.2638
-rw------- 1 tomas users 2978800  7. led 19.46 pgsql_tmp23267.2639
-rw------- 1 tomas users 2959880  7. led 19.46 pgsql_tmp23267.2640
-rw------- 1 tomas users 3010040  7. led 19.46 pgsql_tmp23267.2641
-rw------- 1 tomas users 3083520  7. led 19.46 pgsql_tmp23267.2642
-rw------- 1 tomas users 3053160  7. led 19.46 pgsql_tmp23267.2643
-rw------- 1 tomas users 3044360  7. led 19.46 pgsql_tmp23267.2644
-rw------- 1 tomas users 3014000  7. led 19.46 pgsql_tmp23267.2645

That is, there are ~130 files, each ~3MB large, ~370MB in total. But the
system does ~140MB/s writes all the time. Also, the table has ~500MB in
total.

So either I'm missing something, but there's some sort of bug.

Attached you can find a bunch of files:

  * 0001-memory-accounting.patch (my memory accounting)
  * 0002-hashagg-patch.patch (jeff's patch, for completeness)
  * 0003-context-fix.patch (fix hashcontext -> aggcontext)
  * test.sql (data / queries I used for testing)

regards
Tomas

>From 71a3f8e16d62ac0f4d51d17e4e0435d73f9bd925 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <t...@fuzzy.cz>
Date: Wed, 7 Jan 2015 18:15:46 +0100
Subject: [PATCH 1/3] memory accounting

---
 src/backend/executor/nodeAgg.c |  5 ++-
 src/backend/utils/mmgr/aset.c  | 88 ++++++++++++++++++++++++++++++++++++++++--
 src/backend/utils/mmgr/mcxt.c  | 46 +++++++++++++++++++++-
 src/include/nodes/memnodes.h   | 17 +++++++-
 src/include/utils/memutils.h   | 10 ++++-
 5 files changed, 157 insertions(+), 9 deletions(-)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 08088ea..c93b915 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -1535,11 +1535,12 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	 * recover no-longer-wanted space.
 	 */
 	aggstate->aggcontext =
-		AllocSetContextCreate(CurrentMemoryContext,
+		AllocSetContextCreateTracked(CurrentMemoryContext,
 							  "AggContext",
 							  ALLOCSET_DEFAULT_MINSIZE,
 							  ALLOCSET_DEFAULT_INITSIZE,
-							  ALLOCSET_DEFAULT_MAXSIZE);
+							  ALLOCSET_DEFAULT_MAXSIZE,
+							  true);
 
 	/*
 	 * tuple table initialization
diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index 85b3c9a..5441e9e 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -242,6 +242,8 @@ typedef struct AllocChunkData
 #define AllocChunkGetPointer(chk)	\
 					((AllocPointer)(((char *)(chk)) + ALLOC_CHUNKHDRSZ))
 
+static void update_allocation(MemoryContext context, int64 size);
+
 /*
  * These functions implement the MemoryContext API for AllocSet contexts.
  */
@@ -250,7 +252,7 @@ static void AllocSetFree(MemoryContext context, void *pointer);
 static void *AllocSetRealloc(MemoryContext context, void *pointer, Size size);
 static void AllocSetInit(MemoryContext context);
 static void AllocSetReset(MemoryContext context);
-static void AllocSetDelete(MemoryContext context);
+static void AllocSetDelete(MemoryContext context, MemoryContext parent);
 static Size AllocSetGetChunkSpace(MemoryContext context, void *pointer);
 static bool AllocSetIsEmpty(MemoryContext context);
 static void AllocSetStats(MemoryContext context, int level);
@@ -430,6 +432,9 @@ randomize_mem(char *ptr, size_t size)
  * minContextSize: minimum context size
  * initBlockSize: initial allocation block size
  * maxBlockSize: maximum allocation block size
+ *
+ * The flag determining whether this context tracks memory usage is inherited
+ * from the parent context.
  */
 MemoryContext
 AllocSetContextCreate(MemoryContext parent,
@@ -438,6 +443,26 @@ AllocSetContextCreate(MemoryContext parent,
 					  Size initBlockSize,
 					  Size maxBlockSize)
 {
+	return AllocSetContextCreateTracked(
+		parent, name, minContextSize, initBlockSize, maxBlockSize,
+		false);
+}
+
+/*
+ * AllocSetContextCreateTracked
+ *		Create a new AllocSet context.
+ *
+ * Implementation for AllocSetContextCreate, but also allows the caller to
+ * specify whether memory usage should be tracked or not.
+ */
+MemoryContext
+AllocSetContextCreateTracked(MemoryContext parent,
+							 const char *name,
+							 Size minContextSize,
+							 Size initBlockSize,
+							 Size maxBlockSize,
+							 bool track_mem)
+{
 	AllocSet	context;
 
 	/* Do the type-independent part of context creation */
@@ -445,7 +470,8 @@ AllocSetContextCreate(MemoryContext parent,
 											 sizeof(AllocSetContext),
 											 &AllocSetMethods,
 											 parent,
-											 name);
+											 name,
+											 track_mem);
 
 	/*
 	 * Make sure alloc parameters are reasonable, and save them.
@@ -500,6 +526,9 @@ AllocSetContextCreate(MemoryContext parent,
 					 errdetail("Failed while creating memory context \"%s\".",
 							   name)));
 		}
+
+		update_allocation((MemoryContext) context, blksize);
+
 		block->aset = context;
 		block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
 		block->endptr = ((char *) block) + blksize;
@@ -590,6 +619,7 @@ AllocSetReset(MemoryContext context)
 		else
 		{
 			/* Normal case, release the block */
+			update_allocation(context, -(block->endptr - ((char*) block)));
 #ifdef CLOBBER_FREED_MEMORY
 			wipe_mem(block, block->freeptr - ((char *) block));
 #endif
@@ -611,11 +641,13 @@ AllocSetReset(MemoryContext context)
  * But note we are not responsible for deleting the context node itself.
  */
 static void
-AllocSetDelete(MemoryContext context)
+AllocSetDelete(MemoryContext context, MemoryContext parent)
 {
 	AllocSet	set = (AllocSet) context;
 	AllocBlock	block = set->blocks;
 
+	MemoryAccounting accounting;
+
 	AssertArg(AllocSetIsValid(set));
 
 #ifdef MEMORY_CONTEXT_CHECKING
@@ -623,6 +655,17 @@ AllocSetDelete(MemoryContext context)
 	AllocSetCheck(context);
 #endif
 
+	if (context->accounting != NULL) {
+
+		accounting = context->accounting->parent;
+
+		while (accounting != NULL)
+		{
+			accounting->total_allocated -= context->accounting->total_allocated;
+			accounting = accounting->parent;
+		}
+	}
+
 	/* Make it look empty, just in case... */
 	MemSetAligned(set->freelist, 0, sizeof(set->freelist));
 	set->blocks = NULL;
@@ -678,6 +721,9 @@ AllocSetAlloc(MemoryContext context, Size size)
 					 errmsg("out of memory"),
 					 errdetail("Failed on request of size %zu.", size)));
 		}
+
+		update_allocation(context, blksize);
+
 		block->aset = set;
 		block->freeptr = block->endptr = ((char *) block) + blksize;
 
@@ -873,6 +919,8 @@ AllocSetAlloc(MemoryContext context, Size size)
 					 errdetail("Failed on request of size %zu.", size)));
 		}
 
+		update_allocation(context, blksize);
+
 		block->aset = set;
 		block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
 		block->endptr = ((char *) block) + blksize;
@@ -976,6 +1024,7 @@ AllocSetFree(MemoryContext context, void *pointer)
 			set->blocks = block->next;
 		else
 			prevblock->next = block->next;
+		update_allocation(context, -(block->endptr - ((char*) block)));
 #ifdef CLOBBER_FREED_MEMORY
 		wipe_mem(block, block->freeptr - ((char *) block));
 #endif
@@ -1088,6 +1137,7 @@ AllocSetRealloc(MemoryContext context, void *pointer, Size size)
 		AllocBlock	prevblock = NULL;
 		Size		chksize;
 		Size		blksize;
+		Size		oldblksize;
 
 		while (block != NULL)
 		{
@@ -1105,6 +1155,8 @@ AllocSetRealloc(MemoryContext context, void *pointer, Size size)
 		/* Do the realloc */
 		chksize = MAXALIGN(size);
 		blksize = chksize + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ;
+		oldblksize = block->endptr - ((char *)block);
+
 		block = (AllocBlock) realloc(block, blksize);
 		if (block == NULL)
 		{
@@ -1114,6 +1166,7 @@ AllocSetRealloc(MemoryContext context, void *pointer, Size size)
 					 errmsg("out of memory"),
 					 errdetail("Failed on request of size %zu.", size)));
 		}
+		update_allocation(context, blksize - oldblksize);
 		block->freeptr = block->endptr = ((char *) block) + blksize;
 
 		/* Update pointers since block has likely been moved */
@@ -1277,6 +1330,35 @@ AllocSetStats(MemoryContext context, int level)
 }
 
 
+/*
+ * update_allocation
+ *
+ * Track newly-allocated or newly-freed memory (freed memory should be
+ * negative).
+ */
+static void
+update_allocation(MemoryContext context, int64 size)
+{
+	MemoryAccounting accounting = context->accounting;
+
+	if (accounting == NULL)
+		return;
+
+	accounting->self_allocated += size;
+
+	while (accounting != NULL) {
+
+		accounting->total_allocated += size;
+
+		Assert(accounting->self_allocated >= 0);
+		Assert(accounting->total_allocated >= accounting->self_allocated);
+
+		accounting = accounting->parent;
+
+	}
+
+}
+
 #ifdef MEMORY_CONTEXT_CHECKING
 
 /*
diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index aa0d458..5e4e404 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -187,6 +187,8 @@ MemoryContextResetChildren(MemoryContext context)
 void
 MemoryContextDelete(MemoryContext context)
 {
+	MemoryContext parent = context->parent;
+
 	AssertArg(MemoryContextIsValid(context));
 	/* We had better not be deleting TopMemoryContext ... */
 	Assert(context != TopMemoryContext);
@@ -202,7 +204,12 @@ MemoryContextDelete(MemoryContext context)
 	 */
 	MemoryContextSetParent(context, NULL);
 
-	(*context->methods->delete_context) (context);
+	/* pass the parent in case it's needed, however */
+	(*context->methods->delete_context) (context, parent);
+
+	if (context->track_mem)
+		pfree(context->accounting);
+	
 	VALGRIND_DESTROY_MEMPOOL(context);
 	pfree(context);
 }
@@ -324,6 +331,26 @@ MemoryContextAllowInCriticalSection(MemoryContext context, bool allow)
 }
 
 /*
+ * MemoryContextMemAllocated
+ *
+ * Return memory allocated by the system to this context. If total is true,
+ * include child contexts. Context must have track_mem set.
+ */
+Size
+MemoryContextMemAllocated(MemoryContext context, bool recursive)
+{
+	Assert(context->track_mem);
+
+	if (! context->track_mem)
+		return 0;
+
+	if (recursive)
+		return context->accounting->total_allocated;
+	else
+		return context->accounting->self_allocated;
+}
+
+/*
  * GetMemoryChunkSpace
  *		Given a currently-allocated chunk, determine the total space
  *		it occupies (including all memory-allocation overhead).
@@ -546,7 +573,8 @@ MemoryContext
 MemoryContextCreate(NodeTag tag, Size size,
 					MemoryContextMethods *methods,
 					MemoryContext parent,
-					const char *name)
+					const char *name,
+					bool track_mem)
 {
 	MemoryContext node;
 	Size		needed = size + strlen(name) + 1;
@@ -576,6 +604,8 @@ MemoryContextCreate(NodeTag tag, Size size,
 	node->firstchild = NULL;
 	node->nextchild = NULL;
 	node->isReset = true;
+	node->track_mem = track_mem;
+	node->accounting = NULL;
 	node->name = ((char *) node) + size;
 	strcpy(node->name, name);
 
@@ -596,6 +626,18 @@ MemoryContextCreate(NodeTag tag, Size size,
 #endif
 	}
 
+	/* if we want to do tracking, just allocate MemoryAccountingData */
+	if (track_mem)
+	{
+		node->accounting = (MemoryAccounting)MemoryContextAllocZero(TopMemoryContext,
+												sizeof(MemoryAccountingData));
+		if (parent)
+			node->accounting->parent = parent->accounting;
+	} else if (parent) {
+		
+		node->accounting = parent->accounting;
+	}
+
 	VALGRIND_CREATE_MEMPOOL(node, 0, false);
 
 	/* Return to type-specific creation routine to finish up */
diff --git a/src/include/nodes/memnodes.h b/src/include/nodes/memnodes.h
index ca9c3de..03c63c9 100644
--- a/src/include/nodes/memnodes.h
+++ b/src/include/nodes/memnodes.h
@@ -41,7 +41,8 @@ typedef struct MemoryContextMethods
 	void	   *(*realloc) (MemoryContext context, void *pointer, Size size);
 	void		(*init) (MemoryContext context);
 	void		(*reset) (MemoryContext context);
-	void		(*delete_context) (MemoryContext context);
+	void		(*delete_context) (MemoryContext context,
+								   MemoryContext parent);
 	Size		(*get_chunk_space) (MemoryContext context, void *pointer);
 	bool		(*is_empty) (MemoryContext context);
 	void		(*stats) (MemoryContext context, int level);
@@ -50,6 +51,18 @@ typedef struct MemoryContextMethods
 #endif
 } MemoryContextMethods;
 
+typedef struct MemoryAccountingData {
+
+	Size	total_allocated; /* including child contexts */
+	Size	self_allocated;  /* not including child contexts */
+
+	/* parent accounting (not parent context) */
+	struct MemoryAccountingData * parent;
+
+} MemoryAccountingData;
+
+typedef MemoryAccountingData * MemoryAccounting;
+
 
 typedef struct MemoryContextData
 {
@@ -60,6 +73,8 @@ typedef struct MemoryContextData
 	MemoryContext nextchild;	/* next child of same parent */
 	char	   *name;			/* context name (just for debugging) */
 	bool		isReset;		/* T = no space alloced since last reset */
+	bool		track_mem;		/* whether to track memory usage */
+	MemoryAccounting	accounting;
 #ifdef USE_ASSERT_CHECKING
 	bool		allowInCritSection;	/* allow palloc in critical section */
 #endif
diff --git a/src/include/utils/memutils.h b/src/include/utils/memutils.h
index 85aba7a..f9fea27 100644
--- a/src/include/utils/memutils.h
+++ b/src/include/utils/memutils.h
@@ -96,6 +96,7 @@ extern void MemoryContextDeleteChildren(MemoryContext context);
 extern void MemoryContextResetAndDeleteChildren(MemoryContext context);
 extern void MemoryContextSetParent(MemoryContext context,
 					   MemoryContext new_parent);
+extern Size MemoryContextMemAllocated(MemoryContext context, bool recursive);
 extern Size GetMemoryChunkSpace(void *pointer);
 extern MemoryContext GetMemoryChunkContext(void *pointer);
 extern MemoryContext MemoryContextGetParent(MemoryContext context);
@@ -117,7 +118,8 @@ extern bool MemoryContextContains(MemoryContext context, void *pointer);
 extern MemoryContext MemoryContextCreate(NodeTag tag, Size size,
 					MemoryContextMethods *methods,
 					MemoryContext parent,
-					const char *name);
+					const char *name,
+					bool track_mem);
 
 
 /*
@@ -130,6 +132,12 @@ extern MemoryContext AllocSetContextCreate(MemoryContext parent,
 					  Size minContextSize,
 					  Size initBlockSize,
 					  Size maxBlockSize);
+extern MemoryContext AllocSetContextCreateTracked(MemoryContext parent,
+					  const char *name,
+					  Size minContextSize,
+					  Size initBlockSize,
+					  Size maxBlockSize,
+					  bool track_mem);
 
 /*
  * Recommended default alloc parameters, suitable for "ordinary" contexts
-- 
2.0.5

>From 2efd34056d32dbaee3883941b5d9a60c1c2b635a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <t...@fuzzy.cz>
Date: Wed, 7 Jan 2015 18:16:04 +0100
Subject: [PATCH 2/3] jeff's hashagg patch

---
 doc/src/sgml/config.sgml                      |  15 +
 src/backend/commands/explain.c                |  38 +++
 src/backend/executor/execGrouping.c           |  59 +++-
 src/backend/executor/nodeAgg.c                | 454 +++++++++++++++++++++++---
 src/backend/optimizer/path/costsize.c         |  37 ++-
 src/backend/optimizer/plan/createplan.c       |   4 +
 src/backend/optimizer/plan/planagg.c          |   2 +-
 src/backend/optimizer/plan/planner.c          |  12 +-
 src/backend/optimizer/prep/prepunion.c        |   2 +-
 src/backend/optimizer/util/pathnode.c         |   2 +-
 src/backend/utils/hash/dynahash.c             |  12 +-
 src/backend/utils/misc/guc.c                  |   9 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/executor/executor.h               |   6 +
 src/include/executor/nodeAgg.h                |   1 +
 src/include/nodes/execnodes.h                 |   7 +
 src/include/nodes/plannodes.h                 |   2 +
 src/include/optimizer/cost.h                  |   3 +-
 src/include/utils/hsearch.h                   |   3 +
 src/test/regress/expected/rangefuncs.out      |   3 +-
 20 files changed, 614 insertions(+), 58 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6bcb106..e241a13 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3045,6 +3045,21 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-hashagg-disk" xreflabel="enable_hashagg_disk">
+      <term><varname>enable_hashagg_disk</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_hashagg_disk</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of hashed aggregation plan
+        types when the planner expects the hash table size to exceed
+        <varname>work_mem</varname>. The default is <literal>on</>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-hashjoin" xreflabel="enable_hashjoin">
       <term><varname>enable_hashjoin</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 8a0be5d..0c8ec04 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -86,6 +86,7 @@ static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
 					 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
+static void show_hashagg_info(AggState *hashstate, ExplainState *es);
 static void show_tidbitmap_info(BitmapHeapScanState *planstate,
 					ExplainState *es);
 static void show_instrumentation_count(const char *qlabel, int which,
@@ -1422,6 +1423,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Agg:
 			show_agg_keys((AggState *) planstate, ancestors, es);
 			show_upper_qual(plan->qual, "Filter", planstate, ancestors, es);
+			show_hashagg_info((AggState *) planstate, es);
 			if (plan->qual)
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
@@ -1912,6 +1914,42 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 }
 
 /*
+ * Show information on hash aggregate buckets and batches
+ */
+static void
+show_hashagg_info(AggState *aggstate, ExplainState *es)
+{
+	Agg *agg = (Agg *)aggstate->ss.ps.plan;
+
+	Assert(IsA(aggstate, AggState));
+
+	if (agg->aggstrategy != AGG_HASHED)
+		return;
+
+	if (!aggstate->hash_init_state)
+	{
+		long	memPeakKb = (aggstate->hash_mem_peak + 1023) / 1024;
+		long	diskKb	  = (aggstate->hash_disk + 1023) / 1024;
+
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			appendStringInfoSpaces(es->str, es->indent * 2);
+			appendStringInfo(
+				es->str,
+				"Batches: %d  Memory Usage: %ldkB  Disk Usage:%ldkB\n",
+				aggstate->hash_num_batches, memPeakKb, diskKb);
+		}
+		else
+		{
+			ExplainPropertyLong("HashAgg Batches",
+								aggstate->hash_num_batches, es);
+			ExplainPropertyLong("Peak Memory Usage", memPeakKb, es);
+			ExplainPropertyLong("Disk Usage", diskKb, es);
+		}
+	}
+}
+
+/*
  * Show information on hash buckets/batches.
  */
 static void
diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 183115f..84bdf6d 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -310,7 +310,8 @@ BuildTupleHashTable(int numCols, AttrNumber *keyColIdx,
 	hash_ctl.hcxt = tablecxt;
 	hashtable->hashtab = hash_create("TupleHashTable", nbuckets,
 									 &hash_ctl,
-					HASH_ELEM | HASH_FUNCTION | HASH_COMPARE | HASH_CONTEXT);
+									 HASH_ELEM | HASH_FUNCTION | HASH_COMPARE |
+									 HASH_CONTEXT | HASH_NOCHILDCXT);
 
 	return hashtable;
 }
@@ -331,6 +332,55 @@ TupleHashEntry
 LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
 					 bool *isnew)
 {
+	uint32 hashvalue;
+
+	hashvalue = TupleHashEntryHash(hashtable, slot);
+	return LookupTupleHashEntryHash(hashtable, slot, hashvalue, isnew);
+}
+
+/*
+ * TupleHashEntryHash
+ *
+ * Calculate the hash value of the tuple.
+ */
+uint32
+TupleHashEntryHash(TupleHashTable hashtable, TupleTableSlot *slot)
+{
+	TupleHashEntryData	dummy;
+	TupleHashTable		saveCurHT;
+	uint32				hashvalue;
+
+	/*
+	 * Set up data needed by hash function.
+	 *
+	 * We save and restore CurTupleHashTable just in case someone manages to
+	 * invoke this code re-entrantly.
+	 */
+	hashtable->inputslot = slot;
+	hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
+	hashtable->cur_eq_funcs = hashtable->tab_eq_funcs;
+
+	saveCurHT = CurTupleHashTable;
+	CurTupleHashTable = hashtable;
+
+	dummy.firstTuple = NULL;	/* flag to reference inputslot */
+	hashvalue = TupleHashTableHash(&dummy, sizeof(TupleHashEntryData));
+
+	CurTupleHashTable = saveCurHT;
+
+	return hashvalue;
+}
+
+/*
+ * LookupTupleHashEntryHash
+ *
+ * Like LookupTupleHashEntry, but allows the caller to specify the tuple's
+ * hash value, to avoid recalculating it.
+ */
+TupleHashEntry
+LookupTupleHashEntryHash(TupleHashTable hashtable, TupleTableSlot *slot,
+						 uint32 hashvalue, bool *isnew)
+{
 	TupleHashEntry entry;
 	MemoryContext oldContext;
 	TupleHashTable saveCurHT;
@@ -371,10 +421,9 @@ LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
 
 	/* Search the hash table */
 	dummy.firstTuple = NULL;	/* flag to reference inputslot */
-	entry = (TupleHashEntry) hash_search(hashtable->hashtab,
-										 &dummy,
-										 isnew ? HASH_ENTER : HASH_FIND,
-										 &found);
+	entry = (TupleHashEntry) hash_search_with_hash_value(
+		hashtable->hashtab, &dummy, hashvalue, isnew ? HASH_ENTER : HASH_FIND,
+		&found);
 
 	if (isnew)
 	{
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index c93b915..3b9f4cc 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -96,6 +96,8 @@
 
 #include "postgres.h"
 
+#include <math.h>
+
 #include "access/htup_details.h"
 #include "catalog/objectaccess.h"
 #include "catalog/pg_aggregate.h"
@@ -108,14 +110,18 @@
 #include "optimizer/tlist.h"
 #include "parser/parse_agg.h"
 #include "parser/parse_coerce.h"
+#include "storage/buffile.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
+#include "utils/dynahash.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/syscache.h"
 #include "utils/tuplesort.h"
 #include "utils/datum.h"
 
+#define HASH_DISK_MIN_PARTITIONS		1
+#define HASH_DISK_MAX_PARTITIONS		256
 
 /*
  * AggStatePerAggData - per-aggregate working state for the Agg scan
@@ -301,6 +307,17 @@ typedef struct AggHashEntryData
 	AggStatePerGroupData pergroup[1];	/* VARIABLE LENGTH ARRAY */
 }	AggHashEntryData;	/* VARIABLE LENGTH STRUCT */
 
+typedef struct HashWork
+{
+	BufFile		 *input_file;	/* input partition, NULL for outer plan */
+	int			  input_bits;	/* number of bits for input partition mask */
+	int64		  input_groups; /* estimated number of input groups */
+
+	int			  n_output_partitions; /* number of output partitions */
+	BufFile		**output_partitions; /* output partition files */
+	int64		 *output_ntuples; /* number of tuples in each partition */
+	int			  output_bits; /* log2(n_output_partitions) + input_bits */
+} HashWork;
 
 static void initialize_aggregates(AggState *aggstate,
 					  AggStatePerAgg peragg,
@@ -321,11 +338,15 @@ static void finalize_aggregate(AggState *aggstate,
 				   Datum *resultVal, bool *resultIsNull);
 static Bitmapset *find_unaggregated_cols(AggState *aggstate);
 static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
-static void build_hash_table(AggState *aggstate);
-static AggHashEntry lookup_hash_entry(AggState *aggstate,
-				  TupleTableSlot *inputslot);
+static void build_hash_table(AggState *aggstate, long nbuckets);
+static AggHashEntry lookup_hash_entry(AggState *aggstate, HashWork *work,
+				   TupleTableSlot *inputslot, uint32 hashvalue);
+static HashWork *hash_work(BufFile *input_file, int64 input_groups,
+						   int input_bits);
+static void save_tuple(AggState *aggstate, HashWork *work,
+					   TupleTableSlot *slot, uint32 hashvalue);
 static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
-static void agg_fill_hash_table(AggState *aggstate);
+static bool agg_fill_hash_table(AggState *aggstate);
 static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
 static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
 
@@ -923,20 +944,46 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
 }
 
 /*
+ * Estimate all memory used by a group in the hash table.
+ */
+Size
+hash_group_size(int numAggs, int inputWidth, Size transitionSpace)
+{
+	Size size;
+
+	/* tuple overhead */
+	size = MAXALIGN(sizeof(MinimalTupleData));
+	/* group key */
+	size += MAXALIGN(inputWidth);
+	/* hash table overhead */
+	size += hash_agg_entry_size(numAggs);
+	/* by-ref transition space */
+	size += transitionSpace;
+
+	return size;
+}
+
+/*
  * Initialize the hash table to empty.
  *
  * The hash table always lives in the aggcontext memory context.
  */
 static void
-build_hash_table(AggState *aggstate)
+build_hash_table(AggState *aggstate, long nbuckets)
 {
 	Agg		   *node = (Agg *) aggstate->ss.ps.plan;
 	MemoryContext tmpmem = aggstate->tmpcontext->ecxt_per_tuple_memory;
 	Size		entrysize;
+	Size		hash_group_mem = hash_group_size(aggstate->numaggs,
+												 node->plan_width,
+												 node->transitionSpace);
 
 	Assert(node->aggstrategy == AGG_HASHED);
 	Assert(node->numGroups > 0);
 
+	/* don't exceed work_mem */
+	nbuckets = Min(nbuckets, (long) ((work_mem * 1024L) / hash_group_mem));
+
 	entrysize = sizeof(AggHashEntryData) +
 		(aggstate->numaggs - 1) * sizeof(AggStatePerGroupData);
 
@@ -944,10 +991,16 @@ build_hash_table(AggState *aggstate)
 											  node->grpColIdx,
 											  aggstate->eqfunctions,
 											  aggstate->hashfunctions,
-											  node->numGroups,
+											  nbuckets,
 											  entrysize,
-											  aggstate->aggcontext,
+											  aggstate->hashcontext,
 											  tmpmem);
+
+	aggstate->hash_mem_min = MemoryContextMemAllocated(
+		aggstate->hashcontext, true);
+
+	if (aggstate->hash_mem_min > aggstate->hash_mem_peak)
+		aggstate->hash_mem_peak = aggstate->hash_mem_min;
 }
 
 /*
@@ -1021,15 +1074,21 @@ hash_agg_entry_size(int numAggs)
  * Find or create a hashtable entry for the tuple group containing the
  * given tuple.
  *
+ * If the group doesn't exist, and there's not enough memory to create it,
+ * save it for a later batch and return NULL.
+ *
  * When called, CurrentMemoryContext should be the per-query context.
  */
 static AggHashEntry
-lookup_hash_entry(AggState *aggstate, TupleTableSlot *inputslot)
+lookup_hash_entry(AggState *aggstate, HashWork *work,
+				  TupleTableSlot *inputslot, uint32 hashvalue)
 {
 	TupleTableSlot *hashslot = aggstate->hashslot;
 	ListCell   *l;
 	AggHashEntry entry;
-	bool		isnew;
+	int64		hash_mem;
+	bool		isnew = false;
+	bool	   *p_isnew;
 
 	/* if first time through, initialize hashslot by cloning input slot */
 	if (hashslot->tts_tupleDescriptor == NULL)
@@ -1049,10 +1108,20 @@ lookup_hash_entry(AggState *aggstate, TupleTableSlot *inputslot)
 		hashslot->tts_isnull[varNumber] = inputslot->tts_isnull[varNumber];
 	}
 
+	hash_mem = MemoryContextMemAllocated(aggstate->hashcontext, true);
+	if (hash_mem > aggstate->hash_mem_peak)
+		aggstate->hash_mem_peak = hash_mem;
+
+	if (hash_mem <= aggstate->hash_mem_min ||
+		hash_mem < work_mem * 1024L)
+		p_isnew = &isnew;
+	else
+		p_isnew = NULL;
+
 	/* find or create the hashtable entry using the filtered tuple */
-	entry = (AggHashEntry) LookupTupleHashEntry(aggstate->hashtable,
-												hashslot,
-												&isnew);
+	entry = (AggHashEntry) LookupTupleHashEntryHash(aggstate->hashtable,
+													hashslot, hashvalue,
+													p_isnew);
 
 	if (isnew)
 	{
@@ -1060,9 +1129,166 @@ lookup_hash_entry(AggState *aggstate, TupleTableSlot *inputslot)
 		initialize_aggregates(aggstate, aggstate->peragg, entry->pergroup);
 	}
 
+	if (entry == NULL)
+		save_tuple(aggstate, work, inputslot, hashvalue);
+
 	return entry;
 }
 
+
+/*
+ * hash_work
+ *
+ * Construct a HashWork item, which represents one iteration of HashAgg to be
+ * done. Should be called in the aggregate's memory context.
+ */
+static HashWork *
+hash_work(BufFile *input_file, int64 input_groups, int input_bits)
+{
+	HashWork *work = palloc(sizeof(HashWork));
+
+	work->input_file = input_file;
+	work->input_bits = input_bits;
+	work->input_groups = input_groups;
+
+	/*
+	 * Will be set only if we run out of memory and need to partition an
+	 * additional level.
+	 */
+	work->n_output_partitions = 0;
+	work->output_partitions = NULL;
+	work->output_ntuples = NULL;
+	work->output_bits = 0;
+
+	return work;
+}
+
+/*
+ * save_tuple
+ *
+ * Not enough memory to add tuple as new entry in hash table. Save for later
+ * in the appropriate partition.
+ */
+static void
+save_tuple(AggState *aggstate, HashWork *work, TupleTableSlot *slot,
+		   uint32 hashvalue)
+{
+	int					 partition;
+	MinimalTuple		 tuple;
+	BufFile				*file;
+	int					 written;
+
+	if (work->output_partitions == NULL)
+	{
+		Agg		*agg = (Agg *) aggstate->ss.ps.plan;
+		Size	 group_size = hash_group_size(aggstate->numaggs,
+											  agg->plan_width,
+											  agg->transitionSpace);
+		double	 total_size = group_size * work->input_groups;
+		int		 npartitions;
+		int		 partition_bits;
+
+		/*
+		 * Try to make enough partitions so that each one fits in work_mem,
+		 * with a little slop.
+		 */
+		npartitions = ceil ( (1.5 * total_size) / (work_mem * 1024L) );
+
+		if (npartitions < HASH_DISK_MIN_PARTITIONS)
+			npartitions = HASH_DISK_MIN_PARTITIONS;
+		if (npartitions > HASH_DISK_MAX_PARTITIONS)
+			npartitions = HASH_DISK_MAX_PARTITIONS;
+
+		partition_bits = my_log2(npartitions);
+
+		/* make sure that we don't exhaust the hash bits */
+		if (partition_bits + work->input_bits >= 32)
+			partition_bits = 32 - work->input_bits;
+
+		/* number of partitions will be a power of two */
+		npartitions = 1L << partition_bits;
+
+		work->output_bits = partition_bits;
+		work->n_output_partitions = npartitions;
+		work->output_partitions = palloc0(sizeof(BufFile *) * npartitions);
+		work->output_ntuples = palloc0(sizeof(int64) * npartitions);
+	}
+
+	if (work->output_bits == 0)
+		partition = 0;
+	else
+		partition = (hashvalue << work->input_bits) >>
+			(32 - work->output_bits);
+
+	work->output_ntuples[partition]++;
+
+	if (work->output_partitions[partition] == NULL)
+		work->output_partitions[partition] = BufFileCreateTemp(false);
+	file = work->output_partitions[partition];
+
+	tuple = ExecFetchSlotMinimalTuple(slot);
+
+	written = BufFileWrite(file, (void *) &hashvalue, sizeof(uint32));
+	if (written != sizeof(uint32))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to HashAgg temporary file: %m")));
+	aggstate->hash_disk += written;
+
+	written = BufFileWrite(file, (void *) tuple, tuple->t_len);
+	if (written != tuple->t_len)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to HashAgg temporary file: %m")));
+	aggstate->hash_disk += written;
+}
+
+
+/*
+ * read_saved_tuple
+ *		read the next tuple from a batch file.  Return NULL if no more.
+ *
+ * On success, *hashvalue is set to the tuple's hash value, and the tuple
+ * itself is stored in the given slot.
+ *
+ * Copied with minor modifications from ExecHashJoinGetSavedTuple.
+ */
+static TupleTableSlot *
+read_saved_tuple(BufFile *file, uint32 *hashvalue, TupleTableSlot *tupleSlot)
+{
+	uint32		header[2];
+	size_t		nread;
+	MinimalTuple tuple;
+
+	/*
+	 * Since both the hash value and the MinimalTuple length word are uint32,
+	 * we can read them both in one BufFileRead() call without any type
+	 * cheating.
+	 */
+	nread = BufFileRead(file, (void *) header, sizeof(header));
+	if (nread == 0)				/* end of file */
+	{
+		ExecClearTuple(tupleSlot);
+		return NULL;
+	}
+	if (nread != sizeof(header))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from HashAgg temporary file: %m")));
+	*hashvalue = header[0];
+	tuple = (MinimalTuple) palloc(header[1]);
+	tuple->t_len = header[1];
+	nread = BufFileRead(file,
+						(void *) ((char *) tuple + sizeof(uint32)),
+						header[1] - sizeof(uint32));
+	if (nread != header[1] - sizeof(uint32))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read from HashAgg temporary file: %m")));
+	return ExecStoreMinimalTuple(tuple, tupleSlot, true);
+}
+
+
 /*
  * ExecAgg -
  *
@@ -1107,9 +1333,16 @@ ExecAgg(AggState *node)
 	/* Dispatch based on strategy */
 	if (((Agg *) node->ss.ps.plan)->aggstrategy == AGG_HASHED)
 	{
-		if (!node->table_filled)
-			agg_fill_hash_table(node);
-		return agg_retrieve_hash_table(node);
+		TupleTableSlot *slot = NULL;
+
+		while (slot == NULL)
+		{
+			if (!node->table_filled)
+				if (!agg_fill_hash_table(node))
+					break;
+			slot = agg_retrieve_hash_table(node);
+		}
+		return slot;
 	}
 	else
 		return agg_retrieve_direct(node);
@@ -1325,13 +1558,15 @@ agg_retrieve_direct(AggState *aggstate)
 /*
  * ExecAgg for hashed case: phase 1, read input and build hash table
  */
-static void
+static bool
 agg_fill_hash_table(AggState *aggstate)
 {
 	PlanState  *outerPlan;
 	ExprContext *tmpcontext;
 	AggHashEntry entry;
 	TupleTableSlot *outerslot;
+	HashWork	*work;
+	int			 i;
 
 	/*
 	 * get state info from node
@@ -1340,20 +1575,73 @@ agg_fill_hash_table(AggState *aggstate)
 	/* tmpcontext is the per-input-tuple expression context */
 	tmpcontext = aggstate->tmpcontext;
 
+	if (aggstate->hash_work == NIL)
+	{
+		aggstate->agg_done = true;
+		return false;
+	}
+
+	work = linitial(aggstate->hash_work);
+	aggstate->hash_work = list_delete_first(aggstate->hash_work);
+
+	/* if not the first time through, reinitialize */
+	if (!aggstate->hash_init_state)
+	{
+		long	 nbuckets;
+		Agg		*node = (Agg *) aggstate->ss.ps.plan;
+
+		MemoryContextResetAndDeleteChildren(aggstate->hashcontext);
+
+		/*
+		 * If this table will hold only a partition of the input, then use a
+		 * proportionally smaller estimate for nbuckets.
+		 */
+		nbuckets = node->numGroups >> work->input_bits;
+
+		build_hash_table(aggstate, nbuckets);
+	}
+
+	aggstate->hash_init_state = false;
+
 	/*
 	 * Process each outer-plan tuple, and then fetch the next one, until we
 	 * exhaust the outer plan.
 	 */
 	for (;;)
 	{
-		outerslot = ExecProcNode(outerPlan);
-		if (TupIsNull(outerslot))
-			break;
+		uint32 hashvalue;
+
+		CHECK_FOR_INTERRUPTS();
+
+		if (work->input_file == NULL)
+		{
+			outerslot = ExecProcNode(outerPlan);
+			if (TupIsNull(outerslot))
+				break;
+
+			hashvalue = TupleHashEntryHash(aggstate->hashtable, outerslot);
+		}
+		else
+		{
+			outerslot = read_saved_tuple(work->input_file, &hashvalue,
+										 aggstate->hashslot);
+			if (TupIsNull(outerslot))
+			{
+				BufFileClose(work->input_file);
+				work->input_file = NULL;
+				break;
+			}
+		}
+
 		/* set up for advance_aggregates call */
 		tmpcontext->ecxt_outertuple = outerslot;
 
 		/* Find or build hashtable entry for this tuple's group */
-		entry = lookup_hash_entry(aggstate, outerslot);
+		entry = lookup_hash_entry(aggstate, work, outerslot, hashvalue);
+
+		/* Tuple may have been saved for later processing */
+		if (entry == NULL)
+			continue;
 
 		/* Advance the aggregates */
 		advance_aggregates(aggstate, entry->pergroup);
@@ -1362,9 +1650,55 @@ agg_fill_hash_table(AggState *aggstate)
 		ResetExprContext(tmpcontext);
 	}
 
+	if (work->input_file)
+		BufFileClose(work->input_file);
+
+	/* add each output partition as a new work item */
+	for (i = 0; i < work->n_output_partitions; i++)
+	{
+		BufFile			*file = work->output_partitions[i];
+		MemoryContext	 oldContext;
+		HashWork		*new_work;
+		int64			 input_ngroups;
+
+		/* partition is empty */
+		if (file == NULL)
+			continue;
+
+		/* rewind file for reading */
+		if (BufFileSeek(file, 0, 0L, SEEK_SET))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not rewind HashAgg temporary file: %m")));
+
+		/*
+		 * Estimate the number of input groups for this new work item as the
+		 * total number of tuples in its input file. Although that's a worst
+		 * case, it's not bad here for two reasons: (1) overestimating is
+		 * better than underestimating; and (2) we've already scanned the
+		 * relation once, so it's likely that we've already finalized many of
+		 * the common values.
+		 */
+		input_ngroups = work->output_ntuples[i];
+
+		oldContext = MemoryContextSwitchTo(aggstate->aggcontext);
+		new_work = hash_work(file,
+							 input_ngroups,
+							 work->output_bits + work->input_bits);
+		aggstate->hash_work = lappend(
+			aggstate->hash_work,
+			new_work);
+		aggstate->hash_num_batches++;
+		MemoryContextSwitchTo(oldContext);
+	}
+
+	pfree(work);
+
 	aggstate->table_filled = true;
 	/* Initialize to walk the hash table */
 	ResetTupleHashIterator(aggstate->hashtable, &aggstate->hashiter);
+
+	return true;
 }
 
 /*
@@ -1396,16 +1730,18 @@ agg_retrieve_hash_table(AggState *aggstate)
 	 * We loop retrieving groups until we find one satisfying
 	 * aggstate->ss.ps.qual
 	 */
-	while (!aggstate->agg_done)
+	for (;;)
 	{
+		CHECK_FOR_INTERRUPTS();
+
 		/*
 		 * Find the next entry in the hash table
 		 */
 		entry = (AggHashEntry) ScanTupleHashTable(&aggstate->hashiter);
 		if (entry == NULL)
 		{
-			/* No more entries in hashtable, so done */
-			aggstate->agg_done = TRUE;
+			/* No more entries in hashtable, so done with this batch */
+			aggstate->table_filled = false;
 			return NULL;
 		}
 
@@ -1637,10 +1973,33 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 
 	if (node->aggstrategy == AGG_HASHED)
 	{
-		build_hash_table(aggstate);
+		MemoryContext oldContext;
+
+		aggstate->hash_mem_min = 0;
+		aggstate->hash_mem_peak = 0;
+		aggstate->hash_num_batches = 0;
+		aggstate->hash_init_state = true;
 		aggstate->table_filled = false;
+		aggstate->hash_disk = 0;
+
+		aggstate->hashcontext =
+			AllocSetContextCreate(aggstate->aggcontext,
+								  "HashAgg Hash Table Context",
+								  ALLOCSET_DEFAULT_MINSIZE,
+								  ALLOCSET_DEFAULT_INITSIZE,
+								  ALLOCSET_DEFAULT_MAXSIZE);
+
+		build_hash_table(aggstate, node->numGroups);
+
 		/* Compute the columns we actually need to hash on */
 		aggstate->hash_needed = find_hash_columns(aggstate);
+
+		/* prime with initial work item to read from outer plan */
+		oldContext = MemoryContextSwitchTo(aggstate->aggcontext);
+		aggstate->hash_work = lappend(aggstate->hash_work,
+									  hash_work(NULL, node->numGroups, 0));
+		aggstate->hash_num_batches++;
+		MemoryContextSwitchTo(oldContext);
 	}
 	else
 	{
@@ -2049,32 +2408,34 @@ ExecEndAgg(AggState *node)
 void
 ExecReScanAgg(AggState *node)
 {
+	Agg			*agg = (Agg *) node->ss.ps.plan;
 	ExprContext *econtext = node->ss.ps.ps_ExprContext;
-	int			aggno;
+	int			 aggno;
 
 	node->agg_done = false;
 
 	node->ss.ps.ps_TupFromTlist = false;
 
-	if (((Agg *) node->ss.ps.plan)->aggstrategy == AGG_HASHED)
+	if (agg->aggstrategy == AGG_HASHED)
 	{
 		/*
-		 * In the hashed case, if we haven't yet built the hash table then we
-		 * can just return; nothing done yet, so nothing to undo. If subnode's
-		 * chgParam is not NULL then it will be re-scanned by ExecProcNode,
-		 * else no reason to re-scan it at all.
+		 * In the hashed case, if we haven't done any execution work yet, we
+		 * can just return; nothing to undo. If subnode's chgParam is not NULL
+		 * then it will be re-scanned by ExecProcNode, else no reason to
+		 * re-scan it at all.
 		 */
-		if (!node->table_filled)
+		if (node->hash_init_state)
 			return;
 
 		/*
-		 * If we do have the hash table and the subplan does not have any
-		 * parameter changes, then we can just rescan the existing hash table;
-		 * no need to build it again.
+		 * If we do have the hash table, it never went to disk, and the
+		 * subplan does not have any parameter changes, then we can just
+		 * rescan the existing hash table; no need to build it again.
 		 */
-		if (node->ss.ps.lefttree->chgParam == NULL)
+		if (node->ss.ps.lefttree->chgParam == NULL && node->hash_disk == 0)
 		{
 			ResetTupleHashIterator(node->hashtable, &node->hashiter);
+			node->table_filled = true;
 			return;
 		}
 	}
@@ -2111,11 +2472,30 @@ ExecReScanAgg(AggState *node)
 	 */
 	MemoryContextResetAndDeleteChildren(node->aggcontext);
 
-	if (((Agg *) node->ss.ps.plan)->aggstrategy == AGG_HASHED)
+	if (agg->aggstrategy == AGG_HASHED)
 	{
+		MemoryContext oldContext;
+
+		node->hashcontext =
+			AllocSetContextCreate(node->aggcontext,
+								  "HashAgg Hash Table Context",
+								  ALLOCSET_DEFAULT_MINSIZE,
+								  ALLOCSET_DEFAULT_INITSIZE,
+								  ALLOCSET_DEFAULT_MAXSIZE);
+
 		/* Rebuild an empty hash table */
-		build_hash_table(node);
+		build_hash_table(node, agg->numGroups);
+		node->hash_init_state = true;
 		node->table_filled = false;
+		node->hash_disk = 0;
+		node->hash_work = NIL;
+
+		/* prime with initial work item to read from outer plan */
+		oldContext = MemoryContextSwitchTo(node->aggcontext);
+		node->hash_work = lappend(node->hash_work,
+								  hash_work(NULL, agg->numGroups, 0));
+		node->hash_num_batches++;
+		MemoryContextSwitchTo(oldContext);
 	}
 	else
 	{
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 020558b..ae18ea5 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -75,6 +75,7 @@
 
 #include "access/htup_details.h"
 #include "executor/executor.h"
+#include "executor/nodeAgg.h"
 #include "executor/nodeHash.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
@@ -113,6 +114,7 @@ bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
 bool		enable_hashagg = true;
+bool		enable_hashagg_disk = true;
 bool		enable_nestloop = true;
 bool		enable_material = true;
 bool		enable_mergejoin = true;
@@ -1468,7 +1470,7 @@ cost_agg(Path *path, PlannerInfo *root,
 		 AggStrategy aggstrategy, const AggClauseCosts *aggcosts,
 		 int numGroupCols, double numGroups,
 		 Cost input_startup_cost, Cost input_total_cost,
-		 double input_tuples)
+		 int input_width, double input_tuples)
 {
 	double		output_tuples;
 	Cost		startup_cost;
@@ -1531,6 +1533,10 @@ cost_agg(Path *path, PlannerInfo *root,
 	else
 	{
 		/* must be AGG_HASHED */
+		double	group_size = hash_group_size(aggcosts->numAggs,
+											 input_width,
+											 aggcosts->transitionSpace);
+
 		startup_cost = input_total_cost;
 		startup_cost += aggcosts->transCost.startup;
 		startup_cost += aggcosts->transCost.per_tuple * input_tuples;
@@ -1538,6 +1544,35 @@ cost_agg(Path *path, PlannerInfo *root,
 		total_cost = startup_cost;
 		total_cost += aggcosts->finalCost * numGroups;
 		total_cost += cpu_tuple_cost * numGroups;
+
+		if (group_size * numGroups > (work_mem * 1024L))
+		{
+			double groups_per_batch = (work_mem * 1024L) / group_size;
+
+			/* first batch doesn't go to disk */
+			double groups_disk = numGroups - groups_per_batch;
+
+			/*
+			 * Assume that the groups that go to disk are of an average number
+			 * of tuples. This is pessimistic -- the largest groups are more
+			 * likely to be processed in the first pass and never go to disk.
+			 */
+			double tuples_disk = groups_disk * (input_tuples / numGroups);
+
+			int tuple_size = sizeof(uint32) /* stored hash value */
+				+ MAXALIGN(sizeof(MinimalTupleData))
+				+ MAXALIGN(input_width);
+			double pages_to_disk = (tuples_disk * tuple_size) / BLCKSZ;
+
+			/*
+			 * Write and then read back the data that's not processed in the
+			 * first pass. Data could be read and written more times than that
+			 * if not enough partitions are created, but the depth will be a
+			 * very small number even for a very large amount of data, so
+			 * ignore it here.
+			 */
+			total_cost += seq_page_cost * 2 * pages_to_disk;
+		}
 		output_tuples = numGroups;
 	}
 
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 655be81..e0abd2c 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -4370,6 +4370,9 @@ make_agg(PlannerInfo *root, List *tlist, List *qual,
 	node->grpColIdx = grpColIdx;
 	node->grpOperators = grpOperators;
 	node->numGroups = numGroups;
+	if (aggcosts != NULL)
+		node->transitionSpace = aggcosts->transitionSpace;
+	node->plan_width = lefttree->plan_width;
 
 	copy_plan_costsize(plan, lefttree); /* only care about copying size */
 	cost_agg(&agg_path, root,
@@ -4377,6 +4380,7 @@ make_agg(PlannerInfo *root, List *tlist, List *qual,
 			 numGroupCols, numGroups,
 			 lefttree->startup_cost,
 			 lefttree->total_cost,
+			 lefttree->plan_width,
 			 lefttree->plan_rows);
 	plan->startup_cost = agg_path.startup_cost;
 	plan->total_cost = agg_path.total_cost;
diff --git a/src/backend/optimizer/plan/planagg.c b/src/backend/optimizer/plan/planagg.c
index b90c2ef..f85e1f2 100644
--- a/src/backend/optimizer/plan/planagg.c
+++ b/src/backend/optimizer/plan/planagg.c
@@ -234,7 +234,7 @@ optimize_minmax_aggregates(PlannerInfo *root, List *tlist,
 	cost_agg(&agg_p, root, AGG_PLAIN, aggcosts,
 			 0, 0,
 			 best_path->startup_cost, best_path->total_cost,
-			 best_path->parent->rows);
+			 best_path->parent->width, best_path->parent->rows);
 
 	if (total_cost > agg_p.total_cost)
 		return NULL;			/* too expensive */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 9cbbcfb..be76a2d 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -2744,7 +2744,8 @@ choose_hashed_grouping(PlannerInfo *root,
 	/* plus the per-hash-entry overhead */
 	hashentrysize += hash_agg_entry_size(agg_costs->numAggs);
 
-	if (hashentrysize * dNumGroups > work_mem * 1024L)
+	if (!enable_hashagg_disk &&
+		hashentrysize * dNumGroups > work_mem * 1024L)
 		return false;
 
 	/*
@@ -2779,7 +2780,7 @@ choose_hashed_grouping(PlannerInfo *root,
 	cost_agg(&hashed_p, root, AGG_HASHED, agg_costs,
 			 numGroupCols, dNumGroups,
 			 cheapest_path->startup_cost, cheapest_path->total_cost,
-			 path_rows);
+			 path_width, path_rows);
 	/* Result of hashed agg is always unsorted */
 	if (target_pathkeys)
 		cost_sort(&hashed_p, root, target_pathkeys, hashed_p.total_cost,
@@ -2810,7 +2811,7 @@ choose_hashed_grouping(PlannerInfo *root,
 		cost_agg(&sorted_p, root, AGG_SORTED, agg_costs,
 				 numGroupCols, dNumGroups,
 				 sorted_p.startup_cost, sorted_p.total_cost,
-				 path_rows);
+				 path_width, path_rows);
 	else
 		cost_group(&sorted_p, root, numGroupCols, dNumGroups,
 				   sorted_p.startup_cost, sorted_p.total_cost,
@@ -2910,7 +2911,8 @@ choose_hashed_distinct(PlannerInfo *root,
 	/* plus the per-hash-entry overhead */
 	hashentrysize += hash_agg_entry_size(0);
 
-	if (hashentrysize * dNumDistinctRows > work_mem * 1024L)
+	if (!enable_hashagg_disk &&
+		hashentrysize * dNumDistinctRows > work_mem * 1024L)
 		return false;
 
 	/*
@@ -2929,7 +2931,7 @@ choose_hashed_distinct(PlannerInfo *root,
 	cost_agg(&hashed_p, root, AGG_HASHED, NULL,
 			 numDistinctCols, dNumDistinctRows,
 			 cheapest_startup_cost, cheapest_total_cost,
-			 path_rows);
+			 path_width, path_rows);
 
 	/*
 	 * Result of hashed agg is always unsorted, so if ORDER BY is present we
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 05f601e..5637a13 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -851,7 +851,7 @@ choose_hashed_setop(PlannerInfo *root, List *groupClauses,
 	cost_agg(&hashed_p, root, AGG_HASHED, NULL,
 			 numGroupCols, dNumGroups,
 			 input_plan->startup_cost, input_plan->total_cost,
-			 input_plan->plan_rows);
+			 input_plan->plan_width, input_plan->plan_rows);
 
 	/*
 	 * Now for the sorted case.  Note that the input is *always* unsorted,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 1395a21..fcfc10c 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1379,7 +1379,7 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
 					 numCols, pathnode->path.rows,
 					 subpath->startup_cost,
 					 subpath->total_cost,
-					 rel->rows);
+					 rel->width, rel->rows);
 	}
 
 	if (all_btree && all_hash)
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 27580af..0b36e14 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -305,11 +305,13 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags)
 			CurrentDynaHashCxt = info->hcxt;
 		else
 			CurrentDynaHashCxt = TopMemoryContext;
-		CurrentDynaHashCxt = AllocSetContextCreate(CurrentDynaHashCxt,
-												   tabname,
-												   ALLOCSET_DEFAULT_MINSIZE,
-												   ALLOCSET_DEFAULT_INITSIZE,
-												   ALLOCSET_DEFAULT_MAXSIZE);
+
+		if ((flags & HASH_NOCHILDCXT) == 0)
+			CurrentDynaHashCxt = AllocSetContextCreate(CurrentDynaHashCxt,
+													   tabname,
+													   ALLOCSET_DEFAULT_MINSIZE,
+													   ALLOCSET_DEFAULT_INITSIZE,
+													   ALLOCSET_DEFAULT_MAXSIZE);
 	}
 
 	/* Initialize the hash header, plus a copy of the table name */
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index f6df077..44f8888 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -771,6 +771,15 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 	{
+		{"enable_hashagg_disk", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of disk-based hashed aggregation plans."),
+			NULL
+		},
+		&enable_hashagg_disk,
+		true,
+		NULL, NULL, NULL
+	},
+	{
 		{"enable_material", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of materialization."),
 			NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index b053659..a1f3fa1 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -270,6 +270,7 @@
 
 #enable_bitmapscan = on
 #enable_hashagg = on
+#enable_hashagg_disk = on
 #enable_hashjoin = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 40fde83..77c2a03 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -147,6 +147,12 @@ extern TupleHashTable BuildTupleHashTable(int numCols, AttrNumber *keyColIdx,
 extern TupleHashEntry LookupTupleHashEntry(TupleHashTable hashtable,
 					 TupleTableSlot *slot,
 					 bool *isnew);
+extern uint32 TupleHashEntryHash(TupleHashTable hashtable,
+					 TupleTableSlot *slot);
+extern TupleHashEntry LookupTupleHashEntryHash(TupleHashTable hashtable,
+					 TupleTableSlot *slot,
+					 uint32 hashvalue,
+					 bool *isnew);
 extern TupleHashEntry FindTupleHashEntry(TupleHashTable hashtable,
 				   TupleTableSlot *slot,
 				   FmgrInfo *eqfunctions,
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index fe3b81a..4370c26 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -22,6 +22,7 @@ extern void ExecEndAgg(AggState *node);
 extern void ExecReScanAgg(AggState *node);
 
 extern Size hash_agg_entry_size(int numAggs);
+extern Size hash_group_size(int numAggs, int inputWidth, Size transitionSpace);
 
 extern Datum aggregate_dummy(PG_FUNCTION_ARGS);
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 41288ed..b7ddacd 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1759,11 +1759,18 @@ typedef struct AggState
 	AggStatePerGroup pergroup;	/* per-Aggref-per-group working state */
 	HeapTuple	grp_firstTuple; /* copy of first tuple of current group */
 	/* these fields are used in AGG_HASHED mode: */
+	MemoryContext hashcontext;	/* subcontext to use for hash table */
 	TupleHashTable hashtable;	/* hash table with one entry per group */
 	TupleTableSlot *hashslot;	/* slot for loading hash table */
 	List	   *hash_needed;	/* list of columns needed in hash table */
+	bool		hash_init_state; /* in initial state before execution? */
 	bool		table_filled;	/* hash table filled yet? */
+	int64		hash_disk;		/* bytes of disk space used */
+	uint64		hash_mem_min;	/* memory used by empty hash table */
+	uint64		hash_mem_peak;	/* memory used at peak of execution */
+	int			hash_num_batches; /* total number of batches created */
 	TupleHashIterator hashiter; /* for iterating through hash table */
+	List	   *hash_work;		/* remaining work to be done */
 } AggState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 316c9ce..4dee074 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -666,6 +666,8 @@ typedef struct Agg
 	AttrNumber *grpColIdx;		/* their indexes in the target list */
 	Oid		   *grpOperators;	/* equality operators to compare with */
 	long		numGroups;		/* estimated number of groups in input */
+	Size		transitionSpace; /* estimated size of by-ref transition val */
+	int			plan_width;		/* input plan width */
 } Agg;
 
 /* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 9c2000b..a5dc906 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -57,6 +57,7 @@ extern bool enable_bitmapscan;
 extern bool enable_tidscan;
 extern bool enable_sort;
 extern bool enable_hashagg;
+extern bool enable_hashagg_disk;
 extern bool enable_nestloop;
 extern bool enable_material;
 extern bool enable_mergejoin;
@@ -102,7 +103,7 @@ extern void cost_agg(Path *path, PlannerInfo *root,
 		 AggStrategy aggstrategy, const AggClauseCosts *aggcosts,
 		 int numGroupCols, double numGroups,
 		 Cost input_startup_cost, Cost input_total_cost,
-		 double input_tuples);
+		 int input_width, double input_tuples);
 extern void cost_windowagg(Path *path, PlannerInfo *root,
 			   List *windowFuncs, int numPartCols, int numOrderCols,
 			   Cost input_startup_cost, Cost input_total_cost,
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 8c2432c..d185bf1 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -94,6 +94,9 @@ typedef struct HASHCTL
 #define HASH_SHARED_MEM 0x0800	/* Hashtable is in shared memory */
 #define HASH_ATTACH		0x1000	/* Do not initialize hctl */
 #define HASH_FIXED_SIZE 0x2000	/* Initial size is a hard limit */
+#define HASH_NOCHILDCXT 0x4000	/* Don't create a child context. Warning:
+								 * hash_destroy will delete the memory context
+								 * specified by the caller. */
 
 
 /* max_dsize value to indicate expansible directory */
diff --git a/src/test/regress/expected/rangefuncs.out b/src/test/regress/expected/rangefuncs.out
index 7991e99..d61a438 100644
--- a/src/test/regress/expected/rangefuncs.out
+++ b/src/test/regress/expected/rangefuncs.out
@@ -3,6 +3,7 @@ SELECT name, setting FROM pg_settings WHERE name LIKE 'enable%';
 ----------------------+---------
  enable_bitmapscan    | on
  enable_hashagg       | on
+ enable_hashagg_disk  | on
  enable_hashjoin      | on
  enable_indexonlyscan | on
  enable_indexscan     | on
@@ -12,7 +13,7 @@ SELECT name, setting FROM pg_settings WHERE name LIKE 'enable%';
  enable_seqscan       | on
  enable_sort          | on
  enable_tidscan       | on
-(11 rows)
+(12 rows)
 
 CREATE TABLE foo2(fooid int, f2 int);
 INSERT INTO foo2 VALUES(1, 11);
-- 
2.0.5

>From 4021282c6504137ba40407ecb31502d789bd1000 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <t...@fuzzy.cz>
Date: Wed, 7 Jan 2015 18:18:51 +0100
Subject: [PATCH 3/3] fix hashcontext -> aggcontext in agg node

---
 src/backend/executor/nodeAgg.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 3b9f4cc..d7cd899 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -997,7 +997,7 @@ build_hash_table(AggState *aggstate, long nbuckets)
 											  tmpmem);
 
 	aggstate->hash_mem_min = MemoryContextMemAllocated(
-		aggstate->hashcontext, true);
+		aggstate->aggcontext, true);
 
 	if (aggstate->hash_mem_min > aggstate->hash_mem_peak)
 		aggstate->hash_mem_peak = aggstate->hash_mem_min;
@@ -1108,7 +1108,7 @@ lookup_hash_entry(AggState *aggstate, HashWork *work,
 		hashslot->tts_isnull[varNumber] = inputslot->tts_isnull[varNumber];
 	}
 
-	hash_mem = MemoryContextMemAllocated(aggstate->hashcontext, true);
+	hash_mem = MemoryContextMemAllocated(aggstate->aggcontext, true);
 	if (hash_mem > aggstate->hash_mem_peak)
 		aggstate->hash_mem_peak = hash_mem;
 
-- 
2.0.5

test.sql
Description: application/sql

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

Reply via email to