Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

Tomas Vondra Sun, 29 Jun 2014 16:25:46 -0700

On 26.6.2014 23:48, Tomas Vondra wrote:
> On 26.6.2014 20:43, Tomas Vondra wrote:
>> Attached is v2 of the patch, with some cleanups / minor improvements:
>>
>> * there's a single FIXME, related to counting tuples in the
> 
> Meh, I couldn't resist resolving this FIXME, so attached is v3 of the
> patch. This just adds a proper 'batch tuples' counter to the hash table.
> 
> All comments, measurements on different queries etc. welcome. We'll
> certainly do a lot of testing, because this was a big issue for us.


Attached is v4 of the patch, with a few minor improvements. The only
thing worth mentioning is overflow protection, similar to what's done in
the ExecChooseHashTableSize() function. Otherwise it's mostly about
improving comments.

Also attached is a v4 with GUC, making it easier to compare effect of
the patch, by simply setting "enable_hashjoin_bucket" to "off" (original
behaviour) or "on" (new behaviour).

And finally there's an SQL script demonstrating the effect of the patch
with various work_mem settings. For example what I see on my desktop is
this (averages from 3 runs):

===== SMALL WORK MEM (2MB) =====
          no dynamic buckets     dynamic buckets
query A   5945 ms                5969 ms
query B   6080 ms                5943 ms
query C   6531 ms                6822 ms
query D   6962 ms                6618 ms

===== MEDIUM WORK MEM (16MB) =====
          no dynamic buckets     dynamic buckets
query A   7955 ms                7944 ms
query B   9970 ms                7569 ms
query C   8643 ms                8560 ms
query D  33908 ms                7700 ms

===== LARGE WORK MEM (64MB) =====
          no dynamic buckets     dynamic buckets
query A   10235 ms               10233 ms
query B   32229 ms                9318 ms
query C   14500 ms               10554 ms
query D  213344 ms                9145 ms

Where "A" is "exactly estimated" and the other queries suffer by various
underestimates. My observations from this are:

(1) For small work_mem values it does not really matter, thanks to the
    caching effects (the whole hash table fits into L2 CPU cache).

(2) For medium work_mem values (not really huge, but exceeding CPU
    caches), the differences are negligible, except for the last query
    with most severe underestimate. In that case the new behaviour is
    much faster.

(3) For large work_mem values, the speedup is pretty obvious and
    dependent on the underestimate.

The question is why to choose large work_mem values when the smaller
values actually perform better. Well, the example tables are not
perfectly representative. In case the outer table is much larger and
does not fit into RAM that easily (which is the case of large fact
tables or joins), the rescans (because of more batches) are more
expensive and outweight the caching benefits.

Also, the work_mem is shared with other nodes, e.g. aggregates, and
decreasing it because of hash joins would hurt them.

regards
Tomas

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0d9663c..db3a953 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,20 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+								hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 								hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) || (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-							 hashtable->nbuckets, hashtable->nbatch,
-							 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+							 hashtable->nbuckets, hashtable->nbuckets_original,
+							 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..96fdd68 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -39,6 +39,7 @@
 
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 					  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -271,6 +272,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	 */
 	hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData));
 	hashtable->nbuckets = nbuckets;
+	hashtable->nbuckets_original = nbuckets;
 	hashtable->log2_nbuckets = log2_nbuckets;
 	hashtable->buckets = NULL;
 	hashtable->keepNulls = keepNulls;
@@ -285,6 +287,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable->nbatch_outstart = nbatch;
 	hashtable->growEnabled = true;
 	hashtable->totalTuples = 0;
+	hashtable->batchTuples = 0;
 	hashtable->innerBatchFile = NULL;
 	hashtable->outerBatchFile = NULL;
 	hashtable->spaceUsed = 0;
@@ -386,6 +389,23 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 /* Target bucket loading (tuples per bucket) */
 #define NTUP_PER_BUCKET			10
 
+/* Multiple of NTUP_PER_BUCKET triggering the increase of nbuckets.
+ * 
+ * Once we reach the threshold we double the number of buckets, and we
+ * want to get 1.0 on average (to get NTUP_PER_BUCKET on average). That
+ * means these two equations should hold
+ * 
+ *   b = 2a         (growth)
+ *   (a + b)/2 = 1  (oscillate around NTUP_PER_BUCKET)
+ * 
+ * which means b=1.3333 (a = b/2). If we wanted higher threshold, we
+ * could grow the nbuckets to (4*nbuckets), thus using (b=4a) for
+ * growth, leading to (b=1.6). Or (b=8a) giving 1.7777 etc.
+ * 
+ * Let's start with doubling the bucket count, i.e. 1.333. */
+#define NTUP_GROW_COEFFICIENT	1.333
+#define NTUP_GROW_THRESHOLD		(NTUP_PER_BUCKET * NTUP_GROW_COEFFICIENT)
+
 void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 						int *numbuckets,
@@ -651,6 +671,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 				/* prevtuple doesn't change */
 				hashtable->spaceUsed -=
 					HJTUPLE_OVERHEAD + HJTUPLE_MINTUPLE(tuple)->t_len;
+				hashtable->batchTuples -= 1;
 				pfree(tuple);
 				nfreed++;
 			}
@@ -682,6 +703,111 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 }
 
 /*
+ * ExecHashIncreaseNumBuckets
+ *		increase the original number of buckets in order to reduce
+ *		number of tuples per bucket
+ */
+static void
+ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
+{
+	int			i;
+	int         ntuples = 0;
+	int			oldnbuckets = hashtable->nbuckets;
+	HashJoinTuple  *oldbuckets = hashtable->buckets;
+	MemoryContext   oldcxt;
+
+	/*
+	 * Safety check to avoid overflow. This should only happen with very large
+	 * work_mem values, because (INT_MAX/2) means ~8GB only for the buckets.
+	 * With tuples, the hash table would require tens of GBs of work_mem.
+	 * 
+	 * XXX Technically there's also a limit for buckets fitting into work_mem
+	 * (with NTUP_PER_BUCKET tuples), but this can't be really exceeded
+	 * because when filling work_mem, another hash will be added (thus the
+	 * number of tuples will drop and more buckets won't be needed anymore).
+	 * 
+	 * That is, something like this will be enforced implicitly:
+	 * 
+	 *    work_mem * 1024L >= (nbuckets * tupsize * NTUP_GROW_THRESHOLD)
+	 * 
+	 * So it's enough to check only the overflow here.
+	 */
+	if (oldnbuckets >= (INT_MAX/2))
+		return;
+
+	/* XXX Not sure if we should update the info about used space here.
+	 * The code seems to ignore the space used for 'buckets' and we're not
+	 * allocating more space for tuples (just shuffling them to the new
+	 * buckets). And the amount of memory used for buckets is quite small
+	 * (just an array of pointers, thus ~8kB per 1k buckets on 64-bit). */
+
+	/* XXX Should we disable growth if (nbuckets * NTUP_PER_BUCKET)
+	 * reaches work_mem (or something like that)? We shouldn't really
+	 * get into such position (should be handled by increasing the
+	 * number of batches, which is called right before this). */
+
+	/* XXX Maybe adding info into hashjoin explain output (e.g. initial
+	 * nbuckets, time spent growing the table) would be appropriate. */
+
+	/* update the hashtable info, so that we can compute buckets etc. */
+	hashtable->log2_nbuckets += 1;
+	hashtable->nbuckets *= 2;
+
+	Assert(hashtable->nbuckets > 1);
+	Assert(hashtable->nbuckets == (1 << hashtable->log2_nbuckets));
+
+#ifdef HJDEBUG
+	printf("Increasing nbuckets to %d\n", hashtable->nbuckets);
+#endif
+
+	/* TODO Maybe it'd be better to resize the buckets in place (should be possible,
+	 * but when I tried it I always ended up with a strange infinite loop). */
+
+	/* allocate a new bucket list (use the batch context as before) */
+	oldcxt = MemoryContextSwitchTo(hashtable->batchCxt);
+
+	hashtable->buckets = (HashJoinTuple *) palloc0(hashtable->nbuckets * sizeof(HashJoinTuple));
+
+	MemoryContextSwitchTo(oldcxt);
+
+	/* walk through the old buckets, move the buckets into the new table */
+	for (i = 0; i < oldnbuckets; i++)
+	{
+
+		HashJoinTuple tuple = oldbuckets[i];
+
+		while (tuple != NULL)
+		{
+			/* save link in case we delete */
+			HashJoinTuple nexttuple = tuple->next;
+			int			bucketno;
+			int			batchno;
+
+			ExecHashGetBucketAndBatch(hashtable, tuple->hashvalue,
+									  &bucketno, &batchno);
+
+			/* move it to the correct bucket */
+			tuple->next = hashtable->buckets[bucketno];
+			hashtable->buckets[bucketno] = tuple;
+
+			/* process the next tuple */
+			tuple = nexttuple;
+
+			ntuples++;
+		}
+	}
+
+	pfree(oldbuckets);
+
+#ifdef HJDEBUG
+	printf("Nbuckets increased to %d, average items per bucket %.1f\n",
+		   hashtable->nbuckets, (float)ntuples / hashtable->nbuckets);
+#endif
+
+}
+
+
+/*
  * ExecHashTableInsert
  *		insert a tuple into the hash table depending on the hash value
  *		it may just go to a temp file for later batches
@@ -734,12 +860,29 @@ ExecHashTableInsert(HashJoinTable hashtable,
 		hashTuple->next = hashtable->buckets[bucketno];
 		hashtable->buckets[bucketno] = hashTuple;
 
+		/* increase the number of tuples in the batch */
+		hashtable->batchTuples += 1;
+
 		/* Account for space used, and back off if we've used too much */
 		hashtable->spaceUsed += hashTupleSize;
 		if (hashtable->spaceUsed > hashtable->spacePeak)
 			hashtable->spacePeak = hashtable->spaceUsed;
 		if (hashtable->spaceUsed > hashtable->spaceAllowed)
 			ExecHashIncreaseNumBatches(hashtable);
+
+		/* If average number of tuples per bucket, double the number of buckets (unless
+		 * nbuckets growth already disabled). */
+		if (hashtable->batchTuples > (hashtable->nbuckets * NTUP_GROW_THRESHOLD)) {
+
+#ifdef HJDEBUG
+			printf("Increasing nbucket to %d because average per bucket = %.1f\n",
+				   nbuckets,  (float)hashtable->batchTuples / hashtable->nbuckets);
+#endif
+
+			ExecHashIncreaseNumBuckets(hashtable);
+
+		}
+
 	}
 	else
 	{
@@ -1066,6 +1209,7 @@ ExecHashTableReset(HashJoinTable hashtable)
 		palloc0(nbuckets * sizeof(HashJoinTuple));
 
 	hashtable->spaceUsed = 0;
+	hashtable->batchTuples = 0;
 
 	MemoryContextSwitchTo(oldcxt);
 }
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 3beae40..2858f82 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -106,6 +106,7 @@ typedef struct HashSkewBucket
 typedef struct HashJoinTableData
 {
 	int			nbuckets;		/* # buckets in the in-memory hash table */
+	int			nbuckets_original;	/* # buckets when starting the first hash */
 	int			log2_nbuckets;	/* its log2 (nbuckets must be a power of 2) */
 
 	/* buckets[i] is head of list of tuples in i'th in-memory bucket */
@@ -129,6 +130,7 @@ typedef struct HashJoinTableData
 	bool		growEnabled;	/* flag to shut off nbatch increases */
 
 	double		totalTuples;	/* # tuples obtained from inner plan */
+	int64		batchTuples;	/* # tuples in the current batch */
 
 	/*
 	 * These arrays are allocated for the life of the hash join, but only if

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0d9663c..db3a953 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,20 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+								hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 								hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) || (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-							 hashtable->nbuckets, hashtable->nbatch,
-							 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+							 hashtable->nbuckets, hashtable->nbuckets_original,
+							 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..4cf6454 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -37,8 +37,10 @@
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
+bool enable_hashjoin_bucket = true;
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 					  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -271,6 +273,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	 */
 	hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData));
 	hashtable->nbuckets = nbuckets;
+	hashtable->nbuckets_original = nbuckets;
 	hashtable->log2_nbuckets = log2_nbuckets;
 	hashtable->buckets = NULL;
 	hashtable->keepNulls = keepNulls;
@@ -285,6 +288,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable->nbatch_outstart = nbatch;
 	hashtable->growEnabled = true;
 	hashtable->totalTuples = 0;
+	hashtable->batchTuples = 0;
 	hashtable->innerBatchFile = NULL;
 	hashtable->outerBatchFile = NULL;
 	hashtable->spaceUsed = 0;
@@ -386,6 +390,23 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 /* Target bucket loading (tuples per bucket) */
 #define NTUP_PER_BUCKET			10
 
+/* Multiple of NTUP_PER_BUCKET triggering the increase of nbuckets.
+ * 
+ * Once we reach the threshold we double the number of buckets, and we
+ * want to get 1.0 on average (to get NTUP_PER_BUCKET on average). That
+ * means these two equations should hold
+ * 
+ *   b = 2a         (growth)
+ *   (a + b)/2 = 1  (oscillate around NTUP_PER_BUCKET)
+ * 
+ * which means b=1.3333 (a = b/2). If we wanted higher threshold, we
+ * could grow the nbuckets to (4*nbuckets), thus using (b=4a) for
+ * growth, leading to (b=1.6). Or (b=8a) giving 1.7777 etc.
+ * 
+ * Let's start with doubling the bucket count, i.e. 1.333. */
+#define NTUP_GROW_COEFFICIENT	1.333
+#define NTUP_GROW_THRESHOLD		(NTUP_PER_BUCKET * NTUP_GROW_COEFFICIENT)
+
 void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 						int *numbuckets,
@@ -651,6 +672,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 				/* prevtuple doesn't change */
 				hashtable->spaceUsed -=
 					HJTUPLE_OVERHEAD + HJTUPLE_MINTUPLE(tuple)->t_len;
+				hashtable->batchTuples -= 1;
 				pfree(tuple);
 				nfreed++;
 			}
@@ -682,6 +704,111 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 }
 
 /*
+ * ExecHashIncreaseNumBuckets
+ *		increase the original number of buckets in order to reduce
+ *		number of tuples per bucket
+ */
+static void
+ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
+{
+	int			i;
+	int         ntuples = 0;
+	int			oldnbuckets = hashtable->nbuckets;
+	HashJoinTuple  *oldbuckets = hashtable->buckets;
+	MemoryContext   oldcxt;
+
+	/*
+	 * Safety check to avoid overflow. This should only happen with very large
+	 * work_mem values, because (INT_MAX/2) means ~8GB only for the buckets.
+	 * With tuples, the hash table would require tens of GBs of work_mem.
+	 * 
+	 * XXX Technically there's also a limit for buckets fitting into work_mem
+	 * (with NTUP_PER_BUCKET tuples), but this can't be really exceeded
+	 * because when filling work_mem, another hash will be added (thus the
+	 * number of tuples will drop and more buckets won't be needed anymore).
+	 * 
+	 * That is, something like this will be enforced implicitly:
+	 * 
+	 *    work_mem * 1024L >= (nbuckets * tupsize * NTUP_GROW_THRESHOLD)
+	 * 
+	 * So it's enough to check only the overflow here.
+	 */
+	if (oldnbuckets >= (INT_MAX/2))
+		return;
+
+	/* XXX Not sure if we should update the info about used space here.
+	 * The code seems to ignore the space used for 'buckets' and we're not
+	 * allocating more space for tuples (just shuffling them to the new
+	 * buckets). And the amount of memory used for buckets is quite small
+	 * (just an array of pointers, thus ~8kB per 1k buckets on 64-bit). */
+
+	/* XXX Should we disable growth if (nbuckets * NTUP_PER_BUCKET)
+	 * reaches work_mem (or something like that)? We shouldn't really
+	 * get into such position (should be handled by increasing the
+	 * number of batches, which is called right before this). */
+
+	/* XXX Maybe adding info into hashjoin explain output (e.g. initial
+	 * nbuckets, time spent growing the table) would be appropriate. */
+
+	/* update the hashtable info, so that we can compute buckets etc. */
+	hashtable->log2_nbuckets += 1;
+	hashtable->nbuckets *= 2;
+
+	Assert(hashtable->nbuckets > 1);
+	Assert(hashtable->nbuckets == (1 << hashtable->log2_nbuckets));
+
+#ifdef HJDEBUG
+	printf("Increasing nbuckets to %d\n", hashtable->nbuckets);
+#endif
+
+	/* TODO Maybe it'd be better to resize the buckets in place (should be possible,
+	 * but when I tried it I always ended up with a strange infinite loop). */
+
+	/* allocate a new bucket list (use the batch context as before) */
+	oldcxt = MemoryContextSwitchTo(hashtable->batchCxt);
+
+	hashtable->buckets = (HashJoinTuple *) palloc0(hashtable->nbuckets * sizeof(HashJoinTuple));
+
+	MemoryContextSwitchTo(oldcxt);
+
+	/* walk through the old buckets, move the buckets into the new table */
+	for (i = 0; i < oldnbuckets; i++)
+	{
+
+		HashJoinTuple tuple = oldbuckets[i];
+
+		while (tuple != NULL)
+		{
+			/* save link in case we delete */
+			HashJoinTuple nexttuple = tuple->next;
+			int			bucketno;
+			int			batchno;
+
+			ExecHashGetBucketAndBatch(hashtable, tuple->hashvalue,
+									  &bucketno, &batchno);
+
+			/* move it to the correct bucket */
+			tuple->next = hashtable->buckets[bucketno];
+			hashtable->buckets[bucketno] = tuple;
+
+			/* process the next tuple */
+			tuple = nexttuple;
+
+			ntuples++;
+		}
+	}
+
+	pfree(oldbuckets);
+
+#ifdef HJDEBUG
+	printf("Nbuckets increased to %d, average items per bucket %.1f\n",
+		   hashtable->nbuckets, (float)ntuples / hashtable->nbuckets);
+#endif
+
+}
+
+
+/*
  * ExecHashTableInsert
  *		insert a tuple into the hash table depending on the hash value
  *		it may just go to a temp file for later batches
@@ -734,12 +861,30 @@ ExecHashTableInsert(HashJoinTable hashtable,
 		hashTuple->next = hashtable->buckets[bucketno];
 		hashtable->buckets[bucketno] = hashTuple;
 
+		/* increase the number of tuples in the batch */
+		hashtable->batchTuples += 1;
+
 		/* Account for space used, and back off if we've used too much */
 		hashtable->spaceUsed += hashTupleSize;
 		if (hashtable->spaceUsed > hashtable->spacePeak)
 			hashtable->spacePeak = hashtable->spaceUsed;
 		if (hashtable->spaceUsed > hashtable->spaceAllowed)
 			ExecHashIncreaseNumBatches(hashtable);
+
+		/* If average number of tuples per bucket, double the number of buckets (unless
+		 * nbuckets growth already disabled). */
+		if (enable_hashjoin_bucket)
+		if (hashtable->batchTuples > (hashtable->nbuckets * NTUP_GROW_THRESHOLD)) {
+
+#ifdef HJDEBUG
+			printf("Increasing nbucket to %d because average per bucket = %.1f\n",
+				   nbuckets,  (float)hashtable->batchTuples / hashtable->nbuckets);
+#endif
+
+			ExecHashIncreaseNumBuckets(hashtable);
+
+		}
+
 	}
 	else
 	{
@@ -1066,6 +1211,7 @@ ExecHashTableReset(HashJoinTable hashtable)
 		palloc0(nbuckets * sizeof(HashJoinTuple));
 
 	hashtable->spaceUsed = 0;
+	hashtable->batchTuples = 0;
 
 	MemoryContextSwitchTo(oldcxt);
 }
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 3a31a75..c92cc26 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -788,6 +788,15 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 	{
+		{"enable_hashjoin_bucket", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables dynamic increase of hash buckets."),
+			NULL
+		},
+		&enable_hashjoin_bucket,
+		true,
+		NULL, NULL, NULL
+	},
+	{
 		{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
 			gettext_noop("Enables genetic query optimization."),
 			gettext_noop("This algorithm attempts to do planning without "
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 3beae40..2858f82 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -106,6 +106,7 @@ typedef struct HashSkewBucket
 typedef struct HashJoinTableData
 {
 	int			nbuckets;		/* # buckets in the in-memory hash table */
+	int			nbuckets_original;	/* # buckets when starting the first hash */
 	int			log2_nbuckets;	/* its log2 (nbuckets must be a power of 2) */
 
 	/* buckets[i] is head of list of tuples in i'th in-memory bucket */
@@ -129,6 +130,7 @@ typedef struct HashJoinTableData
 	bool		growEnabled;	/* flag to shut off nbatch increases */
 
 	double		totalTuples;	/* # tuples obtained from inner plan */
+	int64		batchTuples;	/* # tuples in the current batch */
 
 	/*
 	 * These arrays are allocated for the life of the hash join, but only if
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 75be5bd..15604cb 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -50,4 +50,6 @@ extern void ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 						int *num_skew_mcvs);
 extern int	ExecHashGetSkewBucket(HashJoinTable hashtable, uint32 hashvalue);
 
+extern bool enable_hashjoin_bucket;
+
 #endif   /* NODEHASH_H */
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 75e2afb..60b8da8 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -62,6 +62,7 @@ extern bool enable_material;
 extern bool enable_mergejoin;
 extern bool enable_hashjoin;
 extern int	constraint_exclusion;
+extern bool enable_hashjoin_bucket;
 
 extern double clamp_row_est(double nrows);
 extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,

hashjoin.sql
Description: application/sql

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

Reply via email to