On 9.7.2014 16:07, Robert Haas wrote:
> On Tue, Jul 8, 2014 at 5:16 PM, Tomas Vondra <[email protected]> wrote:
>> Thinking about this a bit more, do we really need to build the hash
>> table on the first pass? Why not to do this:
>>
>> (1) batching
>> - read the tuples, stuff them into a simple list
>> - don't build the hash table yet
>>
>> (2) building the hash table
>> - we have all the tuples in a simple list, batching is done
>> - we know exact row count, can size the table properly
>> - build the table
>
> We could do this, and in fact we could save quite a bit of memory if
> we allocated say 1MB chunks and packed the tuples in tightly instead
> of palloc-ing each one separately. But I worry that rescanning the
> data to build the hash table would slow things down too much.
I did a quick test of how much memory we could save by this. The
attached patch densely packs the tuples into 32kB chunks (1MB seems way
too much because of small work_mem values, but I guess this might be
tuned based on number of tuples / work_mem size ...).
Tested on query like this (see the first message in this thread how to
generate the tables):
QUERY PLAN
-----------------------------------------------------------------------
Aggregate (cost=2014697.64..2014697.65 rows=1 width=33) (actual
time=63796.270..63796.271 rows=1 loops=1)
-> Hash Left Join (cost=318458.14..1889697.60 rows=50000016
width=33) (actual time=2865.656..61778.592 rows=50000000 loops=1)
Hash Cond: (o.id = i.id)
-> Seq Scan on outer_table o (cost=0.00..721239.16
rows=50000016 width=4) (actual time=0.033..2676.234 rows=50000000 loops=1)
-> Hash (cost=193458.06..193458.06 rows=10000006 width=37)
(actual time=2855.408..2855.408 rows=10000000 loops=1)
Buckets: 1048576 Batches: 1 Memory Usage: 703125kB
-> Seq Scan on inner_table i (cost=0.00..193458.06
rows=10000006 width=37) (actual time=0.044..952.802 rows=10000000 loops=1)
Planning time: 1.139 ms
Execution time: 63889.056 ms
(9 rows)
I.e. it creates a single batch with ~700 MB of tuples. Without the
patch, top shows this:
VIRT RES SHR S %CPU %MEM COMMAND
2540356 1,356g 5936 R 100,0 17,6 postgres: EXPLAIN
and the MemoryContextStats added to MultiExecHash shows this:
HashBatchContext: 1451221040 total in 182 blocks; 2826592 free (11
chunks); 1448394448 used
So yeah, the overhead is pretty huge in this case - basicaly 100%
overhead, because the inner table row width is only ~40B. With wider
rows the overhead will be lower.
Now, with the patch it looks like this:
VIRT RES SHR S %CPU %MEM COMMAND
1835332 720200 6096 R 100,0 8,9 postgres: EXPLAIN
HashBatchContext: 729651520 total in 21980 blocks; 0 free (0 chunks);
729651520 used
So, pretty much no overhead at all. It was slightly faster too (~5%) but
I haven't done much testing so it might be measurement error.
This patch is pretty independent of the other changes discussed here
(tweaking NTUP_PER_BUCKET / nbuckets) so I'll keep it separate.
regards
Tomas
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..18fd4a9 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -47,6 +47,7 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable,
int bucketNumber);
static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
+static char * chunk_alloc(HashJoinTable hashtable, int tupleSize);
/* ----------------------------------------------------------------
* ExecHash
@@ -130,6 +131,9 @@ MultiExecHash(HashState *node)
if (node->ps.instrument)
InstrStopNode(node->ps.instrument, hashtable->totalTuples);
+ /* print some context stats */
+ MemoryContextStats(hashtable->batchCxt);
+
/*
* We do not return the hash table directly because it's not a subtype of
* Node, and so would violate the MultiExecProcNode API. Instead, our
@@ -223,6 +227,8 @@ ExecEndHash(HashState *node)
ExecEndNode(outerPlan);
}
+/* 32kB chunks by default */
+#define CHUNK_SIZE (32*1024L)
/* ----------------------------------------------------------------
* ExecHashTableCreate
@@ -294,6 +300,10 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
hashtable->spaceAllowedSkew =
hashtable->spaceAllowed * SKEW_WORK_MEM_PERCENT / 100;
+ hashtable->chunk_data = NULL;
+ hashtable->chunk_used = 0;
+ hashtable->chunk_length = 0;
+
/*
* Get info about the hash functions to be used for each hash key. Also
* remember whether the join operators are strict.
@@ -717,8 +727,8 @@ ExecHashTableInsert(HashJoinTable hashtable,
/* Create the HashJoinTuple */
hashTupleSize = HJTUPLE_OVERHEAD + tuple->t_len;
- hashTuple = (HashJoinTuple) MemoryContextAlloc(hashtable->batchCxt,
- hashTupleSize);
+ hashTuple = (HashJoinTuple) chunk_alloc(hashtable, hashTupleSize);
+
hashTuple->hashvalue = hashvalue;
memcpy(HJTUPLE_MINTUPLE(hashTuple), tuple, tuple->t_len);
@@ -1068,6 +1078,13 @@ ExecHashTableReset(HashJoinTable hashtable)
hashtable->spaceUsed = 0;
MemoryContextSwitchTo(oldcxt);
+
+ /* reset the chunks too (the memory was allocated within batchCxt, so it's
+ * already freed) */
+ hashtable->chunk_data = NULL;
+ hashtable->chunk_length = 0;
+ hashtable->chunk_used = 0;
+
}
/*
@@ -1318,6 +1335,31 @@ ExecHashGetSkewBucket(HashJoinTable hashtable, uint32 hashvalue)
return INVALID_SKEW_BUCKET_NO;
}
+static
+char * chunk_alloc(HashJoinTable hashtable, int tupleSize) {
+
+ /* if tuple size is greater than of chunk size, just use MemoryContextAlloc directly */
+ /* TODO maybe using ~20% of chunk size would be more appropriate here */
+ if (tupleSize > CHUNK_SIZE)
+ return MemoryContextAlloc(hashtable->batchCxt, tupleSize);
+
+ /* ok, it's within chunk size, let's see if we have enough space for it in the current
+ * chunk => if not, allocate a new chunk (works for the first call because the NULL
+ * chunk has length=used=0) */
+ if ((hashtable->chunk_length - hashtable->chunk_used) < tupleSize) {
+ hashtable->chunk_data = MemoryContextAlloc(hashtable->batchCxt, CHUNK_SIZE);
+ hashtable->chunk_used = 0;
+ hashtable->chunk_length = CHUNK_SIZE;
+ }
+
+ /* OK, we have enough space in the chunk, let's add the tuple */
+ hashtable->chunk_used += tupleSize;
+
+ /* allocate pointer to the start of the tuple memory */
+ return hashtable->chunk_data + (hashtable->chunk_used - tupleSize);
+
+}
+
/*
* ExecHashSkewTableInsert
*
@@ -1338,8 +1380,8 @@ ExecHashSkewTableInsert(HashJoinTable hashtable,
/* Create the HashJoinTuple */
hashTupleSize = HJTUPLE_OVERHEAD + tuple->t_len;
- hashTuple = (HashJoinTuple) MemoryContextAlloc(hashtable->batchCxt,
- hashTupleSize);
+ hashTuple = (HashJoinTuple) chunk_alloc(hashtable, hashTupleSize);
+
hashTuple->hashvalue = hashvalue;
memcpy(HJTUPLE_MINTUPLE(hashTuple), tuple, tuple->t_len);
HeapTupleHeaderClearMatch(HJTUPLE_MINTUPLE(hashTuple));
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 3beae40..3e63175 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -157,6 +157,11 @@ typedef struct HashJoinTableData
MemoryContext hashCxt; /* context for whole-hash-join storage */
MemoryContext batchCxt; /* context for this-batch-only storage */
+
+ char *chunk_data; /* memory for dense-packing tuples */
+ Size chunk_length; /* size of the chunk */
+ Size chunk_used; /* currently-allocated memory in the chunk */
+
} HashJoinTableData;
#endif /* HASHJOIN_H */
--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers