Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-08-20 Thread Heikki Linnakangas

On 07/20/2014 07:17 PM, Tomas Vondra wrote:

On 19.7.2014 20:24, Tomas Vondra wrote:

On 13.7.2014 21:32, Tomas Vondra wrote:

The current patch only implemnents this for tuples in the main
hash table, not for skew buckets. I plan to do that, but it will
require separate chunks for each skew bucket (so we can remove it
without messing with all of them). The chunks for skew buckets
should be smaller (I'm thinking about ~4kB), because there'll be
more of them, but OTOH those buckets are for frequent values so the
chunks should not remain empty.


I've looked into extending the dense allocation to the skew buckets,
and I think we shouldn't do that. I got about 50% of the changes and
then just threw it out because it turned out quite pointless.

The amount of memory for skew buckets is limited to 2% of work mem,
so even with 100% overhead it'll use ~4% of work mem. So there's
pretty much nothing to gain here. So the additional complexity
introduced by the dense allocation seems pretty pointless.

I'm not entirely happy with the code allocating some memory densely
and some using traditional palloc, but it certainly seems cleaner
than the code I had.

So I think the patch is mostly OK as is.


Attached is v4 of the patch, mostly with minor improvements and cleanups
(removed MemoryContextStats, code related to skew buckets).


Can you remind me what kind of queries this is supposed to help with? 
Could you do some performance testing to demonstrate the effect? And 
also a worst-case scenario.


At a quick glance, I think you're missing a fairly obvious trick in 
ExecHashIncreaseNumBatches: if a chunk contains only a single tuple, you 
can avoid copying it to a new chunk and just link the old chunk to the 
new list instead. Not sure if ExecHashIncreaseNumBatches is called often 
enough for that to be noticeable, but at least it should help in extreme 
cases where you push around huge  100MB tuples.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-08-20 Thread Tomas Vondra
On 20 Srpen 2014, 14:05, Heikki Linnakangas wrote:
 On 07/20/2014 07:17 PM, Tomas Vondra wrote:
 On 19.7.2014 20:24, Tomas Vondra wrote:
 On 13.7.2014 21:32, Tomas Vondra wrote:
 The current patch only implemnents this for tuples in the main
 hash table, not for skew buckets. I plan to do that, but it will
 require separate chunks for each skew bucket (so we can remove it
 without messing with all of them). The chunks for skew buckets
 should be smaller (I'm thinking about ~4kB), because there'll be
 more of them, but OTOH those buckets are for frequent values so the
 chunks should not remain empty.

 I've looked into extending the dense allocation to the skew buckets,
 and I think we shouldn't do that. I got about 50% of the changes and
 then just threw it out because it turned out quite pointless.

 The amount of memory for skew buckets is limited to 2% of work mem,
 so even with 100% overhead it'll use ~4% of work mem. So there's
 pretty much nothing to gain here. So the additional complexity
 introduced by the dense allocation seems pretty pointless.

 I'm not entirely happy with the code allocating some memory densely
 and some using traditional palloc, but it certainly seems cleaner
 than the code I had.

 So I think the patch is mostly OK as is.

 Attached is v4 of the patch, mostly with minor improvements and cleanups
 (removed MemoryContextStats, code related to skew buckets).

 Can you remind me what kind of queries this is supposed to help with?
 Could you do some performance testing to demonstrate the effect? And
 also a worst-case scenario.

The dense allocation? That was not really directed at any specific
queries, it was about lowering hashjoin memory requirements in general.

First to make it more correct with respect to work_mem (hashjoin does not
account for the palloc overhead, so the actual memory consumption might be
much higher than work_mem).

Second to compensate for the memory for more buckets due to
NTUP_PER_BUCKET, which is tweaked in the 'hashjoin dynamic nbuckets'
patch.

There are some numbers / more detailed analysis in
http://www.postgresql.org/message-id/a17d6217fe0c9e459cb45cb764ad727c.squir...@sq.gransy.com


 At a quick glance, I think you're missing a fairly obvious trick in
 ExecHashIncreaseNumBatches: if a chunk contains only a single tuple, you
 can avoid copying it to a new chunk and just link the old chunk to the
 new list instead. Not sure if ExecHashIncreaseNumBatches is called often
 enough for that to be noticeable, but at least it should help in extreme
 cases where you push around huge  100MB tuples.

Yeah, I thought about that too, but it seemed like a rare corner case. But
maybe you're right - it will also eliminate the 'fluctuation' (allocating
100MB chunk, which might get us over work_mem, ...). Not sure if this is
going to help, but it's easy enough to fix I guess.

The other optimization I'm thinking about is that when increasing number
of batches, the expectation is only about 1/2 the chunks will be
necessary. So instead of freeing the chunk, we might keep it and reuse it
later. That might lower the palloc overhead a bit (but it's already very
low, so I don't think that's measurable difference).

regards
Tomas





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-20 Thread Tomas Vondra
On 20.7.2014 00:12, Tomas Vondra wrote:
 On 19.7.2014 23:07, Tomas Vondra wrote:
 On 19.7.2014 20:28, Tomas Vondra wrote:
 For the first case, a WARNING at the end of estimate_hash_bucketsize
 says this:

 WARNING:  nbuckets=8388608.00 estfract=0.01
 WARNING:  nbuckets=65536.00 estfract=0.000267

 There are 4.3M rows in the big_table, so it says ~4 tuples (selectivity
 = 1e-6), and ~10 tuples/bucket for the small_table (42k rows).

 For the second case, I get this:

 WARNING:  nbuckets=16777216.00 estfract=0.01
 WARNING:  nbuckets=262144.00 estfract=0.000100

 i.e. ~11 tuples/bucket for big_table and ~20 tuples for small_table.

 That sounds really strange, because small_table in the second test case
 is almost perfectly unique. And in the first test case, there are only
 very few keys with 2 occurences.
 
 FWIW, the problem seems to be this:
 
   /*
* Adjust estimated bucketsize upward to account for skewed
* distribution. */
 
   if (avgfreq  0.0  mcvfreq  avgfreq)
 estfract *= mcvfreq / avgfreq;
 
 Which is explained like in the estimate_hash_bucketsize comment like this:
 
 * be nonuniformly occupied. If the other relation in the join has a key
 * distribution similar to this one's, then the most-loaded buckets are
 * exactly those that will be probed most often. Therefore, the average
 * bucket size for costing purposes should really be taken as something *
 close to the worst case bucket size. We try to estimate this by
 * adjusting the fraction if there are too few distinct data values, and
 * then scaling up by the ratio of the most common value's frequency to
 * the average frequency.

After thinking about this a bit more, I'm starting to question the logic
behind the adjustment. The thing is, with NTUP_PER_BUCKET=1, all the
tuples in a bucket should have the same hash value and (hash collisions
aside) the same join key value. That means we're not really searching
through the bucket, we're merely iterating and returning all the tuples.

With NTUP_PER_BUCKET=10 it maybe made sense, because it was expected to
see multiple hash values in a single bucket. With NTUP_PER_BUCKET=1
we shouldn't really expect that. So I think we should rip the MCV
adjustment out ouf estimate_hash_bucketsize (and indeed, it fixed both
the test cases).

The only case where I think this might play role is when we can't use
the optimal number of buckets (achieving NTUP_PER_BUCKET=1), but in that
case the patched ExecChooseHashTableSize prefers to increase number of
batches. So that seems safe to me.

Or in what scenario (with NTUP_PER_BUCKET=1) does this still make sense?


Two more comments:

(a) Isn't the minimum selectivity enforced in estimate_hash_bucketsize
(1.0e-6) too high? I mean, With 1GB work_mem I can get ~20M tuples
into the hash table, and in that case it's off by an order of
magnitude. Maybe 1e-9 would be more appropriate?

(b) Currently, ExecChooseHashTableSize sizes the hash table (nbuckets)
using ntuples, but maybe ndistinct (if available) would be better?
It's tricky, though, because while both values are initially
estimates, ndistinct is notoriously less reliable. Also, we
currently don't have means to count ndistinct while building the
table (as opposed to ntuples, which is trivial), which is needed
when resizing the hash table in case the initial estimate was off.
So currently the code optimizes for ndistinct==ntuples which may
produce larger nbuckets values, but ISTM it's the right choice.

regards
Tomas




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-20 Thread Tomas Vondra
On 19.7.2014 20:24, Tomas Vondra wrote:
 On 13.7.2014 21:32, Tomas Vondra wrote:
 The current patch only implemnents this for tuples in the main
 hash table, not for skew buckets. I plan to do that, but it will
 require separate chunks for each skew bucket (so we can remove it
 without messing with all of them). The chunks for skew buckets
 should be smaller (I'm thinking about ~4kB), because there'll be
 more of them, but OTOH those buckets are for frequent values so the
 chunks should not remain empty.
 
 I've looked into extending the dense allocation to the skew buckets, 
 and I think we shouldn't do that. I got about 50% of the changes and 
 then just threw it out because it turned out quite pointless.
 
 The amount of memory for skew buckets is limited to 2% of work mem, 
 so even with 100% overhead it'll use ~4% of work mem. So there's 
 pretty much nothing to gain here. So the additional complexity 
 introduced by the dense allocation seems pretty pointless.
 
 I'm not entirely happy with the code allocating some memory densely 
 and some using traditional palloc, but it certainly seems cleaner 
 than the code I had.
 
 So I think the patch is mostly OK as is.

Attached is v4 of the patch, mostly with minor improvements and cleanups
(removed MemoryContextStats, code related to skew buckets).

regards
Tomas
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..6a9d956 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -47,6 +47,7 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable,
 		int bucketNumber);
 static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
+static char * chunk_alloc(HashJoinTable hashtable, int tupleSize);
 
 /* 
  *		ExecHash
@@ -129,7 +130,7 @@ MultiExecHash(HashState *node)
 	/* must provide our own instrumentation support */
 	if (node-ps.instrument)
 		InstrStopNode(node-ps.instrument, hashtable-totalTuples);
-
+	
 	/*
 	 * We do not return the hash table directly because it's not a subtype of
 	 * Node, and so would violate the MultiExecProcNode API.  Instead, our
@@ -223,7 +224,6 @@ ExecEndHash(HashState *node)
 	ExecEndNode(outerPlan);
 }
 
-
 /* 
  *		ExecHashTableCreate
  *
@@ -294,6 +294,8 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable-spaceAllowedSkew =
 		hashtable-spaceAllowed * SKEW_WORK_MEM_PERCENT / 100;
 
+	hashtable-chunks_batch = NULL;
+
 	/*
 	 * Get info about the hash functions to be used for each hash key. Also
 	 * remember whether the join operators are strict.
@@ -556,10 +558,10 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	int			oldnbatch = hashtable-nbatch;
 	int			curbatch = hashtable-curbatch;
 	int			nbatch;
-	int			i;
 	MemoryContext oldcxt;
 	long		ninmemory;
 	long		nfreed;
+	HashChunk	oldchunks = hashtable-chunks_batch;
 
 	/* do nothing if we've decided to shut off growth */
 	if (!hashtable-growEnabled)
@@ -612,51 +614,70 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	 */
 	ninmemory = nfreed = 0;
 
-	for (i = 0; i  hashtable-nbuckets; i++)
-	{
-		HashJoinTuple prevtuple;
-		HashJoinTuple tuple;
+	/* we're going to scan through the chunks directly, not buckets, so we can reset the
+	 * buckets and reinsert all the tuples there - just set them to NULL */
+	memset(hashtable-buckets, 0, sizeof(void*) * hashtable-nbuckets);
 
-		prevtuple = NULL;
-		tuple = hashtable-buckets[i];
+	/* reset the chunks (we have a copy in oldchunks) - this way we'll allocate */
+	hashtable-chunks_batch = NULL;
 
-		while (tuple != NULL)
-		{
-			/* save link in case we delete */
-			HashJoinTuple nexttuple = tuple-next;
-			int			bucketno;
-			int			batchno;
+	/* so, let's scan through the old chunks, and all tuples in each chunk */
+	while (oldchunks != NULL) {
+
+		/* pointer to the next memory chunk (may be NULL for the last chunk) */
+		HashChunk nextchunk = oldchunks-next;
+
+		/* position within the buffer (up to oldchunks-used) */
+		size_t idx = 0;	
+
+		/* process all tuples stored in this chunk (and then free it) */
+		while (idx  oldchunks-used) {
+
+			/* get the hashjoin tuple and mintuple stored right after it */
+			HashJoinTuple hashTuple = (HashJoinTuple)(oldchunks-data + idx);
+			MinimalTuple tuple = HJTUPLE_MINTUPLE(hashTuple);
+
+			int		hashTupleSize = (HJTUPLE_OVERHEAD + tuple-t_len);
+
+			int		bucketno;
+			int		batchno;
 
 			ninmemory++;
-			ExecHashGetBucketAndBatch(hashtable, tuple-hashvalue,
+			ExecHashGetBucketAndBatch(hashtable, hashTuple-hashvalue,
 	  bucketno, batchno);
-			Assert(bucketno == i);
+
 			if (batchno == curbatch)
 			{
-/* keep tuple */
-prevtuple = tuple;
+/* keep tuple - this means we need to copy it into the new chunks */
+HashJoinTuple copyTuple = (HashJoinTuple) chunk_alloc(hashtable, hashTupleSize);
+			

Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-19 Thread Tomas Vondra
On 14.7.2014 06:29, Stephen Frost wrote:
 Tomas,
 
 * Tomas Vondra (t...@fuzzy.cz) wrote:
 On 6.7.2014 17:57, Stephen Frost wrote:
 * Tomas Vondra (t...@fuzzy.cz) wrote:
 I can't find the thread / test cases in the archives. I've found this
 thread in hackers:

 http://www.postgresql.org/message-id/caoezvif-r-ilf966weipk5by-khzvloqpwqurpak3p5fyw-...@mail.gmail.com

 Can you point me to the right one, please?

 This:

 http://www.postgresql.org/message-id/20130328235627.gv4...@tamriel.snowman.net

 Sadly, the link to the SQL does not work :-(

   http://snowman.net/~sfrost/test_case.sql = 404

 and I think there may have been other test cases on that thread
 somewhere...  I certainly recall having at least a couple of them that I
 was playing with.

 I can probably find them around here somewhere too, if necessary.

 If you could look for them, that'd be great.
 
 cc'ing me directly helps me notice these...
 
 I've added back test_case.sql and test_case2.sql.  Please check them out
 and let me know- they're real-world cases we've run into.

Hi Stephen,

I've reviewed the two test cases mentioned here, and sadly there's
nothing that can be 'fixed' by this patch. The problem here lies in the
planning stage, which decides to hash the large table - we can't fix
that in the executor.

I did three runs - with master, with the 'dense allocation' patch
applied, and with 'dense allocation + ntup patch' applied.

As the estimates are pretty exact here, the ntup patch effectively only
sets NTUP_PER_BUCKET=1 (and does not do any dynamic resize or so).

For the first testcase, it behaves like this:

### master (no patches) 

 QUERY PLAN
-
 Hash Join  (cost=115030.10..116671.32 rows=41176 width=24) (actual
time=980.363..1015.623 rows=13731 loops=1)
   Hash Cond: (small_table.id_short = big_table.id_short)
   -  Seq Scan on small_table  (cost=0.00..714.76 rows=41176 width=24)
(actual time=0.015..2.563 rows=41176 loops=1)
   -  Hash  (cost=61626.71..61626.71 rows=4272271 width=4) (actual
time=979.399..979.399 rows=4272271 loops=1)
 Buckets: 524288  Batches: 1  Memory Usage: 150198kB
 -  Seq Scan on big_table  (cost=0.00..61626.71 rows=4272271
width=4) (actual time=0.026..281.151 rows=4272271 loops=1)
 Planning time: 0.196 ms
 Execution time: 1034.931 ms
(8 rows)

# without explain analyze

Time: 841,198 ms
Time: 825,000 ms

### master + dense allocation 

 QUERY PLAN
-
 Hash Join  (cost=115030.10..116671.32 rows=41176 width=24) (actual
time=778.116..815.254 rows=13731 loops=1)
   Hash Cond: (small_table.id_short = big_table.id_short)
   -  Seq Scan on small_table  (cost=0.00..714.76 rows=41176 width=24)
(actual time=0.006..2.423 rows=41176 loops=1)
   -  Hash  (cost=61626.71..61626.71 rows=4272271 width=4) (actual
time=775.803..775.803 rows=4272271 loops=1)
 Buckets: 524288  Batches: 1  Memory Usage: 150198kB
 -  Seq Scan on big_table  (cost=0.00..61626.71 rows=4272271
width=4) (actual time=0.005..227.274 rows=4272271 loops=1)
 Planning time: 0.061 ms
 Execution time: 826.346 ms
(8 rows)

# without explain analyze

Time: 758,393 ms
Time: 734,329 ms

### master + dense allocation + ntup=1 patch 

 QUERY PLAN
-
 Hash Join  (cost=115030.10..116465.44 rows=41176 width=24) (actual
time=1020.288..1033.539 rows=13731 loops=1)
   Hash Cond: (small_table.id_short = big_table.id_short)
   -  Seq Scan on small_table  (cost=0.00..714.76 rows=41176 width=24)
(actual time=0.015..2.214 rows=41176 loops=1)
   -  Hash  (cost=61626.71..61626.71 rows=4272271 width=4) (actual
time=953.890..953.890 rows=4272271 loops=1)
 Buckets: 8388608  Batches: 1  Memory Usage: 215734kB
 -  Seq Scan on big_table  (cost=0.00..61626.71 rows=4272271
width=4) (actual time=0.014..241.223 rows=4272271 loops=1)
 Planning time: 0.193 ms
 Execution time: 1055.351 ms
(8 rows)

# without explain analyze

Time: 836,050 ms
Time: 822,818 ms

#

For the second test case, the times are like this:

# master
Time: 2326,579 ms
Time: 2320,232 ms

# master + dense allocation
Time: 2207,254 ms
Time: 2251,691 ms

# master + dense allocation + ntup=1
Time: 2374,278 ms
Time: 2377,409 ms

So while the dense allocation shaves off ~5%, the time with both patches
is slightly higher (again, ~5% difference). After spending some time by
profiling the code, I believe it's because of worse cache hit ratio when
accessing the buckets. By decreasing NTUP_PER_BUCKET the second test
case now uses ~130MB of buckets, while originally it used only ~16MB).
That means slightly lower cache hit ratio when accessing
hashtable-buckets (e.g. in ExecHashTableInsert).

The 

Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-19 Thread Tom Lane
Tomas Vondra t...@fuzzy.cz writes:
 I've reviewed the two test cases mentioned here, and sadly there's
 nothing that can be 'fixed' by this patch. The problem here lies in the
 planning stage, which decides to hash the large table - we can't fix
 that in the executor.

We've heard a couple reports before of the planner deciding to hash a
larger table rather than a smaller one.  The only reason I can think of
for that is if the smaller table has many more duplicates, so that the
planner thinks the executor might end up traversing long hash chains.
The planner's estimates could easily be off in this area, of course.
estimate_hash_bucketsize() is the likely culprit if it's wrong.

Which test case are you seeing this in, exactly?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-19 Thread Tomas Vondra
On 13.7.2014 21:32, Tomas Vondra wrote:
 The current patch only implemnents this for tuples in the main hash 
 table, not for skew buckets. I plan to do that, but it will require 
 separate chunks for each skew bucket (so we can remove it without 
 messing with all of them). The chunks for skew buckets should be
 smaller (I'm thinking about ~4kB), because there'll be more of them,
 but OTOH those buckets are for frequent values so the chunks should
 not remain empty.

I've looked into extending the dense allocation to the skew buckets, and
I think we shouldn't do that. I got about 50% of the changes and then
just threw it out because it turned out quite pointless.

The amount of memory for skew buckets is limited to 2% of work mem, so
even with 100% overhead it'll use ~4% of work mem. So there's pretty
much nothing to gain here. So the additional complexity introduced by
the dense allocation seems pretty pointless.

I'm not entirely happy with the code allocating some memory densely and
some using traditional palloc, but it certainly seems cleaner than the
code I had.

So I think the patch is mostly OK as is.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-19 Thread Tomas Vondra
On 19.7.2014 20:24, Tom Lane wrote:
 Tomas Vondra t...@fuzzy.cz writes:
 I've reviewed the two test cases mentioned here, and sadly there's
 nothing that can be 'fixed' by this patch. The problem here lies in the
 planning stage, which decides to hash the large table - we can't fix
 that in the executor.
 
 We've heard a couple reports before of the planner deciding to hash a
 larger table rather than a smaller one.  The only reason I can think of
 for that is if the smaller table has many more duplicates, so that the
 planner thinks the executor might end up traversing long hash chains.
 The planner's estimates could easily be off in this area, of course.
 estimate_hash_bucketsize() is the likely culprit if it's wrong.
 
 Which test case are you seeing this in, exactly?

The two reported by Stephen here:


http://www.postgresql.org/message-id/20130328235627.gv4...@tamriel.snowman.net

Just download this (I had to gunzip it):

  http://snowman.net/~sfrost/test_case.sql
  http://snowman.net/~sfrost/test_case2.sql

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-19 Thread Tomas Vondra
On 19.7.2014 20:28, Tomas Vondra wrote:
 On 19.7.2014 20:24, Tom Lane wrote:
 Tomas Vondra t...@fuzzy.cz writes:
 I've reviewed the two test cases mentioned here, and sadly there's
 nothing that can be 'fixed' by this patch. The problem here lies in the
 planning stage, which decides to hash the large table - we can't fix
 that in the executor.

 We've heard a couple reports before of the planner deciding to hash a
 larger table rather than a smaller one.  The only reason I can think of
 for that is if the smaller table has many more duplicates, so that the
 planner thinks the executor might end up traversing long hash chains.
 The planner's estimates could easily be off in this area, of course.
 estimate_hash_bucketsize() is the likely culprit if it's wrong.

 Which test case are you seeing this in, exactly?
 
 The two reported by Stephen here:
 
 
 http://www.postgresql.org/message-id/20130328235627.gv4...@tamriel.snowman.net
 
 Just download this (I had to gunzip it):
 
   http://snowman.net/~sfrost/test_case.sql
   http://snowman.net/~sfrost/test_case2.sql

Anyway, looking a bit at the distributions, I don't think the
distributions are significantly skewed. In the first testscase, I get this

test=# select cnt, count(*) from (select id_short, count(*) AS cnt from
small_table group by 1) foo group by cnt order by cnt;
 cnt | count
-+---
   1 |23
   2 | 18795
   3 |13
   4 |   726
   5 | 4
   6 |93
   8 | 4
  10 | 1
(8 rows)

and in the second one I get this:

test=# select cnt, count(*) from (select id_short, count(*) AS cnt from
small_table group by 1) foo group by cnt order by cnt;
 cnt | count
-+
   1 | 182161
   2 |   5582
   3 |371
   4 | 28
   5 |  2
   6 |  3
   9 |  1
(7 rows)

So while #1 contains most values twice, #2 is almost perfectly distinct.
In the big_table, the values are unique.

For the first case, a WARNING at the end of estimate_hash_bucketsize
says this:

WARNING:  nbuckets=8388608.00 estfract=0.01
WARNING:  nbuckets=65536.00 estfract=0.000267

There are 4.3M rows in the big_table, so it says ~4 tuples (selectivity
= 1e-6), and ~10 tuples/bucket for the small_table (42k rows).

For the second case, I get this:

WARNING:  nbuckets=16777216.00 estfract=0.01
WARNING:  nbuckets=262144.00 estfract=0.000100

i.e. ~11 tuples/bucket for big_table and ~20 tuples for small_table.

That sounds really strange, because small_table in the second test case
is almost perfectly unique. And in the first test case, there are only
very few keys with 2 occurences.


By explicitly setting the selectivity in estimate_hash_bucketsize to
1e-6, I got these plans:

Test case #1

  QUERY PLAN
-
 Hash Join  (cost=1229.46..79288.95 rows=41176 width=24) (actual
time=50.397..634.986 rows=13731 loops=1)
   Hash Cond: (big_table.id_short = small_table.id_short)
   -  Seq Scan on big_table  (cost=0.00..61626.71 rows=4272271 width=4)
(actual time=0.018..229.597 rows=4272271 loops=1)
   -  Hash  (cost=714.76..714.76 rows=41176 width=24) (actual
time=31.653..31.653 rows=41176 loops=1)
 Buckets: 65536  Batches: 1  Memory Usage: 3086kB
 -  Seq Scan on small_table  (cost=0.00..714.76 rows=41176
width=24) (actual time=0.012..13.341 rows=41176 loops=1)
 Planning time: 0.597 ms
 Execution time: 635.498 ms
(8 rows)

Without explain analyze: 486 ms (original plan: 850ms).

Test case #2

  QUERY PLAN
-
 Hash Join  (cost=5240.21..220838.44 rows=194587 width=4) (actual
time=88.252..2143.457 rows=149540 loops=1)
   Hash Cond: (big_table.id_short = small_table.id_short)
   -  Seq Scan on big_table  (cost=0.00..169569.72 rows=11755372
width=4) (actual time=0.018..533.955 rows=11755475 loops=1)
   -  Hash  (cost=2807.87..2807.87 rows=194587 width=4) (actual
time=87.548..87.548 rows=194587 loops=1)
 Buckets: 262144  Batches: 1  Memory Usage: 8889kB
 -  Seq Scan on small_table  (cost=0.00..2807.87 rows=194587
width=4) (actual time=0.012..29.929 rows=194587 loops=1)
 Planning time: 0.438 ms
 Execution time: 2146.818 ms
(8 rows)

Without explain analyze: 1600 ms (original plan: 2300ms)

The differences are fairly small - well, it's 30-40% faster, but it's
less than 1s in both cases. I'm wondering whether we could get into
similar problems with much longer queries and still get 30-40% improvement.

Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-19 Thread Tomas Vondra
On 19.7.2014 23:07, Tomas Vondra wrote:
 On 19.7.2014 20:28, Tomas Vondra wrote:
 For the first case, a WARNING at the end of estimate_hash_bucketsize
 says this:
 
 WARNING:  nbuckets=8388608.00 estfract=0.01
 WARNING:  nbuckets=65536.00 estfract=0.000267
 
 There are 4.3M rows in the big_table, so it says ~4 tuples (selectivity
 = 1e-6), and ~10 tuples/bucket for the small_table (42k rows).
 
 For the second case, I get this:
 
 WARNING:  nbuckets=16777216.00 estfract=0.01
 WARNING:  nbuckets=262144.00 estfract=0.000100
 
 i.e. ~11 tuples/bucket for big_table and ~20 tuples for small_table.
 
 That sounds really strange, because small_table in the second test case
 is almost perfectly unique. And in the first test case, there are only
 very few keys with 2 occurences.

FWIW, the problem seems to be this:

  /*
   * Adjust estimated bucketsize upward to account for skewed
   * distribution. */

  if (avgfreq  0.0  mcvfreq  avgfreq)
estfract *= mcvfreq / avgfreq;

Which is explained like in the estimate_hash_bucketsize comment like this:

* be nonuniformly occupied. If the other relation in the join has a key
* distribution similar to this one's, then the most-loaded buckets are
* exactly those that will be probed most often. Therefore, the average
* bucket size for costing purposes should really be taken as something *
close to the worst case bucket size. We try to estimate this by
* adjusting the fraction if there are too few distinct data values, and
* then scaling up by the ratio of the most common value's frequency to
* the average frequency.

The problem is that the two testcases don't behave like this, i.e. the
other relation does not have similar distribution (because there the
join key is unique). Actually, I wonder how frequently that happens and
I wouldn't be surprised if it was quite rare.

So maybe we shouldn't scale it the way we scale it now. For example we
could do this:

  if (avgfreq  0.0  mcvfreq  avgfreq)
estfract *= sqrt(mcvfreq / avgfreq);

Or instead of using the first MCV frequency (i.e. the frequency of the
most common value, which causes the unexpectedly hight tuple/bucket
values), use an average or median of the MCV frequenfies.

Either of these three changes fixed the first test case, some fixed
the second one. However it's pure guesswork, and I have how many plans
will be hit negatively by these changes.

What I think might work better is lowering the cost of searching small
hash tables, when the hash table fits into L1/L2... caches. The table I
posted in the first message in this thread illustrates that:

  kB  NTUP=1  NTUP=2  NTUP=4  NTUP=5  NTUP=9  NTUP=10
  14076753721878909324948812574
  7032941711586   17417   15820   25732   25402
  35157   13179   17101   27225   24763   43323   43175
  70313   14203   18475   29354   26690   46840   46676
  175782  14319   17458   25325   34542   37342   60949
  351563  16297   19928   29124   43364   43232   70014
  703125  19027   23125   33610   50283   50165   81398

So a small hash table (~1.4MB), fitting into L2 (measured on CPU with
~4MB cache) is influenced much less than large tables. I don't know
whether we should detect the CPU cache size somehow, or whether we
should assume some conservative size (say, 4MB sounds OK), or what. But
I think something like this might work

  if (hash_size  4MB) {
 /* take in only 10% of the difference */
 estfract += (mcvfreq / avgfreq) * estfract * 0.1;
  } else if (hash_size  16MB) {
 estfract *= (mcvfreq / avgfreq);
  } else {
 /* linear approximation between 10 and 100% */
 estfract += (mcvfreq / avgfreq) * estfract
 * (0.1 + 0.9 * (hash_size - 4MB) / 12MB);
  }

Or maybe not, I'm not really sure. The problem is that the two test
cases really don't match the assumption that the outer relation has the
same distribution.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-13 Thread Simon Riggs
On 12 July 2014 12:43, Tomas Vondra t...@fuzzy.cz wrote:

 So lets just this change done and then do more later.

 There's no way back, sadly. The dense allocation turned into a
 challenge. I like challenges. I have to solve it or I won't be able to
 sleep.

I admire your tenacity, but how about we solve the first challenge
first and the second one second?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-13 Thread Tomas Vondra
On 13.7.2014 12:27, Simon Riggs wrote:
 On 12 July 2014 12:43, Tomas Vondra t...@fuzzy.cz wrote:
 
 So lets just this change done and then do more later.

 There's no way back, sadly. The dense allocation turned into a 
 challenge. I like challenges. I have to solve it or I won't be able
 to sleep.
 
 I admire your tenacity, but how about we solve the first challenge 
 first and the second one second?

Yeah, I know. However the two patches deal with the same part of the
code, so it feels natural to work on both at the same time. And I
already had a solution in my head, and only needed to code it.

Also, I think the question of allocated memory / memory pressure is
important, and I'd like to avoid this will be solved by the other
patch only to find out that the other patch does not work.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-13 Thread Tomas Vondra
On 11.7.2014 19:25, Tomas Vondra wrote:
 2) walking through the tuples sequentially
 --
 
 The other option is not to walk the tuples through buckets, but by
 walking throught the chunks - we know the tuples are stored as
 HashJoinTuple/MinimalTuple, so it shouldn't be that difficult. And by
 walking the tuples in the order they're stored, we may just rearrange
 the tuples in already allocated memory. And maybe it could be even
 faster than the current code, because of the sequential access to the
 memory (as opposed to the random access through buckets) and CPU caches.
 So I like this approach - it's simple, clean and shouldn't add any
 overhead (memory or CPU).

So, attached is a patch that implements this. It's not very complicated
and the resulting code is surprisingly readable (IMHO comprable to the
previous code).

Basics of the patch:

(1) adds HashChunkData structure - a buffer used for dense allocation
of tuples, 16kB by default

(2) adds 'chunks_batch' into HashJoinTableData, which is a linked list
of HashChunkData

(3) the allocation itself is done in chunk_alloc - in short it either
allocates the tuple in the current chunk, or adds a new one if
needed

(4) ExecHashTableInsert calls chunk_alloc (instead of the regular
MemoryContextAlloc)

(5) the main change is in ExecHashIncreaseNumBatches, which now can't
do pfree (does not work with dense allocation), so it reads the
tuples from chunks directly and simply rebuilds the buckets
from scratch (thanks to this we only need one more chunk of memory,
not a complete copy, and we don't need to duplicate buckets at all)

From the tests I've done so far this seems to work perfectly - it still
saves the memory overhead, and I've seen no difference in performance.

The current patch only implemnents this for tuples in the main hash
table, not for skew buckets. I plan to do that, but it will require
separate chunks for each skew bucket (so we can remove it without
messing with all of them). The chunks for skew buckets should be smaller
(I'm thinking about ~4kB), because there'll be more of them, but OTOH
those buckets are for frequent values so the chunks should not remain empty.

But let's see if the current patch misses something important first.


regards
Tomas
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..5f9f2df 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -47,6 +47,7 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable,
 		int bucketNumber);
 static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
+static char * chunk_alloc(HashJoinTable hashtable, int tupleSize);
 
 /* 
  *		ExecHash
@@ -130,6 +131,9 @@ MultiExecHash(HashState *node)
 	if (node-ps.instrument)
 		InstrStopNode(node-ps.instrument, hashtable-totalTuples);
 
+	/* print some context stats */
+	MemoryContextStats(hashtable-batchCxt);
+	
 	/*
 	 * We do not return the hash table directly because it's not a subtype of
 	 * Node, and so would violate the MultiExecProcNode API.  Instead, our
@@ -223,7 +227,6 @@ ExecEndHash(HashState *node)
 	ExecEndNode(outerPlan);
 }
 
-
 /* 
  *		ExecHashTableCreate
  *
@@ -294,6 +297,8 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable-spaceAllowedSkew =
 		hashtable-spaceAllowed * SKEW_WORK_MEM_PERCENT / 100;
 
+	hashtable-chunks_batch = NULL;
+
 	/*
 	 * Get info about the hash functions to be used for each hash key. Also
 	 * remember whether the join operators are strict.
@@ -556,10 +561,10 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	int			oldnbatch = hashtable-nbatch;
 	int			curbatch = hashtable-curbatch;
 	int			nbatch;
-	int			i;
 	MemoryContext oldcxt;
 	long		ninmemory;
 	long		nfreed;
+	HashChunk	oldchunks = hashtable-chunks_batch;
 
 	/* do nothing if we've decided to shut off growth */
 	if (!hashtable-growEnabled)
@@ -612,51 +617,70 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	 */
 	ninmemory = nfreed = 0;
 
-	for (i = 0; i  hashtable-nbuckets; i++)
-	{
-		HashJoinTuple prevtuple;
-		HashJoinTuple tuple;
+	/* we're going to scan through the chunks directly, not buckets, so we can reset the
+	 * buckets and reinsert all the tuples there - just set them to NULL */
+	memset(hashtable-buckets, 0, sizeof(void*) * hashtable-nbuckets);
 
-		prevtuple = NULL;
-		tuple = hashtable-buckets[i];
+	/* reset the chunks (we have a copy in oldchunks) - this way we'll allocate */
+	hashtable-chunks_batch = NULL;
 
-		while (tuple != NULL)
-		{
-			/* save link in case we delete */
-			HashJoinTuple nexttuple = tuple-next;
-			int			bucketno;
-			int			batchno;
+	/* so, let's scan through the old chunks, and all tuples in each chunk */
+	while (oldchunks != NULL) {
+
+		/* pointer to the 

Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-13 Thread Stephen Frost
Tomas,

* Tomas Vondra (t...@fuzzy.cz) wrote:
 On 6.7.2014 17:57, Stephen Frost wrote:
  * Tomas Vondra (t...@fuzzy.cz) wrote:
  I can't find the thread / test cases in the archives. I've found this
  thread in hackers:
 
  http://www.postgresql.org/message-id/caoezvif-r-ilf966weipk5by-khzvloqpwqurpak3p5fyw-...@mail.gmail.com
 
  Can you point me to the right one, please?
  
  This:
  
  http://www.postgresql.org/message-id/20130328235627.gv4...@tamriel.snowman.net
 
 Sadly, the link to the SQL does not work :-(
 
   http://snowman.net/~sfrost/test_case.sql = 404
 
  and I think there may have been other test cases on that thread
  somewhere...  I certainly recall having at least a couple of them that I
  was playing with.
  
  I can probably find them around here somewhere too, if necessary.
 
 If you could look for them, that'd be great.

cc'ing me directly helps me notice these...

I've added back test_case.sql and test_case2.sql.  Please check them out
and let me know- they're real-world cases we've run into.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-12 Thread Simon Riggs
On 11 July 2014 18:25, Tomas Vondra t...@fuzzy.cz wrote:

 Turns out getting this working properly will quite complicated.

Lets keep this patch simple then. Later research can be another patch.

In terms of memory pressure, having larger joins go x4 faster has a
much more significant reducing effect on memory pressure than anything
else. So my earlier concerns seem less of a concern.

So lets just this change done and then do more later.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-12 Thread Tomas Vondra
On 12.7.2014 11:39, Simon Riggs wrote:
 On 11 July 2014 18:25, Tomas Vondra t...@fuzzy.cz wrote:
 
 Turns out getting this working properly will quite complicated.
 
 Lets keep this patch simple then. Later research can be another patch.

Well, the dense allocation is independent to the NTUP_PER_BUCKET
changes, and only happened to be discussed here because it's related to
hash joins. My plan was to keep it as a separate patch, thus not making
the NTUP patch any more complex.

 In terms of memory pressure, having larger joins go x4 faster has a 
 much more significant reducing effect on memory pressure than
 anything else. So my earlier concerns seem less of a concern.

OK.

 So lets just this change done and then do more later.

There's no way back, sadly. The dense allocation turned into a
challenge. I like challenges. I have to solve it or I won't be able to
sleep.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-11 Thread Simon Riggs
On 9 July 2014 18:54, Tomas Vondra t...@fuzzy.cz wrote:

 (1) size the buckets for NTUP_PER_BUCKET=1 (and use whatever number
 of batches this requires)

If we start off by assuming NTUP_PER_BUCKET = 1, how much memory does
it save to recalculate the hash bucket at 10 instead?
Resizing sounds like it will only be useful of we only just overflow our limit.

If we release next version with this as a hardcoded change, my
understanding is that memory usage for hash joins will leap upwards,
even if the run time of queries reduces. It sounds like we need some
kind of parameter to control this. We made it faster might not be
true if we run this on servers that are already experiencing high
memory pressure.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-11 Thread Tomas Vondra
On 11 Červenec 2014, 9:27, Simon Riggs wrote:
 On 9 July 2014 18:54, Tomas Vondra t...@fuzzy.cz wrote:

 (1) size the buckets for NTUP_PER_BUCKET=1 (and use whatever number
 of batches this requires)

 If we start off by assuming NTUP_PER_BUCKET = 1, how much memory does
 it save to recalculate the hash bucket at 10 instead?
 Resizing sounds like it will only be useful of we only just overflow our
 limit.

 If we release next version with this as a hardcoded change, my
 understanding is that memory usage for hash joins will leap upwards,
 even if the run time of queries reduces. It sounds like we need some
 kind of parameter to control this. We made it faster might not be
 true if we run this on servers that are already experiencing high
 memory pressure.

Sure. We certainly don't want to make things worse for environments with
memory pressure.

The current implementation has two issues regarding memory:

(1) It does not include buckets into used memory, i.e. it's not included
into work_mem (so we may overflow work_mem). I plan to fix this, to make
work_mem a bit more correct, as it's important for cases with
NTUP_PER_BUCKET=1.

(2) There's a significant palloc overhead, because of allocating each
tuple separately - see my message from yesterday, where I observed the
batch memory context to get 1.4GB memory for 700MB of tuple data. By
densely packing the tuples, I got down to ~700MB (i.e. pretty much no
overhead).

The palloc overhead seems to be 20B (on x86_64) per tuple, and eliminating
this it more than compensates for ~8B per tuple, required for
NTUP_PER_BUCKET=1. And fixing (1) makes it more correct / predictable.

It also improves the issue that palloc overhead is not counted into
work_mem at all (that's why I got ~1.4GB batch context with work_mem=1GB).

So in the end this should give us much lower memory usage for hash joins,
even if we switch to NTUP_PER_BUCKET=1 (although that's pretty much
independent change). Does that seem reasonable?

Regarding the tunable to control this - I certainly don't want another GUC
no one really knows how to set right. And I think it's unnecessary thanks
to the palloc overhead / work_mem accounting fix, described above.

The one thing I'm not sure about is what to do in case of reaching the
work_mem limit (which should only happen with underestimated row count /
row width) - how to decide whether to shrink the hash table or increment
the number of batches. But this is not exclusive to NTUP_PER_BUCKET=1, it
may happen with whatever NTUP_PER_BUCKET value you choose.

The current code does not support resizing at all, so it always increments
the number of batches, but I feel an interleaved approach might be more
appropriate (nbuckets/2, nbatches*2, nbuckets/2, nbatches*2, ...). It'd be
nice to have some cost estimates ('how expensive is a rescan' vs. 'how
expensive is a resize'), but I'm not sure how to get that.

regards
Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-11 Thread Simon Riggs
On 11 July 2014 10:23, Tomas Vondra t...@fuzzy.cz wrote:
 On 11 Červenec 2014, 9:27, Simon Riggs wrote:
 On 9 July 2014 18:54, Tomas Vondra t...@fuzzy.cz wrote:

 (1) size the buckets for NTUP_PER_BUCKET=1 (and use whatever number
 of batches this requires)

 If we start off by assuming NTUP_PER_BUCKET = 1, how much memory does
 it save to recalculate the hash bucket at 10 instead?
 Resizing sounds like it will only be useful of we only just overflow our
 limit.

 If we release next version with this as a hardcoded change, my
 understanding is that memory usage for hash joins will leap upwards,
 even if the run time of queries reduces. It sounds like we need some
 kind of parameter to control this. We made it faster might not be
 true if we run this on servers that are already experiencing high
 memory pressure.

 Sure. We certainly don't want to make things worse for environments with
 memory pressure.

 The current implementation has two issues regarding memory:

 (1) It does not include buckets into used memory, i.e. it's not included
 into work_mem (so we may overflow work_mem). I plan to fix this, to make
 work_mem a bit more correct, as it's important for cases with
 NTUP_PER_BUCKET=1.

 (2) There's a significant palloc overhead, because of allocating each
 tuple separately - see my message from yesterday, where I observed the
 batch memory context to get 1.4GB memory for 700MB of tuple data. By
 densely packing the tuples, I got down to ~700MB (i.e. pretty much no
 overhead).

 The palloc overhead seems to be 20B (on x86_64) per tuple, and eliminating
 this it more than compensates for ~8B per tuple, required for
 NTUP_PER_BUCKET=1. And fixing (1) makes it more correct / predictable.

 It also improves the issue that palloc overhead is not counted into
 work_mem at all (that's why I got ~1.4GB batch context with work_mem=1GB).

 So in the end this should give us much lower memory usage for hash joins,
 even if we switch to NTUP_PER_BUCKET=1 (although that's pretty much
 independent change). Does that seem reasonable?

Yes, that alleviates my concern. Thanks.

 Regarding the tunable to control this - I certainly don't want another GUC
 no one really knows how to set right. And I think it's unnecessary thanks
 to the palloc overhead / work_mem accounting fix, described above.

Agreed, nor does anyone.

 The one thing I'm not sure about is what to do in case of reaching the
 work_mem limit (which should only happen with underestimated row count /
 row width) - how to decide whether to shrink the hash table or increment
 the number of batches. But this is not exclusive to NTUP_PER_BUCKET=1, it
 may happen with whatever NTUP_PER_BUCKET value you choose.

 The current code does not support resizing at all, so it always increments
 the number of batches, but I feel an interleaved approach might be more
 appropriate (nbuckets/2, nbatches*2, nbuckets/2, nbatches*2, ...). It'd be
 nice to have some cost estimates ('how expensive is a rescan' vs. 'how
 expensive is a resize'), but I'm not sure how to get that.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-11 Thread Tomas Vondra
On 10.7.2014 21:33, Tomas Vondra wrote:
 On 9.7.2014 16:07, Robert Haas wrote:
 On Tue, Jul 8, 2014 at 5:16 PM, Tomas Vondra t...@fuzzy.cz wrote:
 Thinking about this a bit more, do we really need to build the hash
 table on the first pass? Why not to do this:

 (1) batching
 - read the tuples, stuff them into a simple list
 - don't build the hash table yet

 (2) building the hash table
 - we have all the tuples in a simple list, batching is done
 - we know exact row count, can size the table properly
 - build the table

 We could do this, and in fact we could save quite a bit of memory if
 we allocated say 1MB chunks and packed the tuples in tightly instead
 of palloc-ing each one separately.  But I worry that rescanning the
 data to build the hash table would slow things down too much.
 
 I did a quick test of how much memory we could save by this. The 
 attached patch densely packs the tuples into 32kB chunks (1MB seems
 way too much because of small work_mem values, but I guess this might
 be tuned based on number of tuples / work_mem size ...).

Turns out getting this working properly will quite complicated. The
patch was only a quick attempt to see how much overhead there is, and
didn't solve one important details - batching.

The problem is that when increasing the number of batches, we need to
get rid of the tuples from one of them. Which means calling pfree() on
the tuples written to a temporary file, and that's not possible with the
dense allocation.


1) copy into new chunks (dead end)
--

The first idea I had was to just copy everything into new chunks and
then throw away the old ones, but this way we might end up using 150% of
work_mem on average (when the two batches are about 1/2 the data each),
and in an extreme case up to 200% of work_mem (all tuples having the
same key and thus falling into the same batch). So I don't think that's
really a good approach.


2) walking through the tuples sequentially
--

The other option is not to walk the tuples through buckets, but by
walking throught the chunks - we know the tuples are stored as
HashJoinTuple/MinimalTuple, so it shouldn't be that difficult. And by
walking the tuples in the order they're stored, we may just rearrange
the tuples in already allocated memory. And maybe it could be even
faster than the current code, because of the sequential access to the
memory (as opposed to the random access through buckets) and CPU caches.
So I like this approach - it's simple, clean and shouldn't add any
overhead (memory or CPU).


3) batch-aware chunks
-

I also think a batch-aware chunks might work. Say we're starting with
nbatch=N. Instead of allocating everything in a single chunk, we'll
allocate the tuples from the chunks according to a hypothetical batch
number - what batch would the tuple belong to if we had (nbatch=N*2).
So we'd have two chunks (or sequences of chunks), and we'd allocate the
tuples into them.

Then if we actually need to increase the number of batches, we know we
can just free one of the chunks, because all of the tuples are from the
'wrong' batche, and redistribute the remaining tuples into the new
chunks (according to the new hypothetical batch numbers).

I'm not sure how much overhead this would be, though. I think it can be
done quite efficiently, though, and it shouldn't have any impact at all,
if we don't do any additional batching (i.e. if the initial estimates
are ok).

Any other ideas how to tackle this?

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-10 Thread Tomas Vondra
On 9.7.2014 16:07, Robert Haas wrote:
 On Tue, Jul 8, 2014 at 5:16 PM, Tomas Vondra t...@fuzzy.cz wrote:
 Thinking about this a bit more, do we really need to build the hash
 table on the first pass? Why not to do this:

 (1) batching
 - read the tuples, stuff them into a simple list
 - don't build the hash table yet

 (2) building the hash table
 - we have all the tuples in a simple list, batching is done
 - we know exact row count, can size the table properly
 - build the table
 
 We could do this, and in fact we could save quite a bit of memory if
 we allocated say 1MB chunks and packed the tuples in tightly instead
 of palloc-ing each one separately.  But I worry that rescanning the
 data to build the hash table would slow things down too much.

I did a quick test of how much memory we could save by this. The
attached patch densely packs the tuples into 32kB chunks (1MB seems way
too much because of small work_mem values, but I guess this might be
tuned based on number of tuples / work_mem size ...).

Tested on query like this (see the first message in this thread how to
generate the tables):

 QUERY PLAN
---
 Aggregate  (cost=2014697.64..2014697.65 rows=1 width=33) (actual
time=63796.270..63796.271 rows=1 loops=1)
   -  Hash Left Join  (cost=318458.14..1889697.60 rows=5016
width=33) (actual time=2865.656..61778.592 rows=5000 loops=1)
 Hash Cond: (o.id = i.id)
 -  Seq Scan on outer_table o  (cost=0.00..721239.16
rows=5016 width=4) (actual time=0.033..2676.234 rows=5000 loops=1)
 -  Hash  (cost=193458.06..193458.06 rows=1006 width=37)
(actual time=2855.408..2855.408 rows=1000 loops=1)
   Buckets: 1048576  Batches: 1  Memory Usage: 703125kB
   -  Seq Scan on inner_table i  (cost=0.00..193458.06
rows=1006 width=37) (actual time=0.044..952.802 rows=1000 loops=1)
 Planning time: 1.139 ms
 Execution time: 63889.056 ms
(9 rows)

I.e. it creates a single batch with ~700 MB of tuples. Without the
patch, top shows this:

 VIRTRESSHR S  %CPU %MEM   COMMAND
  2540356 1,356g   5936 R 100,0 17,6   postgres: EXPLAIN

and the MemoryContextStats added to MultiExecHash shows this:

HashBatchContext: 1451221040 total in 182 blocks; 2826592 free (11
chunks); 1448394448 used

So yeah, the overhead is pretty huge in this case - basicaly 100%
overhead, because the inner table row width is only ~40B. With wider
rows the overhead will be lower.

Now, with the patch it looks like this:

 VIRTRESSHR S  %CPU %MEM   COMMAND
  1835332 720200   6096 R 100,0  8,9   postgres: EXPLAIN

HashBatchContext: 729651520 total in 21980 blocks; 0 free (0 chunks);
729651520 used

So, pretty much no overhead at all. It was slightly faster too (~5%) but
I haven't done much testing so it might be measurement error.

This patch is pretty independent of the other changes discussed here
(tweaking NTUP_PER_BUCKET / nbuckets) so I'll keep it separate.

regards
Tomas
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..18fd4a9 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -47,6 +47,7 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable,
 		int bucketNumber);
 static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
+static char * chunk_alloc(HashJoinTable hashtable, int tupleSize);
 
 /* 
  *		ExecHash
@@ -130,6 +131,9 @@ MultiExecHash(HashState *node)
 	if (node-ps.instrument)
 		InstrStopNode(node-ps.instrument, hashtable-totalTuples);
 
+	/* print some context stats */
+	MemoryContextStats(hashtable-batchCxt);
+	
 	/*
 	 * We do not return the hash table directly because it's not a subtype of
 	 * Node, and so would violate the MultiExecProcNode API.  Instead, our
@@ -223,6 +227,8 @@ ExecEndHash(HashState *node)
 	ExecEndNode(outerPlan);
 }
 
+/* 32kB chunks by default */
+#define CHUNK_SIZE	(32*1024L)
 
 /* 
  *		ExecHashTableCreate
@@ -294,6 +300,10 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable-spaceAllowedSkew =
 		hashtable-spaceAllowed * SKEW_WORK_MEM_PERCENT / 100;
 
+	hashtable-chunk_data = NULL;
+	hashtable-chunk_used = 0;
+	hashtable-chunk_length = 0;
+
 	/*
 	 * Get info about the hash functions to be used for each hash key. Also
 	 * remember whether the join operators are strict.
@@ -717,8 +727,8 @@ ExecHashTableInsert(HashJoinTable hashtable,
 
 		/* Create the HashJoinTuple */
 		hashTupleSize = HJTUPLE_OVERHEAD + tuple-t_len;
-		hashTuple = (HashJoinTuple) MemoryContextAlloc(hashtable-batchCxt,
-	   hashTupleSize);
+		hashTuple = (HashJoinTuple) chunk_alloc(hashtable, hashTupleSize);
+
 		hashTuple-hashvalue = hashvalue;
 		

Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-09 Thread Robert Haas
On Tue, Jul 8, 2014 at 5:16 PM, Tomas Vondra t...@fuzzy.cz wrote:
 Thinking about this a bit more, do we really need to build the hash
 table on the first pass? Why not to do this:

 (1) batching
 - read the tuples, stuff them into a simple list
 - don't build the hash table yet

 (2) building the hash table
 - we have all the tuples in a simple list, batching is done
 - we know exact row count, can size the table properly
 - build the table

We could do this, and in fact we could save quite a bit of memory if
we allocated say 1MB chunks and packed the tuples in tightly instead
of palloc-ing each one separately.  But I worry that rescanning the
data to build the hash table would slow things down too much.

 Also, maybe we could use a regular linear hash table [1], instead of
 using the current implementation with NTUP_PER_BUCKET=1. (Although,
 that'd be absolutely awful with duplicates.)

Linear probing is pretty awful unless your load factor is  1.  You'd
probably want NTUP_PER_BUCKET=0.25, or something like that, which
would eat up a lot of memory.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-09 Thread Tomas Vondra
On 9.7.2014 16:07, Robert Haas wrote:
 On Tue, Jul 8, 2014 at 5:16 PM, Tomas Vondra t...@fuzzy.cz wrote:
 Thinking about this a bit more, do we really need to build the
 hash table on the first pass? Why not to do this:
 
 (1) batching - read the tuples, stuff them into a simple list -
 don't build the hash table yet
 
 (2) building the hash table - we have all the tuples in a simple
 list, batching is done - we know exact row count, can size the
 table properly - build the table
 
 We could do this, and in fact we could save quite a bit of memory if 
 we allocated say 1MB chunks and packed the tuples in tightly instead 
 of palloc-ing each one separately.  But I worry that rescanning the 
 data to build the hash table would slow things down too much.

OK. I think we should shoot for no negative impact on well estimated
plans, and the rescans might violate that. I need to think about this
for some time, but my current plan is this:

(1) size the buckets for NTUP_PER_BUCKET=1 (and use whatever number
of batches this requires)

(2) build the batches as in the current patch

(3) if the hash table is too small, resize it

Which is not really different from the current patch.

 Also, maybe we could use a regular linear hash table [1], instead
 of using the current implementation with NTUP_PER_BUCKET=1.
 (Although, that'd be absolutely awful with duplicates.)
 
 Linear probing is pretty awful unless your load factor is  1.
 You'd probably want NTUP_PER_BUCKET=0.25, or something like that,
 which would eat up a lot of memory.

My experience is that 0.5-0.75 load factor is quite OK, and
NTUP_PER_BUCKET=1 gives you ~0.75 load on average, so it's not that
different. Anyway, the inability to handle duplicies reasonably is
probably a show-stopper.

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-08 Thread Robert Haas
On Wed, Jul 2, 2014 at 8:13 PM, Tomas Vondra t...@fuzzy.cz wrote:
 I propose dynamic increase of the nbuckets (up to NTUP_PER_BUCKET=1)
 once the table is built and there's free space in work_mem. The patch
 mentioned above makes implementing this possible / rather simple.

Another idea would be to start with NTUP_PER_BUCKET=1 and then, if we
run out of memory, increase NTUP_PER_BUCKET.  I'd like to think that
the common case is that work_mem will be large enough that everything
fits; and if you do it that way, then you save yourself the trouble of
rehashing later, which as you point out might lose if there are only a
few probes.  If it turns out that you run short of memory, you can
merge pairs of buckets up to three times, effectively doubling
NTUP_PER_BUCKET each time.

Yet another idea is to stick with your scheme, but do the dynamic
bucket splits lazily.  Set a flag on each bucket indicating whether or
not it needs a lazy split.  When someone actually probes the hash
table, if the flag is set for a particular bucket, move any tuples
that don't belong as we scan the bucket.  If we reach the end of the
bucket chain, clear the flag.

Not sure either of these are better (though I kind of like the first
one) but I thought I'd throw them out there...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-08 Thread Tomas Vondra
On 8 Červenec 2014, 14:49, Robert Haas wrote:
 On Wed, Jul 2, 2014 at 8:13 PM, Tomas Vondra t...@fuzzy.cz wrote:
 I propose dynamic increase of the nbuckets (up to NTUP_PER_BUCKET=1)
 once the table is built and there's free space in work_mem. The patch
 mentioned above makes implementing this possible / rather simple.

 Another idea would be to start with NTUP_PER_BUCKET=1 and then, if we
 run out of memory, increase NTUP_PER_BUCKET.  I'd like to think that
 the common case is that work_mem will be large enough that everything
 fits; and if you do it that way, then you save yourself the trouble of
 rehashing later, which as you point out might lose if there are only a
 few probes.  If it turns out that you run short of memory, you can
 merge pairs of buckets up to three times, effectively doubling
 NTUP_PER_BUCKET each time.

Maybe. I'm not against setting NTUP_PER_BUCKET=1, but with large outer
relations it may be way cheaper to use higher NTUP_PER_BUCKET values
instead of increasing the number of batches (resulting in repeated scans
of the outer table). I think it's important (but difficult) to keep these
things somehow balanced.

With large work_mem values the amount of memory for buckets may be quite
significant. E.g. 800MB work_mem may easily give ~120MB of memory taken by
buckets with NTUP_PER_BUCKET=1. With NTUP_PER_BUCKET=4 it's just ~30MB.


 Yet another idea is to stick with your scheme, but do the dynamic
 bucket splits lazily.  Set a flag on each bucket indicating whether or
 not it needs a lazy split.  When someone actually probes the hash
 table, if the flag is set for a particular bucket, move any tuples
 that don't belong as we scan the bucket.  If we reach the end of the
 bucket chain, clear the flag.

I don't think a lazy scheme makes much sense here. There are two clearly
separated phases - loading the table and search.

Also, it seems to me it can't work as described. Say we build a table and
then find out we need a table 4x the size. If I understand your proposal
correctly, you'd just resize buckets array, copy the existing buckets
(first 1/4 buckets) and set 'needs_split' flags. Correct?

Well, the problem is that while scanning a bucket you already need to know
which tuples belong there. But with the lazy approach, the tuples might
still be in the old (not yet split) buckets. So you'd have to search the
current bucket and all buckets that were not split yet. Or did I miss
something?

Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-08 Thread Robert Haas
On Tue, Jul 8, 2014 at 9:35 AM, Tomas Vondra t...@fuzzy.cz wrote:
 On 8 Červenec 2014, 14:49, Robert Haas wrote:
 On Wed, Jul 2, 2014 at 8:13 PM, Tomas Vondra t...@fuzzy.cz wrote:
 I propose dynamic increase of the nbuckets (up to NTUP_PER_BUCKET=1)
 once the table is built and there's free space in work_mem. The patch
 mentioned above makes implementing this possible / rather simple.

 Another idea would be to start with NTUP_PER_BUCKET=1 and then, if we
 run out of memory, increase NTUP_PER_BUCKET.  I'd like to think that
 the common case is that work_mem will be large enough that everything
 fits; and if you do it that way, then you save yourself the trouble of
 rehashing later, which as you point out might lose if there are only a
 few probes.  If it turns out that you run short of memory, you can
 merge pairs of buckets up to three times, effectively doubling
 NTUP_PER_BUCKET each time.

 Maybe. I'm not against setting NTUP_PER_BUCKET=1, but with large outer
 relations it may be way cheaper to use higher NTUP_PER_BUCKET values
 instead of increasing the number of batches (resulting in repeated scans
 of the outer table). I think it's important (but difficult) to keep these
 things somehow balanced.

Right, I think that's clear.  I'm just pointing out that you get to
decide: you can either start with a larger NTUP_PER_BUCKET and then
reduce it if you enough memory left, or you can start with a smaller
NTUP_PER_BUCKET and then increase it if you run short of memory.

 Yet another idea is to stick with your scheme, but do the dynamic
 bucket splits lazily.  Set a flag on each bucket indicating whether or
 not it needs a lazy split.  When someone actually probes the hash
 table, if the flag is set for a particular bucket, move any tuples
 that don't belong as we scan the bucket.  If we reach the end of the
 bucket chain, clear the flag.

 I don't think a lazy scheme makes much sense here. There are two clearly
 separated phases - loading the table and search.

 Also, it seems to me it can't work as described. Say we build a table and
 then find out we need a table 4x the size. If I understand your proposal
 correctly, you'd just resize buckets array, copy the existing buckets
 (first 1/4 buckets) and set 'needs_split' flags. Correct?

 Well, the problem is that while scanning a bucket you already need to know
 which tuples belong there. But with the lazy approach, the tuples might
 still be in the old (not yet split) buckets. So you'd have to search the
 current bucket and all buckets that were not split yet. Or did I miss
 something?

If you have, say, bucket 0..63 and you decide to change it to buckets
0..255, then any tuple that isn't in bucket N has to be in bucket N %
64.  More generally, any time you expand or contract by a power of two
it's pretty easy to figure out where tuples have to go.

But even if you do something that's not a power of two, like say a 10x
compaction of the buckets, there's still only two buckets that can
contain any particular hash value.  If the hash value is H, the old
number of buckets is O, and the new number of buckets is N, then the
tuple has got to be in bucket H % O or bucket H % N.  If bucket H % O
doesn't have the needs-split flag set, then it must be in H % N (if it
exists at all).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-08 Thread Tomas Vondra
On 8 Červenec 2014, 16:16, Robert Haas wrote:
 On Tue, Jul 8, 2014 at 9:35 AM, Tomas Vondra t...@fuzzy.cz wrote:
 Maybe. I'm not against setting NTUP_PER_BUCKET=1, but with large outer
 relations it may be way cheaper to use higher NTUP_PER_BUCKET values
 instead of increasing the number of batches (resulting in repeated scans
 of the outer table). I think it's important (but difficult) to keep
 these
 things somehow balanced.

 Right, I think that's clear.  I'm just pointing out that you get to
 decide: you can either start with a larger NTUP_PER_BUCKET and then
 reduce it if you enough memory left, or you can start with a smaller
 NTUP_PER_BUCKET and then increase it if you run short of memory.

I don't think those two approaches are equal.

With the approach I took, I can use a compromise value (NTUP=4) initially,
and only resize the hash table once at the end (while keeping the amount
of memory under work_mem all the time).

With the NTUP=1 and increase in case of memory pressure you have to
shrink the table immediately (to fit into work_mem), and if the hash table
gets split into multiple batches you're suddenly in a situation with lower
memory pressure and you may need to increase it again.

 Yet another idea is to stick with your scheme, but do the dynamic
 bucket splits lazily.  Set a flag on each bucket indicating whether or
 not it needs a lazy split.  When someone actually probes the hash
 table, if the flag is set for a particular bucket, move any tuples
 that don't belong as we scan the bucket.  If we reach the end of the
 bucket chain, clear the flag.

 I don't think a lazy scheme makes much sense here. There are two clearly
 separated phases - loading the table and search.

 Also, it seems to me it can't work as described. Say we build a table
 and
 then find out we need a table 4x the size. If I understand your proposal
 correctly, you'd just resize buckets array, copy the existing buckets
 (first 1/4 buckets) and set 'needs_split' flags. Correct?

 Well, the problem is that while scanning a bucket you already need to
 know
 which tuples belong there. But with the lazy approach, the tuples might
 still be in the old (not yet split) buckets. So you'd have to search the
 current bucket and all buckets that were not split yet. Or did I miss
 something?

 If you have, say, bucket 0..63 and you decide to change it to buckets
 0..255, then any tuple that isn't in bucket N has to be in bucket N %
 64.  More generally, any time you expand or contract by a power of two
 it's pretty easy to figure out where tuples have to go.

OK.

 But even if you do something that's not a power of two, like say a 10x
 compaction of the buckets, there's still only two buckets that can
 contain any particular hash value.  If the hash value is H, the old
 number of buckets is O, and the new number of buckets is N, then the
 tuple has got to be in bucket H % O or bucket H % N.  If bucket H % O
 doesn't have the needs-split flag set, then it must be in H % N (if it
 exists at all).

I wonder if this is really worth the effort - my guess is it's efficient
only if large portion of buckets is not visited (and thus does not need to
be split) at all. Not sure how common that is (our workloads certainly are
not like that).

regards
Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-08 Thread Robert Haas
On Tue, Jul 8, 2014 at 12:06 PM, Tomas Vondra t...@fuzzy.cz wrote:
 On 8 Červenec 2014, 16:16, Robert Haas wrote:
 On Tue, Jul 8, 2014 at 9:35 AM, Tomas Vondra t...@fuzzy.cz wrote:
 Maybe. I'm not against setting NTUP_PER_BUCKET=1, but with large outer
 relations it may be way cheaper to use higher NTUP_PER_BUCKET values
 instead of increasing the number of batches (resulting in repeated scans
 of the outer table). I think it's important (but difficult) to keep
 these
 things somehow balanced.

 Right, I think that's clear.  I'm just pointing out that you get to
 decide: you can either start with a larger NTUP_PER_BUCKET and then
 reduce it if you enough memory left, or you can start with a smaller
 NTUP_PER_BUCKET and then increase it if you run short of memory.

 I don't think those two approaches are equal.

 With the approach I took, I can use a compromise value (NTUP=4) initially,
 and only resize the hash table once at the end (while keeping the amount
 of memory under work_mem all the time).

 With the NTUP=1 and increase in case of memory pressure you have to
 shrink the table immediately (to fit into work_mem), and if the hash table
 gets split into multiple batches you're suddenly in a situation with lower
 memory pressure and you may need to increase it again.

True.  On the other hand, this really only comes into play when the
estimates are wrong.  If you know at the start how many tuples you're
going to need to store and how big they are, you determine whether
NTUP_PER_BUCKET=1 is going to work before you even start building the
hash table.  If it isn't, then you use fewer buckets right from the
start.  If we start by estimating a small value for NTUP_PER_BUCKET
and then let it float upward if we turn out to have more tuples than
expected, we're optimizing for the case where our statistics are
right.  If we start by estimating a larger value for NTUP_PER_BUCKET
than what we think we need to fit within work_mem, we're basically
guessing that our statistics are more likely to be wrong than to be
right.  I think.

 I wonder if this is really worth the effort - my guess is it's efficient
 only if large portion of buckets is not visited (and thus does not need to
 be split) at all. Not sure how common that is (our workloads certainly are
 not like that).

Yeah.  It may be a bad idea.  I threw it out there as a possible way
of trying to mitigate the worst case, which is when you trouble to
build the hash table and then make very few probes.  But that may not
be worth the complexity that this would introduce.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-08 Thread Tomas Vondra
On 8.7.2014 19:00, Robert Haas wrote:
 On Tue, Jul 8, 2014 at 12:06 PM, Tomas Vondra t...@fuzzy.cz wrote:
 On 8 Červenec 2014, 16:16, Robert Haas wrote:

 Right, I think that's clear. I'm just pointing out that you get
 to decide: you can either start with a larger NTUP_PER_BUCKET and
 then reduce it if you enough memory left, or you can start with a
 smaller NTUP_PER_BUCKET and then increase it if you run short of
 memory.

 I don't think those two approaches are equal.

 With the approach I took, I can use a compromise value (NTUP=4) 
 initially, and only resize the hash table once at the end (while
 keeping the amount of memory under work_mem all the time).

 With the NTUP=1 and increase in case of memory pressure you have 
 to shrink the table immediately (to fit into work_mem), and if the 
 hash table gets split into multiple batches you're suddenly in a
 situation with lower memory pressure and you may need to increase
 it again.
 
 True. On the other hand, this really only comes into play when the 
 estimates are wrong. If you know at the start how many tuples you're 
 going to need to store and how big they are, you determine whether 
 NTUP_PER_BUCKET=1 is going to work before you even start building
 the hash table. If it isn't, then you use fewer buckets right from
 the start. If we start by estimating a small value for
 NTUP_PER_BUCKET and then let it float upward if we turn out to have
 more tuples than expected, we're optimizing for the case where our
 statistics are right. If we start by estimating a larger value for
 NTUP_PER_BUCKET than what we think we need to fit within work_mem,
 we're basically guessing that our statistics are more likely to be
 wrong than to be right. I think.

Good point. The fist patch was targetted exactly at the wrongly
estimated queries. This patch attempts to apply the rehash to all plans,
and maybe there's a better way.

If the estimates are correct / not too off, we can use this information
to do the sizing 'right' at the beginning (without facing rehashes later).

Over-estimates are not a problem, because it won't make the hash table
slower (it'll be sized for more tuples) and we can't change the number
of batches anyway.

With under-estimates we have to decide whether to resize the hash or
increase the number of batches.

In both cases that matter (correct estimates and under-estimates) we
have to decide whether to increase the number of buckets or batches. I'm
not sure how to do that.

 I wonder if this is really worth the effort - my guess is it's 
 efficient only if large portion of buckets is not visited (and
 thus does not need to be split) at all. Not sure how common that is
 (our workloads certainly are not like that).
 
 Yeah. It may be a bad idea. I threw it out there as a possible way of
 trying to mitigate the worst case, which is when you trouble to build
 the hash table and then make very few probes. But that may not be
 worth the complexity that this would introduce.

Let's keep it simple for now. I think the sizing question (explained
above) is more important and needs to be solved first.

regards
Tomas




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-08 Thread Jeff Janes
On Tue, Jul 8, 2014 at 6:35 AM, Tomas Vondra t...@fuzzy.cz wrote:

 On 8 Červenec 2014, 14:49, Robert Haas wrote:
  On Wed, Jul 2, 2014 at 8:13 PM, Tomas Vondra t...@fuzzy.cz wrote:
  I propose dynamic increase of the nbuckets (up to NTUP_PER_BUCKET=1)
  once the table is built and there's free space in work_mem. The patch
  mentioned above makes implementing this possible / rather simple.
 
  Another idea would be to start with NTUP_PER_BUCKET=1 and then, if we
  run out of memory, increase NTUP_PER_BUCKET.  I'd like to think that
  the common case is that work_mem will be large enough that everything
  fits; and if you do it that way, then you save yourself the trouble of
  rehashing later, which as you point out might lose if there are only a
  few probes.  If it turns out that you run short of memory, you can
  merge pairs of buckets up to three times, effectively doubling
  NTUP_PER_BUCKET each time.

 Maybe. I'm not against setting NTUP_PER_BUCKET=1, but with large outer
 relations it may be way cheaper to use higher NTUP_PER_BUCKET values
 instead of increasing the number of batches (resulting in repeated scans
 of the outer table). I think it's important (but difficult) to keep these
 things somehow balanced.

 With large work_mem values the amount of memory for buckets may be quite
 significant. E.g. 800MB work_mem may easily give ~120MB of memory taken by
 buckets with NTUP_PER_BUCKET=1. With NTUP_PER_BUCKET=4 it's just ~30MB.



That extra 90MB is memory well spent, in my book.  Versus having to walk a
4-deep linked list which is scattered all over main memory just to find one
entry we care about in it.

It might cause some things that were very close to the edge to tip over
into multi-pass hash joins, but even that is not necessarily a bad thing.
 (When I test with work_mem in the 20 to 50 MB range, I often find
batches=2 be ~30% faster than batches=1, I think because partitioning into
main memory using sequential writes is not much of a burden, and building
and walking two hash tables that both fit in L3 cache is much faster than
building 1 hash table in main memory, and more than makes up for the work
of partitioning.  Presumably that situation doesn't apply to work_mem
900MB, but isn't NUMA about the same thing as L4 cache, in effect?).

And if someone had a whole bunch of hash joins which were right in that
anti-sweet spot, all they have to do is increase work_mem by (at most) 15%
to get out of it.  work_mem is basically impossible to tune, so I doubt
anyone exists who has a reasonable setting for it which can' be increased
by 15% and still be reasonable.  And if someone does have it tuned so
tightly, they probably won't be upgrading to new major versions without
expecting to do some re-tuning.

Cheers,

Jeff


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-08 Thread Tomas Vondra
On 8.7.2014 21:53, Jeff Janes wrote:
 On Tue, Jul 8, 2014 at 6:35 AM, Tomas Vondra t...@fuzzy.cz wrote:
 
 Maybe. I'm not against setting NTUP_PER_BUCKET=1, but with large
 outer relations it may be way cheaper to use higher NTUP_PER_BUCKET
 values instead of increasing the number of batches (resulting in
 repeated scans of the outer table). I think it's important (but
 difficult) to keep these things somehow balanced.
 
 With large work_mem values the amount of memory for buckets may be
 quite significant. E.g. 800MB work_mem may easily give ~120MB of
 memory taken by buckets with NTUP_PER_BUCKET=1. With
 NTUP_PER_BUCKET=4 it's just ~30MB.
 
 
 That extra 90MB is memory well spent, in my book. Versus having to
 walk a 4-deep linked list which is scattered all over main memory
 just to find one entry we care about in it.

Yes, I share this view, although it really depends on how expensive it's
to rescan the outer relation (due to more batches) vs. lookup in a
deeper hash table.

The other thing is that this memory is currently unaccounted for, i.e.
the space for buckets is not counted within the work_mem limit (unless
I'm missing something in the code). I'm not sure why, and I believe it
should be, so I changer this in the patch.

 It might cause some things that were very close to the edge to tip
 over into multi-pass hash joins, but even that is not necessarily a
 bad thing. (When I test with work_mem in the 20 to 50 MB range, I
 often find batches=2 be ~30% faster than batches=1, I think because 
 partitioning into main memory using sequential writes is not much of
 a burden, and building and walking two hash tables that both fit in
 L3 cache is much faster than building 1 hash table in main memory,
 and more than makes up for the work of partitioning. Presumably that
 situation doesn't apply to work_mem 900MB, but isn't NUMA about the
 same thing as L4 cache, in effect?).

Yes, I've seen cases where plans with (nbatch1) were faster than a plan
with (nbatch=1). I'm not entirely sure why :-( but I have two hypotheses
so far:

(a) Caching within CPU (current CPUs have multiple MBs of L2 cache),
which may make a difference, especially in the size range you've
mentioned.

(b) Lower tuple/bucket ratio - this may easily happen for example if
the estimates are slighly lower than reality (either row count or
row width) and narrowly exceed work_mem, thus forcing batching.
The resulting hash table has ~50% tuple/bucket on average, and
thus is faster.

 And if someone had a whole bunch of hash joins which were right in
 that anti-sweet spot, all they have to do is increase work_mem by (at
 most) 15% to get out of it. work_mem is basically impossible to tune,
 so I doubt anyone exists who has a reasonable setting for it which
 can' be increased by 15% and still be reasonable. And if someone does
 have it tuned so tightly, they probably won't be upgrading to new
 major versions without expecting to do some re-tuning.

Right. Still, it's not really nice to get slower hash joins after
upgrading to a new version - I somehow expect to get the same or better
performance, if possible. So I'd like to make it as efficient as
possible, within the given memory limit.

Sadly, the increase may be needed anyway because of counting the bucket
memory into work_mem, as mentioned above. If committed, this should be
probably mentioned in release notes or something.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-08 Thread Tomas Vondra
Hi,

Thinking about this a bit more, do we really need to build the hash
table on the first pass? Why not to do this:

(1) batching
- read the tuples, stuff them into a simple list
- don't build the hash table yet

(2) building the hash table
- we have all the tuples in a simple list, batching is done
- we know exact row count, can size the table properly
- build the table

Also, maybe we could use a regular linear hash table [1], instead of
using the current implementation with NTUP_PER_BUCKET=1. (Although,
that'd be absolutely awful with duplicates.)

regards
Tomas

[1] http://en.wikipedia.org/wiki/Linear_probing


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-06 Thread Tomas Vondra
On 6.7.2014 06:47, Stephen Frost wrote:
 * Greg Stark (st...@mit.edu) wrote:
 Last time was we wanted to use bloom filters in hash joins to
 filter out tuples that won't match any of the future hash batches
 to reduce the amount of tuples that need to be spilled to disk.
 However the problem was that it was unclear for a given amount of
 memory usage how to pick the right size bloom filter and how to
 model how much it would save versus how much it would cost in
 reduced hash table size.
 
 Right. There's only one hash function available, really, (we don't 
 currently support multiple hash functions..), unless we want to try
 and re-hash the 32bit hash value that we get back (not against trying
 that, but it isn't what I'd start with) and it would hopefully be
 sufficient for this.

I don't really see a need to add more hashing functions. Use the hash
value itself as an input to the bloom filter seems just fine to me.

According to [1] the expected number of collisions is

   n - k + k * math.pow((1 - 1/k), n)

where 'k' is the number of possible values (2^32 in our case), and 'n'
is the number of inserts (i.e. values inserted into the table). With a
1GB work_mem and ~50B per row, that's ~20M rows. According to the
formula, that's ~50k collisions. In other words, 0.25% of false
positives. This seems more than sufficient for the bloom filter, and if
0.25% of false positives causes it to be inefficient I'd say the gains
are not that great anyway. (Unless my calculations are somehow flawed,
of course).

[1] https://www.cs.duke.edu/courses/cps102/spring10/Lectures/L-18.pdf

 I think it just required some good empirical tests and hash join
 heavy workloads to come up with some reasonable guesses. We don't
 need a perfect model just some reasonable bloom filter size that
 we're pretty sure will usually help more than it hurts.
 
 This would help out a lot of things, really.. Perhaps what Tomas is 
 developing regarding test cases would help here also.

The test cases might be reused I guess. However I still don't have a
clear idea of how exactly the bloom filter would be implemented. In the
threads I found it was suggested to use per-bucket filtersm, which seems
really strange to me. I see two options:

(a) global filter

- bloom filter sized according to estimates from planner
- ... so we're screwed if the estimates are significantly wrong
- create the filter in ExecHashTableCreate
- insert all tuples in ExecHashTableInsert (all batches)

(b) per-batch filter

- may be built after the batching is completed
- we have reliable numbers at this point (not just estimates)
- per-batch filters may be smaller (only fraction of tuples)

So the per-batch filter seems to be a better approach. The question is
the sizing of the bloom filter. According to [2] there are three rules
of thumb for sizing bloom filters:

   (1) 1B per item, to get 2% error rate
   (2) 0.7 * bits of hash functions (~5 when using 1B / item)
   (3) number of hashes dominates the performance (more to compute,
   more memory accesses)

When working with 20M rows, this means ~20MB and 5 hash functions. Not
that bad, although we don't know how expensive it's to build the filter.

[2] http://corte.si/posts/code/bloom-filter-rules-of-thumb/index.html

However this is somewhat flawed because this assumes the 20M hashed
values are distinct, so if the inner table contains duplicate keys, we
might use smaller bloom filter. But we don't know that value (ndistinct
estimates are rather unreliable, but we might use hyperloglog counter to
do this).

Also, there are cases when we know there will be no misses (e.g. a join
on a FK), and in that case it's pointless to build the bloom filter.
Would be nice to get some flag from the planner whether to build it, a
threshold when to build it (e.g. if rowcount is  10k, build the filter).

Doing something similar for the two patches I posted (tweaking the
nbuckets dynamically) might also benefit from such planning information,
although it's not that critical for them.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-06 Thread Stephen Frost
Tomas,

* Tomas Vondra (t...@fuzzy.cz) wrote:
 I can't find the thread / test cases in the archives. I've found this
 thread in hackers:
 
 http://www.postgresql.org/message-id/caoezvif-r-ilf966weipk5by-khzvloqpwqurpak3p5fyw-...@mail.gmail.com
 
 Can you point me to the right one, please?

This:

http://www.postgresql.org/message-id/20130328235627.gv4...@tamriel.snowman.net

and I think there may have been other test cases on that thread
somewhere...  I certainly recall having at least a couple of them that I
was playing with.

I can probably find them around here somewhere too, if necessary.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-06 Thread Tomas Vondra
On 6.7.2014 17:57, Stephen Frost wrote:
 Tomas,
 
 * Tomas Vondra (t...@fuzzy.cz) wrote:
 I can't find the thread / test cases in the archives. I've found this
 thread in hackers:

 http://www.postgresql.org/message-id/caoezvif-r-ilf966weipk5by-khzvloqpwqurpak3p5fyw-...@mail.gmail.com

 Can you point me to the right one, please?
 
 This:
 
 http://www.postgresql.org/message-id/20130328235627.gv4...@tamriel.snowman.net

Sadly, the link to the SQL does not work :-(

  http://snowman.net/~sfrost/test_case.sql = 404

 and I think there may have been other test cases on that thread
 somewhere...  I certainly recall having at least a couple of them that I
 was playing with.
 
 I can probably find them around here somewhere too, if necessary.

If you could look for them, that'd be great.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-05 Thread Stephen Frost
* Greg Stark (st...@mit.edu) wrote:
 On Thu, Jul 3, 2014 at 11:40 AM, Atri Sharma atri.j...@gmail.com wrote:
  IIRC, last time when we tried doing bloom filters, I was short of some real
  world useful hash functions that we could use for building the bloom filter.
 
 Last time was we wanted to use bloom filters in hash joins to filter
 out tuples that won't match any of the future hash batches to reduce
 the amount of tuples that need to be spilled to disk. However the
 problem was that it was unclear for a given amount of memory usage how
 to pick the right size bloom filter and how to model how much it would
 save versus how much it would cost in reduced hash table size.

Right.  There's only one hash function available, really, (we don't
currently support multiple hash functions..), unless we want to try and
re-hash the 32bit hash value that we get back (not against trying that,
but it isn't what I'd start with) and it would hopefully be sufficient
for this.

 I think it just required some good empirical tests and hash join heavy
 workloads to come up with some reasonable guesses. We don't need a
 perfect model just some reasonable bloom filter size that we're pretty
 sure will usually help more than it hurts.

This would help out a lot of things, really..  Perhaps what Tomas is
developing regarding test cases would help here also.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-03 Thread Tomas Vondra
On 3.7.2014 02:13, Tomas Vondra wrote:
 Hi,
 
 while hacking on the 'dynamic nbucket' patch, scheduled for the next CF
 (https://commitfest.postgresql.org/action/patch_view?id=1494) I was
 repeatedly stumbling over NTUP_PER_BUCKET. I'd like to propose a change
 in how we handle it.
 
 
 TL;DR; version
 --
 
 I propose dynamic increase of the nbuckets (up to NTUP_PER_BUCKET=1)
 once the table is built and there's free space in work_mem. The patch
 mentioned above makes implementing this possible / rather simple.

Attached is v1 of this experimental patch. It's supposed to be applied
on top of v7 of this patch

   http://www.postgresql.org/message-id/53b59498.3000...@fuzzy.cz

as it shared most of the implementation. So apply it like this:

   patch -p1  hashjoin-nbuckets-growth-v7.patch
   patch -p1  hashjoin-dynamic-ntup-v1.patch

It implements the ideas outlined in the previous message, most importantly:

   (a) decreases NTUP_PER_BUCKET to 4

   (b) checks free work_mem when deciding whether to add batch

   (c) after building the batches, increases the number of buckets as
   much as possible, i.e. up to the number of batch tuples, but not
   exceeding work_mem

The improvements for the queries I tried so far are quite significant
(partially due to the NTUP_PER_BUCKET decrease, partially due to the
additional bucket count increase).

The rebuild is quite fast - the patch measures and reports this as a
WARNING, and the timings I've seen are ~12ms per 7MB (i.e. ~120ms for
70MB and ~1200ms for 700MB). Of course, this only makes sense when
compared to how much time it saved, but for the queries I tested so far
this was a good investment.

However it's likely there are queries where this may not be the case,
i.e. where rebuilding the hash table is not worth it. Let me know if you
can construct such query (I wasn't).

regards
Tomas
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 53642d1..a4623dc 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -58,7 +58,7 @@ static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
  */
 
 /* Target bucket loading (tuples per bucket) */
-#define NTUP_PER_BUCKET			10
+#define NTUP_PER_BUCKET			4
 
 /* Multiple of NTUP_PER_BUCKET triggering the increase of nbuckets.
  * 
@@ -77,6 +77,8 @@ static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 #define NTUP_GROW_COEFFICIENT	1.333
 #define NTUP_GROW_THRESHOLD		(NTUP_PER_BUCKET * NTUP_GROW_COEFFICIENT)
 
+#define SPACE_USED(hashtable, nbuckets) ((hashtable)-spaceUsed + (nbuckets) * sizeof(void*))
+
 /* 
  *		ExecHash
  *
@@ -156,21 +158,33 @@ MultiExecHash(HashState *node)
 		}
 	}
 
-	/* If average number of tuples per bucket is over the defined threshold,
-	 * increase the number of buckets to get below it. */
+/*
+ * Consider resizing the hash table (number of buckets) for better
+ * lookup performance. The code in ExecHashTableInsert guarantees
+ * we have enough memory to reach NTUP_PER_BUCKET, but maybe we can
+ * do better - getting lower number of tuples per bucket (up to
+ * NTUP_PER_BUCKET=1).
+ */
 	if (enable_hashjoin_bucket) {
 
-		/* consider only tuples in the non-skew buckets */
-		double nonSkewTuples = (hashtable-totalTuples - hashtable-skewTuples);
-
-		if ((nonSkewTuples / hashtable-nbatch)  (hashtable-nbuckets * NTUP_GROW_THRESHOLD)) {
+instr_time start_time, end_time;
 
 #ifdef HJDEBUG
-			printf(Increasing nbucket to %d (average per bucket = %.1f)\n,
-   nbuckets,  (batchTuples / hashtable-nbuckets));
+		printf(Increasing nbucket to %d (average per bucket = %.1f)\n,
+nbuckets,  (batchTuples / hashtable-nbuckets));
 #endif
-			ExecHashIncreaseNumBuckets(hashtable);
-		}
+
+elog(WARNING, hash resize (start) : nbuckets=%d, hashtable-nbuckets);
+
+INSTR_TIME_SET_CURRENT(start_time);
+
+ExecHashIncreaseNumBuckets(hashtable);
+
+INSTR_TIME_SET_CURRENT(end_time);
+INSTR_TIME_SUBTRACT(end_time, start_time);
+
+elog(WARNING, hash resize (end) : nbuckets=%d duration=%.3f, hashtable-nbuckets, INSTR_TIME_GET_MILLISEC(end_time));
+
 	}
 
 	/* must provide our own instrumentation support */
@@ -738,35 +752,34 @@ ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
 	int			oldnbuckets = hashtable-nbuckets;
 	HashJoinTuple  *oldbuckets = hashtable-buckets;
 	MemoryContext   oldcxt;
-	double		batchTuples = (hashtable-totalTuples / hashtable-nbatch);
+
+/* average number of tuples per batch */
+double  batchTuples = (hashtable-totalTuples - hashtable-skewTuples) / hashtable-nbatch;
+
+/* memory available for buckets */
+SizefreeMemory = (hashtable-spaceAllowed - hashtable-spaceUsed);
 
 	/*
-	 * Determine the proper number of buckets, i.e. stop once the average
-	 * per bucket gets below the threshold (1.33 * NTUP_PER_BUCKET).
-	

Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-03 Thread Stephen Frost
Tomas,

* Tomas Vondra (t...@fuzzy.cz) wrote:
 However it's likely there are queries where this may not be the case,
 i.e. where rebuilding the hash table is not worth it. Let me know if you
 can construct such query (I wasn't).

Thanks for working on this!  I've been thinking on this for a while and
this seems like it may be a good approach.  Have you considered a bloom
filter over the buckets..?  Also, I'd suggest you check the archives
from about this time last year for test cases that I was using which
showed cases where hashing the larger table was a better choice- those
same cases may also show regression here (or at least would be something
good to test).

Have you tried to work out what a 'worst case' regression for this
change would look like?  Also, how does the planning around this change?
Are we more likely now to hash the smaller table (I'd guess 'yes' just
based on the reduction in NTUP_PER_BUCKET, but did you make any changes
due to the rehashing cost?)?

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-03 Thread Atri Sharma
On Thu, Jul 3, 2014 at 11:40 PM, Stephen Frost sfr...@snowman.net wrote:

 Tomas,

 * Tomas Vondra (t...@fuzzy.cz) wrote:
  However it's likely there are queries where this may not be the case,
  i.e. where rebuilding the hash table is not worth it. Let me know if you
  can construct such query (I wasn't).

 Thanks for working on this!  I've been thinking on this for a while and
 this seems like it may be a good approach.  Have you considered a bloom
 filter?


IIRC, last time when we tried doing bloom filters, I was short of some real
world useful hash functions that we could use for building the bloom filter.

If we are restarting experiments on this, I would be glad to assist.

Regards,

Atri


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-03 Thread Tomas Vondra
Hi Stephen,

On 3.7.2014 20:10, Stephen Frost wrote:
 Tomas,
 
 * Tomas Vondra (t...@fuzzy.cz) wrote:
 However it's likely there are queries where this may not be the case,
 i.e. where rebuilding the hash table is not worth it. Let me know if you
 can construct such query (I wasn't).
 
 Thanks for working on this! I've been thinking on this for a while
 and this seems like it may be a good approach. Have you considered a
 bloom filter over the buckets..? Also, I'd suggest you check the

I know you've experimented with it, but I haven't looked into that yet.

 archives from about this time last year for test cases that I was
 using which showed cases where hashing the larger table was a better
 choice- those same cases may also show regression here (or at least
 would be something good to test).

Good idea, I'll look at the test cases - thanks.

 Have you tried to work out what a 'worst case' regression for this 
 change would look like? Also, how does the planning around this
 change? Are we more likely now to hash the smaller table (I'd guess
 'yes' just based on the reduction in NTUP_PER_BUCKET, but did you
 make any changes due to the rehashing cost?)?

The case I was thinking about is underestimated cardinality of the inner
table and a small outer table. That'd lead to a large hash table and
very few lookups (thus making the rehash inefficient). I.e. something
like this:

  Hash Join
 Seq Scan on small_table (rows=100) (actual rows=100)
 Hash
Seq Scan on bad_estimate (rows=100) (actual rows=10)
Filter: ((a  100) AND (b  100))

But I wasn't able to reproduce this reasonably, because in practice
that'd lead to a nested loop or something like that (which is a planning
issue, impossible to fix in hashjoin code).

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-03 Thread Greg Stark
On Thu, Jul 3, 2014 at 11:40 AM, Atri Sharma atri.j...@gmail.com wrote:
 IIRC, last time when we tried doing bloom filters, I was short of some real
 world useful hash functions that we could use for building the bloom filter.

Last time was we wanted to use bloom filters in hash joins to filter
out tuples that won't match any of the future hash batches to reduce
the amount of tuples that need to be spilled to disk. However the
problem was that it was unclear for a given amount of memory usage how
to pick the right size bloom filter and how to model how much it would
save versus how much it would cost in reduced hash table size.

I think it just required some good empirical tests and hash join heavy
workloads to come up with some reasonable guesses. We don't need a
perfect model just some reasonable bloom filter size that we're pretty
sure will usually help more than it hurts.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-03 Thread Tomas Vondra
On 3.7.2014 20:50, Tomas Vondra wrote:
 Hi Stephen,
 
 On 3.7.2014 20:10, Stephen Frost wrote:
 Tomas,

 * Tomas Vondra (t...@fuzzy.cz) wrote:
 However it's likely there are queries where this may not be the case,
 i.e. where rebuilding the hash table is not worth it. Let me know if you
 can construct such query (I wasn't).

 Thanks for working on this! I've been thinking on this for a while
 and this seems like it may be a good approach. Have you considered a
 bloom filter over the buckets..? Also, I'd suggest you check the
 archives from about this time last year for test cases that I was
 using which showed cases where hashing the larger table was a better
 choice- those same cases may also show regression here (or at least
 would be something good to test).
 
 Good idea, I'll look at the test cases - thanks.

I can't find the thread / test cases in the archives. I've found this
thread in hackers:

http://www.postgresql.org/message-id/caoezvif-r-ilf966weipk5by-khzvloqpwqurpak3p5fyw-...@mail.gmail.com

Can you point me to the right one, please?

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers