On Sun, Oct 28, 2018 at 1:22 AM Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > On 10/27/2018 01:51 PM, Antonin Houska wrote: > > Antonin Houska <a...@cybertec.at> wrote: > >> Thomas Munro <thomas.mu...@enterprisedb.com> wrote: > >>> On Sat, Oct 27, 2018 at 12:30 AM Antonin Houska <a...@cybertec.at> wrote: > >>> Are you saying there is a bug in this logic (which is nbuckets * 0.75 > >>> expressed as integer maths), or saying that 0.75 is not a good maximum > >>> load factor? I looked around at a couple of general purpose hash > >>> tables and saw that some used 0.75 and some used 1.0, as a wasted > >>> space-vs-collision trade-off. If I have my maths right[1], with 0.75 > >>> you expect to have 75 entries in ~53 buckets, but with 1.0 you expect > >>> to have 100 entries in ~64 buckets. > >> > >> I don't know how exactly you apply the [1] formula (what is "n" and what is > >> "N" in your case?), but my consideration was much simpler: For example, if > >> BUCKETS_PER_PARTITION returns 8 (power of 2 is expected here and also more > >> convenient), then MAX_COUNT_PER_PARTITION returns 8 / 2 + 8 / 4 = 6. Thus > >> the > >> hashtable gets resized if we're going to add the 7th entry to the > >> partition, > >> i.e. we the number of entries in the partition is lower than the number of > >> buckets. Is that o.k.?
n balls = the keys being hashed and insert N bins = the hash table buckets Unless we know of some special properties of the hash function and input data, I believe we have to treat the hashes like uniformly distributed random numbers when making predictions, hence the balls-into-bins probability stuff that can tell you about the expected distribution. The expected number of occupied bins for 1 million balls into 1 million bins: select 1000000.0 * (1.0 - pow(1.0 - (1.0 / 1000000), 1000000.0)); -> 632120.7427683549057142050000000 The same thing by counting distinct positive random numbers modulo 1 million: select count(distinct (random() * 200000000)::int % 1000000) from generate_series(1, 1000000) ss; -> 632373, 632246, 631954, ... Distinct hashes of the first 1 million integers (arbitrary keys) modulo 1 million (buckets): select count(distinct abs(hashoid(n)) % 1000000) from generate_series(1, 1000000) ss(n); -> 632115 (For a moment I was confused about getting a higher number until I realised I need abs() to avoid doubling the effective number of buckets...) > > Well, it may be o.k. I've just checked what the fill factor means in hash > > index and it's also the number of entries divided by the number of buckets. > > > > Using load factor ~0.75 is definitely the right thing to do. One way to > interpret it is "average chain length" (or whatever is the approach in > that particular hash table implementations) and one of the main points > of hash tables is to eliminate linear scans. We could pick a value > closer to 1.0, but experience says it's not worth it - it's easy to get > a annoyingly long chains due to hash collisions, in exchange for fairly > minimal space savings. Using the hashes of the first 1 million integers, let's try putting them into different numbers of buckets, and see how many buckets we get with each chain length: WITH entries AS (SELECT abs(hashoid(generate_series(1, 1000000))) % NBUCKETS AS bucket), lengths AS (SELECT bucket, COUNT(*) AS chain_length FROM entries GROUP BY bucket) SELECT chain_length, COUNT(*) AS count FROM lengths GROUP BY chain_length ORDER BY 2 DESC; NBUCKETS = 1000000 (load factor 1.0) chain_length | count --------------+-------- 1 | 367537 2 | 184580 3 | 61109 4 | 15197 5 | 3079 6 | 509 7 | 95 8 | 7 9 | 2 NBUCKETS = 1333333 (load factor 0.75) chain_length | count --------------+-------- 1 | 472012 2 | 177174 3 | 44348 4 | 8361 5 | 1243 6 | 143 7 | 9 8 | 2 NBUCKETS = 2000000 (load factor 0.5) chain_length | count --------------+-------- 1 | 606451 2 | 151741 3 | 25226 4 | 3148 5 | 323 6 | 28 7 | 2 > That being said, I wonder if we should tweak NTUP_PER_BUCKET=1 in hash > joins to a lower value. IIRC we ended up with 1.0 because originally it > was set to 10.0, and we reduced it to 1.0 in 9.5 which gave us nice > speedup. But I don't recall if we tried using even lower value (probably > not). Maybe we don't need to do that because we only build the hash > table at the very end, when we exactly know how many entries will it > contain, so we don't need to do lookups and inserts at the same time, > and we don't need to grow the hash table (at least in the non-parallel > case). And we end up with 0.75 load factor on average, due to the > doubling (the sizes are essentially uniformly distributed between > 0.5+epsilon and 1.0-epsilon). Yeah, it would indeed be interesting to experiment with load factors lower than 1.0. Every link in the chain is a cache miss. In future we should work on mitigating that by prefetching during both building and probing (see nearby thread), but I suspect you can only do that effectively for the bucket header and perhaps the first tuple in the chain; if I'm right then longer chains will become even more expensive relative to short ones. By the way, aside from wasting memory, when you add extra buckets you make full/right outer hash joins slower because they scan all the buckets. -- Thomas Munro http://www.enterprisedb.com