On Wed Sep 21, 2022 at 8:21 PM CEST, NRK wrote: > On Wed, Sep 21, 2022 at 04:24:28PM +0000, Hadrien Lacour wrote: > > The other "interesting" one is genhtab, an alternative to gperf that is much > > simpler to use and produces smaller binaries for big tables (but slower, cf > > genhtab_bench) using a simple chained hashing table (with some tricks due to > > being built AOT) using fnv1a; I'll probably try with XXH3 soon, to see if > > the > > speed issue can be mitigated without too much binary bloating. > > const uint32_t bkt = hash & (ht.len - 1); > > Due to mul being the last step in fnv1a, the entropy is stored in the > high bits, so you want to be masking the high bits and not the low ones. Huh, makes sense, I was somehow counting on it having a good distribution.
> And for statically generated hash-tables (as well as in general) I tend > to prefer open-addressing rather than chaining. For probing I prefer > linear probe with robin-hood insertion[0] for statically generated > tables. > Never liked the conceptual complexity of open addressing (and how you can lock on a single bucket, in read-write MT scenarios), but it could be better, indeed. Another thing I like about chained hashing, is how you can just change the bucket type according to your read/write performance needs (vector, skiplist, avl tree, etc...). > As for why: linear probing nicely exploits spatial locality[1] and > robin-hood can reduce the max probe size by alleviating clustering[2]. > This leads to both faster worst-case lookups as well as less memory > usage. In this specific case, there's not much cache improvement since the bucket contents are stored sequentially, without pointers except for the keys. > For run-time tables where you don't know the inputs, double-hashing > might be a better approach[3]. But for statically generated one, I > wouldn't use double-hashing since you can easily change the hash > function until it's producing good results (e.g for fnv1a changing the > starting state or the multiplier, or both[4]). > > Also one more trick that can reduce memory usage is to eliminate pointer > overhead and use an array of arrays, i.e using `char [][n]` instead of > `char *[]`. This isn't _guaranteed_ to reduce mem usage though, so > you'll have to do some calculation to figure it out, i.e: > > /* approximate mem usage for `char *[]` */ > for (sum = 0, i = 0; i < LEN(input); ++i) > sum += (strlen(input[i]) + 1) + sizeof (char *); > /* vs char[][max_input_strlen + 1] */ > sum = LEN(input) * (max_input_strlen + 1); > That's a very special case, but worth thinking about, indeed. To be honest, performance wasn't the focus, since I can't imagine beating gperf -lC.
