On 11/3/23 12:36:30, Rob Landley wrote: > > As to the hash code, the only part I've taken from Python's > > implementation is the probe sequence > > So... NOT the "linked lists descending from each hash bucket" approach > then. > > > probe = (probe * 5 + 1 + (perturb >>= 5)) & hashmask > > where hashmask is 2**n-1 (for table size 2**n) and perturb is > > initially the entire original key hash. > > I had a class on this in college 20 years ago, and literally have not used > it since, because "the hash table can entirely fill up" and "once the hash > table is more than half full everything starts colliding just from > displacements", and "picking the right hash algorithm is now _extra_ > magic"...
The chained hash table (linked lists from buckets) can grow indefinitely but will slow down as the chains get longer, unless there's code to rebuild it with more buckets. The other approach, "open addressing" (named in 1957) does the thing with an initial probe and then subsequent probes as needed until an open slot is found. The table size is either prime or a power of 2. Power of 2 is nice because you can do the initial probe by masking the original hash by 'and'ing with (2**n-1). An advantage is when growing it, you can double it, while growing a prime-size table needs finding a suitable larger prime (though now that I think of it, maybe a smallish table of primes would do fine, and allow growing by smaller increments. I'll have to try that...) The probe sequence can be j, j+1 mod m, j+2 mod m,... (for table size m) but that's terrible. In 1968 Guy de Balbine came up with double hashing, where the sequence is j+k mod m, j+2k mod m, ... where k is some different hash of the same key, but must also be non-zero and relatively prime to m. For prime table size, that's anything from 1 to m-1. For a power-of-two table, that's any odd number 1 to m-1. (The main reason I think that table size is always prime or 2**n). > > This is brilliant and probably due to Tim Peters > > "probably"? (I'm happy to cite a primary source...) I tried for an hour or two to find a definitive statement that Tim came up with the perturb idea. The long comment in the Python source says he tested various shifts for perturb. A commit by Tim https://github.com/python/cpython/commit/81673fb5ca25e17a1c1b806c06b046669a566a88 is the best evidence I have that j = (j * 5 + 1 + (perturb >>= 5)) mod m is all Tim's. Lots of good info there too. > > and already copied in > > other hash table implementations (including vim), so I doubt anyone > > will mind using the same probe sequence. > > And this is better than a constant prime number displacement because > you're guaranteed to hit every entry in 2*sizeof(table) checks instead of > some multiple with the prime, and also adjacent displacement chains are > less likely to connect up? The j = (j * 5 + 1) mod m part alone (without perturb) will hit all the slots, as will any constant odd (not necessarily prime) stride. Whether the constant is 1 or 37 or ... it will still be real slow. The j*5+1 recurrence works but the 5 is not unique. For a 2**n table size, j*a+c works if c is odd and (a-1) is a multiple of 4. (Hull-Dobell theorem, in Knuth vol. 2.) So j*33+1 or j*17+3... I ran some tests... [Digression: My tsort counts all the input symbols, uses that count to set the table size. Worst case is all symbols unique, so I need the table to be at least 1 larger than that. But generally the symbols will appear at least twice in real input, as predecessors and successors, so that table will be less than half full. Nearly any decent probe scheme will be fast. I tweaked the sizing code to: for (k = 2; k < (nsym < 128 ? 128 : nsym + nsym / 128); k *= 2); Now 8323580 unique symbols will make a table of 2**23 which will be .992 full, which should stress the probe logic to the max. (8323582 will double the table size so 8323580 is the max for a table of 2**23.)] So I made the 8323580 unique symbols from 1 to 12 alphanumeric characters and another file with 8323582 so it "unstressed" the 2**24 table. Load %: .992 .496 With 8323580 8323582 probe sequence ======== ======= ============== 18.10sec 4.26 j * 5 + 1 + (perturb >>= 5) (per Python) 18.38 5.06 j * 33 + 1 ^ (perturb >>= 5) 15.98 4.52 j * 33 + 1 ^ (perturb >>= 4) 1m35.90 6.09 j * 5 + 1 1m39.20 4.56 j * 33 + 1 1m37.67 4.48 j * 17 + 3 8.30 8.04 j + ((hash >> 8) | 1) didn't try 2m51.72 j + ((hash >> 23) | 1) 9.06 6.01 j + ((hash * stringlength) | 1) didn't try 3m15.25 j + 37 (constant stride) didn't try 83m27.71 j + 1 (classic linear probe, constant stride) With perturb shifting, j*a+1 are about the same, whether a is 5 or 33 and perturb shift is 5 or 4, and using + or ^. Note j*5+1, j*33+1, j*17+3 (no perturb) are about equal, and bad with a full table. Using the original hash shifted 1 byte right works pretty effectively as a second hash (de Balbine, Knuth's algo D); hash >> 23 was not good. Similarly, I tried hash*strlen(s) as a secondary hash and got pretty good results as well. The constant stride of 37 did badly and the linear probe (stride 1) was just terrible. (Yet linear probing is supposed to be decent up to 50% full, per Knuth? Maybe need to investigate further.) > With >>5 you need a table size over a million to more than 4 nonlinear > displacements in a given chain. It's not THAT much protection. Still seems > input data dependent. A hash table without a good hash function goes > septic quick, and "good" is contextual. (...) The point of the Python approach was to work well with a "bad" hash function, as Python's is bad from the usual viewpoint. E.g. integers are their own hash: h(1) is 1, h(2) is 2 etc. So the best approach for Python may not be best with a different hash funtion. I'm using D. Bernstein's djb2a (djb2 modified). (BTW looks like the pending/diff.c tried to use that but there's a transposition in the initial constant: they have 5831, should be 5381.) > I didn't make it all the way through the 200+ lines of description at the > top, not because it's hard to read but because they thought it NEEDED that > much explanation. > > Remember my gripe at the end of > https://landley.net/notes-2023.html#12-09-2023 where I was complaining > that I'd felt the need to write a long explanation of the code as a > comment in the source file? That meant (to me) that the code was still > wrong. See https://people.csail.mit.edu/gregs/ll1-discuss-archive-html/msg01701.html for a different opinion on that, and on that comment in particular. > Supporting deletion in-place in the linked list approach was reasonably > straightforward. I remember that deleting in the table walk version was > fiddly but I don't remember the solution. > [...] > Hang on, I think I asked the grad student about it. You could either read > _backwards_ through displacements (not feasible when your displacements > aren't constant) or having whiteouts as in a "there was something here so > keep reading" marker commemorating removed entries so the table could > never actually go back to a clean state. (Because with a variable stride > even reading through a full-to-end whiteout list with no hits doesn't mean > you can REMOVE the whiteouts, because you essentially have cross-linked > lists. You have to garbage collect your table, and generally you allocate > a BIGGER table and re-hash every entry into the new one to get a clean > table once you've hit a certain pathological size, and I went "I'll let > the library do this for me"...) Yes, that's pretty much it. Mark the deleted entries, rebuild a bigger table when needed. Most systems expand when the table is about 60%-80% full, including deleted slots. > I'm happy to do a proper version in lib/ if we decide we need this. As I > said, tar could also use this and I'm pretty sure I've got a couple more > in the todo heap. (It's just "tsort go fast" that seemed a strange entry > point to this round of optimization...) It's just kind of hard to get a decent generalized hash table system that handles expansion and deletions without having a lot of code. My forthcoming awk hash table uses the code from my expanding sequential lists and my dynamic ref-counted strings, so I doubt it's general enough to be useful in your lib. > Hmmm... a 1024 entry table that garbage collects to a 32k entry table > which garbage collects to a million table entry... That adds 5 bits each > time so the perturb thingy gets 1 extra stride meaning the collisions get > re-jiggered and at least the first 2 garbage collections would be > negligible latency spikes (and if you gc from a million that's on you). Not sure if you're getting what the perturb thingy is. It's just the full original hash, that gets some upper bits mixed into the stride with each step, but will eventually shift down to 0 so the remaining steps are just the j*5+1 mod m part, which as shown in my table above does pretty well for a half-full table, badly for a full table, but Python never lets it get past about 2/3 full. Some other interesting links: This guy https://www.andreinc.net/2021/11/08/a-tale-of-java-hash-tables did a benchmark of Java's chained hashing HashMap against a bunch of different open-addressing versions he implemented, and the original HashMap came out on top. I need to look into that. Then the Hacker News guys commented on it (https://news.ycombinator.com/item?id=29319151) and I think mostly preferred open-addressing. There's a good comparison of hashing algos (not the table handling, but the string hashing) at https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed. Ray _______________________________________________ Toybox mailing list Toybox@lists.landley.net http://lists.landley.net/listinfo.cgi/toybox-landley.net