clustering: https://www.cs.cornell.edu/courses/cs3110/2014fa/lectures/13/lec13.html
careful hash functions often treat short inputs specially. iterated shift-xor alone is weak in expanding the "changed bits(s)" signal, at least by comparison to a) large prime multiply, b) good s-boxes, c) introduction of keyed material. collisions will happen, and no general hash will attain the full period possibility. yes a perfect hash is possible but not to my knowledge for general strings. strings tend to be very non-random (a kind of zipf's law for word spelling/product names/text...) so there is a difficult demand for the hash function. in spirit you want to encrypt and carefully fold down to tablesize. utf-8/16/32 strings are somewhat sparse...so you don't have as many bits of input material as you may want. (http://www.unicode.org/faq/utf_bom.html ... 21 bits is enough) searching through parameter space works very well for many hash and similar goals. important to know how to model the input and measure output quality. On Fri, Feb 8, 2019 at 1:56 AM Serhat Sevki Dincer <jfcga...@gmail.com> wrote: > 8 Şub 2019 Cum 11:57 tarihinde Michael Jones <michael.jo...@gmail.com> > şunu yazdı: > >> ...says that in one particular test condition (8 character strings, 1M >> random strings, all possible shift values) >> and under one particular metric of virtue... >> >> x = x<<15 ^ x>>33 >> >> ...gives the closest overall approximation to a random result. you might >> try that in your application to see how well it works for you. that figure >> of merit (44.608) is not particularly high but not terrible. A modification >> of FNV1a can get to 22 (almost: 2.200000000000000355e+01) here. >> > > Hm, so overlapping shifts help get better mixing. As a hash function gets > better at mixing bits, it will almost surely have a collision (due to > birthday paradox) with two inputs with lengths <= 4. > > Can we artificially guarantee a minimum sum of colliding (utf8 string) > inputs that is > 8 ? > > I think such a hash needs to satisfy the following: > > 1) Number of valid utf8 strings with length <= 8 is less than (255^9 - > 1)/254 which is less than 2^64 > So if the hash maps these strings to distinct uint64 values, then minimum > sum of colliding input lengths is at least 9. Can it be > 9 ? > > 2) Be non-trivial. For example, this hash satisfies 1): > if len(input) < 8, pad with zeros up to total 8 bytes and return > otherwise return first 8 bytes of input > > this also satisfies 1): > Treat input as integer in base 255, return input mod 2^64 > > 3) have equidistribution for overall inputs > > Instead of statistical methods, we can maybe brute search among parametric > simple hash functions for one that satisfies these 3 properties? Or > explicitly design one? > >> -- *Michael T. jonesmichael.jo...@gmail.com <michael.jo...@gmail.com>* -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.