On Tue, Mar 8, 2016 at 2:18 PM, Nikita Popov <nikita....@gmail.com> wrote:
> On Tue, Mar 8, 2016 at 2:01 PM, Andone, Bogdan <bogdan.and...@intel.com> > wrote: > >> Hi Guys, >> >> I would like to propose a small change into the DJBX33A hash function >> algorithm which will make easier the key matching validations in hash >> lookup functions. >> >> The change addresses the modulo 8 tailing bytes of the key. For these >> bytes we can use an 8 bit shift instead of a 5 bit shift; we also need to >> replace the ADD by XOR, in order to avoid byte level overflows. This change >> ensures the uniqueness of the hash function transformation for the tailing >> bytes: supposing two strings have same partial hash value for the first Nx8 >> bytes, different combinations of tailing characters (with the same tail >> size) will always generate different keys. >> We have the following consequences: >> If two strings have: >> - same hash value, >> - same length, >> - same bytes for the first Nx8 positions, >> then they are equal, and the tailing bytes can be skipped during >> comparison. >> >> There is a visible performance gain if we apply this approach as we can >> use a lightweight memcmp() implementation based on longs comparison and >> completely free of the complexity incurred by tailing bytes. For Mediawiki >> I have a 1.7% performance gain while Wordpress reports 1.2% speedup on >> Haswell-EP. >> >> Let’s take a small example: >> Suppose we have a key=”this_is_a_key_value”. >> The hash function for the first N x 8 byes are computed in the original >> way; suppose “this_is_a_key_va” (16bytes) will return a partial hash value >> h1; the final hash value will be computed by the following sequence: >> h = ((h1<<8) ^ h1) ^ ‘l’; >> h = ((h<<8) ^ h) ^ ‘u’; >> h = ((h<<8) ^ h) ^ ‘e’; >> or, in only one operation: >> h = (h1<<24) ^ (h1<<16) ^ (h1<<8) ^ h1 ^ (‘l’<<16) ^ ((‘l’^‘u’)<<8) ^ >> (‘l’^’u’^‘e’) >> We can see that ht=(‘l’<<16) ^ ((‘l’^‘u’)<<8) ^ (‘l’^’u’^‘e’) cannot be >> obtained by any other 3 characters long tail. The statement is not true if >> we use ADD instead of XOR, as extended ASCII characters might generate >> overflows affecting the LSB of the higher byte in the hash value. >> >> I pushed a pull request here: https://github.com/php/php-src/pull/1793. >> Unfortunately it does not pass the travis tests because “htmlspecialchars >> etc use a generated table that assumes the current hash function” as >> noticed by Nikita. >> >> Let me know your thoughts on this idea. >> > > Hey Bogdan, > > This looks like an interesting idea! I'm somewhat apprehensive about > coupling this to a change of the hash function, for two reasons: > a) This will make it more problematic if we want to change the hash > function in the future, e.g. if we want to switch to SipHash. > b) The quality of the new hash distribution is not immediately clear, but > likely non-trivially weaker. > > So I'm wondering if we can keep the concept of using a zend_ulong aligned > memcmp while leaving the hash function alone: The zend_string allocation > policy already allocates the string data aligned and padded to zend_ulong > boundaries. If we were to additionally explicitly zero out the last byte > (to avoid valgrind warnings) we should be able to compare the character > data of two zend_strings using a zend_ulong memcmp. This would have the > additional benefit that it works for normal string comparisons (unrelated > to hashtables) as well. On the other hand, this is only possible for > zend_string to zend_string comparisons, not for comparisons with static > strings. > s/zero out the last byte/zero out the last zend_ulong I'd like to add another issue with relying on the hash for this which I just remembered: We currently always set the top bit of the hash for strings (see http://lxr.php.net/xref/PHP_TRUNK/Zend/zend_string.h#351), in order to ensure that hashes are never zero. This makes the hash non-unique. Nikita