> From: Nikita Popov [mailto:nikita....@gmail.com] > Sent: Tuesday, March 08, 2016 3:43 PM > To: Andone, Bogdan > Cc: internals@lists.php.net > Subject: Re: [PHP-DEV] Lazy keys comparison during hash lookups > >On Tue, Mar 8, 2016 at 2:18 PM, Nikita Popov <nikita....@gmail.com> wrote: > >> On Tue, Mar 8, 2016 at 2:01 PM, Andone, Bogdan > <bogdan.and...@intel.com> wrote: > >> Hi Guys, > >> > >> I would like to propose a small change into the DJBX33A hash function > algorithm which will make easier the key matching validations in hash > lookup functions. > >> > >> The change addresses the modulo 8 tailing bytes of the key. For these > bytes we can use an 8 bit shift instead of a 5 bit shift; we also need > to replace the ADD by XOR, in order to avoid byte level overflows. This > change ensures the uniqueness of the hash function transformation for > the tailing bytes: supposing two strings have same partial hash value > for the first Nx8 bytes, different combinations of tailing characters > (with the same tail size) will always generate different keys. > >> We have the following consequences: > >> If two strings have: > >> - same hash value, > >> - same length, > >> - same bytes for the first Nx8 positions, > >> then they are equal, and the tailing bytes can be skipped during > comparison. > >> > >> There is a visible performance gain if we apply this approach as we > can use a lightweight memcmp() implementation based on longs comparison > and completely free of the complexity incurred by tailing bytes. For > Mediawiki I have a 1.7% performance gain while Wordpress reports 1.2% > speedup on Haswell-EP. > >> > >> Let’s take a small example: > >> Suppose we have a key=”this_is_a_key_value”. > >> The hash function for the first N x 8 byes are computed in the > original way; suppose “this_is_a_key_va” (16bytes) will return a partial > hash value h1; the final hash value will be computed by the following > sequence: > >> h = ((h1<<8) ^ h1) ^ ‘l’; > >> h = ((h<<8) ^ h) ^ ‘u’; > >> h = ((h<<8) ^ h) ^ ‘e’; > >> or, in only one operation: > >> h = (h1<<24) ^ (h1<<16) ^ (h1<<8) ^ h1 ^ (‘l’<<16) ^ ((‘l’^‘u’)<<8) ^ > (‘l’^’u’^‘e’) > >> We can see that ht=(‘l’<<16) ^ ((‘l’^‘u’)<<8) ^ (‘l’^’u’^‘e’) cannot > be obtained by any other 3 characters long tail. The statement is not > true if we use ADD instead of XOR, as extended ASCII characters might > generate overflows affecting the LSB of the higher byte in the hash > value. > >> > >> I pushed a pull request here: https://github.com/php/php- > src/pull/1793. Unfortunately it does not pass the travis tests because > “htmlspecialchars etc use a generated table that assumes the current > hash function” as noticed by Nikita. > >> > >> Let me know your thoughts on this idea. > > > > Hey Bogdan, > > This looks like an interesting idea! I'm somewhat apprehensive about > coupling this to a change of the hash function, for two reasons: > > a) This will make it more problematic if we want to change the hash > function in the future, e.g. if we want to switch to SipHash. > > b) The quality of the new hash distribution is not immediately clear, > but likely non-trivially weaker. > > So I'm wondering if we can keep the concept of using a zend_ulong > aligned memcmp while leaving the hash function alone: The zend_string > allocation policy already allocates the string data aligned and padded > to zend_ulong boundaries. If we were to additionally explicitly zero out > the last byte (to avoid valgrind warnings) we should be able to compare > the character data of two zend_strings using a zend_ulong memcmp. This > would have the additional benefit that it works for normal string > comparisons (unrelated to hashtables) as well. On the other hand, this > is only possible for zend_string to zend_string comparisons, not for > comparisons with static strings. > > Regards, > s/zero out the last byte/zero out the last zend_ulong > I'd like to add another issue with relying on the hash for this which I > just remembered: We currently always set the top bit of the hash for > strings (see http://lxr.php.net/xref/PHP_TRUNK/Zend/zend_string.h#351), > in order to ensure that hashes are never zero. This makes the hash non- > unique. > Nikita Yeah... I missed the top bit set, but I think it could be solved somehow. Imho the strongest reason against the patch is the dependency between the hash function and the key validation method but I have the impression that this is not be the most dirty hack deployed ever in PHP; and I don't know very well what is the degree of compromise acceptable for 1% speedup :-) On the other hand, your idea to zero the last zend_ulong in the zend_string looks nice. There could be an additional potential gain also from lightweight implementation of memcmp(). I will give it a try to see if it gets better results than the current proposal. I assume all zend_string allocation functions are located in zend_string.h, correct?
Thanks, Bogdan