RE: [PHP-DEV] Lazy keys comparison during hash lookups

Andone, Bogdan Tue, 08 Mar 2016 06:19:09 -0800

> From: Nikita Popov [mailto:[email protected]] 
> Sent: Tuesday, March 08, 2016 3:43 PM
> To: Andone, Bogdan
> Cc: [email protected]
> Subject: Re: [PHP-DEV] Lazy keys comparison during hash lookups
> >On Tue, Mar 8, 2016 at 2:18 PM, Nikita Popov <[email protected]> wrote:
> >> On Tue, Mar 8, 2016 at 2:01 PM, Andone, Bogdan
> <[email protected]> wrote:
> >> Hi Guys,
> >>
> >> I would like to propose a small change into the DJBX33A hash function
> algorithm which will make easier the key matching validations in hash
> lookup functions.
> >>
> >> The change addresses the modulo 8 tailing bytes of the key. For these
> bytes we can use an 8 bit shift instead of a 5 bit shift; we also need
> to replace the ADD by XOR, in order to avoid byte level overflows. This
> change ensures the uniqueness of the hash function transformation for
> the tailing bytes: supposing two strings have same partial hash value
> for the first Nx8 bytes, different combinations of tailing characters
> (with the same tail size) will always generate different keys.
> >> We have the following consequences:
> >> If two strings have:
> >> - same hash value,
> >> - same length,
> >> - same bytes for the first Nx8 positions,
> >> then they are equal, and the tailing bytes can be skipped during
> comparison.
> >>
> >> There is a visible performance gain if we apply this approach as we
> can use a lightweight memcmp() implementation based on longs comparison
> and completely free of the complexity incurred by tailing bytes. For
> Mediawiki I have a 1.7%  performance gain while Wordpress reports 1.2%
> speedup on Haswell-EP.
> >>
> >> Let’s take a small example:
> >> Suppose we have a key=”this_is_a_key_value”.
> >> The hash function for the  first N x 8 byes are computed in the
> original way; suppose “this_is_a_key_va” (16bytes) will return a partial
> hash value h1; the final hash value will be computed by the following
> sequence:
> >> h = ((h1<<8) ^ h1) ^ ‘l’;
> >> h = ((h<<8) ^ h) ^ ‘u’;
> >> h = ((h<<8) ^ h) ^ ‘e’;
> >> or, in only one operation:
> >> h = (h1<<24) ^ (h1<<16) ^ (h1<<8) ^ h1 ^ (‘l’<<16) ^ ((‘l’^‘u’)<<8) ^
> (‘l’^’u’^‘e’)
> >> We can see that ht=(‘l’<<16) ^ ((‘l’^‘u’)<<8) ^ (‘l’^’u’^‘e’)  cannot
> be obtained by any other 3 characters long tail. The statement is not
> true if we use ADD instead of XOR, as extended ASCII characters might
> generate overflows affecting the LSB of the higher byte in the hash
> value.
> >>
> >> I pushed a pull request here: https://github.com/php/php-
> src/pull/1793. Unfortunately it does not pass the travis tests because
> “htmlspecialchars etc use a generated table that assumes the current
> hash function” as noticed by Nikita.
> >>
> >> Let me know your thoughts on this idea.
> >
> > Hey Bogdan,
> > This looks like an interesting idea! I'm somewhat apprehensive about
> coupling this to a change of the hash function, for two reasons:
> > a) This will make it more problematic if we want to change the hash
> function in the future, e.g. if we want to switch to SipHash.
> > b) The quality of the new hash distribution is not immediately clear,
> but likely non-trivially weaker.
> > So I'm wondering if we can keep the concept of using a zend_ulong
> aligned memcmp while leaving the hash function alone: The zend_string
> allocation policy already allocates the string data aligned and padded
> to zend_ulong boundaries. If we were to additionally explicitly zero out
> the last byte (to avoid valgrind warnings) we should be able to compare
> the character data of two zend_strings using a zend_ulong memcmp. This
> would have the additional benefit that it works for normal string
> comparisons (unrelated to hashtables) as well. On the other hand, this
> is only possible for zend_string to zend_string comparisons, not for
> comparisons with static strings.
> > Regards,
> s/zero out the last byte/zero out the last zend_ulong
> I'd like to add another issue with relying on the hash for this which I
> just remembered: We currently always set the top bit of the hash for
> strings (see http://lxr.php.net/xref/PHP_TRUNK/Zend/zend_string.h#351),
> in order to ensure that hashes are never zero. This makes the hash non-
> unique.
> Nikita
Yeah... I missed the top bit set, but I think it could be solved somehow. Imho 
the strongest reason against the patch is the dependency between the hash 
function and the key validation method but I have the impression that this is 
not be the most dirty hack deployed ever in PHP; and I don't know very well 
what is the degree of compromise acceptable for 1% speedup :-) 
On the other hand, your idea to zero the last zend_ulong in the zend_string 
looks nice. There could be an additional potential gain also from lightweight 
implementation of memcmp(). I will give it a try to see if it gets better 
results than the current proposal.
I assume all zend_string allocation functions are located in zend_string.h, 
correct?


Thanks,
Bogdan

RE: [PHP-DEV] Lazy keys comparison during hash lookups

Reply via email to