On Tue, Mar 8, 2016 at 2:18 PM, Nikita Popov <nikita....@gmail.com> wrote:

> On Tue, Mar 8, 2016 at 2:01 PM, Andone, Bogdan <bogdan.and...@intel.com>
> wrote:
>
>> Hi Guys,
>>
>> I would like to propose a small change into the DJBX33A hash function
>> algorithm which will make easier the key matching validations in hash
>> lookup functions.
>>
>> The change addresses the modulo 8 tailing bytes of the key. For these
>> bytes we can use an 8 bit shift instead of a 5 bit shift; we also need to
>> replace the ADD by XOR, in order to avoid byte level overflows. This change
>> ensures the uniqueness of the hash function transformation for the tailing
>> bytes: supposing two strings have same partial hash value for the first Nx8
>> bytes, different combinations of tailing characters (with the same tail
>> size) will always generate different keys.
>> We have the following consequences:
>> If two strings have:
>> - same hash value,
>> - same length,
>> - same bytes for the first Nx8 positions,
>> then they are equal, and the tailing bytes can be skipped during
>> comparison.
>>
>> There is a visible performance gain if we apply this approach as we can
>> use a lightweight memcmp() implementation based on longs comparison and
>> completely free of the complexity incurred by tailing bytes. For Mediawiki
>> I have a 1.7%  performance gain while Wordpress reports 1.2% speedup on
>> Haswell-EP.
>>
>> Let’s take a small example:
>> Suppose we have a key=”this_is_a_key_value”.
>> The hash function for the  first N x 8 byes are computed in the original
>> way; suppose “this_is_a_key_va” (16bytes) will return a partial hash value
>> h1; the final hash value will be computed by the following sequence:
>> h = ((h1<<8) ^ h1) ^ ‘l’;
>> h = ((h<<8) ^ h) ^ ‘u’;
>> h = ((h<<8) ^ h) ^ ‘e’;
>> or, in only one operation:
>> h = (h1<<24) ^ (h1<<16) ^ (h1<<8) ^ h1 ^ (‘l’<<16) ^ ((‘l’^‘u’)<<8) ^
>> (‘l’^’u’^‘e’)
>> We can see that ht=(‘l’<<16) ^ ((‘l’^‘u’)<<8) ^ (‘l’^’u’^‘e’)  cannot be
>> obtained by any other 3 characters long tail. The statement is not true if
>> we use ADD instead of XOR, as extended ASCII characters might generate
>> overflows affecting the LSB of the higher byte in the hash value.
>>
>> I pushed a pull request here: https://github.com/php/php-src/pull/1793.
>> Unfortunately it does not pass the travis tests because “htmlspecialchars
>> etc use a generated table that assumes the current hash function” as
>> noticed by Nikita.
>>
>> Let me know your thoughts on this idea.
>>
>
> Hey Bogdan,
>
> This looks like an interesting idea! I'm somewhat apprehensive about
> coupling this to a change of the hash function, for two reasons:
> a) This will make it more problematic if we want to change the hash
> function in the future, e.g. if we want to switch to SipHash.
> b) The quality of the new hash distribution is not immediately clear, but
> likely non-trivially weaker.
>
> So I'm wondering if we can keep the concept of using a zend_ulong aligned
> memcmp while leaving the hash function alone: The zend_string allocation
> policy already allocates the string data aligned and padded to zend_ulong
> boundaries. If we were to additionally explicitly zero out the last byte
> (to avoid valgrind warnings) we should be able to compare the character
> data of two zend_strings using a zend_ulong memcmp. This would have the
> additional benefit that it works for normal string comparisons (unrelated
> to hashtables) as well. On the other hand, this is only possible for
> zend_string to zend_string comparisons, not for comparisons with static
> strings.
>

s/zero out the last byte/zero out the last zend_ulong

I'd like to add another issue with relying on the hash for this which I
just remembered: We currently always set the top bit of the hash for
strings (see http://lxr.php.net/xref/PHP_TRUNK/Zend/zend_string.h#351), in
order to ensure that hashes are never zero. This makes the hash non-unique.

Nikita

Reply via email to