Hi Guys,

I would like to propose a small change into the DJBX33A hash function algorithm 
which will make easier the key matching validations in hash lookup functions. 

The change addresses the modulo 8 tailing bytes of the key. For these bytes we 
can use an 8 bit shift instead of a 5 bit shift; we also need to replace the 
ADD by XOR, in order to avoid byte level overflows. This change ensures the 
uniqueness of the hash function transformation for the tailing bytes: supposing 
two strings have same partial hash value for the first Nx8 bytes, different 
combinations of tailing characters (with the same tail size) will always 
generate different keys. 
We have the following consequences:
If two strings have:
- same hash value, 
- same length,
- same bytes for the first Nx8 positions,
then they are equal, and the tailing bytes can be skipped during comparison.

There is a visible performance gain if we apply this approach as we can use a 
lightweight memcmp() implementation based on longs comparison and completely 
free of the complexity incurred by tailing bytes. For Mediawiki I have a 1.7%  
performance gain while Wordpress reports 1.2% speedup on Haswell-EP.

Let’s take a small example:
Suppose we have a key=”this_is_a_key_value”. 
The hash function for the  first N x 8 byes are computed in the original way; 
suppose “this_is_a_key_va” (16bytes) will return a partial hash value h1; the 
final hash value will be computed by the following sequence:
h = ((h1<<8) ^ h1) ^ ‘l’;
h = ((h<<8) ^ h) ^ ‘u’;
h = ((h<<8) ^ h) ^ ‘e’;
or, in only one operation:
h = (h1<<24) ^ (h1<<16) ^ (h1<<8) ^ h1 ^ (‘l’<<16) ^ ((‘l’^‘u’)<<8) ^ 
(‘l’^’u’^‘e’)
We can see that ht=(‘l’<<16) ^ ((‘l’^‘u’)<<8) ^ (‘l’^’u’^‘e’)  cannot be 
obtained by any other 3 characters long tail. The statement is not true if we 
use ADD instead of XOR, as extended ASCII characters might generate overflows 
affecting the LSB of the higher byte in the hash value.

I pushed a pull request here: https://github.com/php/php-src/pull/1793. 
Unfortunately it does not pass the travis tests because “htmlspecialchars etc 
use a generated table that assumes the current hash function” as noticed by 
Nikita.

Let me know your thoughts on this idea.
Thanks,
Bogdan

Reply via email to