RE: [PHP-DEV] Lazy keys comparison during hash lookups

2016-04-29 Thread Andone, Bogdan


> -Original Message-
> From: Matt Wilmas [mailto:php_li...@realplain.com]
> Sent: Wednesday, April 27, 2016 5:44 PM
> To: Andone, Bogdan <bogdan.and...@intel.com>; 'Nikita Popov'
> <nikita@gmail.com>
> Cc: internals@lists.php.net
> Subject: Re: [PHP-DEV] Lazy keys comparison during hash lookups
> 
> Hi Bogdan, all,
> 
> - Original Message -
> From: "Andone, Bogdan"
> Sent: Monday, April 25, 2016
> 
> >> -Original Message-
> >> From: Andone, Bogdan [mailto:bogdan.and...@intel.com]
> >> Sent: Friday, March 18, 2016 11:58 AM
> >> To: 'Nikita Popov' <nikita....@gmail.com>
> >> Cc: internals@lists.php.net
> >> Subject: RE: [PHP-DEV] Lazy keys comparison during hash lookups
> >>
> >> > -Original Message-
> >> > From: Andone, Bogdan [mailto:bogdan.and...@intel.com]
> >> > Sent: Wednesday, March 09, 2016 9:33 AM
> >> > To: 'Nikita Popov'
> >> > Cc: internals@lists.php.net
> >> > Subject: RE: [PHP-DEV] Lazy keys comparison during hash lookups
> >> >
> >> >
> >> >
> >> > > -Original Message-
> >> > > From: Andone, Bogdan [mailto:bogdan.and...@intel.com]
> >> > > Sent: Tuesday, March 08, 2016 4:19 PM
> >> > > To: 'Nikita Popov'
> >> > > Cc: internals@lists.php.net
> >> > > Subject: RE: [PHP-DEV] Lazy keys comparison during hash lookups
> >> > >
> >> > > > From: Nikita Popov [mailto:nikita@gmail.com]
> >> > > > Sent: Tuesday, March 08, 2016 3:43 PM
> >> > > > To: Andone, Bogdan
> >> > > > Cc: internals@lists.php.net
> >> > > > Subject: Re: [PHP-DEV] Lazy keys comparison during hash lookups
> >> > > > >On Tue, Mar 8, 2016 at 2:18 PM, Nikita Popov
> >> > > > ><nikita@gmail.com>
> >> > > wrote:
> >> > > > >> On Tue, Mar 8, 2016 at 2:01 PM, Andone, Bogdan
> >> > > > <bogdan.and...@intel.com> wrote:
> >> > > > >> Hi Guys,
> >> > > > >>
> >> > > > >> I would like to propose a small change into the DJBX33A hash
> >> > > function
> >> > > > algorithm which will make easier the key matching validations
> >> > > > in
> >> > hash
> >> > > > lookup functions.
> >> > > > >>
> >> > > > >> The change addresses the modulo 8 tailing bytes of the key.
> >> > > > >> For
> >> > > these
> >> > > > bytes we can use an 8 bit shift instead of a 5 bit shift; we
> >> > > > also
> >> > need
> >> > > > to replace the ADD by XOR, in order to avoid byte level overflows.
> >> > > This
> >> > > > change ensures the uniqueness of the hash function
> >> > > > transformation
> >> > for
> >> > > > the tailing bytes: supposing two strings have same partial hash
> >> > value
> >> > > > for the first Nx8 bytes, different combinations of tailing
> >> > characters
> >> > > > (with the same tail size) will always generate different keys.
> >> > > > >> We have the following consequences:
> >> > > > >> If two strings have:
> >> > > > >> - same hash value,
> >> > > > >> - same length,
> >> > > > >> - same bytes for the first Nx8 positions, then they are
> >> > > > >> equal, and the tailing bytes can be skipped during
> >> > > > comparison.
> >> > > > >>
> >> > > > >> There is a visible performance gain if we apply this
> >> > > > >> approach as
> >> > we
> >> > > > can use a lightweight memcmp() implementation based on longs
> >> > > comparison
> >> > > > and completely free of the complexity incurred by tailing bytes.
> >> > > > For Mediawiki I have a 1.7% performance gain while Wordpress
> >> > > > reports
> >> > 1.2%
> >> > > > speedup on Haswell-EP.
> >> > > > >>
> >> > > > >> Let's take a small example:
> >> > > > >> Suppose we have a key="this_is_a_key_value".
> >> > &

Re: [PHP-DEV] Lazy keys comparison during hash lookups

2016-04-27 Thread Matt Wilmas

Hi Bogdan, all,

- Original Message -
From: "Andone, Bogdan"
Sent: Monday, April 25, 2016


-Original Message-
From: Andone, Bogdan [mailto:bogdan.and...@intel.com]
Sent: Friday, March 18, 2016 11:58 AM
To: 'Nikita Popov' <nikita@gmail.com>
Cc: internals@lists.php.net
Subject: RE: [PHP-DEV] Lazy keys comparison during hash lookups

> -Original Message-
> From: Andone, Bogdan [mailto:bogdan.and...@intel.com]
> Sent: Wednesday, March 09, 2016 9:33 AM
> To: 'Nikita Popov'
> Cc: internals@lists.php.net
> Subject: RE: [PHP-DEV] Lazy keys comparison during hash lookups
>
>
>
> > -Original Message-
> > From: Andone, Bogdan [mailto:bogdan.and...@intel.com]
> > Sent: Tuesday, March 08, 2016 4:19 PM
> > To: 'Nikita Popov'
> > Cc: internals@lists.php.net
> > Subject: RE: [PHP-DEV] Lazy keys comparison during hash lookups
> >
> > > From: Nikita Popov [mailto:nikita@gmail.com]
> > > Sent: Tuesday, March 08, 2016 3:43 PM
> > > To: Andone, Bogdan
> > > Cc: internals@lists.php.net
> > > Subject: Re: [PHP-DEV] Lazy keys comparison during hash lookups
> > > >On Tue, Mar 8, 2016 at 2:18 PM, Nikita Popov
> > > ><nikita@gmail.com>
> > wrote:
> > > >> On Tue, Mar 8, 2016 at 2:01 PM, Andone, Bogdan
> > > <bogdan.and...@intel.com> wrote:
> > > >> Hi Guys,
> > > >>
> > > >> I would like to propose a small change into the DJBX33A hash
> > function
> > > algorithm which will make easier the key matching validations in
> hash
> > > lookup functions.
> > > >>
> > > >> The change addresses the modulo 8 tailing bytes of the key. For
> > these
> > > bytes we can use an 8 bit shift instead of a 5 bit shift; we also
> need
> > > to replace the ADD by XOR, in order to avoid byte level overflows.
> > This
> > > change ensures the uniqueness of the hash function transformation
> for
> > > the tailing bytes: supposing two strings have same partial hash
> value
> > > for the first Nx8 bytes, different combinations of tailing
> characters
> > > (with the same tail size) will always generate different keys.
> > > >> We have the following consequences:
> > > >> If two strings have:
> > > >> - same hash value,
> > > >> - same length,
> > > >> - same bytes for the first Nx8 positions, then they are equal,
> > > >> and the tailing bytes can be skipped during
> > > comparison.
> > > >>
> > > >> There is a visible performance gain if we apply this approach
> > > >> as
> we
> > > can use a lightweight memcmp() implementation based on longs
> > comparison
> > > and completely free of the complexity incurred by tailing bytes.
> > > For Mediawiki I have a 1.7% performance gain while Wordpress
> > > reports
> 1.2%
> > > speedup on Haswell-EP.
> > > >>
> > > >> Let’s take a small example:
> > > >> Suppose we have a key=”this_is_a_key_value”.
> > > >> The hash function for the first N x 8 byes are computed in the
> > > original way; suppose “this_is_a_key_va” (16bytes) will return a
> > partial
> > > hash value h1; the final hash value will be computed by the
> following
> > > sequence:
> > > >> h = ((h1<<8) ^ h1) ^ ‘l’;
> > > >> h = ((h<<8) ^ h) ^ ‘u’;
> > > >> h = ((h<<8) ^ h) ^ ‘e’;
> > > >> or, in only one operation:
> > > >> h = (h1<<24) ^ (h1<<16) ^ (h1<<8) ^ h1 ^ (‘l’<<16) ^
> ((‘l’^‘u’)<<8)
> > ^
> > > (‘l’^’u’^‘e’)
> > > >> We can see that ht=(‘l’<<16) ^ ((‘l’^‘u’)<<8) ^
> > (‘l’^’u’^‘e’) cannot
> > > be obtained by any other 3 characters long tail. The statement is
> not
> > > true if we use ADD instead of XOR, as extended ASCII characters
> might
> > > generate overflows affecting the LSB of the higher byte in the
> > > hash value.
> > > >>
> > > >> I pushed a pull request here: https://github.com/php/php-
> > > src/pull/1793. Unfortunately it does not pass the travis tests
> because
> > > “htmlspecialchars etc use a generated table that assumes the
> > > current hash function” as noticed by Nikita.
> > > >>
> > > >> Let me know your thoughts on this idea.
> > > >
> > > > Hey Bogdan,
> > > > This l

RE: [PHP-DEV] Lazy keys comparison during hash lookups

2016-04-25 Thread Andone, Bogdan


> -Original Message-
> From: Andone, Bogdan [mailto:bogdan.and...@intel.com]
> Sent: Friday, March 18, 2016 11:58 AM
> To: 'Nikita Popov' <nikita@gmail.com>
> Cc: internals@lists.php.net
> Subject: RE: [PHP-DEV] Lazy keys comparison during hash lookups
> 
> > -Original Message-
> > From: Andone, Bogdan [mailto:bogdan.and...@intel.com]
> > Sent: Wednesday, March 09, 2016 9:33 AM
> > To: 'Nikita Popov'
> > Cc: internals@lists.php.net
> > Subject: RE: [PHP-DEV] Lazy keys comparison during hash lookups
> >
> >
> >
> > > -Original Message-
> > > From: Andone, Bogdan [mailto:bogdan.and...@intel.com]
> > > Sent: Tuesday, March 08, 2016 4:19 PM
> > > To: 'Nikita Popov'
> > > Cc: internals@lists.php.net
> > > Subject: RE: [PHP-DEV] Lazy keys comparison during hash lookups
> > >
> > > > From: Nikita Popov [mailto:nikita@gmail.com]
> > > > Sent: Tuesday, March 08, 2016 3:43 PM
> > > > To: Andone, Bogdan
> > > > Cc: internals@lists.php.net
> > > > Subject: Re: [PHP-DEV] Lazy keys comparison during hash lookups
> > > > >On Tue, Mar 8, 2016 at 2:18 PM, Nikita Popov
> > > > ><nikita@gmail.com>
> > > wrote:
> > > > >> On Tue, Mar 8, 2016 at 2:01 PM, Andone, Bogdan
> > > > <bogdan.and...@intel.com> wrote:
> > > > >> Hi Guys,
> > > > >>
> > > > >> I would like to propose a small change into the DJBX33A hash
> > > function
> > > > algorithm which will make easier the key matching validations in
> > hash
> > > > lookup functions.
> > > > >>
> > > > >> The change addresses the modulo 8 tailing bytes of the key. For
> > > these
> > > > bytes we can use an 8 bit shift instead of a 5 bit shift; we also
> > need
> > > > to replace the ADD by XOR, in order to avoid byte level overflows.
> > > This
> > > > change ensures the uniqueness of the hash function transformation
> > for
> > > > the tailing bytes: supposing two strings have same partial hash
> > value
> > > > for the first Nx8 bytes, different combinations of tailing
> > characters
> > > > (with the same tail size) will always generate different keys.
> > > > >> We have the following consequences:
> > > > >> If two strings have:
> > > > >> - same hash value,
> > > > >> - same length,
> > > > >> - same bytes for the first Nx8 positions, then they are equal,
> > > > >> and the tailing bytes can be skipped during
> > > > comparison.
> > > > >>
> > > > >> There is a visible performance gain if we apply this approach
> > > > >> as
> > we
> > > > can use a lightweight memcmp() implementation based on longs
> > > comparison
> > > > and completely free of the complexity incurred by tailing bytes.
> > > > For Mediawiki I have a 1.7%  performance gain while Wordpress
> > > > reports
> > 1.2%
> > > > speedup on Haswell-EP.
> > > > >>
> > > > >> Let’s take a small example:
> > > > >> Suppose we have a key=”this_is_a_key_value”.
> > > > >> The hash function for the  first N x 8 byes are computed in the
> > > > original way; suppose “this_is_a_key_va” (16bytes) will return a
> > > partial
> > > > hash value h1; the final hash value will be computed by the
> > following
> > > > sequence:
> > > > >> h = ((h1<<8) ^ h1) ^ ‘l’;
> > > > >> h = ((h<<8) ^ h) ^ ‘u’;
> > > > >> h = ((h<<8) ^ h) ^ ‘e’;
> > > > >> or, in only one operation:
> > > > >> h = (h1<<24) ^ (h1<<16) ^ (h1<<8) ^ h1 ^ (‘l’<<16) ^
> > ((‘l’^‘u’)<<8)
> > > ^
> > > > (‘l’^’u’^‘e’)
> > > > >> We can see that ht=(‘l’<<16) ^ ((‘l’^‘u’)<<8) ^
> > > (‘l’^’u’^‘e’)  cannot
> > > > be obtained by any other 3 characters long tail. The statement is
> > not
> > > > true if we use ADD instead of XOR, as extended ASCII characters
> > might
> > > > generate overflows affecting the LSB of the higher byte in the
> > > > hash value.
> > > > >>
> > > > >> I pushed a pull request here: https://github.com/php/php-
> > &g

RE: [PHP-DEV] Lazy keys comparison during hash lookups

2016-03-18 Thread Andone, Bogdan
> -Original Message-
> From: Andone, Bogdan [mailto:bogdan.and...@intel.com]
> Sent: Wednesday, March 09, 2016 9:33 AM
> To: 'Nikita Popov'
> Cc: internals@lists.php.net
> Subject: RE: [PHP-DEV] Lazy keys comparison during hash lookups
> 
> 
> 
> > -Original Message-
> > From: Andone, Bogdan [mailto:bogdan.and...@intel.com]
> > Sent: Tuesday, March 08, 2016 4:19 PM
> > To: 'Nikita Popov'
> > Cc: internals@lists.php.net
> > Subject: RE: [PHP-DEV] Lazy keys comparison during hash lookups
> >
> > > From: Nikita Popov [mailto:nikita@gmail.com]
> > > Sent: Tuesday, March 08, 2016 3:43 PM
> > > To: Andone, Bogdan
> > > Cc: internals@lists.php.net
> > > Subject: Re: [PHP-DEV] Lazy keys comparison during hash lookups
> > > >On Tue, Mar 8, 2016 at 2:18 PM, Nikita Popov <nikita@gmail.com>
> > wrote:
> > > >> On Tue, Mar 8, 2016 at 2:01 PM, Andone, Bogdan
> > > <bogdan.and...@intel.com> wrote:
> > > >> Hi Guys,
> > > >>
> > > >> I would like to propose a small change into the DJBX33A hash
> > function
> > > algorithm which will make easier the key matching validations in
> hash
> > > lookup functions.
> > > >>
> > > >> The change addresses the modulo 8 tailing bytes of the key. For
> > these
> > > bytes we can use an 8 bit shift instead of a 5 bit shift; we also
> need
> > > to replace the ADD by XOR, in order to avoid byte level overflows.
> > This
> > > change ensures the uniqueness of the hash function transformation
> for
> > > the tailing bytes: supposing two strings have same partial hash
> value
> > > for the first Nx8 bytes, different combinations of tailing
> characters
> > > (with the same tail size) will always generate different keys.
> > > >> We have the following consequences:
> > > >> If two strings have:
> > > >> - same hash value,
> > > >> - same length,
> > > >> - same bytes for the first Nx8 positions,
> > > >> then they are equal, and the tailing bytes can be skipped during
> > > comparison.
> > > >>
> > > >> There is a visible performance gain if we apply this approach as
> we
> > > can use a lightweight memcmp() implementation based on longs
> > comparison
> > > and completely free of the complexity incurred by tailing bytes. For
> > > Mediawiki I have a 1.7%  performance gain while Wordpress reports
> 1.2%
> > > speedup on Haswell-EP.
> > > >>
> > > >> Let’s take a small example:
> > > >> Suppose we have a key=”this_is_a_key_value”.
> > > >> The hash function for the  first N x 8 byes are computed in the
> > > original way; suppose “this_is_a_key_va” (16bytes) will return a
> > partial
> > > hash value h1; the final hash value will be computed by the
> following
> > > sequence:
> > > >> h = ((h1<<8) ^ h1) ^ ‘l’;
> > > >> h = ((h<<8) ^ h) ^ ‘u’;
> > > >> h = ((h<<8) ^ h) ^ ‘e’;
> > > >> or, in only one operation:
> > > >> h = (h1<<24) ^ (h1<<16) ^ (h1<<8) ^ h1 ^ (‘l’<<16) ^
> ((‘l’^‘u’)<<8)
> > ^
> > > (‘l’^’u’^‘e’)
> > > >> We can see that ht=(‘l’<<16) ^ ((‘l’^‘u’)<<8) ^
> > (‘l’^’u’^‘e’)  cannot
> > > be obtained by any other 3 characters long tail. The statement is
> not
> > > true if we use ADD instead of XOR, as extended ASCII characters
> might
> > > generate overflows affecting the LSB of the higher byte in the hash
> > > value.
> > > >>
> > > >> I pushed a pull request here: https://github.com/php/php-
> > > src/pull/1793. Unfortunately it does not pass the travis tests
> because
> > > “htmlspecialchars etc use a generated table that assumes the current
> > > hash function” as noticed by Nikita.
> > > >>
> > > >> Let me know your thoughts on this idea.
> > > >
> > > > Hey Bogdan,
> > > > This looks like an interesting idea! I'm somewhat apprehensive
> about
> > > coupling this to a change of the hash function, for two reasons:
> > > > a) This will make it more problematic if we want to change the
> hash
> > > function in the future, e.g. if we want to switch to SipHash.
> > > > b) The quality of the new hash distribution is not immediately
> > clear,
> > &

RE: [PHP-DEV] Lazy keys comparison during hash lookups

2016-03-08 Thread Andone, Bogdan


> -Original Message-
> From: Andone, Bogdan [mailto:bogdan.and...@intel.com]
> Sent: Tuesday, March 08, 2016 4:19 PM
> To: 'Nikita Popov'
> Cc: internals@lists.php.net
> Subject: RE: [PHP-DEV] Lazy keys comparison during hash lookups
> 
> > From: Nikita Popov [mailto:nikita@gmail.com]
> > Sent: Tuesday, March 08, 2016 3:43 PM
> > To: Andone, Bogdan
> > Cc: internals@lists.php.net
> > Subject: Re: [PHP-DEV] Lazy keys comparison during hash lookups
> > >On Tue, Mar 8, 2016 at 2:18 PM, Nikita Popov <nikita@gmail.com>
> wrote:
> > >> On Tue, Mar 8, 2016 at 2:01 PM, Andone, Bogdan
> > <bogdan.and...@intel.com> wrote:
> > >> Hi Guys,
> > >>
> > >> I would like to propose a small change into the DJBX33A hash
> function
> > algorithm which will make easier the key matching validations in hash
> > lookup functions.
> > >>
> > >> The change addresses the modulo 8 tailing bytes of the key. For
> these
> > bytes we can use an 8 bit shift instead of a 5 bit shift; we also need
> > to replace the ADD by XOR, in order to avoid byte level overflows.
> This
> > change ensures the uniqueness of the hash function transformation for
> > the tailing bytes: supposing two strings have same partial hash value
> > for the first Nx8 bytes, different combinations of tailing characters
> > (with the same tail size) will always generate different keys.
> > >> We have the following consequences:
> > >> If two strings have:
> > >> - same hash value,
> > >> - same length,
> > >> - same bytes for the first Nx8 positions,
> > >> then they are equal, and the tailing bytes can be skipped during
> > comparison.
> > >>
> > >> There is a visible performance gain if we apply this approach as we
> > can use a lightweight memcmp() implementation based on longs
> comparison
> > and completely free of the complexity incurred by tailing bytes. For
> > Mediawiki I have a 1.7%  performance gain while Wordpress reports 1.2%
> > speedup on Haswell-EP.
> > >>
> > >> Let’s take a small example:
> > >> Suppose we have a key=”this_is_a_key_value”.
> > >> The hash function for the  first N x 8 byes are computed in the
> > original way; suppose “this_is_a_key_va” (16bytes) will return a
> partial
> > hash value h1; the final hash value will be computed by the following
> > sequence:
> > >> h = ((h1<<8) ^ h1) ^ ‘l’;
> > >> h = ((h<<8) ^ h) ^ ‘u’;
> > >> h = ((h<<8) ^ h) ^ ‘e’;
> > >> or, in only one operation:
> > >> h = (h1<<24) ^ (h1<<16) ^ (h1<<8) ^ h1 ^ (‘l’<<16) ^ ((‘l’^‘u’)<<8)
> ^
> > (‘l’^’u’^‘e’)
> > >> We can see that ht=(‘l’<<16) ^ ((‘l’^‘u’)<<8) ^
> (‘l’^’u’^‘e’)  cannot
> > be obtained by any other 3 characters long tail. The statement is not
> > true if we use ADD instead of XOR, as extended ASCII characters might
> > generate overflows affecting the LSB of the higher byte in the hash
> > value.
> > >>
> > >> I pushed a pull request here: https://github.com/php/php-
> > src/pull/1793. Unfortunately it does not pass the travis tests because
> > “htmlspecialchars etc use a generated table that assumes the current
> > hash function” as noticed by Nikita.
> > >>
> > >> Let me know your thoughts on this idea.
> > >
> > > Hey Bogdan,
> > > This looks like an interesting idea! I'm somewhat apprehensive about
> > coupling this to a change of the hash function, for two reasons:
> > > a) This will make it more problematic if we want to change the hash
> > function in the future, e.g. if we want to switch to SipHash.
> > > b) The quality of the new hash distribution is not immediately
> clear,
> > but likely non-trivially weaker.
> > > So I'm wondering if we can keep the concept of using a zend_ulong
> > aligned memcmp while leaving the hash function alone: The zend_string
> > allocation policy already allocates the string data aligned and padded
> > to zend_ulong boundaries. If we were to additionally explicitly zero
> out
> > the last byte (to avoid valgrind warnings) we should be able to
> compare
> > the character data of two zend_strings using a zend_ulong memcmp. This
> > would have the additional benefit that it works for normal string
> > comparisons (unrelated to hashtables) as well. On the other hand, this
> > is only possible for zend_string to zend_string comparisons, 

RE: [PHP-DEV] Lazy keys comparison during hash lookups

2016-03-08 Thread Andone, Bogdan
> From: Nikita Popov [mailto:nikita@gmail.com] 
> Sent: Tuesday, March 08, 2016 3:43 PM
> To: Andone, Bogdan
> Cc: internals@lists.php.net
> Subject: Re: [PHP-DEV] Lazy keys comparison during hash lookups
> >On Tue, Mar 8, 2016 at 2:18 PM, Nikita Popov <nikita@gmail.com> wrote:
> >> On Tue, Mar 8, 2016 at 2:01 PM, Andone, Bogdan
> <bogdan.and...@intel.com> wrote:
> >> Hi Guys,
> >>
> >> I would like to propose a small change into the DJBX33A hash function
> algorithm which will make easier the key matching validations in hash
> lookup functions.
> >>
> >> The change addresses the modulo 8 tailing bytes of the key. For these
> bytes we can use an 8 bit shift instead of a 5 bit shift; we also need
> to replace the ADD by XOR, in order to avoid byte level overflows. This
> change ensures the uniqueness of the hash function transformation for
> the tailing bytes: supposing two strings have same partial hash value
> for the first Nx8 bytes, different combinations of tailing characters
> (with the same tail size) will always generate different keys.
> >> We have the following consequences:
> >> If two strings have:
> >> - same hash value,
> >> - same length,
> >> - same bytes for the first Nx8 positions,
> >> then they are equal, and the tailing bytes can be skipped during
> comparison.
> >>
> >> There is a visible performance gain if we apply this approach as we
> can use a lightweight memcmp() implementation based on longs comparison
> and completely free of the complexity incurred by tailing bytes. For
> Mediawiki I have a 1.7%  performance gain while Wordpress reports 1.2%
> speedup on Haswell-EP.
> >>
> >> Let’s take a small example:
> >> Suppose we have a key=”this_is_a_key_value”.
> >> The hash function for the  first N x 8 byes are computed in the
> original way; suppose “this_is_a_key_va” (16bytes) will return a partial
> hash value h1; the final hash value will be computed by the following
> sequence:
> >> h = ((h1<<8) ^ h1) ^ ‘l’;
> >> h = ((h<<8) ^ h) ^ ‘u’;
> >> h = ((h<<8) ^ h) ^ ‘e’;
> >> or, in only one operation:
> >> h = (h1<<24) ^ (h1<<16) ^ (h1<<8) ^ h1 ^ (‘l’<<16) ^ ((‘l’^‘u’)<<8) ^
> (‘l’^’u’^‘e’)
> >> We can see that ht=(‘l’<<16) ^ ((‘l’^‘u’)<<8) ^ (‘l’^’u’^‘e’)  cannot
> be obtained by any other 3 characters long tail. The statement is not
> true if we use ADD instead of XOR, as extended ASCII characters might
> generate overflows affecting the LSB of the higher byte in the hash
> value.
> >>
> >> I pushed a pull request here: https://github.com/php/php-
> src/pull/1793. Unfortunately it does not pass the travis tests because
> “htmlspecialchars etc use a generated table that assumes the current
> hash function” as noticed by Nikita.
> >>
> >> Let me know your thoughts on this idea.
> >
> > Hey Bogdan,
> > This looks like an interesting idea! I'm somewhat apprehensive about
> coupling this to a change of the hash function, for two reasons:
> > a) This will make it more problematic if we want to change the hash
> function in the future, e.g. if we want to switch to SipHash.
> > b) The quality of the new hash distribution is not immediately clear,
> but likely non-trivially weaker.
> > So I'm wondering if we can keep the concept of using a zend_ulong
> aligned memcmp while leaving the hash function alone: The zend_string
> allocation policy already allocates the string data aligned and padded
> to zend_ulong boundaries. If we were to additionally explicitly zero out
> the last byte (to avoid valgrind warnings) we should be able to compare
> the character data of two zend_strings using a zend_ulong memcmp. This
> would have the additional benefit that it works for normal string
> comparisons (unrelated to hashtables) as well. On the other hand, this
> is only possible for zend_string to zend_string comparisons, not for
> comparisons with static strings.
> > Regards,
> s/zero out the last byte/zero out the last zend_ulong
> I'd like to add another issue with relying on the hash for this which I
> just remembered: We currently always set the top bit of the hash for
> strings (see http://lxr.php.net/xref/PHP_TRUNK/Zend/zend_string.h#351),
> in order to ensure that hashes are never zero. This makes the hash non-
> unique.
> Nikita
Yeah... I missed the top bit set, but I think it could be solved somehow. Imho 
the strongest reason against the patch is the dependency between the hash 
function and the key validation method but I have the impression that this is 
not be the most dirty hack deployed ever in PHP; and I don't know very well 
what is the degree of compromise acceptable for 1% speedup :-) 
On the other hand, your idea to zero the last zend_ulong in the zend_string 
looks nice. There could be an additional potential gain also from lightweight 
implementation of memcmp(). I will give it a try to see if it gets better 
results than the current proposal.
I assume all zend_string allocation functions are located in zend_string.h, 
correct?

Thanks,
Bogdan


Re: [PHP-DEV] Lazy keys comparison during hash lookups

2016-03-08 Thread Nikita Popov
On Tue, Mar 8, 2016 at 2:18 PM, Nikita Popov  wrote:

> On Tue, Mar 8, 2016 at 2:01 PM, Andone, Bogdan 
> wrote:
>
>> Hi Guys,
>>
>> I would like to propose a small change into the DJBX33A hash function
>> algorithm which will make easier the key matching validations in hash
>> lookup functions.
>>
>> The change addresses the modulo 8 tailing bytes of the key. For these
>> bytes we can use an 8 bit shift instead of a 5 bit shift; we also need to
>> replace the ADD by XOR, in order to avoid byte level overflows. This change
>> ensures the uniqueness of the hash function transformation for the tailing
>> bytes: supposing two strings have same partial hash value for the first Nx8
>> bytes, different combinations of tailing characters (with the same tail
>> size) will always generate different keys.
>> We have the following consequences:
>> If two strings have:
>> - same hash value,
>> - same length,
>> - same bytes for the first Nx8 positions,
>> then they are equal, and the tailing bytes can be skipped during
>> comparison.
>>
>> There is a visible performance gain if we apply this approach as we can
>> use a lightweight memcmp() implementation based on longs comparison and
>> completely free of the complexity incurred by tailing bytes. For Mediawiki
>> I have a 1.7%  performance gain while Wordpress reports 1.2% speedup on
>> Haswell-EP.
>>
>> Let’s take a small example:
>> Suppose we have a key=”this_is_a_key_value”.
>> The hash function for the  first N x 8 byes are computed in the original
>> way; suppose “this_is_a_key_va” (16bytes) will return a partial hash value
>> h1; the final hash value will be computed by the following sequence:
>> h = ((h1<<8) ^ h1) ^ ‘l’;
>> h = ((h<<8) ^ h) ^ ‘u’;
>> h = ((h<<8) ^ h) ^ ‘e’;
>> or, in only one operation:
>> h = (h1<<24) ^ (h1<<16) ^ (h1<<8) ^ h1 ^ (‘l’<<16) ^ ((‘l’^‘u’)<<8) ^
>> (‘l’^’u’^‘e’)
>> We can see that ht=(‘l’<<16) ^ ((‘l’^‘u’)<<8) ^ (‘l’^’u’^‘e’)  cannot be
>> obtained by any other 3 characters long tail. The statement is not true if
>> we use ADD instead of XOR, as extended ASCII characters might generate
>> overflows affecting the LSB of the higher byte in the hash value.
>>
>> I pushed a pull request here: https://github.com/php/php-src/pull/1793.
>> Unfortunately it does not pass the travis tests because “htmlspecialchars
>> etc use a generated table that assumes the current hash function” as
>> noticed by Nikita.
>>
>> Let me know your thoughts on this idea.
>>
>
> Hey Bogdan,
>
> This looks like an interesting idea! I'm somewhat apprehensive about
> coupling this to a change of the hash function, for two reasons:
> a) This will make it more problematic if we want to change the hash
> function in the future, e.g. if we want to switch to SipHash.
> b) The quality of the new hash distribution is not immediately clear, but
> likely non-trivially weaker.
>
> So I'm wondering if we can keep the concept of using a zend_ulong aligned
> memcmp while leaving the hash function alone: The zend_string allocation
> policy already allocates the string data aligned and padded to zend_ulong
> boundaries. If we were to additionally explicitly zero out the last byte
> (to avoid valgrind warnings) we should be able to compare the character
> data of two zend_strings using a zend_ulong memcmp. This would have the
> additional benefit that it works for normal string comparisons (unrelated
> to hashtables) as well. On the other hand, this is only possible for
> zend_string to zend_string comparisons, not for comparisons with static
> strings.
>

s/zero out the last byte/zero out the last zend_ulong

I'd like to add another issue with relying on the hash for this which I
just remembered: We currently always set the top bit of the hash for
strings (see http://lxr.php.net/xref/PHP_TRUNK/Zend/zend_string.h#351), in
order to ensure that hashes are never zero. This makes the hash non-unique.

Nikita


Re: [PHP-DEV] Lazy keys comparison during hash lookups

2016-03-08 Thread Nikita Popov
On Tue, Mar 8, 2016 at 2:01 PM, Andone, Bogdan 
wrote:

> Hi Guys,
>
> I would like to propose a small change into the DJBX33A hash function
> algorithm which will make easier the key matching validations in hash
> lookup functions.
>
> The change addresses the modulo 8 tailing bytes of the key. For these
> bytes we can use an 8 bit shift instead of a 5 bit shift; we also need to
> replace the ADD by XOR, in order to avoid byte level overflows. This change
> ensures the uniqueness of the hash function transformation for the tailing
> bytes: supposing two strings have same partial hash value for the first Nx8
> bytes, different combinations of tailing characters (with the same tail
> size) will always generate different keys.
> We have the following consequences:
> If two strings have:
> - same hash value,
> - same length,
> - same bytes for the first Nx8 positions,
> then they are equal, and the tailing bytes can be skipped during
> comparison.
>
> There is a visible performance gain if we apply this approach as we can
> use a lightweight memcmp() implementation based on longs comparison and
> completely free of the complexity incurred by tailing bytes. For Mediawiki
> I have a 1.7%  performance gain while Wordpress reports 1.2% speedup on
> Haswell-EP.
>
> Let’s take a small example:
> Suppose we have a key=”this_is_a_key_value”.
> The hash function for the  first N x 8 byes are computed in the original
> way; suppose “this_is_a_key_va” (16bytes) will return a partial hash value
> h1; the final hash value will be computed by the following sequence:
> h = ((h1<<8) ^ h1) ^ ‘l’;
> h = ((h<<8) ^ h) ^ ‘u’;
> h = ((h<<8) ^ h) ^ ‘e’;
> or, in only one operation:
> h = (h1<<24) ^ (h1<<16) ^ (h1<<8) ^ h1 ^ (‘l’<<16) ^ ((‘l’^‘u’)<<8) ^
> (‘l’^’u’^‘e’)
> We can see that ht=(‘l’<<16) ^ ((‘l’^‘u’)<<8) ^ (‘l’^’u’^‘e’)  cannot be
> obtained by any other 3 characters long tail. The statement is not true if
> we use ADD instead of XOR, as extended ASCII characters might generate
> overflows affecting the LSB of the higher byte in the hash value.
>
> I pushed a pull request here: https://github.com/php/php-src/pull/1793.
> Unfortunately it does not pass the travis tests because “htmlspecialchars
> etc use a generated table that assumes the current hash function” as
> noticed by Nikita.
>
> Let me know your thoughts on this idea.
>

Hey Bogdan,

This looks like an interesting idea! I'm somewhat apprehensive about
coupling this to a change of the hash function, for two reasons:
a) This will make it more problematic if we want to change the hash
function in the future, e.g. if we want to switch to SipHash.
b) The quality of the new hash distribution is not immediately clear, but
likely non-trivially weaker.

So I'm wondering if we can keep the concept of using a zend_ulong aligned
memcmp while leaving the hash function alone: The zend_string allocation
policy already allocates the string data aligned and padded to zend_ulong
boundaries. If we were to additionally explicitly zero out the last byte
(to avoid valgrind warnings) we should be able to compare the character
data of two zend_strings using a zend_ulong memcmp. This would have the
additional benefit that it works for normal string comparisons (unrelated
to hashtables) as well. On the other hand, this is only possible for
zend_string to zend_string comparisons, not for comparisons with static
strings.

Regards,
Nikita