Re: Hash computation and TFB

Stefan Bidi Tue, 06 Aug 2013 06:32:13 -0700

I copied the hash algorithm straight out of -base, so they should match.  I
remember a few months ago Richard was playing around with hash functions
and this might be causing some issues, now.

I just looked it up, the changes were made on rev 36344.

There is another issue... -base allows UTF-8 strings, which will not be
hashed to the same UTF-16 value.  In my opinion, allowing UTF-8 string
literals is not a good idea and base should revert back to Latin1 as the
default C string encoding.  I'm actually debating adding a UTF-16 string
literals configure option for corebase.  I believe using UTF-16 internally
is the only sane solution to non-ASCII encodings.

I've tried experimenting with other hash functions that are not
one-at-a-time, but unfortunately have not found anything that will work on
both ASCII and Unicode strings consistently.  It would be really nice to be
able to work with 32- or 64-bit integers directly instead of 8- or 16-bit
characters.  If could use UTF-16 across the board, this wouldn't be a
problem.

Anyway, those are my thoughts.

On Tue, Aug 6, 2013 at 8:14 AM, Luboš Doležel <[email protected]> wrote:

> Hello,
>
> hash computation with Toll-Free Bridging is a tricky subject. Do it wrong
> and you'll get all sorts of trouble, especially with dictionaries, which
> use hashes a lot.
>
> The code in corebase currently dispatches all CFHash() calls on ObjC
> objects to -hash, which is bad. The following expectation breaks due to
> this dispatch:
>
> CFHash(@"string") == CFHash(CFSTR("string"))
>
> because NSString uses a different hashing algorithm than CFString.
> My suggestion is to do away with the ObjC dispatch in CFHash() and alter
> all the CF*Hash() functions to support ObjC types.
>
> While looking at CFStringHash(), I've also noticed that either 8-bit or
> 16-bit raw character data is used for hashing based on what is available. I
> believe this breaks the following case:
>
> ===
> CFStringRef str1 = CFSTR("str");
> CFStringRef str2 = CFStringCreateWithCharacters(**NULL, (UniChar*)
> "s\0t\0r\0", 3); // "str" in UTF-16
>
> CFHash(str1) == CFHash(str2);
> ===
>
> While the two strings are obviously identical, different bytes are used to
> generate the hash in both cases.
>
> This problem can by solved by converting the character data to Unicode
> first, which has a performance impact, but only once for every CFString.
>
> The situation with CFHash() calls on NSStrings is worse, since corebase
> has nowhere to save the calculated hash, so it must be recalculated every
> time. But I think it's better to be slow than to be wrong. Please review
> the attached patch and let me know if you have any observations.
>
> --
> Luboš Doležel
>

_______________________________________________
Gnustep-dev mailing list
[email protected]
https://lists.gnu.org/mailman/listinfo/gnustep-dev

Re: Hash computation and TFB

Reply via email to