I copied the hash algorithm straight out of -base, so they should match. I remember a few months ago Richard was playing around with hash functions and this might be causing some issues, now.
I just looked it up, the changes were made on rev 36344. There is another issue... -base allows UTF-8 strings, which will not be hashed to the same UTF-16 value. In my opinion, allowing UTF-8 string literals is not a good idea and base should revert back to Latin1 as the default C string encoding. I'm actually debating adding a UTF-16 string literals configure option for corebase. I believe using UTF-16 internally is the only sane solution to non-ASCII encodings. I've tried experimenting with other hash functions that are not one-at-a-time, but unfortunately have not found anything that will work on both ASCII and Unicode strings consistently. It would be really nice to be able to work with 32- or 64-bit integers directly instead of 8- or 16-bit characters. If could use UTF-16 across the board, this wouldn't be a problem. Anyway, those are my thoughts. On Tue, Aug 6, 2013 at 8:14 AM, Luboš Doležel <[email protected]> wrote: > Hello, > > hash computation with Toll-Free Bridging is a tricky subject. Do it wrong > and you'll get all sorts of trouble, especially with dictionaries, which > use hashes a lot. > > The code in corebase currently dispatches all CFHash() calls on ObjC > objects to -hash, which is bad. The following expectation breaks due to > this dispatch: > > CFHash(@"string") == CFHash(CFSTR("string")) > > because NSString uses a different hashing algorithm than CFString. > My suggestion is to do away with the ObjC dispatch in CFHash() and alter > all the CF*Hash() functions to support ObjC types. > > While looking at CFStringHash(), I've also noticed that either 8-bit or > 16-bit raw character data is used for hashing based on what is available. I > believe this breaks the following case: > > === > CFStringRef str1 = CFSTR("str"); > CFStringRef str2 = CFStringCreateWithCharacters(**NULL, (UniChar*) > "s\0t\0r\0", 3); // "str" in UTF-16 > > CFHash(str1) == CFHash(str2); > === > > While the two strings are obviously identical, different bytes are used to > generate the hash in both cases. > > This problem can by solved by converting the character data to Unicode > first, which has a performance impact, but only once for every CFString. > > The situation with CFHash() calls on NSStrings is worse, since corebase > has nowhere to save the calculated hash, so it must be recalculated every > time. But I think it's better to be slow than to be wrong. Please review > the attached patch and let me know if you have any observations. > > -- > Luboš Doležel >
_______________________________________________ Gnustep-dev mailing list [email protected] https://lists.gnu.org/mailman/listinfo/gnustep-dev
