On 5 Apr 2018, at 20:09, Stefan Bidigaray <stefanb...@gmail.com> wrote: > > I know this is probably going to be rejected, but how about making constant > string either ASCII or UTF-16 only? Scratching UTF-8 altogether? I know this > would increase the byte count for most European languages using Latin > characters, but I don't see the point of maintaining both UTF-8 and UTF-16 > encoding. Everything that can be done with UTF-16 can be encoded in UTF-8 > (and vise-versa), so how would the compiler pick between the two? > Additionally, wouldn't sticking to just 1 of the 2 encoding simplify the code > significantly?
I am leaning in this direction. The APIs all want UTF-16 codepoints. In ASCII, each character is precisely one UTF-16 codepoint. In UTF-16, every two-byte value is a UTF-16 codepoint. In UTF-8, UTF-16 codepoints are somewhere between 1 and 3 characters long and the mapping is complicated. It’s a shame that in the 64-bit transition Apple didn’t make unichar 32 bits and make it a unicode character, so we’re stuck in the same situation of Windows with a hasty s/UCS2/UTF-16/ and an attempt to make the APIs keep working. My current plan is to make the format support ASCII, UTF-8, UTF-16, and UTF-32, but only generate ASCII and UTF-16 in the compiler and then decide later if we want to support generating UTF-8 and UTF-32. I also won’t initialise the hash in the compiler initially, until we’ve decided a bit more what the hash should be. David _______________________________________________ Gnustep-dev mailing list Gnustep-dev@gnu.org https://lists.gnu.org/mailman/listinfo/gnustep-dev