On 5 Apr 2018, at 20:09, Stefan Bidigaray <stefanb...@gmail.com> wrote:
> 
> I know this is probably going to be rejected, but how about making constant 
> string either ASCII or UTF-16 only? Scratching UTF-8 altogether? I know this 
> would increase the byte count for most European languages using Latin 
> characters, but I don't see the point of maintaining both UTF-8 and UTF-16 
> encoding. Everything that can be done with UTF-16 can be encoded in UTF-8 
> (and vise-versa), so how would the compiler pick between the two? 
> Additionally, wouldn't sticking to just 1 of the 2 encoding simplify the code 
> significantly?

I am leaning in this direction.  The APIs all want UTF-16 codepoints.  In 
ASCII, each character is precisely one UTF-16 codepoint.  In UTF-16, every 
two-byte value is a UTF-16 codepoint.  In UTF-8, UTF-16 codepoints are 
somewhere between 1 and 3 characters long and the mapping is complicated.  It’s 
a shame that in the 64-bit transition Apple didn’t make unichar 32 bits and 
make it a unicode character, so we’re stuck in the same situation of Windows 
with a hasty s/UCS2/UTF-16/ and an attempt to make the APIs keep working.

My current plan is to make the format support ASCII, UTF-8, UTF-16, and UTF-32, 
but only generate ASCII and UTF-16 in the compiler and then decide later if we 
want to support generating UTF-8 and UTF-32.  I also won’t initialise the hash 
in the compiler initially, until we’ve decided a bit more what the hash should 
be.

David


_______________________________________________
Gnustep-dev mailing list
Gnustep-dev@gnu.org
https://lists.gnu.org/mailman/listinfo/gnustep-dev

Reply via email to