On 26 Aug 2025, at 23:53, Phil Smith III <li...@akphs.com> wrote:
> 
> Without commenting on UTF-EBCDIC, I think I can answer:
>> What need does UTF-8 address?
> 
> Fitting the BMP (plus) into as little space as possible. Now, in this modren 
> world of large storage devices and high bandwidth, it's not clear that UTF-8 
> is worth the hassle--but it's entrenched, which makes it important. Or at 
> least here to stay.

UTF-8 is critically important outside of the EBCDIC enclave since the first 128 
characters are identical to US-ASCII-7. Compatibility with decades of code is 
critical.

> Personally, I think UTF-16 would make life easier in many, many cases.

Just as ASCII and EBCDIC are too US-centric, UTF-16 is too 
old-European-centric. I rarely find software claiming UTF-16 support that 
correctly supports UTF-16 encoded characters above U+FFFF. Very simply, when I 
see UTF-16, I assume the software involved is broken.

With UTF-32 there is no question about at least accepting the full range of 
Unicode characters. And UTF-32 is fixed-width, so counting characters is easy, 
unlike UTF-8 and UTF-16. Since unlimited storage and bandwidth are now 
available, why bother with UTF-16?  ¡Just use UTF-32!  But if one believes in 
limits, UTF-8 is almost always more compact than UTF-16 and UTF-32 while being 
a compatible superset of US-ASCII-7.

Bringing this discussion back to the z/Architecture instruction set, it once 
seemed unnecessary to me that there were instructions for handling UTF-16, such 
as CUTFU. But I later realized IBM added many of the instructions specifically 
for their JVM. Java strings are UTF-16. Basic things like determining the 
number of characters in a Java string requires special processing so I expect 
most Java applications incorrectly handle characters above U+FFFF, such as the 
characters common to ALL modern scripts: emoji. 😀
(U+1F600)

David

P.S. I am not dumping on Java as a language. All human and programming 
languages have their quirks. I have coded in well over 20 languages, and find 
Java to be far from the worst. The original design choice to use UCS2 for Java 
strings and the inability to reform past UTF-16 is by far my biggest criticism. 
But I get how hard it is to change these design choices. Python 2 is still 
heavily used despite being insecure and unsupported simply because Python 3 
changed strings from ASCII to Unicode and some people really don’t like change.

Reply via email to