On 26 Aug 2025, at 23:53, Phil Smith III <li...@akphs.com> wrote: > > Without commenting on UTF-EBCDIC, I think I can answer: >> What need does UTF-8 address? > > Fitting the BMP (plus) into as little space as possible. Now, in this modren > world of large storage devices and high bandwidth, it's not clear that UTF-8 > is worth the hassle--but it's entrenched, which makes it important. Or at > least here to stay.
UTF-8 is critically important outside of the EBCDIC enclave since the first 128 characters are identical to US-ASCII-7. Compatibility with decades of code is critical. > Personally, I think UTF-16 would make life easier in many, many cases. Just as ASCII and EBCDIC are too US-centric, UTF-16 is too old-European-centric. I rarely find software claiming UTF-16 support that correctly supports UTF-16 encoded characters above U+FFFF. Very simply, when I see UTF-16, I assume the software involved is broken. With UTF-32 there is no question about at least accepting the full range of Unicode characters. And UTF-32 is fixed-width, so counting characters is easy, unlike UTF-8 and UTF-16. Since unlimited storage and bandwidth are now available, why bother with UTF-16? ¡Just use UTF-32! But if one believes in limits, UTF-8 is almost always more compact than UTF-16 and UTF-32 while being a compatible superset of US-ASCII-7. Bringing this discussion back to the z/Architecture instruction set, it once seemed unnecessary to me that there were instructions for handling UTF-16, such as CUTFU. But I later realized IBM added many of the instructions specifically for their JVM. Java strings are UTF-16. Basic things like determining the number of characters in a Java string requires special processing so I expect most Java applications incorrectly handle characters above U+FFFF, such as the characters common to ALL modern scripts: emoji. 😀 (U+1F600) David P.S. I am not dumping on Java as a language. All human and programming languages have their quirks. I have coded in well over 20 languages, and find Java to be far from the worst. The original design choice to use UCS2 for Java strings and the inability to reform past UTF-16 is by far my biggest criticism. But I get how hard it is to change these design choices. Python 2 is still heavily used despite being insecure and unsupported simply because Python 3 changed strings from ASCII to Unicode and some people really don’t like change.