UTF-8 is great for datastreams but a PITA to deal with in a language or an 
application program.

UTF-16 is the worst of both worlds -- uses roughly double the space of UTF-8 
but still you can't quite deal with the characters as though they were fixed 
size. Worse, if you do pretend to deal with them as fixed size, it mostly works.

What about a language concept where data was externalized as UTF-8 but 
presented to the program logic internally as UTF-32? With automatic, 
transparent re-encoding back-and-forth for externalization?

Charles


-----Original Message-----
From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf 
Of John McKown
Sent: Monday, May 8, 2017 7:08 AM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Interesting article on UNICODE.

This may be old hat to many/most here. But I found it quite interesting.
And perhaps more important now that IBM is emphasizing z/OS in the "mobile"
world.

http://reedbeta.com/blog/programmers-intro-to-unicode/

IMO, something that IBM languages could find really useful is Google Go's 
concept of a "rune". A "rune" is what Go uses instead of the longer "UNICODE 
Code Point". It is basically the _concept_ of a "letter" without any specific 
bit encoding. E.g. U+0061 is "LOWER CASE LATIN LETTER A". But
U+0061 is NOT 0x61 (UTF-8) or 0x0061 (UTF-16BE) or 0x6100 (UTF-16LE).
Unfortunately the language COBOL only has a PIC X. Which is "one 8 bit byte". 
There's not even a _concept_ of a UNICODE Code Point in COBOL. And, honestly, I 
don't really see how to implement such in COBOL, unless they just do the "easy" 
thing and use UTF-32. Which really "wastes" bytes in memory and on disk. Uh, 
unless your DASD array does some sort of transparent compression on the back 
end.

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Reply via email to