Re: Future of locale, will there be POSIX.utf-8, what will it bring?

Steffen Nurpmeso via austin-group-l at The Open Group Fri, 07 Jan 2022 10:46:45 -0800

Hello.

shwaresyst wrote in
 <1494661216.220561.1641574109...@mail.yahoo.com>:
[i resort a bit]
 |  On Thu, Jan 6, 2022 at 3:40 PM, Steffen Nurpmeso via austin-group-l \
 |  at The Open Group<austin-group-l@opengroup.org> wrote:   Hello!
 |
 |I wonder about POSIX.utf-?8, i tried to remember any statement
 |i had read, and Mantis did not show up results.
 |
 |In particular i am interested in whether LC_CTYPE results will
 |bring true Unicode support or not, the reason i am asking is that
 |the upcoming version of my work-box GNU LibC-based (2.34) Linux
 |distribution will provide it like
 |
 |  localedef -i POSIX -f UTF-8 $PKG/usr/lib/locale/C.UTF-8 2> /dev/null \
 ||| true
 |
 |and then this thing is detected as an UTF-8 locale, but causes
 |three test failures of the MUA i maintain because character set
 |conversion behaves differently.
 |
 |My personal opinion was that POSIX.utf8 will bring the complete
 |range of Unicode characters to at least LC_CTYPE, i wonder about
 |LC_COLLATE, as language matching is, hm, very language specific.
 |The rest not (maybe LC_MESSAGES going for UTF-8 though).
 |
 |Is that approximately correct?


 |The first Issue 8 draft is focusing, afaik, on adding the C1x changes \
 |and Mantis Issue 8 tagged items. The changes to XBD 6, 7, etc., that \
 |will formally add a POSIX UTF8 locale are to be part of the second, \
 |maybe third, draft. This is why you don't see them yet.
 |For maximum compatibility with existing practice the required base \
 |repertoire for this will likely be some subset of UCS-2, plus ISO-6429 \

16-bit characters i do not see in POSIX, going that route would
make impossible implementations which use specific bit patterns in
wchar_t, which, if i recall correctly from 2014 or when i was
looking into the issue, is used by at least the Citrus
implementation of the mb* and w* series for at least some asian
languages.  And more .. but that was not the issue i am concerned
about at the moment anyhow, i personally would assume 8-bit aka
UTF-8 character strings to be predominant in Unix based systems,
they surely are in the predominant ones.  (Even though, i have to
say, UTF-16 aka 16-bit characters do have their value for the
majority of the massively declining number of human languages, and
the older i get the more i think using that as a base is a good
decision.)

 |in full, not the complete range. I've hopes this will be significantly \
 |more than the minimal repertoire of C2x, but it may not as a matter \

That made me look for and download a 2020 draft of ISO C2X, i did
not have a look until now.

 |of deferral to the C standard. It should be left up to implementations \
 |still, in my opinion, how much of the range beyond this base they want \
 |to support as extensions, including UTF16 as an encoding. How the LC_* \
 |categories will be extended to fully support that base repertoire accord\
 |ing to the Unicode requirements hasn't been determined yet either, \
 |but this is the nominal goal. 

And from a glance i do not see anything Unicode-enabled-locale
wise.  UTF-16 specifically i do not see ... as you will have to
convert on input and on output in order to use it in your program,
and then you can very well convert to the transparent wchar_t, or
use the wide I/O series which gives it to you.  Minimizing the
tremendous deficiency that many traditional Unix programs have to
face because the historic string interfaces do not provide proper
functionality to deal with human languages is out of scope is it?

At least it seems as if ISO C2X introduces support for UTF-8 as
a native string representation ... in practice it seems Unix
people use GNU libunicode (which explicitly supports UTF-(32|16|8)
i think) as well as ICU (which i think used UTF-16 internally but
offered improved UTF-8 interface performance by then), so the ISO
standard people were able to simply ignore their responsibility
and focused on mysterious s..t decisions, and POSIX has to follow
ISO C suit for one, and then simply had not the ressources to
define an entire Unicode string interface by themselve ... and so
practice has created its own Genesis.

Thank you.  And ciao from Germany,

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

Re: Future of locale, will there be POSIX.utf-8, what will it bring?

Reply via email to