[ast-developers] ksh93 support for Unicode outside BMP (=basic multilingual plane) ...

Roland Mainz Mon, 25 Jun 2012 08:24:09 -0700

Hi!

----


I've been testing ksh93's GB18030 (GB18030 is related since this
standard requires support for characters _outside_ the BMP, e.g. all
GB81030-conforming applicatinos must support unicode code points >
0xFFFF without problems) and Unicode support two weeks ago and hit a
very bad issue with printf '%q\n'.

In theory (the remainder of the text assumes a *.UTF-8 locale) if a
character is not printable (e.g. |iswprint()| returns |0|) then
ksh93's printf '%q' quoting support should use "\u<hex unicode code
point>".
The problem with that is...
1. ... characters beyond unicode code point 0xffff do not work because
the implementation somehow doesn't get it right on SuSE 12.1
Linux/AMD64. For example $ LC_ALL=en_US.UTF-8 ksh -c 'printf
"\u[1F640]\n"' # prints garbage instead of a valid unicode character
encoded in UTF-8
2. ... characters beyond unicode code point 0xffff would
(theoretically) generate ambigous \u sequences because there is no way
to distinguish whether the code point has four, five or six hexadecial
digits, e.g. this code in src/cmd/ksh93/sh/string.c ...
-- snip --
   417  #if SHOPT_MULTIBYTE
   418                                  if(!iswprint(c) || (c>=256 &&
iswspace(c)))
   419                                  {
   420                                          sfprintf(staksp,"\\u%.4x",c);
   421                                          continue;
   422                                  }
   423  #else
-- snip --
How should code differ between unicode code point 0xabcd followed by
the German word "affe" (=engl. "ape") ... the current implementation
would print this as "\uabcdaffe" and it is unclear how this should be
parsed by the reader.

3. Idea how \u<hex unicode code point> sequences should work is pretty
much disputed and implementation-spefic (except "\u[<hex unicode code
point>]", see below) ... which completely ruins interoperability. For
example many old applications assume that all unicode values fit into
four hexdecimal digits (which is wrong... the so called "Basic
Multilingual Plane" (="BMP") covers the unicode code points from
0x0000-0xFFFF but there are unicode code points outside the BMP, e.g.
see 
http://en.wikipedia.org/wiki/Basic_Multilingual_Plane#Supplementary_Multilingual_Plane)
and support for values outside BMP is somehow "tacked on" by other
escape sequences. Other implementations just say: "... \u is followed
by four hex digits. Values outside BMP add another \u to cover the
remaining _bytes_ ..." (no... this is not going to work properly
either)

After digging-out this issue I went to the Unicode people and asked
for advice. After a longer discussion this was the result:
1. Output: printf '%q' should _ALWAYS_ use the form "\u[<hex unicode
code point>]", even for four-digit values. This eliminates all the
interoperability issues (see [3] above) kills the parsing issues for
unicode code points outside the BMP (e.g. code points like U+1D538
(see 
http://en.wikipedia.org/wiki/Mathematical_alphanumeric_symbols#Latin_letters)).
Basically the code in src/cmd/ksh93/sh/string.c should look like this
(e.g. same as the old one... only with '[' and ']' added after "\u"):
-- snip --
   417  #if SHOPT_MULTIBYTE
   418                                  if(!iswprint(c) || (c>=256 &&
iswspace(c)))
   419                                  {
   420                                          sfprintf(staksp,"\\u[%.4x]",c);
   421                                          continue;
   422                                  }
   423  #else
-- snip --
2. Input: Reading such values (or using the $'...')-style literals
should both allow "\u[<unicode hex codepoint>]" and "\u[U+<unicode hex
codepoint>]". The old form \uXXXX should only be used for legacy
purposes as _input_ (NOT output) and exactly expect four hexadeciamal
characters.
Please never ever output \uXXXX ever again. It only generates trouble.

----

Bye,
Roland

P.S.: David... where in src/cmd/ksh93/ is the code which _reads_
$'...'-style string literals ?

-- 
  __ .  . __
 (o.\ \/ /.o) [email protected]
  \__\/\/__/  MPEG specialist, C&&JAVA&&Sun&&Unix programmer
  /O /==\ O\  TEL +49 641 3992797
 (;O/ \/ \O;)

_______________________________________________
ast-developers mailing list
[email protected]
https://mailman.research.att.com/mailman/listinfo/ast-developers

[ast-developers] ksh93 support for Unicode outside BMP (=basic multilingual plane) ...

Reply via email to