Re: [ast-developers] ksh93 support for Unicode outside BMP (=basic multilingual plane) ...

Glenn Fowler Mon, 25 Jun 2012 10:01:57 -0700

On Mon, 25 Jun 2012 18:15:22 +0200 Roland Mainz wrote:
> On Mon, Jun 25, 2012 at 5:46 PM, Dan Shelton
> <[email protected]> wrote:
> > On 25 June 2012 17:23, Roland Mainz <[email protected]> wrote:
> >> Hi!
> >>
> >> ----
> >>
> >> I've been testing ksh93's GB18030 (GB18030 is related since this
> >> standard requires support for characters _outside_ the BMP, e.g. all
> >> GB81030-conforming applicatinos must support unicode code points >
> >> 0xFFFF without problems) and Unicode support two weeks ago and hit a
> >> very bad issue with printf '%q\n'.
> >>
> >> In theory (the remainder of the text assumes a *.UTF-8 locale) if a
> >> character is not printable (e.g. |iswprint()| returns |0|) then
> >> ksh93's printf '%q' quoting support should use "\u<hex unicode code
> >> point>".
> >> The problem with that is...
> >> 1. ... characters beyond unicode code point 0xffff do not work because
> >> the implementation somehow doesn't get it right on SuSE 12.1
> >> Linux/AMD64. For example $ LC_ALL=en_US.UTF-8 ksh -c 'printf
> >> "\u[1F640]\n"' # prints garbage instead of a valid unicode character
> >> encoded in UTF-8
> >> 2. ... characters beyond unicode code point 0xffff would
> >> (theoretically) generate ambigous \u sequences because there is no way
> >> to distinguish whether the code point has four, five or six hexadecial
> >> digits, e.g. this code in src/cmd/ksh93/sh/string.c ...
> >> -- snip --
> >>   417  #if SHOPT_MULTIBYTE
> >>   418                                  if(!iswprint(c) || (c>=256 &&
> >> iswspace(c)))
> >>   419                                  {
> >>   420                                          
> >> sfprintf(staksp,"\\u%.4x",c);
> >>   421                                          continue;
> >>   422                                  }
> >>   423  #else
> >> -- snip --
> >> How should code differ between unicode code point 0xabcd followed by
> >> the German word "affe" (=engl. "ape") ... the current implementation
> >> would print this as "\uabcdaffe" and it is unclear how this should be
> >> parsed by the reader.
> >>
> >> 3. Idea how \u<hex unicode code point> sequences should work is pretty
> >> much disputed and implementation-spefic (except "\u[<hex unicode code
> >> point>]", see below) ... which completely ruins interoperability. For
> >> example many old applications assume that all unicode values fit into
> >> four hexdecimal digits (which is wrong... the so called "Basic
> >> Multilingual Plane" (="BMP") covers the unicode code points from
> >> 0x0000-0xFFFF but there are unicode code points outside the BMP, e.g.
> >> see 
> >> http://en.wikipedia.org/wiki/Basic_Multilingual_Plane#Supplementary_Multilingual_Plane)
> >> and support for values outside BMP is somehow "tacked on" by other
> >> escape sequences. Other implementations just say: "... \u is followed
> >> by four hex digits. Values outside BMP add another \u to cover the
> >> remaining _bytes_ ..." (no... this is not going to work properly
> >> either)
> >>
> >> After digging-out this issue I went to the Unicode people and asked
> >> for advice. After a longer discussion this was the result:
> >> 1. Output: printf '%q' should _ALWAYS_ use the form "\u[<hex unicode
> >> code point>]", even for four-digit values. This eliminates all the
> >> interoperability issues (see [3] above) kills the parsing issues for
> >> unicode code points outside the BMP (e.g. code points like U+1D538
> >> (see 
> >> http://en.wikipedia.org/wiki/Mathematical_alphanumeric_symbols#Latin_letters)).
> >> Basically the code in src/cmd/ksh93/sh/string.c should look like this
> >> (e.g. same as the old one... only with '[' and ']' added after "\u"):
> >> -- snip --
> >>   417  #if SHOPT_MULTIBYTE
> >>   418                                  if(!iswprint(c) || (c>=256 &&
> >> iswspace(c)))
> >>   419                                  {
> >>   420                                          
> >> sfprintf(staksp,"\\u[%.4x]",c);
> >>   421                                          continue;
> >>   422                                  }
> >>   423  #else
> >> -- snip --
> >
> > Do I got it right that David only has to change this single line and
> > add the [] to make the people in the Unicode consortium happy?


> Erm... to cover the "output" part... yes. The other request was that
> "\u[U+XXXXX]" should work (together with "\u[XXXXXX]" and "\uXXXX")
> because almost all Unicode documentation uses "U+<hex value>" to refer
> to unicode code points.

I just changed src/lib/libast/string/chresc.c for the read-side fix

> ----

> Bye,
> Roland

> P.S.: The 3rd issue was that an application (in this case ksh93)
> should consider unicode code points not assigened (e.g. see
> http://en.wikipedia.org/wiki/Basic_Multilingual_Plane#Unassigned_planes)
> by the current Unicode (current version is 6.2beta) standard as
> not-printable... but that is IMO job of the wide-char support in the
> locale which implements |iswprint()| ... IMO we should not litter the
> libast/ksh93 code with lots of |if()|-codes just to do that (well...
> the only disputed idea may be the "private use planes" (see
> http://en.wikipedia.org/wiki/Basic_Multilingual_Plane#Private_Use_Area_planes)
> because some operating systems stuff printable characters there which
> are not printable in other platforms... but I leave that horror
> happily out until either a) someone complains or b) we have a BugZilla
> to discuss whether this makes sense or not...).

_______________________________________________
ast-developers mailing list
[email protected]
https://mailman.research.att.com/mailman/listinfo/ast-developers

Re: [ast-developers] ksh93 support for Unicode outside BMP (=basic multilingual plane) ...

Reply via email to