[ksh93-integration-discuss] Re: [ast-users] Two problems with Unicode characters in ksh93

Glenn Fowler Mon, 28 Aug 2006 08:56:36 -0400 (EDT)

On Mon, 28 Aug 2006 11:27:27 +0200 Roland Mainz wrote:
> Glenn Fowler wrote:
> > On Fri, 25 Aug 2006 16:03:38 +0200 I. Szczesniak wrote:
> [snip]
> > note that \u takes up to 8 hex digits, so the 2nd "\u9836" using decimal
> > will treat 9836 as hex, and ...


> I just did take a look at the code - it seems to support something like
> "\u[<value>]", too - wouldn't that be better to document this as the
> preferred way to specifc unicode values ? IMO this may be less
> error-prone compared to something like $ (s="escape" ; printf "\u360$s")
> # (which would AFAIK be interpreted as "\u360e" instead of the
> (intended) "\u360" ...

ha -- I just looked at the code -- these support existing practice:
        \u  4 hex digits
        \U  8 hex digits
        \x  no limit hex digits

I added in \u{...} \u[...] (for U and x too) ast extensions
to handle esp. the \x form that otherwise may defy all
attempts to code some hex char combinations

note that ansi C's solution is to use string literal catenation
to tame \x:

        "a\ff" "b"

which in ast can be specified

        "a\x{ff}b"
        "a\x[ff]b"

> > > How do I specify Unicode outside the Basic Multilingual Plane (BMP,
> > > which uses values larger than 2^16) in ksh93?
> > 
> > will handle up to 2^32 bits

so my comment about \u taking 8 hex digits was wrong
\uXXXX without {}/[] for 2^16, \UXXXXXXXX without {}/[] for 2^32

> What about the idea with "\w" to have a way to specify a widechar value
> (CC:'ing Ienup and i18n-discuss at opensolaris.org for feedback. The idea
> was to add "\wXXXX" to allow someone to specify a locale-specific
> widechar value in a similar way how "\uXXXX" can be used to specific a
> unicode value in ksh93's "printf" command (see
> https://mailman.research.att.com/pipermail/ast-users/2006q3/001220.html)
> - which may be usefull for locales like *.GB18030, ja_JP.PCK etc.) ?

I'm not sure about the distinction between a unicode wchar_t and a
locale-specific wchar_t

upon parsing a wchar_t from \u... or \U... one has a unicode value
and would call wctomb() to get a mb value that could be catenated
in a string

would you do the same parse as \u... or \U... for \w... to get a wchar_t?

once that \w... wchar_t was in hand what function would you call
to get an mb value that could be catenated in a string?

-- Glenn Fowler -- AT&T Research, Florham Park NJ --

[ksh93-integration-discuss] Re: [ast-users] Two problems with Unicode characters in ksh93

Reply via email to