I tried the tr example with en_GB.UTF-8 on S10_u6. The source for
/usr/bin/tr is the same between S10_u5 and S10_u6, so the tr behavior
should be the same.
I get:
SunOS XXX 5.10 Generic_137138-09 i86pc i386 i86pc
(a.)
% echo A | env LANG=en_GB.UTF-8 tr A '\301' | od -xc
0000000 000a
\n
0000001
If LC_CTYPE is set to a locale other than en_GB.UTF-8, with the above
command, tr's call to setlocale() could fail, and result in tr operating
in the C locale. If that happened, the result would look like:
% echo A | env LANG=C tr A '\301' | od -xc
0000000 0ac1
301 \n
0000002
Perhaps this explains the claim from the original report.
You could do this to double-check a call to setlocale() executes as
expected:
% env LANG=en_GB.UTF-8 locale
LANG=en_GB.UTF-8
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_ALL=
This shows what locale could report if LC_CTYPE and LC_ALL were set
to C (not what the user might expect):
% env LANG=en_GB.UTF-8 locale
LANG=en_GB.UTF-8
LC_CTYPE=C
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=C
With the above settings, tr would operate in the C locale, since LANG
has lower precedence than LC_ALL. The user should unset LC_ALL and
LC_CTYPE before executing the command (a.).
Either way (the old tr behaves), the new tr ignores the invalid characters.
This results in the differing behavior between new and old tr. We may want
to revisit how tr should handle invalid characters. CR 6778537 has
been filed as a placeholder.
> Date: Sun, 30 Nov 2008 21:38:56 +0100
> From: Roland Mainz <roland.mainz at nrubsig.org>
>
> April Chin wrote:
> > > On Mon, 20 Oct 2008 19:16:20 +0100 Chris Ridd wrote:
> > > > On 20 Oct 2008, at 18:58, Glenn Fowler wrote:
> > >
> > > > > On Mon, 20 Oct 2008 18:39:48 +0100 Chris Ridd wrote:
> > > > >> On 20 Oct 2008, at 14:08, Glenn Fowler wrote:
> > > > >>> can you truss the bad machine to see the tr read and write calls
> > > > >> It isn't very illuminating I'm afraid:
> > > > >
> > > > > it does implicate tr
> > > > > it looks like it gets the literal args 'A' '\301'
> > > > > it reads " A\n" and writes " A\n"
> > >
> > > > I called a little C program (instead of tr) to print out argv[][]
> > > > carefully, and argv[1] was the characters "A" and NUL, and argv[2] was
> > > > the characters "\", "3", "0", "1" and NUL.
> > >
> > > > > what are your locale env var settings { LANG LC_* } ?
> > >
> > > > No LC_* variables are set, but LANG is "en_GB.UTF-8"
> > >
> > > > So if I unset LANG, tr writes "\301\n".
> >
> > > my guess is there is an interaction in /usr/bin/tr between some UTF-8
locale
> > > and the invalid UTF-8 character '\301' which voids the A => '\301' map
> > > and simply maps A => A
> > >
> > > it would be nice to verify this
> > >
> > > in any case, the original code snippet should be run with LC_ALL=C and/or
> > > LANG=C
> >
> > Robbin Kawabata, the Sun engineer who has worked with tr and
> > UTF-8 locales, is looking into this and will be replying to this thread
> > on what she finds.
>
> Robbin/April:
> Any news on the issue yet ?
>
> ----
>
> Bye,
> Roland
>
> --
> __ . . __
> (o.\ \/ /.o) roland.mainz at nrubsig.org
> \__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer
> /O /==\ O\ TEL +49 641 3992797
> (;O/ \/ \O;)