Re: What's in a wchar_t string on unix?

Frank Yung-Fong Tang Tue, 02 Mar 2004 09:30:10 -0800


Antoine Leca wrote on 3/2/2004, 5:50 AM:


 > Rick Cameron asked:
 >
 > > If the locale is set to be Unicode,
 >
 > That part is highly suspect.
 > Since you write that, you already know the wchar_t encoding (as well
 > as char
 > one) depends on the locale setting.

no, not true. the wchar_t is depend on the COMPILER and C LIB 
implementation, not depend on the locale setting.

For example, wchar_t in MS Windows is defined by Microsoft (again MS is 
the one who define the compiler and C lib in that platform) as UCS2. And 
in windows, wchar_t always hold UCS2 regradless what locale you set. But 
that is because of the one who design the compiler and c lib define so.

Also, in gnu's gcc and lib c implementation, wchar_t is defined to be 4 
bytes and always hold utf-32 regardless which locale you are set to. But 
again, that is defined by who wrote gcc and gnu version of lib c.

It is compiler and c lib implementation depend.
It is NOT locale dependent (unless a particular c lib implementaion 
define so)
It is NOT application implementation depend.


 > Few person has this right. So you
 > then
 > also know that "wchar_t is implementation defined" in all the relevant
 > standards (ANSI, C99, POSIX, SUS). In other words, this says, answer
 > is in
 > the documentation for YOUR implementation.

be careful here. The so called implementation in those standard refer to 
the implementation of C compiler and C library code. They are not refer 
to the application implementation.

 >
 > Now, we can try to guess. But there are only guesses.
 >
 > > what's in a wchar_t string? Is it UTF-32, or UTF-16 with the code units
 > zero-extended to 4 bytes?
 >
 > The later is an heresy. Nobody should be fool enough to have this. UCS-2
 > with the code units zero-extended to 4 bytes might be an option, but if a
 > implementor has support for UTF-16, why would she store extended
 > UTF-16 (in
 > whatever form, i.e. split or joined, 4 or 8 bytes) in wchar_t? Any
 > evidence
 > of this would be a severe bug, IMHO.

Again, the the C lib and C compiler implementation (again, not 
application implementation) are free to choose what they do. So they may 
choose to do whatever less possible for design so you won't able to 
guess it right.

 >
 > Back to your original question, and assuming "the locale is set to be
 > Unicode", there is as much possibility to encounter UTF-32 values (which
 > would mean the implementation does have Unicode 3.1 support) than
 > zero-extended UCS-2 (case of a pre-3.1 Unicode implementation). Other
 > values
 > would be very strange, IMHO.

Not strange at all if the developer for C lib and C compiler 
implementation intentionally want to make it opaque so no one can easily 
  find out the answer and do the wrong thing. Of course, eventually 
people can still find it out. If I implement one today, I will probably 
do a UTF-32 xor with 0x1BADBEEF (hum... that may not work, since I may 
need to make sure ASCII 0x00 - 0x7f map to 0x00000000 - 0x0000007f- I 
think [not 100%] that is mandate by ANSI/C for wchar_t)


 >
 > Recent standards has a test feature macro, __STDC_ISO_10646__, that if
 > defined will tell you the answer: defined to be greater than 1999xxL will
 > mean UTF-32 values. Defined but less than 1999xxL will probably mean no
 > surrogate support, hence zero-extended UCS-2. Undefined does not tell you
 > anything.
 > Unfortunately, this is also the most current setup.

As long as you start to guess the value wchar_t, you are in the wrong 
path for the ANSI/C wchar_t.

I don't like the fact how ANSI/C define wchar_t and definitely there are 
a need for a data type which hold the wide char and also let us know 
what the value mean, but for sure that data type is not wchar_t. It is 
wchar_t on Win32 only because MS add additional definitation to it.

Re: What's in a wchar_t string on unix?

Reply via email to