Re: string vs. char [was Re: Java and Unicode]

Antoine Leca Mon, 20 Nov 2000 03:40:28 -0800
Marco Cimarosti wrote:
> 
> Actually, C does have different types for characters within strings and for
> characters in isolation.

That is not my point of view.
There is a special case for 'H', that holds int type rather than char, for
backward compatibility reasons (such as because the first versions of C were
not able to deal correctly with to-be-promoted arguments). Similarly, a
number of (old) functions use int for the character arguments.
Then, there is the point of view that int represents _either_ a valid
character, _or_ an error indication (EOF). This is the reason that makes
int used for the return type of fgetc.

Outside this, a string is clearly an array of characters, and characters are
stored using the type char (or one of the sign alternatives). As a result,
you can write 'H' either as such, or as "Hello, world!\n"[0].


> The type of a string literal (e.g. "Hello world!\n") is "array of char",
> while the type of a character literal (e.g. 'H') is "int".
>
> This distinction is generally reflected also in the C library, so that you
> don't get compiler warnings when passing character constants to functions.

You need not, since C considers character to be (small) integers, which eases
passing of arguments. This is unrelated to the issue.

 
> This distinction has been retained also in the newer "wide character
> library": "wchar_t" is the wide equivalent of "char", while "wint_t" is the
> wide equivalent of "int".

Not exactly. The wide versions has the same distinction as the narrow one for
the second case above (finding errors), but not for the first one (promoting).

 
> The wide version of the examples above is:
> 
> int fputws(const wchar_t * c, FILE * stream);
> wint_t fputwc(wint_t c, FILE * stream);
                ^^^^^^
Instead, we have
  int fputws(const wchar_t * s, FILE * stream);
  wint_t fputwc(wchar_t c, FILE *stream);

It shows clearly that c cannot hold the WEOF value. OTOH, the returned value
_can_ be the error indication WEOF, so the type is wint_t.

Similarly, the type of L'H' is wchar_t. You gave other examples in your "But".

 
> int iswalpha(wint_t c);

Here, the iswalpha is intended to be able to test valid characters as well
as the error indication, so the type is wint_t; here WEOF is specifically
allowed.


> In an Unicode implementation of the "wide character library" (wchar.h and
> wctype.h), this difference may be exploited to use different UTF's for
> strings and characters:

Ah, now we go into the interresting field.
Please note that I left aside UTF-16, because I am not clear if 16-bit are
adequate or not to code UTF-16 in wchar_t (in other words, if wchar_t can be
a multiwide encoding).

 
> typedef unsigned short wchar_t;
> /* UTF-16 character, used within string. */
> 
> typedef unsigned long  wint_t;
> /* UTF-32 character, used for handling isolated characters. */

To date, no problem.

 
> But, unluckily, there is a "but". Type "wchar_t" is *also* used for isolated
> character in a couple of stupid APIs:

See above for another example: fputwc...

 
> But I think that changing those "wchar_t c" to "wint_t c" is a smaller
> "violence" to the standards than changing them to "const wchar_t * c".

;-)

> And you can also implement it in an elegant, quasi-standard way:
<corrected> 
> wchar_t * _wcschr_32(const wchar_t * s, wint_t c);
> wchar_t * _wcsrchr_32(const wchar_t * s, wint_t c);
> size_t _wcrtomb_32(char * s, wint_t c, mbstate_t * mbs);

What is the point? You cannot pass to these anything other than values
between 0 (WCHAR_MIN) and WCHAR_MAX anyway. And there are no really
"interesting" ways to extend the meaning of these functions outside
this range.
Or do I miss something?

 
> #ifdef PEDANTIC_STANDARD
> wchar_t * wcschr(const wchar_t * s, wchar_t c);
> wchar_t * wcsrchr(const wchar_t * s, wchar_t c);
> size_t wcrtomb(char * s, wchar_t c, mbstate_t * mbs);
> #else
> #define wcschr  _wcschr_32
> #define wcsrchr _wcsrchr_32
> #define wcrtomb _wcrtomb_32
> #endif
Re: string vs. char [was Re: Java and Unicode]

Reply via email to