Tim Greenwood wrote: > In my interpretation of the C standard (which I am reading from > http://std.dkuug.dk/JTC1/SC22/WG14/www/docs/n843.pdf) UTF-8 is not a > valid wchar_t encoding if your execution character set contains > characters outside the C0 controls and Basic Latin range, and > UTF-16 is not a valid wchar_t encoding if your execution character > set has characters outside the BMP. In other words whatever you > consider to be a character (which may be a combining character) > must be encoded in one wchar_t code unit. > > The relevant passage is > > 11 A wide character constant has type wchar_t, an integer > type defined in the <stddef.h> header. The value of a wide character > constant containing a single multibyte character that maps to a > member of the extended execution character set is the wide > character (code) corresponding to that multibyte character, as > defined by the mbtowc function, with an implementation-defined > current locale. The value of a wide character constant containing > more than one multibyte character, or containing a multibyte > character or escape sequence not represented in the extended > execution character set, is implementation-defined.
I don't know. I thought a bit about this, and I think that your restrictive interpretation is not necessarily correct. After all, the C Standard just says is that a "wide character" and a "multibyte character" is whatever the <mbtowc> function defines them to be. And it is quite easy to show that the <mbtowc> function could, in turn, define them to be whatever the <mbrtowc> function defines them to be: // My hypothetical "mbtowc.c" #include <wchar.h> // (See ISO/IEC 9899:1999 - 7.20.7.2 "The mbtowc function") int mbtowc (wchar_t * pwc, const char * s, size_t n) { int retval; static mbstate_t internal; if (s == NULL) { // yes: we are stateful (or pretend we are) return 1; } retval = (int)mbrtowc(pwc, s, n, &internal); if (retval < 0) { retval = -1; } return retval; } As the definition of multibyte characters and wide character is now completely up to the <mbrtowc>, we could well adopt the convention (or call it "trick", if you prefer) of pretending that a 4-byte UTF-8 multibyte sequence is actually a sequence of *two* 2-byte multibyte sequences. Technically, the trick is possible because: a) returning 2 twice instead than 4 once guarantees the correct advance while scanning a string; b) we can actually map both our fake 2-byte multibyte sequences to an actual "wide character": the high and low surrogates; c) the <mbstate_t> object can be used to store the relevant data across the two calls. Legally, the trick is possible because of the purposely vague wording of the C Standard, which leaves the definition of wide and multibyte characters completely up to the implementation. Here is what I mean: // Excerpt from my hypothetical <wchar.h> for UTF-16 wide characters // ... // (See ISO/IEC 9899:1999 - 7.17 "Common definitions <stddef.h>") typedef short wchar_t; // ... // (See ISO/IEC 9899:1999 - 7.24 "Extended multibyte and wide character utilities <wchar.h>") typedef wchar_t mbstate_t; // ... // My hypothetical "mbrtowc.c" for UTF-16 wide characters #include <wchar.h> // (See ISO/IEC 9899:1999 - 7.24.6.3.2 "The mbrtowc function") size_t mbrtowc (wchar_t * pwc, const char * s, size_t n, mbstate_t * ps) { extern int _MyDecodeUtf8 (const char * s, size_t n, long * c32); extern void _MyEncodeUtf16 (long c32, wchar_t * hi16, wchar_t * lo16); static mbstate_t internal = 0; long c32; int retval; if (ps == NULL) { ps = &internal; } if (s == NULL) { pwc = NULL; s = ""; n = 1; } if (*ps != 0) { if (pwc != NULL) { // output second surrogate saved in previous call *pwc = *ps; } // clear saved surrogate *ps = 0; // return fake multibyte length return 2; } retval = _MyDecodeUtf8(s, n, &c32); if (retval == 4) { // output first surrogate and save second surrogate for next call _MyEncodeUtf16(c32, pwc, ps); // return fake multibyte length retval = 2; } else if (retval >= 0 && pwc != NULL) { *pwc = (wchar_t)c32; } return retval; } If the above UTF-16 implementation could perhaps look relatively "smart", an UTF-8 implementation would definitely look very silly. However, if it we agree that defining what a "multibyte character" and a "wide character" are the exclusive task of <mbtowc> (and hence of <mbrtowc>), then the below implementation, silly as it is, could well be 100% compliant with C99: // Excerpt from my hypothetical <wchar.h> for UTF-8 (or DBCS, or SBCS, or any byte-oriented encoding) wide characters // ... // (See ISO/IEC 9899:1999 - 7.17 "Common definitions <stddef.h>") typedef char wchar_t; // ... // (See ISO/IEC 9899:1999 - 7.24 "Extended multibyte and wide character utilities <wchar.h>") typedef wchar_t mbstate_t; // ... // My hypothetical "mbrtowc.c" with UTF-8 (or DBCS, or ...) wide characters #include <wchar.h> // (See ISO/IEC 9899:1999 - 7.24.6.3.2 "The mbrtowc function") size_t mbrtowc (wchar_t * pwc, const char * s, size_t n, mbstate_t * ps) { if (s == NULL) { pwc = NULL; s = ""; n = 1; } if (n < 1) { return -1; } if (pwc != NULL) { *pwc = *s; } return (*s == 0) ? 0 : 1; // (ps is unused) } _ Marco