[OT?] The C standard library and UTF's (was RE: Text Editors and Canonical Equivalence (was Coloured diacritics))

Marco Cimarosti Fri, 12 Dec 2003 06:27:02 -0800

Tim Greenwood wrote:
> In my interpretation of the C standard (which I am reading from 
> http://std.dkuug.dk/JTC1/SC22/WG14/www/docs/n843.pdf) UTF-8 is not a 
> valid wchar_t encoding if your execution character set contains 
> characters outside the C0 controls and Basic Latin range, and 
> UTF-16 is not a valid wchar_t encoding if your execution character
> set has characters outside the BMP. In other words whatever you 
> consider to be a character (which may be a combining character)
> must be encoded in one wchar_t code unit.
> 
> The relevant passage is
> 
> 11 A wide character constant has type wchar_t, an integer 
> type defined in the <stddef.h> header. The value of a wide character
> constant containing a single multibyte character that maps to a
> member of the extended execution character set is the wide
> character (code) corresponding to that multibyte character, as
> defined by the mbtowc function, with an implementation-defined
> current locale. The value of a wide character constant containing
> more than one multibyte character, or containing a multibyte
> character or escape sequence not represented in the extended
> execution character set, is implementation-defined.


I don't know. I thought a bit about this, and I think that your restrictive
interpretation is not necessarily correct.

After all, the C Standard just says is that a "wide character" and a
"multibyte character" is whatever the <mbtowc> function defines them to be.

And it is quite easy to show that the <mbtowc> function could, in turn,
define them to be whatever the <mbrtowc> function defines them to be:

   // My hypothetical "mbtowc.c"
   #include <wchar.h>
   // (See ISO/IEC 9899:1999 - 7.20.7.2 "The mbtowc function")
   int mbtowc (wchar_t * pwc, const char * s, size_t n)
   {
      int retval;
      static mbstate_t internal;
      if (s == NULL)
      {
         // yes: we are stateful (or pretend we are)
         return 1;
      }
      retval = (int)mbrtowc(pwc, s, n, &internal);
      if (retval < 0)
      {
         retval = -1;
      }
      return retval;
   }

As the definition of multibyte characters and wide character is now
completely up to the <mbrtowc>, we could well adopt the convention (or call
it "trick", if you prefer) of pretending that a 4-byte UTF-8 multibyte
sequence is actually a sequence of *two* 2-byte multibyte sequences.

Technically, the trick is possible because:

        a) returning 2 twice instead than 4 once guarantees the correct
advance while scanning a string;
        b) we can actually map both our fake 2-byte multibyte sequences to
an actual "wide character": the high and low surrogates;
        c) the <mbstate_t> object can be used to store the relevant data
across the two calls.

Legally, the trick is possible because of the purposely vague wording of the
C Standard, which leaves the definition of wide and multibyte characters
completely up to the implementation.

Here is what I mean:

   // Excerpt from my hypothetical <wchar.h> for UTF-16 wide characters
   // ...
   // (See ISO/IEC 9899:1999 - 7.17 "Common definitions <stddef.h>")
   typedef short wchar_t;
   // ...
   // (See ISO/IEC 9899:1999 - 7.24 "Extended multibyte and wide character
utilities <wchar.h>")
   typedef wchar_t mbstate_t;
   // ...

   // My hypothetical "mbrtowc.c" for UTF-16 wide characters
   #include <wchar.h>
   // (See ISO/IEC 9899:1999 - 7.24.6.3.2 "The mbrtowc function")
   size_t mbrtowc (wchar_t * pwc, const char * s, size_t n, mbstate_t * ps)
   {
      extern int _MyDecodeUtf8 (const char * s, size_t n, long * c32);
      extern void _MyEncodeUtf16 (long c32, wchar_t * hi16, wchar_t * lo16);
      static mbstate_t internal = 0;
      long c32;
      int retval;
      if (ps == NULL)
      {
         ps = &internal;
      }
      if (s == NULL)
      {
         pwc = NULL;
         s = "";
         n = 1;
      }
      if (*ps != 0)
      {
         if (pwc != NULL)
         {
            // output second surrogate saved in previous call
            *pwc = *ps;
         }
         // clear saved surrogate
         *ps = 0;
         // return fake multibyte length
         return 2;
      }
      retval = _MyDecodeUtf8(s, n, &c32);
      if (retval == 4)
      {
         // output first surrogate and save second surrogate for next call
         _MyEncodeUtf16(c32, pwc, ps);
         // return fake multibyte length
         retval = 2;
      }
      else if (retval >= 0 && pwc != NULL)
      {
         *pwc = (wchar_t)c32;
      }
      return retval;
   }

If the above UTF-16 implementation could perhaps look relatively "smart", an
UTF-8 implementation would definitely look very silly.

However, if it we agree that defining what a "multibyte character" and a
"wide character" are the exclusive task of <mbtowc> (and hence of
<mbrtowc>), then the below implementation, silly as it is, could well be
100% compliant with C99:

   // Excerpt from my hypothetical <wchar.h> for UTF-8 (or DBCS, or SBCS, or
any byte-oriented encoding) wide characters
   // ...
   // (See ISO/IEC 9899:1999 - 7.17 "Common definitions <stddef.h>")
   typedef char wchar_t;
   // ...
   // (See ISO/IEC 9899:1999 - 7.24 "Extended multibyte and wide character
utilities <wchar.h>")
   typedef wchar_t mbstate_t;
   // ...

   // My hypothetical "mbrtowc.c" with UTF-8 (or DBCS, or ...) wide
characters
   #include <wchar.h>
   // (See ISO/IEC 9899:1999 - 7.24.6.3.2 "The mbrtowc function")
   size_t mbrtowc (wchar_t * pwc, const char * s, size_t n, mbstate_t * ps)
   {
      if (s == NULL)
      {
         pwc = NULL;
         s = "";
         n = 1;
      }
      if (n < 1)
      {
         return -1;
      }
      if (pwc != NULL)
      {
         *pwc = *s;
      }
      return (*s == 0) ? 0 : 1;
      // (ps is unused)
   }

_ Marco

[OT?] The C standard library and UTF's (was RE: Text Editors and Canonical Equivalence (was Coloured diacritics))

Reply via email to