RE: string vs. char [was Re: Java and Unicode]

addison Fri, 17 Nov 2000 10:30:38 -0800
Well... I think you're right. I knew that char and string units weren't
really the same thing. My concern was how to make it easy on developers to
use the Unicode API using their "native intelligence".

More thought makes me less certain of my approach. Specifically, as Mark
points out, looping structures are much more ugly when using
pointers. And I still have to have all of the code for the scalar value
conversion. It's better to force a casting macro on the developers than
trying to do their dirty work for them. Fooling developers who use your
API is a Bad Idea, usually.

Thanks again for the feedback.

Addison

===========================================================
Addison P. Phillips                    Principal Consultant
Inter-Locale LLC                http://www.inter-locale.com
Los Gatos, CA, USA          mailto:[EMAIL PROTECTED]

+1 408.210.3569 (mobile)              +1 408.904.4762 (fax)
===========================================================
Globalization Engineering & Consulting Services

On Fri, 17 Nov 2000, Marco Cimarosti wrote:

> Addison P. Phillips wrote:
> > I ended up deciding that the Unicode API for this OS will only work in
> > strings. CTYPE replacement functions (such as isalpha) and
> > character based
> > replacement functions (such as strchr) will take and return
> > strings for
> > all of their arguments.
> >
> > Internally, my functions are converting the pointed character to its
> > scalar value (to look it up in the database most efficiently).
> >
> > This isn't very satisfying. It goes somewhat against the grain of 'C'
> > programming. But it's equally unsatisfying to use a 32-bit
> > representation
> > for a character and a 16-bit representation for a string,
> > because in 'C',
> > a string *is* an array of characters. Which is more
> > natural? Which is more common? Iterating across an array of
> > 16-bit values
> > or
> 
> Actually, C does have different types for characters within strings and for
> characters in isolation.
> 
> The type of a string literal (e.g. "Hello world!\n") is "array of char",
> while the type of a character literal (e.g. 'H') is "int".
> 
> This distinction is generally reflected also in the C library, so that you
> don't get compiler warnings when passing character constants to functions.
> 
> E.g., compare the following functions from <stdio.h>:
> 
> int fputs(const char * s, FILE * stream);
> int fputc(int c, FILE * stream);
> 
> The same convention is generally used through the C library, not only in the
> I/O functions. E.g.:
> 
> int isalpha(int c);
> int tolower(int c);
> 
> This distinction has been retained also in the newer "wide character
> library": "wchar_t" is the wide equivalent of "char", while "wint_t" is the
> wide equivalent of "int".
> 
> The wide version of the examples above is:
> 
> int fputws(const wchar_t * c, FILE * stream);
> wint_t fputwc(wint_t c, FILE * stream);
> 
> int iswalpha(wint_t c);
> wint_t towlower(wint_t c);
> 
> In an Unicode implementation of the "wide character library" (wchar.h and
> wctype.h), this difference may be exploited to use different UTF's for
> strings and characters:
> 
> typedef unsigned short wchar_t;
> /* UTF-16 character, used within string. */
> 
> typedef unsigned long  wint_t;
> /* UTF-32 character, used for handling isolated characters. */
> 
> But, unluckily, there is a "but". Type "wchar_t" is *also* used for isolated
> character in a couple of stupid APIs:
> 
> wchar_t * wcschr(const wchar_t * s, wchar_t c);
> wchar_t * wcsrchr(const wchar_t * s, wchar_t c);
> size_t wcrtomb(char * s, wchar_t c, mbstate_t * mbs);
> 
> BTW, the blunder in wcschr() and wcsrchr() is inherited from their "narrow"
> ancestors: strchr() and strrchr().
> 
> But I think that changing those "wchar_t c" to "wint_t c" is a smaller
> "violence" to the standards than changing them to "const wchar_t * c".
> And you can also implement it in an elegant, quasi-standard way:
> 
> wchar_t * _wcschr_32(const wint_t * s, wchar_t c);
> wchar_t * _wcsrchr_32(const wint_t * s, wchar_t c);
> size_t _wcrtomb_32(char * s, wint_t c, mbstate_t * mbs);
> 
> #ifdef PEDANTIC_STANDARD
> wchar_t * wcschr(const wchar_t * s, wchar_t c);
> wchar_t * wcsrchr(const wchar_t * s, wchar_t c);
> size_t wcrtomb(char * s, wchar_t c, mbstate_t * mbs);
> #else
> #define wcschr  _wcschr_32
> #define wcsrchr _wcsrchr_32
> #define wcrtomb _wcrtomb_32
> #endif
> 
> I would like to see the opinion of C standardization experts (e.g. A. Leca)
> about this forcing of the C standard.
> 
> _ Marco.
> 
> ______________________________________________
> La mia e-mail è ora:         My e-mail is now:
> >>>       marco.cimarostiªeurope.com       <<<
> (Cambiare "ª" in "@")      (Change "ª" to "@")
>      
> 
> ______________________________________________
> FREE Personalized Email at Mail.com
> Sign up at http://www.mail.com/?sr=signup
>
RE: string vs. char [was Re: Java and Unicode]

Reply via email to