RE: My Querry

Addison Phillips [wM] Tue, 23 Nov 2004 11:40:29 -0800

Title: RE: My Querry

Hi Mike,

You misread my sentence, I think. I did NOT say that C language strings are compatible with UTF-8, but rather that the UTF-8 was designed with compatibility with C language "strings" (char*) in mind. The point of UTF-8 was actually to be compatible with Unix file systems, of course. But one stimulus for the encoding was so that the Plan9 operating system wouldn't have to rewrite the C libraries to deal with UTF-16 (then UCS-2). In other words, my statement is quite correct about the design goals of FSS-UTF, UTF-8's progenitor. See for example:

http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

If you read carefully, you'll see the desire to protect the null and \ bytes.

A NULL character is considered to terminate a char* by many C functions. I don't see how it helps anything to confuse a new user by bringing up the fact that you can't put a NULL character into the middle of a char*. This, as you point out, applies equally to ASCII data.

Java's TES was designed to transport Java java.lang.String objects in a C char*. Java strings can contain the character U+0000 and Java's developers wished to allow this character in the middle of a java.lang.String. Hence this bit of fudge.

When talking to a newbie I purposely omitted all of these glorious but pointless details. The point is that UTF-8 can go into your char* just like any other multibyte encoding and in contrast with the myth that char* and Unicode cannot mix.

Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
http://www.webMethods.com

Chair, W3C Internationalization Working Group
http://www.w3.org/International

Internationalization is an architecture.
It is not a feature.

-----Original Message-----
From: Mike Ayers [mailto:[EMAIL PROTECTED]
Sent: 2004年11月23日 10:32
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: RE: My Querry

> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED]] On Behalf Of Addison Phillips [wM]
> Sent: Tuesday, November 23, 2004 9:14 AM

> One of the nice things about UTF-8 is that the ASCII bytes
> from 0 to 7F hex (including the C0 control characters from
> \x00 through \x01f---including NULL) represent the ASCII
> characters from 0 to 7F hex.

        Correct.

> That is, amoung other things
> UTF-8 was designed specifically to be compatible with C
> language strings.

        Wrong! Weren't you paying attention last week? C language strings are not even fully compatible with ASCII. UTF-8 is fully compatible with ASCII, therefore C language strings are not fully compatible with UTF-8. The Java folks devised a TES, which was UTF-8 with one change (and therefore no longer UTF-8), which was "designed specifically to be compatible with C language strings". This method apparently upsets some people.

        Since the problem between C strings and ASCII/UTF-8/(your character set here) is solely the inability to handle zero valued character elements, it may be, and very often is, practical to use C strings anyway, as zero valued characters are uncommon at best in practice, and explicitly disallowed in many applications.

/|/|ike

"Tumbleweed E-mail Firewall <tumbleweed.com>" made the following
annotations on 11/23/04 10:34:18
------------------------------------------------------------------------------
This e-mail, including attachments, may include confidential and/or proprietary information, and may be used only by the person or entity to which it is addressed. If the reader of this e-mail is not the intended recipient or his or her authorized agent, the reader is hereby notified that any dissemination, distribution or copying of this e-mail is prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and delete this e-mail immediately.
==============================================================================

RE: My Querry

Reply via email to