RE: discovering code points with embedded nulls

Kenneth Whistler Wed, 05 Feb 2003 13:26:06 -0800

Erik followed up:

> From what I'm hearing from you all is that a null 
> in UTF-8 is for termination and termination only.
> Is this correct?


Not quite. A null byte (0x00) in UTF-8 is only a
representation of the NULL character (U+0000). It can
be present in UTF-8 for whatever purposes one might use
a NULL in textual data.

One very common usage of a NULL is as a convention for
string termination. And if you are using NULL's that way,
then of course any API which depends on that convention
will have a problem with NULL characters embedded *in*
the string for other reasons, since they will prematurely
detect end-of-string in their processing.

If your string termination convention does *not* use
NULL (but instead some other mechanism such as explicit
length attributes), then there is no inherent reason why
you could not use NULL's for some other purpose embedded
in the string -- for example to delimit fielded data
within the string, or some other purpose. In such cases,
if your Unicode data is represented in the UTF-8 encoding
form, then those NULL's will end up as 0x00 embedded
bytes, because that is how NULL's characters are represented
in UTF-8.

--Ken

RE: discovering code points with embedded nulls

Reply via email to