From: "Antoine Leca" <[EMAIL PROTECTED]>
On Thursday, November 25th, 2004 08:05Z Philippe Verdy va escriure:

In ASCII, or in all other ISO 646 charsets, code positions are ALL in the range 0 to 127. Nothing is defined outside of this range, exactly like Unicode does not define or mandate anything for code points larger than 0x10FFFF, should they be stored or handled in memory with 21-, 24-, 32-, or 64-bit code units, more or less packed according to architecture or network framing constraints. So the question of whever an application can or cannot use the extra bits is left to the application, and this has no influence on the standard charset encoding or on the encoding of Unicode itself.

What you seem to miss here is that given computers are nowadays based on
8-bit units, there have been a strong move in the '80s and the '90s to
_reserve_ ALL the 8 bits of the octet for characters. And what was asking A.
Freitag was precisely to avoid bringing different ideas about possibilities
to encode other class of informations inside the 8th bit of a ASCII-based
storage of a character.

This is true for example in an API that just says that a "char" (or whatever datatype used in some convenient language) contains an ASCII code or Unicode code point, and expects that the datatype instance will be equal to the ASCII code or Unicode code point.
In that case, the assumption of such API is that you can compare the "char" instance for equality instead of comparing only the effective code points, and this greately simplifies the programmation.
So an API that says that a "char" will contain ASCII code positions should always assume that only the instance values 0 to 127 will be used; same thing if an API says that an "int" contains an Unicode code point.


The problem lives only in the usage of the same datatype to store also something else (even if it's just a parity bit or bit forced to 1).

As long as this is not documented with the API itself, it should not be used, to preserve the rational assumption about identities of chars and identies of codes.

So for me, a protocol that adds a parity bit to the ASCII code of a character is doing that on purpose, and this should be isolated in that documented part of its API. If the protocol wants to snd this data to an API or interface that does not document this use, it should remove/clear the extra bit, to make sure that the character identity is preserved and interpreted correctly (I can't see how such a protocol implementation can expect that a '@' character coded as 192 will be correctly interpreted by the other simpler interface that expects that all '@' instances will be equal to 64...)

In safe programming, any unused field in a storage unit should be given a mandatory default. As the simplest form that perserves the code identity in ASCII or code point identity in Unicode is the one that use 0 as this default, extra bits should be cleared. If not, anything can appear within the recipient of the "character":

- the recipient may interpret the value as something else than a character, behaving as if the characterdata was absent (so there will be data loss, in addition to unpected behavior). Bad practice, given that it is not documented in the recipient API or interface.

- the recipient may interpret the value as another character, or may not recognize the expected character. It's not clearly a bad programming practice for recipients, because it is the simplest form of handling for them. However the recipient will not behave the way expected by the sender, and it is the sender's fault, not the recipient's fault.

- the recipient may take additional unexpected actions in addition to the normal handling of the character without the extra bits. It would be a bad programming practive of recipients, if this specific behavior is not documented, so senders should not need to care about it.

- the recipient may filter/ignore the value completely... resulting in data loss; this may be sometimes a good practice, but only if this recipient behavior is documented.

- the recipient may filter/ignore the extra bits (for example by masking); for me it's a bad programming practice for recipients...

- the recipient may substitute the incorrect value by another one (such as a SUB ASCII control or a U+FFFD Unicode substitute to mark the presence of an error, without changing the string length).

- an exception may be raised (so the interface will fail) because the given value does belong to the expected ASCII code range or Unicode code point range (the safest practice for recipients, that are working under the "design by contract" model, is to check the domain value range of all its incoming data or parameters, to force the senders to obey the contract).

Don't expect blindly that any interface capable of accepting ASCII codes in 8-bit code units will also accept transparently all values outside of the restricted ASCII code range, unless this behavior is explicitly documenting how the character will be handled, and if this extension adds some equivalences (for example when the recipient will discard the extra bits)...

The only safe way is then:
- to send only values in the definition range of the standard encoding.
- to not accept values out of this range, by raising a run-time exception. Run-time checking may sometimes be avoided in some languages that support value ranges in their datatype definitions; but this requires a new API with new explicitly restricted datatypes than the basic character datatype (the Character class in Java is such a datatype, whose constructor restricts acceptable values to the Unicode code point range 0..0x10FFFF)...
- to create separate datatype definitions if one wants to pack more information in the same storage unit (for example by definining bitfield structures in C/C++, or by hiding this packing within the private implementation of the storage, not accessible directly without accessor methods, and not exposing these storage details to the published or public or protected interfaces), possibly with several constructors (only provided that the API can also be used to determine if an instance is a character or not), but with at least an API to retreive the original unique standard code from the instance.


For C/C++ programs that use the native "char" datatype along with C strings, the only safe way is to NOT put anything else than the pure standard code in the instance value, so that one can effectively make sure that '@'==64 in an interface that is expected to receive ASCII characters.

Same thing for Java which assumes that all "char" instances are regular UTF-16 code units (this is less a problem for UTF-16, because the whole 16-bit code unit space is valid and has a normative behavior in Unicode, even for surrogate and non-character code units), or for C/C++ programs using 16-bit wide code units.

For C/C++ programs that use the ANSI "wchar_t" datatype (which is not guaranteed to be 16-bit) no one should expect that extra bits that may exist on some platforms may be usable.

For any language that use some fixed-width integer to store UTF-32 code units, the definition domain should be checked by recipients, or the recipient should document their behavior if other values are possible:

Many applications will not only accept valid code points in 0..0x10FFFF, but also some "magic" values like -1 which have other meaning (such as the end of the input stream, or no other character available still). When this happens, the behavior is (or should be) documented explicitly, because the interface does not communicate only with valid characters.




Reply via email to