Re: Misuse of 8th bit [Was: My Querry]

Philippe Verdy Thu, 25 Nov 2004 12:25:08 -0800

From: "Antoine Leca" <[EMAIL PROTECTED]>

On Wednesday, November 24th, 2004 22:16Z Asmus Freytag va escriure:
I'm not seeing a lot in this thread that adds to the store of
knowledge on this issue, but I see a number of statements that are
easily misconstrued or misapplied, including the thoroughly
discredited practice of storing information in the high
bit, when piping seven-bit data through eight-bit pathways. The
problem  with that approach, of course, is that the assumption
that there were never going to be 8-bit data in these same pipes
proved fatally wrong.
Since I was the person who did introduce this theme into the thread, I feel there is an important point that should be highlighted here. The "widely discredited practice of storing information in the high bit" is in fact like the Y2K problem, a bad consequence of past practices. Only difference is that we do not have a hard time limit to solve it.

Whever an application chooses to use the 8th (or even 9th...) bit of a storage or memory or networking byte used also to store an ASCII-coded character, as a zero, or as a even or odd parity bit, of for other purpose is the choice of the application. It does not change the fact that this (these) extra bit(s) is not used to code the character itself. I see this usage as a data structure, that *contains* (I don't say *is*) a character code. This completely out of topic of the ASCII encoding itself which is only concerned by the codes assigned to characters, and only characters. In ASCII, or in all other ISO 646 charsets, code positions are ALL in the range 0 to 127. Nothing is defined outside of this range, exactly like Unicode does not define or mandate anything for code points larger than 0x10FFFF, should they be stored or handled in memory with 21-, 24-, 32-, or 64-bit code units, more or less packed according to architecture or network framing constraints. So the question of whever an application can or cannot use the extra bits is left to the application, and this has no influence on the standard charset encoding or on the encoding of Unicode itself.

So a good question to ask is how to handle values of variables or instances, that are supposed to contain a character code, but whose internal storage can make values out of the designed range fit in the storage code unit. For me it is left to the application, but many applications will simply assume that such a datatype is made to accept a unique code per designated character. Using the extra storage bits for something else will break this legitimate assumption, and so applications must be prepared specially to handle this case, by filtering values before checking for character identity.

Neither Unicode or US-ASCII or ISO 646 define what an application can do there. The code positions or code points they define are *unique* only in their *definition domain*. If you use larger domains for values, nothing defines in Unicode or ISO 646 or ASCII how to interpret the value: these standards will NOT assume that the low-order bits can safely be used to index equivalent classes, because these equivalence classes cannot be defined strictly within the definition domain of these standard.

So I see no valid rationale behind requiring applications to clear the extra bits, or to leave the extra bits unaffected, or to force these applications to necessarily interpreting the low order bits as valid code points. We are out of the definition domain, so any larger domain is application-specific, and applications may as well use ASCII or Unicode within storage code units which add some offsets, or multiply the standard codes by a constant, or apply a reordering transformation (permutation) on them and other possible non-character values.

When ASCII and ISO 646 in general define a charset with 128 unique code positions, they don't say how this information will be stored (an application may as well need to use 7 distinct bytes (or other structures...), not necessarily consecutive, to *represent* the unique codes that represent ASCII or ISO 646 characters), and they don't restrict the usage of these codes separately of any other independant information (such as parity bits, or anything else). Any storage structure that allows keeping the identity and equivalences of the original standard code in its definition domain is equally valid as a representation of the standard, but this structure is out of scope of the charset definition.

Re: Misuse of 8th bit [Was: My Querry]

Reply via email to