Re: Misuse of 8th bit [Was: My Querry]

Doug Ewell Thu, 25 Nov 2004 14:04:51 -0800

Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

> Whever an application chooses to use the 8th (or even 9th...) bit of a
> storage or memory or networking byte used also to store an ASCII-coded
> character, as a zero, or as a even or odd parity bit, of for other
> purpose is the choice of the application. It does not change the fact
> that this (these) extra bit(s) is not used to code the character
> itself. I see this usage as a data structure, that *contains* (I don't
> say *is*) a character code. This completely out of topic of the ASCII
> encoding itself which is only concerned by the codes assigned to
> characters, and only characters.


Unfortunately, although *we* understand this distinction, most people
outside this list will not.  And to make things worse, they will use
language that only serves to blur the distinction.

For example, the term "8-bit ASCII" was formerly used to mean an 8-bit
byte that contained an ASCII character code in the bottom 7 bits, and
where bit 7 (the MSB) might be:

- always 0
- always 1
- odd or even parity

depending on the implementation.  (This was before the 1980s, when
companies started populating code points 128 and beyond with "extended
Latin" letters and other goodies, and calling *that* 8-bit ASCII.)

Implementations would pass these 8-bit thingies around, bit 7 and all,
and expect them to remain unscathed.  Programs that emitted bit 7 = 1
expected to receive bit 7 = 1.  Those that emitted odd parity expected
to receive odd parity.  This was not just a data-interchange convention;
many of these programs internally processed the byte as an atomic unit,
parity bit and all.  As John Cowan pointed out, on some systems the 8th
bit was very much considered part of the "character," even though
according to your model (which I do think makes sense) it is really a
separate field within an 8-bit-wide data structure.

> In ASCII, or in all other ISO 646 charsets, code positions are ALL in
> the range 0 to 127. Nothing is defined outside of this range, exactly
> like Unicode does not define or mandate anything for code points
> larger than 0x10FFFF, should they be stored or handled in memory with
> 21-, 24-, 32-, or 64-bit code units, more or less packed according to
> architecture or network framing constraints.

This is why it's perfectly legal to design your own TES or other
structure for carrying Unicode (or even ASCII) code points.  Inside your
own black box, it doesn't matter what you do, as long as you don't
corrupt data.  But when communicating with the outside world, one needs
to adhere to established standards.

> Neither Unicode or US-ASCII or ISO 646 define what an application can
> do there. The code positions or code points they define are *unique*
> only in their *definition domain*. If you use larger domains for
> values, nothing defines in Unicode or ISO 646 or ASCII how to
> interpret the value: these standards will NOT assume that the low-
> order bits can safely be used to index equivalent classes, because
> these equivalence classes cannot be defined strictly within the
> definition domain of these standard.

What I think you are saying is this (and if so, I agree with it):

If I want to design a 32-bit structure that contains a Unicode code
point in 21 of the bits and something else in the remaining 11 -- or
(more generally) uses values 0 through 0x10FFFF for Unicode characters
and other values for something different -- I can do so.  But I MUST NOT
represent this as some sort of extension of Unicode, and I MUST adhere
to all the conformance rules of Unicode inasmuch as they relate to the
part of my structure that purports to represent a code point.  And I
SHOULD be very careful about passing these around to the outside world,
lest someone get the wrong impression.  Same for ASCII.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Re: Misuse of 8th bit [Was: My Querry]

Reply via email to