Re: Misuse of 8th bit [Was: My Querry]

2004-11-26 Thread Doug Ewell
John Cowan  wrote:

> No, I don't agree with this part.  Unicode just isn't going to expand
> past 0x10 unless Earth joins the Galactic Empire.  So the upper
> bits are indeed free for private uses.

A few years ago there was the "Whistler Constant," which basically
stated that at current growth rates it would take over 900 years to fill
all 15 of Unicode's publicly available planes.

Since the rate of growth is actually decreasing every year -- there
being no as-yet-undiscovered Chinas to contribute an extra 100,000
characters -- the projected date of exhaustion is not getting any
closer.

There is also decreasing support, financial and technical, for
significant new character additions outside of Han.  Many scripts have
been on the Roadmap for five or more years, and comparatively few have
been added since that time.  Some may never be encoded, sadly.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/





Re: Misuse of 8th bit [Was: My Querry]

2004-11-26 Thread John Cowan
Antoine Leca scripsit:

> In a similar vein, I cannot be in agreement that it could be advisable to
> use the 22th, 23th, 32th, 63th, etc., the upper bits of the storage of a
> Unicode codepoint. Right now, nobody is seeing any use for them as part of
> characters, but history should have learned us we should prevent this kind
> of optimisations to occur.

No, I don't agree with this part.  Unicode just isn't going to expand
past 0x10 unless Earth joins the Galactic Empire.  So the upper bits
are indeed free for private uses.

> Particularly when it is NOT defined by the
> standards: such a situation leads everybody and his dog to find his
> particular "optimum" use for these "free space", and these classes of
> optimums do not generally collides between them...

I don't think this matters as long as the upper bits are not used in
interchange.  For example, it would be reasonable to represent Unicode
characters as immediates on a virtual machine by using some pattern in
the upper bits that flags them as characters.

-- 
Eric Raymond is the Margaret Mead   John Cowan
of the Open Source movement.[EMAIL PROTECTED]
--Bruce Perens, http://www.ccil.org/~cowan
  some years agohttp://www.reutershealth.com



Re: Misuse of 8th bit [Was: My Querry]

2004-11-26 Thread Asmus Freytag
The fact is, once you dedicate the top bits in a pipe to some purposes, 
you've narrowed the width of the pipe. That's what happened to those 
systems that implemented a 7-bit pipe for ASCII by using the top bit for 
other purposes.

And everybody seems to agree that when you serialize such an encoding the 
'unused' bits indeed do need to be set to 0.  0xFFF0 is *not* the same 
as 0x0010. Only the second example is the correct UTF-32 value for the 
largest Unicode code point.

However, even strictly internal use of the lesser number of bits, though 
not illegal, or incorrect, can be *unwise*. It limits the ways such a 
system can be enabled for other character sets.

Now, while ASCII was something of a minimal character set, Unicode strives 
to be universal. The chances of getting burned by limiting your 
architecture to the features of a single character set are  inversely 
proportional to its scope and coverage.

In an ideal world, Unicode would satisfy all needs, present and future, and 
you could build systems that can only ever deal with Unicode. And many such 
systems are being build and will work quite well. However, there's always a 
chance that someday some other coding system(*) may need to be used in 
parts of your system, and you may well be happy having kept your plumbing 
generically to 32-bit.

Call it engineer's caution, if you will.
A./



Re: Misuse of 8th bit [Was: My Querry]

2004-11-26 Thread Philippe Verdy
From: "Antoine Leca" <[EMAIL PROTECTED]>
On Thursday, November 25th, 2004 08:05Z Philippe Verdy va escriure:
In ASCII, or in all other ISO 646 charsets, code positions are ALL in
the range 0 to 127. Nothing is defined outside of this range, exactly
like Unicode does not define or mandate anything for code points
larger than 0x10, should they be stored or handled in memory with
21-, 24-, 32-, or 64-bit code units, more or less packed according to
architecture or network framing constraints.
So the question of whever an application can or cannot use the extra
bits is left to the application, and this has no influence on the
standard charset encoding or on the encoding of Unicode itself.
What you seem to miss here is that given computers are nowadays based on
8-bit units, there have been a strong move in the '80s and the '90s to
_reserve_ ALL the 8 bits of the octet for characters. And what was asking 
A.
Freitag was precisely to avoid bringing different ideas about 
possibilities
to encode other class of informations inside the 8th bit of a ASCII-based
storage of a character.
This is true for example in an API that just says that a "char" (or whatever 
datatype used in some convenient language) contains an ASCII code or Unicode 
code point, and expects that the datatype instance will be equal to the 
ASCII code or Unicode code point.
In that case, the assumption of such API is that you can compare the "char" 
instance for equality instead of comparing only the effective code points, 
and this greately simplifies the programmation.
So an API that says that a "char" will contain ASCII code positions should 
always assume that only the instance values 0 to 127 will be used; same 
thing if an API says that an "int" contains an Unicode code point.

The problem lives only in the usage of the same datatype to store also 
something else (even if it's just a parity bit or bit forced to 1).

As long as this is not documented with the API itself, it should not be 
used, to preserve the rational assumption about identities of chars and 
identies of codes.

So for me, a protocol that adds a parity bit to the ASCII code of a 
character is doing that on purpose, and this should be isolated in that 
documented part of its API. If the protocol wants to snd this data to an API 
or interface that does not document this use, it should remove/clear the 
extra bit, to make sure that the character identity is preserved and 
interpreted correctly (I can't see how such a protocol implementation can 
expect that a '@' character coded as 192 will be correctly interpreted by 
the other simpler interface that expects that all '@' instances will be 
equal to 64...)

In safe programming, any unused field in a storage unit should be given a 
mandatory default. As the simplest form that perserves the code identity in 
ASCII or code point identity in Unicode is the one that use 0 as this 
default, extra bits should be cleared. If not, anything can appear within 
the recipient of the "character":

- the recipient may interpret the value as something else than a character, 
behaving as if the characterdata was absent (so there will be data loss, in 
addition to unpected behavior). Bad practice, given that it is not 
documented in the recipient API or interface.

- the recipient may interpret the value as another character, or may not 
recognize the expected character. It's not clearly a bad programming 
practice for recipients, because it is the simplest form of handling for 
them. However the recipient will not behave the way expected by the sender, 
and it is the sender's fault, not the recipient's fault.

- the recipient may take additional unexpected actions in addition to the 
normal handling of the character without the extra bits. It would be a bad 
programming practive of recipients, if this specific behavior is not 
documented, so senders should not need to care about it.

- the recipient may filter/ignore the value completely... resulting in data 
loss; this may be sometimes a good practice, but only if this recipient 
behavior is documented.

- the recipient may filter/ignore the extra bits (for example by masking); 
for me it's a bad programming practice for recipients...

- the recipient may substitute the incorrect value by another one (such as a 
SUB ASCII control or a U+FFFD Unicode substitute to mark the presence of an 
error, without changing the string length).

- an exception may be raised (so the interface will fail) because the given 
value does belong to the expected ASCII code range or Unicode code point 
range (the safest practice for recipients, that are working under the 
"design by contract" model, is to check the domain value range of all its 
incoming data or parameters, to force the senders to obey the contract).

Don't expect blindly that any interface capable of accepting ASCII codes in 
8-bit code units will also accept transparently all values outside of the 
restricted ASCII code range, unless this behav

Re: Misuse of 8th bit [Was: My Querry]

2004-11-26 Thread Antoine Leca
On Thursday, November 25th, 2004 08:05Z Philippe Verdy va escriure:
>
> In ASCII, or in all other ISO 646 charsets, code positions are ALL in
> the range 0 to 127. Nothing is defined outside of this range, exactly
> like Unicode does not define or mandate anything for code points
> larger than 0x10, should they be stored or handled in memory with
> 21-, 24-, 32-, or 64-bit code units, more or less packed according to
> architecture or network framing constraints.
> So the question of whever an application can or cannot use the extra
> bits is left to the application, and this has no influence on the
> standard charset encoding or on the encoding of Unicode itself.

What you seem to miss here is that given computers are nowadays based on
8-bit units, there have been a strong move in the '80s and the '90s to
_reserve_ ALL the 8 bits of the octet for characters. And what was asking A.
Freitag was precisely to avoid bringing different ideas about possibilities
to encode other class of informations inside the 8th bit of a ASCII-based
storage of a character.

In a similar vein, I cannot be in agreement that it could be advisable to
use the 22th, 23th, 32th, 63th, etc., the upper bits of the storage of a
Unicode codepoint. Right now, nobody is seeing any use for them as part of
characters, but history should have learned us we should prevent this kind
of optimisations to occur. Particularly when it is NOT defined by the
standards: such a situation leads everybody and his dog to find his
particular "optimum" use for these "free space", and these classes of
optimums do not generally collides between them...


Antoine




Re: Misuse of 8th bit [Was: My Querry]

2004-11-25 Thread Doug Ewell
Philippe Verdy  wrote:

> Whever an application chooses to use the 8th (or even 9th...) bit of a
> storage or memory or networking byte used also to store an ASCII-coded
> character, as a zero, or as a even or odd parity bit, of for other
> purpose is the choice of the application. It does not change the fact
> that this (these) extra bit(s) is not used to code the character
> itself. I see this usage as a data structure, that *contains* (I don't
> say *is*) a character code. This completely out of topic of the ASCII
> encoding itself which is only concerned by the codes assigned to
> characters, and only characters.

Unfortunately, although *we* understand this distinction, most people
outside this list will not.  And to make things worse, they will use
language that only serves to blur the distinction.

For example, the term "8-bit ASCII" was formerly used to mean an 8-bit
byte that contained an ASCII character code in the bottom 7 bits, and
where bit 7 (the MSB) might be:

- always 0
- always 1
- odd or even parity

depending on the implementation.  (This was before the 1980s, when
companies started populating code points 128 and beyond with "extended
Latin" letters and other goodies, and calling *that* 8-bit ASCII.)

Implementations would pass these 8-bit thingies around, bit 7 and all,
and expect them to remain unscathed.  Programs that emitted bit 7 = 1
expected to receive bit 7 = 1.  Those that emitted odd parity expected
to receive odd parity.  This was not just a data-interchange convention;
many of these programs internally processed the byte as an atomic unit,
parity bit and all.  As John Cowan pointed out, on some systems the 8th
bit was very much considered part of the "character," even though
according to your model (which I do think makes sense) it is really a
separate field within an 8-bit-wide data structure.

> In ASCII, or in all other ISO 646 charsets, code positions are ALL in
> the range 0 to 127. Nothing is defined outside of this range, exactly
> like Unicode does not define or mandate anything for code points
> larger than 0x10, should they be stored or handled in memory with
> 21-, 24-, 32-, or 64-bit code units, more or less packed according to
> architecture or network framing constraints.

This is why it's perfectly legal to design your own TES or other
structure for carrying Unicode (or even ASCII) code points.  Inside your
own black box, it doesn't matter what you do, as long as you don't
corrupt data.  But when communicating with the outside world, one needs
to adhere to established standards.

> Neither Unicode or US-ASCII or ISO 646 define what an application can
> do there. The code positions or code points they define are *unique*
> only in their *definition domain*. If you use larger domains for
> values, nothing defines in Unicode or ISO 646 or ASCII how to
> interpret the value: these standards will NOT assume that the low-
> order bits can safely be used to index equivalent classes, because
> these equivalence classes cannot be defined strictly within the
> definition domain of these standard.

What I think you are saying is this (and if so, I agree with it):

If I want to design a 32-bit structure that contains a Unicode code
point in 21 of the bits and something else in the remaining 11 -- or
(more generally) uses values 0 through 0x10 for Unicode characters
and other values for something different -- I can do so.  But I MUST NOT
represent this as some sort of extension of Unicode, and I MUST adhere
to all the conformance rules of Unicode inasmuch as they relate to the
part of my structure that purports to represent a code point.  And I
SHOULD be very careful about passing these around to the outside world,
lest someone get the wrong impression.  Same for ASCII.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/





Re: Misuse of 8th bit [Was: My Querry]

2004-11-25 Thread Philippe Verdy
From: "Antoine Leca" <[EMAIL PROTECTED]>
On Wednesday, November 24th, 2004 22:16Z Asmus Freytag va escriure:
I'm not seeing a lot in this thread that adds to the store of
knowledge on this issue, but I see a number of statements that are
easily misconstrued or misapplied, including the thoroughly
discredited practice of storing information in the high
bit, when piping seven-bit data through eight-bit pathways. The
problem  with that approach, of course, is that the assumption
that there were never going to be 8-bit data in these same pipes
proved fatally wrong.
Since I was the person who did introduce this theme into the thread, I 
feel
there is an important point that should be highlighted here. The "widely
discredited practice of storing information in the high bit" is in fact 
like
the Y2K problem, a bad consequence of past practices. Only difference is
that we do not have a hard time limit to solve it.
Whever an application chooses to use the 8th (or even 9th...) bit of a 
storage or memory or networking byte used also to store an ASCII-coded 
character, as a zero, or as a even or odd parity bit, of for other purpose 
is the choice of the application. It does not change the fact that this 
(these) extra bit(s) is not used to code the character itself.
I see this usage as a data structure, that *contains* (I don't say *is*) a 
character code. This completely out of topic of the ASCII encoding itself 
which is only concerned by the codes assigned to characters, and only 
characters.
In ASCII, or in all other ISO 646 charsets, code positions are ALL in the 
range 0 to 127. Nothing is defined outside of this range, exactly like 
Unicode does not define or mandate anything for code points larger than 
0x10, should they be stored or handled in memory with 21-, 24-, 32-, or 
64-bit code units, more or less packed according to architecture or network 
framing constraints.
So the question of whever an application can or cannot use the extra bits is 
left to the application, and this has no influence on the standard charset 
encoding or on the encoding of Unicode itself.

So a good question to ask is how to handle values of variables or instances, 
that are supposed to contain a character code, but whose internal storage 
can make values out of the designed range fit in the storage code unit. For 
me it is left to the application, but many applications will simply assume 
that such a datatype is made to accept a unique code per designated 
character. Using the extra storage bits for something else will break this 
legitimate assumption, and so applications must be prepared specially to 
handle this case, by filtering values before checking for character 
identity.

Neither Unicode or US-ASCII or ISO 646 define what an application can do 
there. The code positions or code points they define are *unique* only in 
their *definition domain*. If you use larger domains for values, nothing 
defines in Unicode or ISO 646 or ASCII how to interpret the value: these 
standards will NOT assume that the low-order bits can safely be used to 
index equivalent classes, because these equivalence classes cannot be 
defined strictly within the definition domain of these standard.

So I see no valid rationale behind requiring applications to clear the extra 
bits, or to leave the extra bits unaffected, or to force these applications 
to necessarily interpreting the low order bits as valid code points.
We are out of the definition domain, so any larger domain is 
application-specific, and applications may as well use ASCII or Unicode 
within storage code units which add some offsets, or multiply the standard 
codes by a constant, or apply a reordering transformation (permutation) on 
them and other possible non-character values.

When ASCII and ISO 646 in general define a charset with 128 unique code 
positions, they don't say how this information will be stored (an 
application may as well need to use 7 distinct bytes (or other 
structures...), not necessarily consecutive, to *represent* the unique codes 
that represent ASCII or ISO 646 characters), and they don't restrict the 
usage of these codes separately of any other independant information (such 
as parity bits, or anything else). Any storage structure that allows keeping 
the identity and equivalences of the original standard code in its 
definition domain is equally valid as a representation of the standard, but 
this structure is out of scope of the charset definition.




Misuse of 8th bit [Was: My Querry]

2004-11-25 Thread Antoine Leca
On Wednesday, November 24th, 2004 22:16Z Asmus Freytag va escriure:
>
> I'm not seeing a lot in this thread that adds to the store of
> knowledge on this issue, but I see a number of statements that are
> easily misconstrued or misapplied, including the thoroughly
> discredited practice of storing information in the high
> bit, when piping seven-bit data through eight-bit pathways. The
> problem  with that approach, of course, is that the assumption
> that there were never going to be 8-bit data in these same pipes
> proved fatally wrong.

Since I was the person who did introduce this theme into the thread, I feel
there is an important point that should be highlighted here. The "widely
discredited practice of storing information in the high bit" is in fact like
the Y2K problem, a bad consequence of past practices. Only difference is
that we do not have a hard time limit to solve it.

The practice itself did disappear quite a long time ago (as I wrote, myself
did use it back in 1980 and perhaps also in 1984 in a Forth interpreter that
did overuse this "feature"), and right now nobody in his common sense will
even think of this idea
(OK, this is too strong, certainly one can show me examples of present day
uses, probably more in the U.S.A. than elsewhere; just as I was able to
encounter projects /designed/ in 1998 with years stored as 2 digits, and
then collating dates on YYMMDD.)

However, what is a real problem right now is the still widely expanded idea
that this feature is still abundant, and that the data should be
"*corrected*". So one should use toascii() and similar mechanism that takes
the /supposed corrupt/ input and make it "good compliant 8-bit US-ASCII" as
some of the answers that were made to me pointed out.

It should be now obvious that a program that *keeps* a eventual parity
information received on a telecommunication line and pass it unmodified to
the next DTE, is less a problem with respect to eventual UTF-8 data that the
equivalent program that actually *removes* unconditionnally the 8th bit.


The crude reality is that the problem you are referring above really comes
from these castrating practices, NOT from the now retired programs of the
'70s that for economy did re-use the 8th bit to store another information
along the pipeline.
And I am noting that nobody advocated in this thread about USING the 8th
bit. However, I saw remarks about possible PREVIOUS uses of it (and these
remarks were accompanied by the relevant "I remember" and "it reminds me"
that might show advises from experimented people toward newbies rather than
easily misconstrued or misapplied statements).
On the other hand I also saw references to practices of /discarding/ the 8th
bit when one receives "USASCII" data (some might even be misconstrued to
make one believe it was normative to do so); and there latter references did
not come with the same "I remember" markers, quite the contrary; and present
practices of Internet mail will quickly show that these practices are still
in use.

In other words, I believe the practice of /storing/ data into the 8th bit is
effectively discredited. What we really need today is to discredit ALSO the
practice of /removing/ information from the 8th bit.


Antoine