Re: Misuse of 8th bit [Was: My Querry]
John Cowan wrote: > No, I don't agree with this part. Unicode just isn't going to expand > past 0x10 unless Earth joins the Galactic Empire. So the upper > bits are indeed free for private uses. A few years ago there was the "Whistler Constant," which basically stated that at current growth rates it would take over 900 years to fill all 15 of Unicode's publicly available planes. Since the rate of growth is actually decreasing every year -- there being no as-yet-undiscovered Chinas to contribute an extra 100,000 characters -- the projected date of exhaustion is not getting any closer. There is also decreasing support, financial and technical, for significant new character additions outside of Han. Many scripts have been on the Roadmap for five or more years, and comparatively few have been added since that time. Some may never be encoded, sadly. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: Misuse of 8th bit [Was: My Querry]
Antoine Leca scripsit: > In a similar vein, I cannot be in agreement that it could be advisable to > use the 22th, 23th, 32th, 63th, etc., the upper bits of the storage of a > Unicode codepoint. Right now, nobody is seeing any use for them as part of > characters, but history should have learned us we should prevent this kind > of optimisations to occur. No, I don't agree with this part. Unicode just isn't going to expand past 0x10 unless Earth joins the Galactic Empire. So the upper bits are indeed free for private uses. > Particularly when it is NOT defined by the > standards: such a situation leads everybody and his dog to find his > particular "optimum" use for these "free space", and these classes of > optimums do not generally collides between them... I don't think this matters as long as the upper bits are not used in interchange. For example, it would be reasonable to represent Unicode characters as immediates on a virtual machine by using some pattern in the upper bits that flags them as characters. -- Eric Raymond is the Margaret Mead John Cowan of the Open Source movement.[EMAIL PROTECTED] --Bruce Perens, http://www.ccil.org/~cowan some years agohttp://www.reutershealth.com
Re: Misuse of 8th bit [Was: My Querry]
The fact is, once you dedicate the top bits in a pipe to some purposes, you've narrowed the width of the pipe. That's what happened to those systems that implemented a 7-bit pipe for ASCII by using the top bit for other purposes. And everybody seems to agree that when you serialize such an encoding the 'unused' bits indeed do need to be set to 0. 0xFFF0 is *not* the same as 0x0010. Only the second example is the correct UTF-32 value for the largest Unicode code point. However, even strictly internal use of the lesser number of bits, though not illegal, or incorrect, can be *unwise*. It limits the ways such a system can be enabled for other character sets. Now, while ASCII was something of a minimal character set, Unicode strives to be universal. The chances of getting burned by limiting your architecture to the features of a single character set are inversely proportional to its scope and coverage. In an ideal world, Unicode would satisfy all needs, present and future, and you could build systems that can only ever deal with Unicode. And many such systems are being build and will work quite well. However, there's always a chance that someday some other coding system(*) may need to be used in parts of your system, and you may well be happy having kept your plumbing generically to 32-bit. Call it engineer's caution, if you will. A./
Re: Misuse of 8th bit [Was: My Querry]
From: "Antoine Leca" <[EMAIL PROTECTED]> On Thursday, November 25th, 2004 08:05Z Philippe Verdy va escriure: In ASCII, or in all other ISO 646 charsets, code positions are ALL in the range 0 to 127. Nothing is defined outside of this range, exactly like Unicode does not define or mandate anything for code points larger than 0x10, should they be stored or handled in memory with 21-, 24-, 32-, or 64-bit code units, more or less packed according to architecture or network framing constraints. So the question of whever an application can or cannot use the extra bits is left to the application, and this has no influence on the standard charset encoding or on the encoding of Unicode itself. What you seem to miss here is that given computers are nowadays based on 8-bit units, there have been a strong move in the '80s and the '90s to _reserve_ ALL the 8 bits of the octet for characters. And what was asking A. Freitag was precisely to avoid bringing different ideas about possibilities to encode other class of informations inside the 8th bit of a ASCII-based storage of a character. This is true for example in an API that just says that a "char" (or whatever datatype used in some convenient language) contains an ASCII code or Unicode code point, and expects that the datatype instance will be equal to the ASCII code or Unicode code point. In that case, the assumption of such API is that you can compare the "char" instance for equality instead of comparing only the effective code points, and this greately simplifies the programmation. So an API that says that a "char" will contain ASCII code positions should always assume that only the instance values 0 to 127 will be used; same thing if an API says that an "int" contains an Unicode code point. The problem lives only in the usage of the same datatype to store also something else (even if it's just a parity bit or bit forced to 1). As long as this is not documented with the API itself, it should not be used, to preserve the rational assumption about identities of chars and identies of codes. So for me, a protocol that adds a parity bit to the ASCII code of a character is doing that on purpose, and this should be isolated in that documented part of its API. If the protocol wants to snd this data to an API or interface that does not document this use, it should remove/clear the extra bit, to make sure that the character identity is preserved and interpreted correctly (I can't see how such a protocol implementation can expect that a '@' character coded as 192 will be correctly interpreted by the other simpler interface that expects that all '@' instances will be equal to 64...) In safe programming, any unused field in a storage unit should be given a mandatory default. As the simplest form that perserves the code identity in ASCII or code point identity in Unicode is the one that use 0 as this default, extra bits should be cleared. If not, anything can appear within the recipient of the "character": - the recipient may interpret the value as something else than a character, behaving as if the characterdata was absent (so there will be data loss, in addition to unpected behavior). Bad practice, given that it is not documented in the recipient API or interface. - the recipient may interpret the value as another character, or may not recognize the expected character. It's not clearly a bad programming practice for recipients, because it is the simplest form of handling for them. However the recipient will not behave the way expected by the sender, and it is the sender's fault, not the recipient's fault. - the recipient may take additional unexpected actions in addition to the normal handling of the character without the extra bits. It would be a bad programming practive of recipients, if this specific behavior is not documented, so senders should not need to care about it. - the recipient may filter/ignore the value completely... resulting in data loss; this may be sometimes a good practice, but only if this recipient behavior is documented. - the recipient may filter/ignore the extra bits (for example by masking); for me it's a bad programming practice for recipients... - the recipient may substitute the incorrect value by another one (such as a SUB ASCII control or a U+FFFD Unicode substitute to mark the presence of an error, without changing the string length). - an exception may be raised (so the interface will fail) because the given value does belong to the expected ASCII code range or Unicode code point range (the safest practice for recipients, that are working under the "design by contract" model, is to check the domain value range of all its incoming data or parameters, to force the senders to obey the contract). Don't expect blindly that any interface capable of accepting ASCII codes in 8-bit code units will also accept transparently all values outside of the restricted ASCII code range, unless this behav
Re: Misuse of 8th bit [Was: My Querry]
On Thursday, November 25th, 2004 08:05Z Philippe Verdy va escriure: > > In ASCII, or in all other ISO 646 charsets, code positions are ALL in > the range 0 to 127. Nothing is defined outside of this range, exactly > like Unicode does not define or mandate anything for code points > larger than 0x10, should they be stored or handled in memory with > 21-, 24-, 32-, or 64-bit code units, more or less packed according to > architecture or network framing constraints. > So the question of whever an application can or cannot use the extra > bits is left to the application, and this has no influence on the > standard charset encoding or on the encoding of Unicode itself. What you seem to miss here is that given computers are nowadays based on 8-bit units, there have been a strong move in the '80s and the '90s to _reserve_ ALL the 8 bits of the octet for characters. And what was asking A. Freitag was precisely to avoid bringing different ideas about possibilities to encode other class of informations inside the 8th bit of a ASCII-based storage of a character. In a similar vein, I cannot be in agreement that it could be advisable to use the 22th, 23th, 32th, 63th, etc., the upper bits of the storage of a Unicode codepoint. Right now, nobody is seeing any use for them as part of characters, but history should have learned us we should prevent this kind of optimisations to occur. Particularly when it is NOT defined by the standards: such a situation leads everybody and his dog to find his particular "optimum" use for these "free space", and these classes of optimums do not generally collides between them... Antoine
Re: Misuse of 8th bit [Was: My Querry]
Philippe Verdy wrote: > Whever an application chooses to use the 8th (or even 9th...) bit of a > storage or memory or networking byte used also to store an ASCII-coded > character, as a zero, or as a even or odd parity bit, of for other > purpose is the choice of the application. It does not change the fact > that this (these) extra bit(s) is not used to code the character > itself. I see this usage as a data structure, that *contains* (I don't > say *is*) a character code. This completely out of topic of the ASCII > encoding itself which is only concerned by the codes assigned to > characters, and only characters. Unfortunately, although *we* understand this distinction, most people outside this list will not. And to make things worse, they will use language that only serves to blur the distinction. For example, the term "8-bit ASCII" was formerly used to mean an 8-bit byte that contained an ASCII character code in the bottom 7 bits, and where bit 7 (the MSB) might be: - always 0 - always 1 - odd or even parity depending on the implementation. (This was before the 1980s, when companies started populating code points 128 and beyond with "extended Latin" letters and other goodies, and calling *that* 8-bit ASCII.) Implementations would pass these 8-bit thingies around, bit 7 and all, and expect them to remain unscathed. Programs that emitted bit 7 = 1 expected to receive bit 7 = 1. Those that emitted odd parity expected to receive odd parity. This was not just a data-interchange convention; many of these programs internally processed the byte as an atomic unit, parity bit and all. As John Cowan pointed out, on some systems the 8th bit was very much considered part of the "character," even though according to your model (which I do think makes sense) it is really a separate field within an 8-bit-wide data structure. > In ASCII, or in all other ISO 646 charsets, code positions are ALL in > the range 0 to 127. Nothing is defined outside of this range, exactly > like Unicode does not define or mandate anything for code points > larger than 0x10, should they be stored or handled in memory with > 21-, 24-, 32-, or 64-bit code units, more or less packed according to > architecture or network framing constraints. This is why it's perfectly legal to design your own TES or other structure for carrying Unicode (or even ASCII) code points. Inside your own black box, it doesn't matter what you do, as long as you don't corrupt data. But when communicating with the outside world, one needs to adhere to established standards. > Neither Unicode or US-ASCII or ISO 646 define what an application can > do there. The code positions or code points they define are *unique* > only in their *definition domain*. If you use larger domains for > values, nothing defines in Unicode or ISO 646 or ASCII how to > interpret the value: these standards will NOT assume that the low- > order bits can safely be used to index equivalent classes, because > these equivalence classes cannot be defined strictly within the > definition domain of these standard. What I think you are saying is this (and if so, I agree with it): If I want to design a 32-bit structure that contains a Unicode code point in 21 of the bits and something else in the remaining 11 -- or (more generally) uses values 0 through 0x10 for Unicode characters and other values for something different -- I can do so. But I MUST NOT represent this as some sort of extension of Unicode, and I MUST adhere to all the conformance rules of Unicode inasmuch as they relate to the part of my structure that purports to represent a code point. And I SHOULD be very careful about passing these around to the outside world, lest someone get the wrong impression. Same for ASCII. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: Misuse of 8th bit [Was: My Querry]
From: "Antoine Leca" <[EMAIL PROTECTED]> On Wednesday, November 24th, 2004 22:16Z Asmus Freytag va escriure: I'm not seeing a lot in this thread that adds to the store of knowledge on this issue, but I see a number of statements that are easily misconstrued or misapplied, including the thoroughly discredited practice of storing information in the high bit, when piping seven-bit data through eight-bit pathways. The problem with that approach, of course, is that the assumption that there were never going to be 8-bit data in these same pipes proved fatally wrong. Since I was the person who did introduce this theme into the thread, I feel there is an important point that should be highlighted here. The "widely discredited practice of storing information in the high bit" is in fact like the Y2K problem, a bad consequence of past practices. Only difference is that we do not have a hard time limit to solve it. Whever an application chooses to use the 8th (or even 9th...) bit of a storage or memory or networking byte used also to store an ASCII-coded character, as a zero, or as a even or odd parity bit, of for other purpose is the choice of the application. It does not change the fact that this (these) extra bit(s) is not used to code the character itself. I see this usage as a data structure, that *contains* (I don't say *is*) a character code. This completely out of topic of the ASCII encoding itself which is only concerned by the codes assigned to characters, and only characters. In ASCII, or in all other ISO 646 charsets, code positions are ALL in the range 0 to 127. Nothing is defined outside of this range, exactly like Unicode does not define or mandate anything for code points larger than 0x10, should they be stored or handled in memory with 21-, 24-, 32-, or 64-bit code units, more or less packed according to architecture or network framing constraints. So the question of whever an application can or cannot use the extra bits is left to the application, and this has no influence on the standard charset encoding or on the encoding of Unicode itself. So a good question to ask is how to handle values of variables or instances, that are supposed to contain a character code, but whose internal storage can make values out of the designed range fit in the storage code unit. For me it is left to the application, but many applications will simply assume that such a datatype is made to accept a unique code per designated character. Using the extra storage bits for something else will break this legitimate assumption, and so applications must be prepared specially to handle this case, by filtering values before checking for character identity. Neither Unicode or US-ASCII or ISO 646 define what an application can do there. The code positions or code points they define are *unique* only in their *definition domain*. If you use larger domains for values, nothing defines in Unicode or ISO 646 or ASCII how to interpret the value: these standards will NOT assume that the low-order bits can safely be used to index equivalent classes, because these equivalence classes cannot be defined strictly within the definition domain of these standard. So I see no valid rationale behind requiring applications to clear the extra bits, or to leave the extra bits unaffected, or to force these applications to necessarily interpreting the low order bits as valid code points. We are out of the definition domain, so any larger domain is application-specific, and applications may as well use ASCII or Unicode within storage code units which add some offsets, or multiply the standard codes by a constant, or apply a reordering transformation (permutation) on them and other possible non-character values. When ASCII and ISO 646 in general define a charset with 128 unique code positions, they don't say how this information will be stored (an application may as well need to use 7 distinct bytes (or other structures...), not necessarily consecutive, to *represent* the unique codes that represent ASCII or ISO 646 characters), and they don't restrict the usage of these codes separately of any other independant information (such as parity bits, or anything else). Any storage structure that allows keeping the identity and equivalences of the original standard code in its definition domain is equally valid as a representation of the standard, but this structure is out of scope of the charset definition.
Misuse of 8th bit [Was: My Querry]
On Wednesday, November 24th, 2004 22:16Z Asmus Freytag va escriure: > > I'm not seeing a lot in this thread that adds to the store of > knowledge on this issue, but I see a number of statements that are > easily misconstrued or misapplied, including the thoroughly > discredited practice of storing information in the high > bit, when piping seven-bit data through eight-bit pathways. The > problem with that approach, of course, is that the assumption > that there were never going to be 8-bit data in these same pipes > proved fatally wrong. Since I was the person who did introduce this theme into the thread, I feel there is an important point that should be highlighted here. The "widely discredited practice of storing information in the high bit" is in fact like the Y2K problem, a bad consequence of past practices. Only difference is that we do not have a hard time limit to solve it. The practice itself did disappear quite a long time ago (as I wrote, myself did use it back in 1980 and perhaps also in 1984 in a Forth interpreter that did overuse this "feature"), and right now nobody in his common sense will even think of this idea (OK, this is too strong, certainly one can show me examples of present day uses, probably more in the U.S.A. than elsewhere; just as I was able to encounter projects /designed/ in 1998 with years stored as 2 digits, and then collating dates on YYMMDD.) However, what is a real problem right now is the still widely expanded idea that this feature is still abundant, and that the data should be "*corrected*". So one should use toascii() and similar mechanism that takes the /supposed corrupt/ input and make it "good compliant 8-bit US-ASCII" as some of the answers that were made to me pointed out. It should be now obvious that a program that *keeps* a eventual parity information received on a telecommunication line and pass it unmodified to the next DTE, is less a problem with respect to eventual UTF-8 data that the equivalent program that actually *removes* unconditionnally the 8th bit. The crude reality is that the problem you are referring above really comes from these castrating practices, NOT from the now retired programs of the '70s that for economy did re-use the 8th bit to store another information along the pipeline. And I am noting that nobody advocated in this thread about USING the 8th bit. However, I saw remarks about possible PREVIOUS uses of it (and these remarks were accompanied by the relevant "I remember" and "it reminds me" that might show advises from experimented people toward newbies rather than easily misconstrued or misapplied statements). On the other hand I also saw references to practices of /discarding/ the 8th bit when one receives "USASCII" data (some might even be misconstrued to make one believe it was normative to do so); and there latter references did not come with the same "I remember" markers, quite the contrary; and present practices of Internet mail will quickly show that these practices are still in use. In other words, I believe the practice of /storing/ data into the 8th bit is effectively discredited. What we really need today is to discredit ALSO the practice of /removing/ information from the 8th bit. Antoine