Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)
On Mon, Feb 19, 2001 at 05:42:41PM -0800, [EMAIL PROTECTED] wrote: > A few days ago I said there was a "widespread belief" that Unicode is a > 16-bit-only character set that ends at U+. A corollary is that the > supplementary characters ranging from U+1 to U+10 are either > little-known or perceived to belong to ISO/IEC 10646 only, not to Unicode. > > At least one list member questioned whether this belief was really widespread. Or, for another example, from the Berlin (GUI project) news (http://www.berlin-consortium.org/news.html#2001-01-10): With the Unicode-related functions in Prague growing out of size, I moved them into a new library called 'Babylon'. It will provide all the functionality defined in the Unicode standard (it is not Unicode but ISO 10646 compliant as it uses 32bit wide characters internally) and is written in C++. -- David Starner - [EMAIL PROTECTED] Pointless website: http://dvdeug.dhis.org "I don't care if Bill personally has my name and reads my email and laughs at me. In fact, I'd be rather honored." - Joseph_Greg
Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)
On 02/19/2001 08:05:49 PM David Starner wrote: >With the Unicode-related functions in Prague growing out of size, I moved them >into a new library called 'Babylon'. It will provide all the functionality >defined in the Unicode standard (it is not Unicode but ISO 10646 compliant as >it uses 32bit wide characters internally) and is written in C++. Eh? Unicode has no aversion to either a 32-bit encoding form (UTF-32 - see UTR#19 or PDUTR#27) or with C++. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: <[EMAIL PROTECTED]>
Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)
The error may arise from a misunderstanding of the reference on the first page of chapter 1 of the book to a 16-bit form and an 8-bit form and to "using a 16-bit encoding." It's also hard to get one's head wrapped around the idea that Unicode isn't just an encoding until one does extensive reading on the website (or in the book). Patrick Rourke - Original Message - From: <[EMAIL PROTECTED]> To: "Unicode List" <[EMAIL PROTECTED]> Sent: Tuesday, February 20, 2001 8:37 AM Subject: Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode) > > On 02/19/2001 08:05:49 PM David Starner wrote: > > >With the Unicode-related functions in Prague growing out of size, I moved > them > >into a new library called 'Babylon'. It will provide all the functionality > >defined in the Unicode standard (it is not Unicode but ISO 10646 compliant > as > >it uses 32bit wide characters internally) and is written in C++. > > Eh? Unicode has no aversion to either a 32-bit encoding form (UTF-32 - see > UTR#19 or PDUTR#27) or with C++. > > > > - Peter > > > -- - > Peter Constable > > Non-Roman Script Initiative, SIL International > 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA > Tel: +1 972 708 7485 > E-mail: <[EMAIL PROTECTED]> > >
Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)
In a message dated 2001-02-20 06:18:34 Pacific Standard Time, [EMAIL PROTECTED] writes: > >With the Unicode-related functions in Prague growing out of size, I moved > them > >into a new library called 'Babylon'. It will provide all the functionality > >defined in the Unicode standard (it is not Unicode but ISO 10646 compliant > as > >it uses 32bit wide characters internally) and is written in C++. > > Eh? Unicode has no aversion to either a 32-bit encoding form (UTF-32 - see > UTR#19 or PDUTR#27) or with C++. I believe that was David's point; he was quoting someone else who believed that a 32-bit representation was compliant with ISO/IEC 10646 but not with Unicode. -Doug Ewell Fullerton, California
Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Tuesday 20 February 2001 17:03, you wrote: > In a message dated 2001-02-20 06:18:34 Pacific Standard Time, > > >into a new library called 'Babylon'. It will provide all the > > > functionality defined in the Unicode standard (it is not Unicode but > > > ISO 10646 compliant as it uses 32bit wide characters internally) > > Eh? Unicode has no aversion to either a 32-bit encoding form (UTF-32 - > > see UTR#19 or PDUTR#27) or with C++. > I believe that was David's point; he was quoting someone else who believed > that a 32-bit representation was compliant with ISO/IEC 10646 but not with > Unicode. Hi! Looks like David was quoting me. I am working on Babylon and wanted to make clear that it is not unicode conformant as its API uses 32bit wide characters which violates clause 1 of Section 3.1. Babylon can im-/export UTF-8/16/32 (UTF-7 is in the works) though, so I'm aiming for 'unicode compliant interchange of 16bit Unicode characters' with Babylon. For more details please see pages 107/108 of the Standard. I was not implying that Unicode can't coexist with 32bit wide characters, nor that it has any problems with C++... maybe I should have soemone who speaks better english then I do write my announcements in the future. Sorry for any misunderstandings I might have caused. - -- Gruss, Tobias - --- Tobias Hunger The box said: 'Windows 95 or better' [EMAIL PROTECTED] So I installed Linux. - --- -BEGIN PGP SIGNATURE- Version: GnuPG v1.0.4 (GNU/Linux) Comment: For info see http://www.gnupg.org iD8DBQE6kqvwVND+cGpk748RAho8AJ99wuAdynbVvSKRPe9nHJdq5i4CmgCfWZNI ZBy93K7znRNtQhkHnjHKDq0= =yfjG -END PGP SIGNATURE-
Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)
The following statements have been made by participants in this thread. 1. A few days ago I said there was a "widespread belief" that Unicode is a 16-bit-only character set that ends at U+. A corollary is that the supplementary characters ranging from U+1 to U+10 are either little-known or perceived to belong to ISO/IEC 10646 only, not to Unicode. 2. Can we put this thread on a constructive footing? I am sure there is lots of outdated and/or incorrect information out there and I would like to preempt its being identified via numerous emails here. If the belief is there are misperceptions that need to be corrected, how should the problem be addressed? Bear in mind the volunteer nature of the organization I wonder if some readers might like to have a look at a specific situation. This would certainly help me and might also provide a useful case study on the practical problems. I do not purport to be an expert in unicode. Unicode is but one of many interests. I do recognize that unicode is attempting to be a comprehensive standard system and I would like to do what I can within my own research to utilize the unicode system. As some readers may remember I am producing a computer language called 1456 object code (in speech, "fourteen fifty-six object code") which is a computer language expressible using 7 bit ascii printing characters and which may be included in the param statements of an applet call in an HTML page. The applet called then calls a Java class file named Engine1456.class and quite substantial computations with graphic output may be achieved using a combination of ready prepared standardized Java classes and programs written in 1456 object code using a text editor. The benefit is that people who either do not know Java or do not have Java compiling facilities available may reasonably straightforwardly produce, using just a text editor such as Notepad, quite elegant graphics programs with Java quality graphics. There is a speed overhead, but, even for fast running programs, a 1456 object code program can get up to about 40% of the speed of a specially written Java program. With programs that wait for user input, the difference in speed may not be noticeable. The system is fully described on www.users.globalnet.co.uk/~ngo which is our family webspace in England and readers are welcome to study it in full if they so wish, yet only a few documents need to be studied, and then only in part, for the purposes of this case study. The 1456 object code system relies for its underlying standardization that the software that interprets the 1456 object code (that is, the 1456 engine) is written in Java. Therefore 1456 object code immediately fits in with being useable with a standard Java enabled browser on the internet and also to being useable on the JavaTV system as telesoftware. As JavaTV may well become a worldwide broadcasting standard there is practical importance in 1456 object code having full capability for being able to handle character strings in all languages that are encoded in unicode. Characters are introduced into the 1456 object code system documents in the document www.users.globalnet.co.uk/~ngo/14560600.htm where 1456 object code characters are said to be "represented using the 16 bit unicode characters of Java." There are various registers explained. The two key items though for this discussion is that one may load a character from the software into a register as a sort of "load immediate" type instruction in two ways. A 7 bit ascii printing character may be loaded using a two character sequence consisting of the ^ character followed by the desired character. For example, ^E can be used to encode the character U+0045 in the software. Any 16 bit unicode character may be loaded by a six character sequence consisting of 'u and four hexadecimal characters. So, the character U+0045 could be loaded using 'u0045 in the software. Clearly, the six character method can be used for more characters than the two character method, as the two character method can only be used for the characters that can be entered as 7 bit ascii printing characters from the keyboard when programming. Please note that when the 1456 object code is being obeyed the character that follows the ^ character is already existing as a 16 bit Java unicode character within the software, the conversion from 7 bit ascii to 16 bit unicode having taken place when it was loaded into the applet from the param statement of the applet call. The page www.users.globalnet.co.uk/~ngo/14560700.htm shows how the six character method using 'u may also be used in the entry of strings of characters. The next page that is needed for this case study is www.users.globalnet.co.uk/~ngo/14561100.htm and within that page the demo2.htm example. Within the source code of the demo2.htm file there are the following uses of the six character method. 'u00e9 'u0108 'u011d For example, the sequence [ C
Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)
On 02/20/2001 11:18:40 AM Tobias Hunger wrote: >Looks like David was quoting me. I am working on Babylon and wanted to make >clear that it is not unicode conformant as its API uses 32bit wide characters >which violates clause 1 of Section 3.1. This is something that UTC should clean up because C1 is obsolete. In fact, UTC just took that action when they met a couple of weeks ago: [86-M8] Motion: Amend Unicode 3.1 to change the Chapter 3, C1 conformance clause to read "A process shall interpret Unicode code units (values) in accordance with the Unicode transformation format used." (passed) So, when TUS3.1 is published later this year, you will not have any problems with conformance with that version of the Standard. (C1 was really obsolete back in version 2.0 when UTF-8 was first adopted into the Standard, but it took a while for that to get fixed.) - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: <[EMAIL PROTECTED]>
Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)
Tobias Hunger said: > > Looks like David was quoting me. I am working on Babylon and wanted to make > clear that it is not unicode conformant as its API uses 32bit wide characters > which violates clause 1 of Section 3.1. No longer, as Peter pointed out. > Babylon can im-/export UTF-8/16/32 > (UTF-7 is in the works) though, so I'm aiming for 'unicode compliant > interchange of 16bit Unicode characters' with Babylon. For more details > please see pages 107/108 of the Standard. Also out of date. This was also subjected to a major revision in the just-completed UTC meeting. These actions were taken to make it clear to everyone that use of a 32-bit encoding form is *not* inconsistent with a claim of compliance to the Unicode Standard, now that UTF-32 has been officially added as a sanctioned encoding form. From this date forward, no one should have to jump through hoops to explain how their 32-bit wide character implementations are and are not conformant to the Unicode Standard. Antoine Leca said: > [EMAIL PROTECTED] wrote: > > > > Eh? Unicode has no aversion to either a 32-bit encoding form (UTF-32 - see > > UTR#19 or PDUTR#27) or with C++. > > Read also TUS3.0, par. 5.2 on top of page 108... > As far as I know, neither UAX-29 nor PDUTR-27 has changed these words... > > That said, one can see it as a overview that ought to be corrected. > As the guy that fighted to introduce the most wide uses of ISO10646/Unicode > in C99, I will certainly welcome any change in this area! ;-) > All taken care of in the rewrite of section 5.2, based on the last UTC meeting's review of the text of PDUTR #27. In general, folks, please calm down a little. The text of PDUTR #27 is out-of-date -- it was a *Proposed Draft*, after all, for review by the UTC. And the editorial committee has been working furiously to update the text for final posting. We decided not to publicly post a bunch of intermediate drafts every 3 days during this process, to avoid generating more confusion about the text drift. But the scheduled date for the next public draft of what will become UAX #27 in the final Unicode 3.1 release is this Friday, February 23. I cannot promise that all issues will be resolved and all truth will be revealed in that document, but much of what has been discussed on this thread should become moot. --Ken
Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)
On Tuesday 20 February 2001 19:29, [EMAIL PROTECTED] wrote: > This is something that UTC should clean up because C1 is obsolete. In fact, > UTC just took that action when they met a couple of weeks ago: Wow, that's great news for me. I am currently very involved with my studies and other projects, so I failed to stay current with post 3.0 changes to the standard:( I again have to say that I'm sorry for the amount of traffic my simple oversight has caused on this list. -- Gruss, Tobias --- Tobias Hunger The box said: 'Windows 95 or better' [EMAIL PROTECTED] So I installed Linux. ---
Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)
Paul Keinänen said: > >[86-M8] Motion: Amend Unicode 3.1 to change the Chapter 3, C1 conformance > >clause to read "A process shall interpret Unicode code units (values) in > >accordance with the Unicode transformation format used." (passed) > > While this wording makes it possible to handle any 32 bit character > API implementation as UTF-32, this wording does not make it any easier > to implement it on processors with an exotic word length. Depending > how "process" is defined, but a character API implementation on a 24 > bit computer using one word/character could be non-conformant, even if > the 24 bits (or even 21 bit :-) would be more than sufficient to > support the 0 .. 10 range. To the contrary--nothing in the wording of UTF-32 prevents an implementation in 24-bit words on a processor that uses such words. The basic definitions of UTF-32 are talking about *serialization*, in which case you are talking about sequences of 4 (8-bit) bytes, and the three encoding schemes: UTF-32BE, UTF-32LE, and UTF-32. This is serialization for interchange of data. As an encoding *form* (i.e. not serialized, but instead with characters represented in computer datatypes), the assumption is that each Unicode scalar value will be represented in a 32-bit word, since that is the most common architecture that people would be using. But nothing would prevent putting them in 64-bit registers, for example, or 24-bit registers (since they fit). The only thing you need to watch out for is that if you *public* a UTF-32 API outside of a self-contained environment, you better make sure that it is using unsigned 32-bit integers, as that is the expectation that would be required for interoperating with other systems. But the same caution would apply to any public API involving integral datatypes -- you cannot willy-nilly pass integral data between a 32-bit API and a 24-bit API. > > It would have been clearer that C1 would only define that code points > in the 0 .. 10 range should be supported, That is everywhere implied in the Unicode Standard. There *are* no code points beyond 10. > allowing character API > implementations (such as dynamically loadable libraries as separate > products) for processors with exotic word lengths Allowed. Although I suppose we should add a note in the future pointing out that 64-bit and 24-bit implementations are to be expected, although not in a public API that claims it is "UTF-32". --Ken > and in a separate > clause say something about the transformation formats. >
Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)
On Tue, 20 Feb 2001 10:29:17 -0800 (GMT-0800), [EMAIL PROTECTED] wrote: > >On 02/20/2001 11:18:40 AM Tobias Hunger wrote: > >>Looks like David was quoting me. I am working on Babylon and wanted to >make >>clear that it is not unicode conformant as its API uses 32bit wide >characters >>which violates clause 1 of Section 3.1. > >This is something that UTC should clean up because C1 is obsolete. In fact, >UTC just took that action when they met a couple of weeks ago: > >[86-M8] Motion: Amend Unicode 3.1 to change the Chapter 3, C1 conformance >clause to read "A process shall interpret Unicode code units (values) in >accordance with the Unicode transformation format used." (passed) While this wording makes it possible to handle any 32 bit character API implementation as UTF-32, this wording does not make it any easier to implement it on processors with an exotic word length. Depending how "process" is defined, but a character API implementation on a 24 bit computer using one word/character could be non-conformant, even if the 24 bits (or even 21 bit :-) would be more than sufficient to support the 0 .. 10 range. While I have not recently seen BCD computers or 24 bit computers, but at least in digital signal processors (DSP) the 24 bit word length is common. It would have been clearer that C1 would only define that code points in the 0 .. 10 range should be supported, allowing character API implementations (such as dynamically loadable libraries as separate products) for processors with exotic word lengths and in a separate clause say something about the transformation formats. Paul Keinänen > >So, when TUS3.1 is published later this year, you will not have any >problems with conformance with that version of the Standard. (C1 was really >obsolete back in version 2.0 when UTF-8 was first adopted into the >Standard, but it took a while for that to get fixed.) > > > >- Peter > > >--- >Peter Constable > >Non-Roman Script Initiative, SIL International >7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA >Tel: +1 972 708 7485 >E-mail: <[EMAIL PROTECTED]> >
Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)
Hi. I took several minutes to scan through your post and I am not sure what you are asking. Would you like to see some examples, for instance, of real (assigned) code points that require encoding by surrogate pairs to be represented as Java char? Looking at what you are trying to do, I think I would rather try to explain UTF-8, but you indicate you are using Java. First, a link I couldn't find from the home page: http://www.unicode.org/charts/draftunicode31 So we have the "musical symbol G clef" at code point 0x1d11e. (I want to say \u1d11e, but that I think that requires a change to Java syntax.) To encode that in a Java char, we need two chars: Subtract 0x1: 0xd11e(binary 1101 0001 0001 1110) Split into two pieces of ten bits each by shifting off the bottom ten bits: (binary 11 0100 | 01 0001 1110) Hi half: 0x0034(binary 00 0011 0100) Lo half: 0x011e(binary 01 0001 1110) Add the base of the appropriate surrogate area: 0xd800 + 0x0034 => 0xd834 0xdc00 + 0x011e => 0xdd1e Store these in two char: char GClefPair[ 2 ] = { \ud834, \udd1e }; Does this answer your question, and could someone check my math? Hmm. I would still suggest you check out UTF-8 and see if that standard transformation might make sense for your application. Joel Rees, Media Fusion KK Amagasaki, Japan - Original Message - From: "William Overington" <[EMAIL PROTECTED]> To: "Unicode List" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Wednesday, February 21, 2001 2:30 AM Subject: Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode) > The following statements have been made by participants in this thread. > > 1. > > A few days ago I said there was a "widespread belief" that Unicode is a > 16-bit-only character set that ends at U+. A corollary is that the > supplementary characters ranging from U+1 to U+10 are either > little-known or perceived to belong to ISO/IEC 10646 only, not to Unicode. > > 2. > > Can we put this thread on a constructive footing? I am sure there is > lots of outdated and/or incorrect information out there and I would > like to preempt its being identified via numerous emails here. > If the belief is there are misperceptions that need to be corrected, how > should the problem be addressed? Bear in mind the volunteer nature of the > organization > > > > I wonder if some readers might like to have a look at a specific situation. > This would certainly help me and might also provide a useful case study on > the practical problems. > > I do not purport to be an expert in unicode. Unicode is but one of many > interests. I do recognize that unicode is attempting to be a comprehensive > standard system and I would like to do what I can within my own research to > utilize the unicode system. > > As some readers may remember I am producing a computer language called 1456 > object code (in speech, "fourteen fifty-six object code") which is a > computer language expressible using 7 bit ascii printing characters and > which may be included in the param statements of an applet call in an HTML > page. The applet called then calls a Java class file named Engine1456.class > and quite substantial computations with graphic output may be achieved using > a combination of ready prepared standardized Java classes and programs > written in 1456 object code using a text editor. The benefit is that people > who either do not know Java or do not have Java compiling facilities > available may reasonably straightforwardly produce, using just a text editor > such as Notepad, quite elegant graphics programs with Java quality graphics. > There is a speed overhead, but, even for fast running programs, a 1456 > object code program can get up to about 40% of the speed of a specially > written Java program. With programs that wait for user input, the > difference in speed may not be noticeable. > > The system is fully described on www.users.globalnet.co.uk/~ngo which is our > family webspace in England and readers are welcome to study it in full if > they so wish, yet only a few documents need to be studied, and then only in > part, for the purposes of this case study. > > The 1456 object code system relies for its underlying standardization that > the software that interprets the 1456 object code (that is, the 1456 engine) > is written in Java. Therefore 1456 object code immediately fits in with > being useable with a standard Java enabled browser on the internet and also > to being useable on the JavaTV system as telesoftware. As JavaTV may well > become a worldwide broadcasting standard there is practical importance in > 1456 object code having full capability for being able to handle c
Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)
Hi, William, I have to admit that I really haven't looked carefully at your transformation techniques and their intended purpose. But it strikes me that you might be re-inventing the wheel. A number of schemes exist for squeezing wide bit patterns into narrow bit streams. UTF-8 has been adopted by UNICODE for squeezing UNICODE into eight streams. UTF-7 is a proposal for squeezing UNICODE into 7 bit streams. I strongly urge you to examine both before you finalize your code. Explanations of UTF-8 are on the UNICODE site (somewhere), but you may need to look up UTF-7 via google.com or another search site. I assume that you have already examined the "quoted printable" and "base 64" techniques, since the state machine you describe seems to bear their influence. I'm glad my quick description helped. You may also want to check your code against the example Java (I think) source for handling surrogate pairs available either on the UNICODE site or the ISO site for ISO/IEC 10646. I should have mentioned that in the earlier post, and I apologize. Joel Rees, Media Fusion KK Amagasaki, Japan