Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)

2001-02-19 Thread David Starner

On Mon, Feb 19, 2001 at 05:42:41PM -0800, [EMAIL PROTECTED] wrote:
> A few days ago I said there was a "widespread belief" that Unicode is a 
> 16-bit-only character set that ends at U+.  A corollary is that the 
> supplementary characters ranging from U+1 to U+10 are either 
> little-known or perceived to belong to ISO/IEC 10646 only, not to Unicode.
> 
> At least one list member questioned whether this belief was really widespread.

Or, for another example, from the Berlin (GUI project) news 
(http://www.berlin-consortium.org/news.html#2001-01-10):

With the Unicode-related functions in Prague growing out of size, I moved them
into a new library called 'Babylon'. It will provide all the functionality
defined in the Unicode standard (it is not Unicode but ISO 10646 compliant as
it uses 32bit wide characters internally) and is written in C++.

-- 
David Starner - [EMAIL PROTECTED]
Pointless website: http://dvdeug.dhis.org
"I don't care if Bill personally has my name and reads my email and 
laughs at me. In fact, I'd be rather honored." - Joseph_Greg



Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)

2001-02-20 Thread Peter_Constable


On 02/19/2001 08:05:49 PM David Starner wrote:

>With the Unicode-related functions in Prague growing out of size, I moved
them
>into a new library called 'Babylon'. It will provide all the functionality
>defined in the Unicode standard (it is not Unicode but ISO 10646 compliant
as
>it uses 32bit wide characters internally) and is written in C++.

Eh? Unicode has no aversion to either a 32-bit encoding form (UTF-32 - see
UTR#19 or PDUTR#27) or with C++.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>





Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)

2001-02-20 Thread P. T. Rourke

The error may arise from a misunderstanding of the reference on the first
page of chapter 1 of the book to a 16-bit form and an 8-bit form and to
"using a 16-bit encoding."  It's also hard to get one's head wrapped around
the idea that Unicode isn't just an encoding until one does extensive
reading on the website (or in the book).

Patrick Rourke


- Original Message -
From: <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Tuesday, February 20, 2001 8:37 AM
Subject: Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in
Unicode)


>
> On 02/19/2001 08:05:49 PM David Starner wrote:
>
> >With the Unicode-related functions in Prague growing out of size, I moved
> them
> >into a new library called 'Babylon'. It will provide all the
functionality
> >defined in the Unicode standard (it is not Unicode but ISO 10646
compliant
> as
> >it uses 32bit wide characters internally) and is written in C++.
>
> Eh? Unicode has no aversion to either a 32-bit encoding form (UTF-32 - see
> UTR#19 or PDUTR#27) or with C++.
>
>
>
> - Peter
>
>
> --
-
> Peter Constable
>
> Non-Roman Script Initiative, SIL International
> 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> Tel: +1 972 708 7485
> E-mail: <[EMAIL PROTECTED]>
>
>




Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)

2001-02-20 Thread DougEwell2

In a message dated 2001-02-20 06:18:34 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> >With the Unicode-related functions in Prague growing out of size, I moved
>  them
>  >into a new library called 'Babylon'. It will provide all the functionality
>  >defined in the Unicode standard (it is not Unicode but ISO 10646 compliant
>  as
>  >it uses 32bit wide characters internally) and is written in C++.
>  
>  Eh? Unicode has no aversion to either a 32-bit encoding form (UTF-32 - see
>  UTR#19 or PDUTR#27) or with C++.

I believe that was David's point; he was quoting someone else who believed 
that a 32-bit representation was compliant with ISO/IEC 10646 but not with 
Unicode.

-Doug Ewell
 Fullerton, California



Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)

2001-02-20 Thread Tobias Hunger

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Tuesday 20 February 2001 17:03, you wrote:
> In a message dated 2001-02-20 06:18:34 Pacific Standard Time,
> >  >into a new library called 'Babylon'. It will provide all the
> >  > functionality defined in the Unicode standard (it is not Unicode but
> >  > ISO 10646 compliant as it uses 32bit wide characters internally)

> >  Eh? Unicode has no aversion to either a 32-bit encoding form (UTF-32 -
> > see UTR#19 or PDUTR#27) or with C++.

> I believe that was David's point; he was quoting someone else who believed
> that a 32-bit representation was compliant with ISO/IEC 10646 but not with
> Unicode.

Hi!

Looks like David was quoting me. I am working on Babylon and wanted to make 
clear that it is not unicode conformant as its API uses 32bit wide characters 
which violates clause 1 of Section 3.1. Babylon can im-/export UTF-8/16/32 
(UTF-7 is in the works) though, so I'm aiming for 'unicode compliant 
interchange of 16bit Unicode characters' with Babylon. For more details 
please see pages 107/108 of the Standard.

I was not implying that Unicode can't coexist with 32bit wide characters, nor 
that it has any problems with C++... maybe I should have soemone who speaks 
better english then I do write my announcements in the future. Sorry for any 
misunderstandings I might have caused.

- -- 
Gruss,
Tobias

- ---
Tobias Hunger  The box said: 'Windows 95 or better'
[EMAIL PROTECTED]  So I installed Linux.
- ---

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.0.4 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE6kqvwVND+cGpk748RAho8AJ99wuAdynbVvSKRPe9nHJdq5i4CmgCfWZNI
ZBy93K7znRNtQhkHnjHKDq0=
=yfjG
-END PGP SIGNATURE-



Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)

2001-02-20 Thread William Overington

The following statements have been made by participants in this thread.

1.

A few days ago I said there was a "widespread belief" that Unicode is a
16-bit-only character set that ends at U+.  A corollary is that the
supplementary characters ranging from U+1 to U+10 are either
little-known or perceived to belong to ISO/IEC 10646 only, not to Unicode.

2.

Can we put this thread on a constructive footing? I am sure there is
lots of outdated and/or incorrect information out there and I would
like to preempt its being identified via numerous emails here.
If the belief is there are misperceptions that need to be corrected, how
should the problem be addressed? Bear in mind the volunteer nature of the
organization



I wonder if some readers might like to have a look at a specific situation.
This would certainly help me and might also provide a useful case study on
the practical problems.

I do not purport to be an expert in unicode.  Unicode is but one of many
interests.  I do recognize that unicode is attempting to be a comprehensive
standard system and I would like to do what I can within my own research to
utilize the unicode system.

As some readers may remember I am producing a computer language called 1456
object code (in speech, "fourteen fifty-six object code") which is a
computer language expressible using 7 bit ascii printing characters and
which may be included in the param statements of an applet call in an HTML
page.  The applet called then calls a Java class file named Engine1456.class
and quite substantial computations with graphic output may be achieved using
a combination of ready prepared standardized Java classes and programs
written in 1456 object code using a text editor.  The benefit is that people
who either do not know Java or do not have Java compiling facilities
available may reasonably straightforwardly produce, using just a text editor
such as Notepad, quite elegant graphics programs with Java quality graphics.
There is a speed overhead, but, even for fast running programs, a 1456
object code program can get up to about 40% of the speed of a specially
written Java program.  With programs that wait for user input, the
difference in speed may not be noticeable.

The system is fully described on www.users.globalnet.co.uk/~ngo which is our
family webspace in England and readers are welcome to study it in full if
they so wish, yet only a few documents need to be studied, and then only in
part, for the purposes of this case study.

The 1456 object code system relies for its underlying standardization that
the software that interprets the 1456 object code (that is, the 1456 engine)
is written in Java.  Therefore 1456 object code immediately fits in with
being useable with a standard Java enabled browser on the internet and also
to being useable on the JavaTV system as telesoftware.  As JavaTV may well
become a worldwide broadcasting standard there is practical importance in
1456 object code having full capability for being able to handle character
strings in all languages that are encoded in unicode.

Characters are introduced into the 1456 object code system documents in the
document

www.users.globalnet.co.uk/~ngo/14560600.htm

where 1456 object code characters are said to be "represented using the 16
bit unicode characters of Java."

There are various registers explained.  The two key items though for this
discussion is that one may load a character from the software into a
register as a sort of "load immediate" type instruction in two ways.

A 7 bit ascii printing character may be loaded using a two character
sequence consisting of the ^ character followed by the desired character.
For example, ^E can be used to encode the character U+0045 in the software.

Any 16 bit unicode character may be loaded by a six character sequence
consisting of 'u and four hexadecimal characters.  So, the character U+0045
could be loaded using 'u0045 in the software.

Clearly, the six character method can be used for more characters than the
two character method, as the two character method can only be used for the
characters that can be entered as 7 bit ascii printing characters from the
keyboard when programming.

Please note that when the 1456 object code is being obeyed the character
that follows the ^ character is already existing as a 16 bit Java unicode
character within the software, the conversion from 7 bit ascii to 16 bit
unicode having taken place when it was loaded into the applet from the param
statement of the applet call.

The page

www.users.globalnet.co.uk/~ngo/14560700.htm

shows how the six character method using 'u may also be used in the entry of
strings of characters.

The next page that is needed for this case study is

www.users.globalnet.co.uk/~ngo/14561100.htm

and within that page the demo2.htm example.

Within the source code of the demo2.htm file there are the following uses of
the six character method.

'u00e9

'u0108

'u011d

For example, the sequence

[ C

Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)

2001-02-20 Thread Peter_Constable


On 02/20/2001 11:18:40 AM Tobias Hunger wrote:

>Looks like David was quoting me. I am working on Babylon and wanted to
make
>clear that it is not unicode conformant as its API uses 32bit wide
characters
>which violates clause 1 of Section 3.1.

This is something that UTC should clean up because C1 is obsolete. In fact,
UTC just took that action when they met a couple of weeks ago:

[86-M8] Motion: Amend Unicode 3.1 to change the Chapter 3, C1 conformance
clause to read "A process shall interpret Unicode code units (values) in
accordance with the Unicode transformation format used." (passed)

So, when TUS3.1 is published later this year, you will not have any
problems with conformance with that version of the Standard. (C1 was really
obsolete back in version 2.0 when UTF-8 was first adopted into the
Standard, but it took a while for that to get fixed.)



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>





Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)

2001-02-20 Thread Kenneth Whistler

Tobias Hunger said:

> 
> Looks like David was quoting me. I am working on Babylon and wanted to make 
> clear that it is not unicode conformant as its API uses 32bit wide characters 
> which violates clause 1 of Section 3.1.

No longer, as Peter pointed out.

> Babylon can im-/export UTF-8/16/32 
> (UTF-7 is in the works) though, so I'm aiming for 'unicode compliant 
> interchange of 16bit Unicode characters' with Babylon. For more details 
> please see pages 107/108 of the Standard.

Also out of date. This was also subjected to a major revision in the just-completed
UTC meeting.

These actions were taken to make it clear to everyone that use of a 32-bit
encoding form is *not* inconsistent with a claim of compliance to the Unicode
Standard, now that UTF-32 has been officially added as a sanctioned encoding
form. From this date forward, no one should have to jump through hoops to
explain how their 32-bit wide character implementations are and are not
conformant to the Unicode Standard.

Antoine Leca said:

> [EMAIL PROTECTED] wrote:
> > 
> > Eh? Unicode has no aversion to either a 32-bit encoding form (UTF-32 - see
> > UTR#19 or PDUTR#27) or with C++.
> 
> Read also TUS3.0, par. 5.2 on top of page 108...
> As far as I know, neither UAX-29 nor PDUTR-27 has changed these words...
> 
> That said, one can see it as a overview that ought to be corrected.
> As the guy that fighted to introduce the most wide uses of ISO10646/Unicode
> in C99, I will certainly welcome any change in this area!  ;-)
> 

All taken care of in the rewrite of section 5.2, based on the last
UTC meeting's review of the text of PDUTR #27.

In general, folks, please calm down a little. The text of PDUTR #27 is
out-of-date -- it was a *Proposed Draft*, after all, for review by
the UTC. And the editorial committee has been working furiously to update
the text for final posting. We decided not to publicly post a bunch of
intermediate drafts every 3 days during this process, to avoid generating
more confusion about the text drift. But the scheduled date for the
next public draft of what will become UAX #27 in the final Unicode 3.1
release is this Friday, February 23.

I cannot promise that all issues will be resolved and all truth will
be revealed in that document, but much of what has been discussed on
this thread should become moot.

--Ken



Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)

2001-02-20 Thread Tobias Hunger

On Tuesday 20 February 2001 19:29, [EMAIL PROTECTED] wrote:
> This is something that UTC should clean up because C1 is obsolete. In fact,
> UTC just took that action when they met a couple of weeks ago:

Wow, that's great news for me. I am currently very involved with my studies 
and other projects, so I failed to stay current with post 3.0 changes to the 
standard:(

I again have to say that I'm sorry for the amount of traffic my simple 
oversight has caused on this list.

-- 
Gruss,
Tobias

---
Tobias Hunger  The box said: 'Windows 95 or better'
[EMAIL PROTECTED]  So I installed Linux.
---




Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)

2001-02-20 Thread Kenneth Whistler

Paul Keinänen said:

> >[86-M8] Motion: Amend Unicode 3.1 to change the Chapter 3, C1 conformance
> >clause to read "A process shall interpret Unicode code units (values) in
> >accordance with the Unicode transformation format used." (passed)
> 
> While this wording makes it possible to handle any 32 bit character
> API implementation as UTF-32, this wording does not make it any easier
> to implement it on processors with an exotic word length. Depending
> how "process" is defined, but a character API implementation on a 24
> bit computer using one word/character could be non-conformant, even if
> the 24 bits (or even 21 bit :-) would be more than sufficient to
> support the 0 .. 10 range.

To the contrary--nothing in the wording of UTF-32 prevents an implementation
in 24-bit words on a processor that uses such words.

The basic definitions of UTF-32 are talking about *serialization*, in
which case you are talking about sequences of 4 (8-bit) bytes, and
the three encoding schemes: UTF-32BE, UTF-32LE, and UTF-32. This is
serialization for interchange of data.

As an encoding *form* (i.e. not serialized, but instead with characters
represented in computer datatypes), the assumption is that each
Unicode scalar value will be represented in a 32-bit word, since that
is the most common architecture that people would be using. But
nothing would prevent putting them in 64-bit registers, for example,
or 24-bit registers (since they fit).

The only thing you need to watch out for is that if you *public*
a UTF-32 API outside of a self-contained environment, you better
make sure that it is using unsigned 32-bit integers, as that is
the expectation that would be required for interoperating with other
systems. But the same caution would apply to any public API involving
integral datatypes -- you cannot willy-nilly pass integral data
between a 32-bit API and a 24-bit API.

>  
> It would have been clearer that C1 would only define that code points
> in the 0 .. 10 range should be supported,

That is everywhere implied in the Unicode Standard. There *are* no code points
beyond 10.

> allowing character API
> implementations (such as dynamically loadable libraries as separate
> products) for processors with exotic word lengths 

Allowed. Although I suppose we should add a note in the future pointing
out that 64-bit and 24-bit implementations are to be expected, although
not in a public API that claims it is "UTF-32".

--Ken

> and in a separate
> clause say something about the transformation formats.
> 




Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)

2001-02-20 Thread Paul Keinanen

On Tue, 20 Feb 2001 10:29:17 -0800 (GMT-0800), [EMAIL PROTECTED]
wrote:

>
>On 02/20/2001 11:18:40 AM Tobias Hunger wrote:
>
>>Looks like David was quoting me. I am working on Babylon and wanted to
>make
>>clear that it is not unicode conformant as its API uses 32bit wide
>characters
>>which violates clause 1 of Section 3.1.
>
>This is something that UTC should clean up because C1 is obsolete. In fact,
>UTC just took that action when they met a couple of weeks ago:
>
>[86-M8] Motion: Amend Unicode 3.1 to change the Chapter 3, C1 conformance
>clause to read "A process shall interpret Unicode code units (values) in
>accordance with the Unicode transformation format used." (passed)

While this wording makes it possible to handle any 32 bit character
API implementation as UTF-32, this wording does not make it any easier
to implement it on processors with an exotic word length. Depending
how "process" is defined, but a character API implementation on a 24
bit computer using one word/character could be non-conformant, even if
the 24 bits (or even 21 bit :-) would be more than sufficient to
support the 0 .. 10 range. 

While I have not recently seen BCD computers or 24 bit computers, but
at least in digital signal processors (DSP) the 24 bit word length is
common.
 
It would have been clearer that C1 would only define that code points
in the 0 .. 10 range should be supported, allowing character API
implementations (such as dynamically loadable libraries as separate
products) for processors with exotic word lengths  and in a separate
clause say something about the transformation formats.

Paul Keinänen

>
>So, when TUS3.1 is published later this year, you will not have any
>problems with conformance with that version of the Standard. (C1 was really
>obsolete back in version 2.0 when UTF-8 was first adopted into the
>Standard, but it took a while for that to get fixed.)
>
>
>
>- Peter
>
>
>---
>Peter Constable
>
>Non-Roman Script Initiative, SIL International
>7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
>Tel: +1 972 708 7485
>E-mail: <[EMAIL PROTECTED]>
>




Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)

2001-02-21 Thread Joel Rees
Hi.

I took several minutes to scan through your post and I am not sure what you
are asking. Would you like to see some examples, for instance, of real
(assigned) code points that require encoding by surrogate pairs to be
represented as Java char? Looking at what you are trying to do, I think I
would rather try to explain UTF-8, but you indicate you are using Java.

First, a link I couldn't find from the home page:

http://www.unicode.org/charts/draftunicode31

So we have the "musical symbol G clef" at code point 0x1d11e. (I want to say
\u1d11e, but that I think that requires a change to Java syntax.) To encode
that in a Java char, we need two chars:

Subtract 0x1:
0xd11e(binary 1101 0001 0001 1110)

Split into two pieces of ten bits each by shifting off the bottom ten bits:
(binary 11 0100 | 01 0001 1110)
Hi half: 0x0034(binary 00 0011 0100)
Lo half: 0x011e(binary 01 0001 1110)

Add the base of the appropriate surrogate area:
0xd800 + 0x0034 => 0xd834
0xdc00 + 0x011e => 0xdd1e

Store these in two char:
char GClefPair[ 2 ] = { \ud834, \udd1e };

Does this answer your question, and could someone check my math?

Hmm. I would still suggest you check out UTF-8 and see if that standard
transformation might make sense for your application.

Joel Rees, Media Fusion KK
Amagasaki, Japan

- Original Message -
From: "William Overington" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Wednesday, February 21, 2001 2:30 AM
Subject: Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in
Unicode)


> The following statements have been made by participants in this thread.
>
> 1.
>
> A few days ago I said there was a "widespread belief" that Unicode is a
> 16-bit-only character set that ends at U+.  A corollary is that the
> supplementary characters ranging from U+1 to U+10 are either
> little-known or perceived to belong to ISO/IEC 10646 only, not to Unicode.
>
> 2.
>
> Can we put this thread on a constructive footing? I am sure there is
> lots of outdated and/or incorrect information out there and I would
> like to preempt its being identified via numerous emails here.
> If the belief is there are misperceptions that need to be corrected, how
> should the problem be addressed? Bear in mind the volunteer nature of the
> organization
>
> 
>
> I wonder if some readers might like to have a look at a specific
situation.
> This would certainly help me and might also provide a useful case study on
> the practical problems.
>
> I do not purport to be an expert in unicode.  Unicode is but one of many
> interests.  I do recognize that unicode is attempting to be a
comprehensive
> standard system and I would like to do what I can within my own research
to
> utilize the unicode system.
>
> As some readers may remember I am producing a computer language called
1456
> object code (in speech, "fourteen fifty-six object code") which is a
> computer language expressible using 7 bit ascii printing characters and
> which may be included in the param statements of an applet call in an HTML
> page.  The applet called then calls a Java class file named
Engine1456.class
> and quite substantial computations with graphic output may be achieved
using
> a combination of ready prepared standardized Java classes and programs
> written in 1456 object code using a text editor.  The benefit is that
people
> who either do not know Java or do not have Java compiling facilities
> available may reasonably straightforwardly produce, using just a text
editor
> such as Notepad, quite elegant graphics programs with Java quality
graphics.
> There is a speed overhead, but, even for fast running programs, a 1456
> object code program can get up to about 40% of the speed of a specially
> written Java program.  With programs that wait for user input, the
> difference in speed may not be noticeable.
>
> The system is fully described on www.users.globalnet.co.uk/~ngo which is
our
> family webspace in England and readers are welcome to study it in full if
> they so wish, yet only a few documents need to be studied, and then only
in
> part, for the purposes of this case study.
>
> The 1456 object code system relies for its underlying standardization that
> the software that interprets the 1456 object code (that is, the 1456
engine)
> is written in Java.  Therefore 1456 object code immediately fits in with
> being useable with a standard Java enabled browser on the internet and
also
> to being useable on the JavaTV system as telesoftware.  As JavaTV may well
> become a worldwide broadcasting standard there is practical importance in
> 1456 object code having full capability for being able to handle c

Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)

2001-02-22 Thread Joel Rees

Hi, William,

I have to admit that I really haven't looked carefully at your
transformation techniques and their intended purpose. But it strikes me that
you might be re-inventing the wheel. A number of schemes exist for squeezing
wide bit patterns into narrow bit streams. UTF-8 has been adopted by UNICODE
for squeezing UNICODE into eight streams. UTF-7 is a proposal for squeezing
UNICODE into 7 bit streams. I strongly urge you to examine both before you
finalize your code.

Explanations of UTF-8 are on the UNICODE site (somewhere), but you may need
to look up UTF-7 via google.com or another search site. I assume that you
have already examined the "quoted printable" and "base 64" techniques, since
the state machine you describe seems to bear their influence.

I'm glad my quick description helped. You may also want to check your code
against the example Java (I think) source for handling surrogate pairs
available either on the UNICODE site or the ISO site for ISO/IEC 10646. I
should have mentioned that in the earlier post, and I apologize.

Joel Rees, Media Fusion KK
Amagasaki, Japan