Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)

William Overington Tue, 20 Feb 2001 10:38:20 -0800
The following statements have been made by participants in this thread.

1.

A few days ago I said there was a "widespread belief" that Unicode is a
16-bit-only character set that ends at U+FFFF.  A corollary is that the
supplementary characters ranging from U+10000 to U+10FFFF are either
little-known or perceived to belong to ISO/IEC 10646 only, not to Unicode.

2.

Can we put this thread on a constructive footing? I am sure there is
lots of outdated and/or incorrect information out there and I would
like to preempt its being identified via numerous emails here.
If the belief is there are misperceptions that need to be corrected, how
should the problem be addressed? Bear in mind the volunteer nature of the
organization....

----

I wonder if some readers might like to have a look at a specific situation.
This would certainly help me and might also provide a useful case study on
the practical problems.

I do not purport to be an expert in unicode.  Unicode is but one of many
interests.  I do recognize that unicode is attempting to be a comprehensive
standard system and I would like to do what I can within my own research to
utilize the unicode system.

As some readers may remember I am producing a computer language called 1456
object code (in speech, "fourteen fifty-six object code") which is a
computer language expressible using 7 bit ascii printing characters and
which may be included in the param statements of an applet call in an HTML
page.  The applet called then calls a Java class file named Engine1456.class
and quite substantial computations with graphic output may be achieved using
a combination of ready prepared standardized Java classes and programs
written in 1456 object code using a text editor.  The benefit is that people
who either do not know Java or do not have Java compiling facilities
available may reasonably straightforwardly produce, using just a text editor
such as Notepad, quite elegant graphics programs with Java quality graphics.
There is a speed overhead, but, even for fast running programs, a 1456
object code program can get up to about 40% of the speed of a specially
written Java program.  With programs that wait for user input, the
difference in speed may not be noticeable.

The system is fully described on www.users.globalnet.co.uk/~ngo which is our
family webspace in England and readers are welcome to study it in full if
they so wish, yet only a few documents need to be studied, and then only in
part, for the purposes of this case study.

The 1456 object code system relies for its underlying standardization that
the software that interprets the 1456 object code (that is, the 1456 engine)
is written in Java.  Therefore 1456 object code immediately fits in with
being useable with a standard Java enabled browser on the internet and also
to being useable on the JavaTV system as telesoftware.  As JavaTV may well
become a worldwide broadcasting standard there is practical importance in
1456 object code having full capability for being able to handle character
strings in all languages that are encoded in unicode.

Characters are introduced into the 1456 object code system documents in the
document

www.users.globalnet.co.uk/~ngo/14560600.htm

where 1456 object code characters are said to be "represented using the 16
bit unicode characters of Java."

There are various registers explained.  The two key items though for this
discussion is that one may load a character from the software into a
register as a sort of "load immediate" type instruction in two ways.

A 7 bit ascii printing character may be loaded using a two character
sequence consisting of the ^ character followed by the desired character.
For example, ^E can be used to encode the character U+0045 in the software.

Any 16 bit unicode character may be loaded by a six character sequence
consisting of 'u and four hexadecimal characters.  So, the character U+0045
could be loaded using 'u0045 in the software.

Clearly, the six character method can be used for more characters than the
two character method, as the two character method can only be used for the
characters that can be entered as 7 bit ascii printing characters from the
keyboard when programming.

Please note that when the 1456 object code is being obeyed the character
that follows the ^ character is already existing as a 16 bit Java unicode
character within the software, the conversion from 7 bit ascii to 16 bit
unicode having taken place when it was loaded into the applet from the param
statement of the applet call.

The page

www.users.globalnet.co.uk/~ngo/14560700.htm

shows how the six character method using 'u may also be used in the entry of
strings of characters.

The next page that is needed for this case study is

www.users.globalnet.co.uk/~ngo/14561100.htm

and within that page the demo2.htm example.

Within the source code of the demo2.htm file there are the following uses of
the six character method.

'u00e9

'u0108

'u011d

For example, the sequence

[ Caf'u00e9]

is used to load the four character string Cafe from the software where there
is an acute accent on the e of the word Cafe.

After that, the 'u method is used where needed to produce desired effects.
It proved very useful to write the software that produced the diagram used
in the document

www.users.globalnet.co.uk/~ngo/14563100.htm

later in the sequence.  The diagram is near the end of the document.

In that software, the characters

'u03b1

'u03b2

'u03b3

'u03be

were used.

The fonts that I have used are from Microsoft as mentioned in the document

www.users.globalnet.co.uk/~ngo/14561100.htm

mentioned previously.  There are about 600 characters available, which is
well less than the 65536 that the 'u command could produce.  There are latin
characters, greek characters and cyrillic characters and more.

Having set the scene of how I apply unicode to my own application at
present, the question arises as to how to proceed to use the full unicode
system.

I am quite happy to designate 'v followed by however many characters is
judged necessary as being the way to load a however many bit unicode
character into a register from the software.  Perhaps that is 'v followed by
eight hexadecimal characters, or maybe that is 'v followed by six
hexadecimal characters.  I can use 'V and 'v without any problem if that is
what is needed.

Yet two further matters arise.

1.  What about the fact that Java uses 16 bit characters?

2. Even if I code the extra characters using some system involving 'v and
maybe 'V commands and however many hexadecimal characters following and
storing them in the software, how am I supposed to display them on the
screen?  Are these characters available in font files?  Suppose that I am
needing to use an application where only, say, ten of these extra characters
are used out of the large number of codes that are available, akin to the
fact that the fonts that I am using have characters for only about 600 of
the 65536 possible codes, can an ordinary font file be used to code these
ten characters with the large code numbers?  I would quite like to have a go
at encoding the 'v and maybe 'V in a reasonable manner and trying it out
with real data for real characters.

I have tried in a posting, with reference to just a few web pages, to
provide sufficient detail of the practical problem that I face in relation
to the matters raised in this thread and wonder if the people who are
specialist in unicode might like in their resolution of this thread to seek
to prepare a document such that someone who is not a unicode specialist yet
is trying to apply unicode to a real project where the unicode aspect is but
one part of the project may straightforwardly find an explanation of the
unicode system sufficient to be able to understand and program the
underlying structure into software and apply that structure correctly using
font files.  Such a document would be very helpful.  If it already exists, I
would be pleased to know of a reference to it.

William Overington

20 February 2001
Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)

Reply via email to