Re: Surrogate space in Unicode

J M Sykes Fri, 16 Feb 2001 07:42:25 -0800
See end ->

----- Original Message -----
From: <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Friday, February 16, 2001 6:05 AM
Subject: Re: Surrogate space in Unicode


> In a message dated 2001-02-15 15:26:55 Pacific Standard Time,
[EMAIL PROTECTED]
> writes:
>
> > > At 2001-02-06 07:48:29 -0800 Mark Davis wrote:
> >  >> At 2001-02-06 01:51 "nikita k" <[EMAIL PROTECTED]> wrote:
> >  >> What is surrogate space in unicode?
> >
> >  (Mark defines various terms relating to 'supplementary' and
'surrogate')
> >
> >  So, I guess it's safe to say that a surrogate code point is
> >  a surrogate code point... which is a surrogate for a supplementary
> >  code point, which is a code point between something and something
> >  else.
> >
> >  Someone needs to take a break from the bureaucrateze and learn
> >  again how to communicate clearly.  Is that not a part of the
> >  goal, here?
>
> I thought Mark's definitions were both accurate and clear, unlike John's
> rejoinder, which was neither.
>
> It has proven difficult to come up with convenient terms for the Unicode
> characters encoded at U+10000 and beyond.  The term 'surrogate' has been
> misused in an attempt to do this.  It is important to use consistent terms
> that demonstrate an understanding of what is going on.
>
> I am not a member of the Consortium, and certainly would not consider
myself
> a bureaucrat, so I wil take a stab at this in the plainest English I can
find
> that does not sacrifice accuracy.
>
> 1.  A Unicode 'code point' is a number between 0 and 1,114,111 inclusive,
> usually expressed in hexadecimal (U+0000 through U+10FFFF).  Not every
code
> point necessarily represents a valid character, although most do.  For
> example, there is no character encoded at U+FFFF.
>
> 2.  A 'basic' code point, which may represent a 'basic character', can
range
> from U+0000 through U+FFFF.  The remaining code points (U+10000 through
> U+10FFFF) are 'supplementary' code points, each of which may represent a
> 'supplementary character'.
>
> 3.  'Surrogate' code points range from U+D800 through U+DFFF (not U+DC00).
> They do not directly represent characters (so there is no such thing as a
> 'surrogate character'), but two of them may be used together according to
the
> rules of UTF-16 to represent a supplementary character.  The two surrogate
> code points used for this purpose would be called a 'surrogate pair'.
Don't
> separate them.
>
> Is that better?

It's clearer, but misses what I understand to be the absolutely crucial
distinction between a code point (correctly defined) and a code unit
(mentioned by Mark but not by Doug). For what a code unit is, see
http://www.unicode.org/unicode/reports/tr17

I would question whether 'surrogate code points' are really code points. In
the sense that they are a subset of 'code points' as defined, I guess they
are; but they are not only unlike every other code point in that they "do
not directly represent characters", they are explicitly and inexorably
disqualified from so doing, being reserved for use, in pairs, as UTF-16 code
units. (Which is what Mark said, of course.)

Looked at in this way, surely it makes it clearer that the transcoding of a
surrogate (code point) into UTF-8 is an abomination.

Simplification is all very well, but it can be taken too far, as when
important distinctions are lost.

For what it's worth,

Mike.

***********************************************************

J M Sykes              Email: [EMAIL PROTECTED]
97 Oakdale Drive
Heald Green
CHEADLE
Cheshire   SK8 3SN
UK                        Tel: (44) 161 437 5413

***********************************************************
Re: Surrogate space in Unicode

Reply via email to