subject:"Re\: surrogate terminology"

Re: surrogate terminology

2000-10-02 Thread Asmus Freytag


This discussion has become quite "surreal".

In the meantime, I and other people who have the need to write about these 
characters have, with more or less encouragement from the Unicode Editorial 
Committee started to use the terms "Supplementary Planes", "Supplementary 
Characters" etc. This view has now also taken hold in WG2 and is being 
reflected in part 2 of ISO 10646.

With that, the discussion is effectively settled - except for the fun of 
continuing it ad inifinitum for its own sake. However, continuing it on 
this list carries a certain element of mis-information with it, as the 
question of the official terminology is no longer an open one. Please bear 
this in mind, and help esp. novice readers on the list understand that this 
is now a hypothetical speculation.

Thank you.
A./

Re: surrogate terminology

2000-09-29 Thread John Cowan


Asmus Freytag wrote:

> With that, the discussion is effectively settled

Well and good.  Now that there is no faintest possibility that anyone will use it
as an official term, I will resume saying "Astral Planes" and "Basic Mundane Plane".

I am also pushing, informally, the term "zigamorph" for the non-character
U+, based on its use within IBM as a term for EBCDIC FF, also a non-character.

-- 
There is / one art   || John Cowan <[EMAIL PROTECTED]>
no more / no less|| http://www.reutershealth.com
to do / all things   || http://www.ccil.org/~cowan
with art- / lessness \\ -- Piet Hein

Re: surrogate terminology

2000-09-29 Thread Asmus Freytag


This discussion has become quite "surreal".

In the meantime, I and other people who have the need to write about these 
characters have, with more or less encouragement from the Unicode Editorial 
Committee started to use the terms "Supplementary Planes", "Supplementary 
Characters" etc. This view has now also taken hold in WG2 and is being 
reflected in part 2 of ISO 10646.

With that, the discussion is effectively settled - except for the fun of 
continuing it ad inifinitum for its own sake. However, continuing it on 
this list carries a certain element of mis-information with it, as the 
question of the official terminology is no longer an open one. Please bear 
this in mind, and help esp. novice readers on the list understand that this 
is now a hypothetical speculation.

Thank you.
A./

Re: surrogate terminology

2000-09-16 Thread Edward Cherlin

At 3:36 AM -0800 9/13/00, Michael Everson wrote:
>Ar 14:43 -0800 2000-09-12, scríobh Kenneth Whistler:
>>BMP:  real characters
>>Plane 1:  complex characters
>>Plane 2:  irrational characters
>>Plane 14: imaginary characters
>
>A lovely taxonomy.
>
>Michael Everson  **  Everson Gunn Teoranta  **   http://www.egt.ie
>15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire/Ireland
>Vox +353 1 478 2597 ** Fax +353 1 478 2597 ** Mob +353 86 807 9169
>27 Páirc an Fhéithlinn;  Baile an Bhóthair;  Co. Átha Cliath; Éire

Not quite accurate, though, since irrationals are a subset of the 
reals, and imaginaries are complex. Mathematically, quaternions come 
next in the sequence after complex numbers, followed by octonions. 
This is of no help in the current case, of course.

Of course[1], there are numerous other types of numbers. Donald 
Knuth's book "Surreal Numbers" describes Conway's scheme for defining 
all numbers at once. This term may be of some use to us. Conway 
numbers include transfinites and infinitesimals of all orders. 
Numbers, in Conway's definition, turn out to be a subset of 
turn-based two-person discrete games with perfect information.

In reality, of course, the Unicode code space is a very short initial 
segment of the natural numbers (taken as ordinals rather than 
cardinals}. Deal with it.

"Let us consider the natural numbers [begins writing on blackboard] 
0,...1,...2,...oops."--From an actual math lecture, according to my 
brother Gregory Cherlin, Professor of Mathematics at Rutgers 
University

[1] This is the mathematician's "of course", used to introduce a fact 
known only to the speaker among those present
-- 

Edward Cherlin, Spamfighter 
"It isn't what you don't know that hurts you, it's
what you know that ain't so."--Mark Twain, or else
some other prominent 19th century humorist and wit

Re: surrogate terminology

2000-09-13 Thread Peter_Constable

On 09/13/2000 01:47:57 AM Mark Davis wrote:

>Not all code points are assigned (or even assignable) to characters.
U+xx
>is used to refer to code points, which range from 0 to 10. Of these
code
>points, some are assigned to characters (including regular characters,
control
>characters, format characters, and private use characters [whose
interpretation
>is a matter of private agreement]), some are assigned to noncharacters
(e.g.
>U+), some are assigned to surrogate area code points (U+D800..U+DFFF),
and
>some are as yet unassigned (e.g. U+20B0). You will see examples of this
usage
>of U+ all throughout the Unicode Standard.

[snip]

>You are absolutely right that no one should be speaking of surrogate area
code
>points as "characters"

[snip]

>People do use the term "character" ambiguously to refer to any of a number
of
>very different entities: abstract characters but also graphemes, glyphs,
code
>points, code units, bytes, etc. To avoid confusion, the broad and
misleading
>uses of the term "character" should be avoided; or at least one should
clarify
>which sense one is using when not absolutely obvious from the context.

The main concern I have in mind is that people get confused by thinking of
Unicode as a uniformly 16-bit encoding standard, but then having to
understand "surrogate characters", which have generally been described as
characters represented using a special pair of codepoints. But then it's
easy to also get confused as to whether those special code points
individually correspond to characters or not. Talking about a
supplementary-plane character in terms of U+d800 U+dc00 doesn't make this
as clear as it could be. I'd suggest that U+ - U+10 should refer to
Unicode scalar values in the space of a CCS, in which case U+d800 and
U+dc00 are unused. But if were talking about the space of data values used
within UTF-16 where these need to be distinguished from USVs, then use
0xd800 notation. It's a subtle point, but I think it would be helpful
precisely in helping people understand the relationship between USVs, the
UTF-16 encoding form, and surrogates in particular.

- Peter

---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>

RE: surrogate terminology (was Re: Surrogate support in *ML?

2000-09-13 Thread Marco . Cimarosti


Peter constable wrote:
> - code values: integers within the space of some encoding 
> form; d800 - dfff
> *are* code values, but not codepoints
> - surrogate: I'm inclined to say that this should refer 
> *only* to a UTF-16
> code value in the range d800 - dfff; equal to "surrogate code value"
> - surrogate pair: a valid pair of UTF-16 surrogate code values used to
> encode an "astral" character; note that a surrogate pair is 
> *different*
> from the character they encode: surrogates come from the 
> sphere of code
> values, not the sphere of characters/codepoints

I add my bit, just to mess things up a little bit.

How about a the term "pointer":

- "surrogate pair" --> "(surrogate) pointer"

- "surrogate (code value)" --> "(surrogate) pointer element"

- "high surrogate (code value)" --> "(surrogate) pointer head"

- "low surrogate (code value)" --> "(surrogate) pointer tail"

But, against my own proposal, it is to be considered that this terminology
will mainly be used by programming professional, that may confuse the term
"pointer" with the similar concept in their programming languages.

_ Marco

Re: surrogate terminology

2000-09-13 Thread Michael Everson


Ar 14:43 -0800 2000-09-12, scríobh Kenneth Whistler:
>BMP:  real characters
>Plane 1:  complex characters
>Plane 2:  irrational characters
>Plane 14: imaginary characters

A lovely taxonomy.

Michael Everson  **  Everson Gunn Teoranta  **   http://www.egt.ie
15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire/Ireland
Vox +353 1 478 2597 ** Fax +353 1 478 2597 ** Mob +353 86 807 9169
27 Páirc an Fhéithlinn;  Baile an Bhóthair;  Co. Átha Cliath; Éire

Re: surrogate terminology

2000-09-13 Thread Mark Davis

Not all code points are assigned (or even assignable) to characters. U+xx
is used to refer to code points, which range from 0 to 10. Of these code
points, some are assigned to characters (including regular characters, control
characters, format characters, and private use characters [whose interpretation
is a matter of private agreement]), some are assigned to noncharacters (e.g.
U+), some are assigned to surrogate area code points (U+D800..U+DFFF), and
some are as yet unassigned (e.g. U+20B0). You will see examples of this usage
of U+ all throughout the Unicode Standard.

People may use "U+2035" to refer to a character. In that case, it is understood
as referring to the abstract character that Unicode associates with that code
point. If I say "the character U+20B0", then I am, strictly speaking, in error,
since there is no character associated with that code point. It is a bit like
saying "the present king of France". I may be speaking loosely of the character
which is proposed for that code point in
(http://www.unicode.org/unicode/alloc/Pipeline.html), the GERMAN PENNY SYMBOL.

You are absolutely right that no one should be speaking of surrogate area code
points as "characters". They are not  assigned to characters, and will never
be. The surrogate area code points are special -- they cannot be assigned to
characters, and their only use is to be reserved so that the corresponding code
units can be used in UTF-16 in pairs as a representation of the supplementary
characters (using that term for characters assigned to codepoints above ).
They are, however, still code points.

People do use the term "character" ambiguously to refer to any of a number of
very different entities: abstract characters but also graphemes, glyphs, code
points, code units, bytes, etc. To avoid confusion, the broad and misleading
uses of the term "character" should be avoided; or at least one should clarify
which sense one is using when not absolutely obvious from the context.

Mark

[EMAIL PROTECTED] wrote:

> On 09/12/2000 02:59:38 PM Kenneth Whistler wrote:
>
> [snip]
>
> I think Ken's comments on planes is good.
>
> >3. The term "surrogate character" should be eschewed altogether, because
> >   of the confusion is causes. "Surrogate code point" can continue to
> >   be used as it currently is, and the term "surrogate pair" is also
> >   useful. But the other terminology related to characters...
>
> The other terminology Ken discussed had to do with the plane in which a
> character is found. What I think is still open is how d800 - dfff get
> referred to. Ken indicated that "surrogate code point" can continue in use
> as is; I don't recall exactly how TUS 3.0 uses it. (Would have made for a
> rather challenging trivia question :-) My biggest concern here is that
> people should not be referring to U+d800 - U+dfff as characters. (I'd be
> willing to accept code point, provided there is a clear statement as to
> what is meant by a code point.) For that matter, I'd be inclined to say
> that the U+ notation should not be used here - U+ should be reserved for
> use to refer to encoded characters in terms of their Unicode scalar values.
> So, 0xd800 is OK, but U+d800 would be wrong.
>
> - Peter
>
> ---
> Peter Constable
>
> Non-Roman Script Initiative, SIL International
> 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> Tel: +1 972 708 7485
> E-mail: <[EMAIL PROTECTED]>

Re: surrogate terminology

2000-09-12 Thread 11digitboy


So what notation do you use? 0x8000 is just another
way to say 32768.

By the way, how was the conference?

Um... give him a REALLY high plane, like in 
oh, I don't know how high.
You can't keep giving people their own planes, because
then I'll want one, and then you'll rip my skin away
from my body for wanting one. And Sarasvati will
make sure no woman ever wants me, not that that would
change anything.

--
Robert Lozyniak
Accusplit pedometer manufactures can go suck eggs
My page: http://walk.to/11
[EMAIL PROTECTED] - email
(917) 421-3909 x1133 - voicemail/fax



 [EMAIL PROTECTED] wrote:
> 
> On 09/12/2000 02:59:38 PM Kenneth Whistler wrote:
> 
> [snip]
> 
> I think Ken's comments on planes is good.
> 
> 
> >3. The term "surrogate character" should be eschewed
> altogether, because
> >   of the confusion is causes. "Surrogate code
> point" can continue to
> >   be used as it currently is, and the term "surrogate
> pair" is also
> >   useful. But the other terminology related to
> characters...
> 
> The other terminology Ken discussed had to do with
> the plane in which a
> character is found. What I think is still open
> is how d800 - dfff get
> referred to. Ken indicated that "surrogate code
> point" can continue in use
> as is; I don't recall exactly how TUS 3.0 uses
> it. (Would have made for a
> rather challenging trivia question :-) My biggest
> concern here is that
> people should not be referring to U+d800 - U+dfff
> as characters. (I'd be
> willing to accept code point, provided there is
> a clear statement as to
> what is meant by a code point.) For that matter,
> I'd be inclined to say
> that the U+ notation should not be used here -
> U+ should be reserved for
> use to refer to encoded characters in terms of
> their Unicode scalar values.
> So, 0xd800 is OK, but U+d800 would be wrong.
> 
> 
> 
> - Peter
> 
> 
> ---
> Peter Constable
> 
> Non-Roman Script Initiative, SIL International
> 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> Tel: +1 972 708 7485
> E-mail: <[EMAIL PROTECTED]>
> 
> 
> 

___
Get your own FREE Bolt Onebox - FREE voicemail, email, and
fax, all in one place - sign up at http://www.bolt.com

Re: surrogate terminology

2000-09-12 Thread Peter_Constable

On 09/12/2000 02:59:38 PM Kenneth Whistler wrote:

[snip]

I think Ken's comments on planes is good.

>3. The term "surrogate character" should be eschewed altogether, because
>   of the confusion is causes. "Surrogate code point" can continue to
>   be used as it currently is, and the term "surrogate pair" is also
>   useful. But the other terminology related to characters...

The other terminology Ken discussed had to do with the plane in which a
character is found. What I think is still open is how d800 - dfff get
referred to. Ken indicated that "surrogate code point" can continue in use
as is; I don't recall exactly how TUS 3.0 uses it. (Would have made for a
rather challenging trivia question :-) My biggest concern here is that
people should not be referring to U+d800 - U+dfff as characters. (I'd be
willing to accept code point, provided there is a clear statement as to
what is meant by a code point.) For that matter, I'd be inclined to say
that the U+ notation should not be used here - U+ should be reserved for
use to refer to encoded characters in terms of their Unicode scalar values.
So, 0xd800 is OK, but U+d800 would be wrong.

- Peter

---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>

Re: Rép. : Re: surrogate terminology

2000-09-12 Thread Michael \(michka\) Kaplan


Hmmm, I hope this is tongue in cheek, the math flashbacks are scary here!

We could have some fun in the future with "hostile characters", "psychotic
characters", "apathetic characters", and other variations. :-)

michka

a new book on internationalization in VB at
http://www.i18nWithVB.com/

- Original Message -
From: "Kenneth Whistler" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Tuesday, September 12, 2000 3:43 PM
Subject: Re: Rép. : Re: surrogate terminology


>
> BMP:  real characters
> Plane 1:  complex characters
> Plane 2:  irrational characters
> Plane 14: imaginary characters
>
> --Ken
>

Re: surrogate terminology

2000-09-12 Thread Mark Leisher



John> Le Mardi, Septembre 12, 2000, à 11:36 AM, Misha Wolf a écrit :

>> I can't stand "astral planes".  The term suggests, to me at least, that
>> these planes (and, hence, the characters in them) are not as "real" as
>> the BMP.
>> 
>> 

John> I guess it doesn't bother me much since I'm used to talking about
John> "imaginary" and "irrational" numbers, too.  (My thirteen-year-old
John> refuses to believe that imaginary numbers are "real."  *sigh*)

Being fond of the term "bucky bits," I always harboured a secret wish to see
"bucky planes" :-)
-
Mark Leisher
Computing Research LabCinema, radio, television, magazines are a
New Mexico State University   school of inattention: people look without
Box 30001, Dept. 3CRL seeing, listen without hearing.
Las Cruces, NM  88003-- Robert Bresson

Re: Rép. : Re: surrogate terminology

2000-09-12 Thread Kenneth Whistler



BMP:  real characters
Plane 1:  complex characters
Plane 2:  irrational characters
Plane 14: imaginary characters

--Ken

Rép. : Re: surrogate terminology

2000-09-12 Thread John H. Jenkins



Le Mardi, Septembre 12, 2000, à 11:36 AM, Misha Wolf a écrit :

> I can't stand "astral planes".  The term suggests, to me at  
> least, that these planes (and, hence, the characters in them)  
> are not as "real" as the BMP. 
>  
>  

I guess it doesn't bother me much since I'm used to talking about "imaginary" and 
"irrational" numbers, too.  (My thirteen-year-old refuses to believe that imaginary 
numbers are "real."  *sigh*)


=
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://www.blueneptune.com/~tseng

RE: surrogate terminology

2000-09-12 Thread Murray Sargent


For what it's worth, I've been referring to characters between 0x1 and
0x10 as "higher-plane" characters as distinguished from BMP characters.
Seems to work well in a general way. For plane 1, I use "plane=1"
characters, etc.

Murray

Re: surrogate terminology

2000-09-12 Thread Kenneth Whistler


Peter noted:

> > We do need to clean up terminology, and we need to do so in a way that
> > incorporates understanding of UTR-17. I think we need:
> > 
> > - BMP characters: characters in the BMP; note that d800-dfff are not
> > characters; fffe and  are also not characters
> > - "astral"/supplementary/extended-plane/?? characters: everything in planes
> > 1 - 16 (excluding anything ending in fffe and )

This is part of a discussion of terminology regarding surrogates
that has been ongoing among an ad hoc group working on the proposed
UTR on surrogate handling, and a separate but related discussion
among the editorial committee. Now it seems to have migrated out
to the general list.

Misha noted:

> 
> I can't stand "astral planes".  The term suggests, to me at 
> least, that these planes (and, hence, the characters in them) 
> are not as "real" as the BMP.
> 
> By contrast, "supplementary planes" is a factual description.
> 

I'll repeat some of the consensus that seems to have emerged from
the other smaller list discussions.

1. The terminology used by 10646 and by the Unicode Standard should
   be convergent in this area, to minimize the proliferation of
   confusion. The FCD for 10646-2 already uses the term "supplementary
   planes", and this seems perfectly acceptable for the Unicode
   Standard as well.

10646 definition:

plane: A subdivision of a group; of 256 x 256 cells.

Suggested Unicode definitions that could be added to the Unicode
glossary, to cover this convergence:

plane: A subdivision of the encoding space; 64K code points starting
   on an even 64K boundary. (Plane 0 0x..0x; Plane 1 0x1..
   0x1, etc.)

BMP: Basic Multilingual Plane, a synonym for Plane 0.

SMP: Supplementary Multilingual Plane, a synonym for Plane 1.

The Supplementary Planes: The collective term for Planes
   1 through 16, considered as a group.

The Astral Planes: Jocular synonym for the Supplementary Planes.

2. The plane names in the FCD for 10646-2 should be modified just
   slightly to tie together the terminology better. The best
   suggestion to date is:

>Plane 1: Supplementary Multilingual Plane for scripts and symbols (SMP)
>Plane 2: Supplementary Ideographic Plane (SIP)
>Plane 14: Supplementary Special-purpose Plane (SSP)

   This makes consistent use of "supplementary plane", and ties the
   plane names and acronyms together in a way which can actually be
   remembered without having to look up the TLA's.

3. The term "surrogate character" should be eschewed altogether, because
   of the confusion is causes. "Surrogate code point" can continue to
   be used as it currently is, and the term "surrogate pair" is also
   useful. But the other terminology related to characters should be
   coordinated with establishing "supplementary planes" as the way to
   refer to Planes 1-16. Some text I wrote earlier about this topic,
   in response to a suggestion to use the terms "extended character"
   and "basic character":

I don't like "extended character", because of the cognitive dissonance
regarding whether the character is an ordinary character that extends
the set located elsewhere, or whether the character itself is extended
in some way -- that is bound to cause confusions, since the UTF-16
encoding scheme for these "extended characters" extends the encoding
form to 2 wydes, as well as extending the character set by adding
the character.

Because of that, I think "supplementary character" is a far better choice
for talking about characters on Planes 1-16. There can be no confusion
there with the mechanics of the encoding form, and there is no artificial
discrimination in that term regarding the status of the good characters
we like in the Supplementary Planes versus the bad characters we don't like
in the Supplementary Planes -- just as for characters in the BMP.

And I would prefer not to start talking about characters in the BMP
as "basic characters", since, as we know, there are many thousands of
them that aren't particularly basic (or useful for implementation).

--Ken

Re: surrogate terminology

2000-09-12 Thread Misha Wolf


> We do need to clean up terminology, and we need to do so in a way that
> incorporates understanding of UTR-17. I think we need:
> 
> - BMP characters: characters in the BMP; note that d800-dfff are not
> characters; fffe and  are also not characters
> - "astral"/supplementary/extended-plane/?? characters: everything in planes
> 1 - 16 (excluding anything ending in fffe and )

I can't stand "astral planes".  The term suggests, to me at 
least, that these planes (and, hence, the characters in them) 
are not as "real" as the BMP.

By contrast, "supplementary planes" is a factual description.

Misha Wolf
W3C I18N WG Chair


-
Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.

Re: surrogate terminology (was Re: Surrogate support in *ML?

2000-09-12 Thread John Cowan


[EMAIL PROTECTED] wrote:
 
> - BMP characters: characters in the BMP; note that d800-dfff are not
> characters; fffe and  are also not characters

Not in the Glossary, but "BMP" is.

> - "astral"/supplementary/extended-plane/?? characters: everything in planes
> 1 - 16 (excluding anything ending in fffe and )

We do need a term for this.

> - codepoint: I'm inclined to use this as an alternate term for Unicode
> Scalar Value; note that by this def'n d800 - dfff, fffe, etc. are *not*
> codepoints

Same as the Glossary.  Note that "code point" can also be applied to non-Unicode
standards:  0x20 is the codepoint for DIGIT ZERO in US-ASCII.

> - code values: integers within the space of some encoding form; d800 - dfff
> *are* code values, but not codepoints

According to the Glossary, code values are bit strings, not integers.

> - surrogate: I'm inclined to say that this should refer *only* to a UTF-16
> code value in the range d800 - dfff; equal to "surrogate code value"

Yes, this is the obvious abstraction from D25 and D26.

> - surrogate pair: a valid pair of UTF-16 surrogate code values used to
> encode an "astral" character; note that a surrogate pair is *different*
> from the character they encode: surrogates come from the sphere of code
> values, not the sphere of characters/codepoints

Matches D27 and the Glossary.

Summary: the Unicode Standard's terms are in good shape.

-- 
There is / one art   || John Cowan <[EMAIL PROTECTED]>
no more / no less|| http://www.reutershealth.com
to do / all things   || http://www.ccil.org/~cowan
with art- / lessness \\ -- Piet Hein

RE: surrogate terminology (was Re: Surrogate support in *ML?

2000-09-12 Thread Ayers, Mike



> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]

> 
> >> What is confusing is that sometimes "surrogates" refer to
> >> certain code units (for UTF-16) that are reserved as code points,
> >> and sometimes "surrogates" is used to refer to 'characters
> >> on planes 01-10'.  I think the latter is a misuse.
> 
> >Good point. In the past, I have used "surrogate characters" 
> to refer to
> the
> >characters encoded above , and surrogate code units to 
> refer to the
> UTF-16
> >units D800-DFFF. However, I think that leads to confusion. 
> Nobody has come
> up
> >with a good term for all characters above . "Plane 1-16 
> characters" is
> >clunky and requires explanation, as does "non-BMP 
> characters". Another
> >possibility is "surrogate-pair characters". My personal favorite is
> "astral
> >characters" (don't remember who came up with that).

I think this is the only memorable term I've heard so far, which
alone should recommend it.

> We do need to clean up terminology, and we need to do so in a way that
> incorporates understanding of UTR-17. I think we need:
> 
> - BMP characters: characters in the BMP; note that d800-dfff are not
> characters; fffe and  are also not characters

Can we take this one further and say "basic characters"?  I've got
enough TLAs floating about already...   ;-)

> - "astral"/supplementary/extended-plane/?? characters: 
> everything in planes
> 1 - 16 (excluding anything ending in fffe and )

"high characters"?

> - codepoint: I'm inclined to use this as an alternate term for Unicode
> Scalar Value; note that by this def'n d800 - dfff, fffe, etc. 
> are *not*
> codepoints
> - code values: integers within the space of some encoding 
> form; d800 - dfff
> *are* code values, but not codepoints

This is, to me, counterintuitive.  I would be inclined to say "at
point d800 there is no valid value" rather than vice versa.  I would
consider all enumerable integers in the code space to be code points -
whether or not there is anything actually at that point (i.e. fffe and 
are valid, but unused, codepoints).

On second glance I see that you want to use the word "code" to mean
two things.  I suspect such wording will cause the same confusion in others
that it caused in me.  Combining your suggestion with mine (in the paragraph
above), I suggest:

code point: integers within the space of some encoding form
code value: the meaning assigned to a code point or code points
Unicode point: a value in the range 0-0x10 which may or may not have a
meaning assigned
Unicode value: Unicode Scalar Value, a fancy way to say "character"

This way, we say "Unicode" to refer to the CCS and "code" to refer
to the CEF.  If we want to go a level higher and talk about the CES, I
suggest "byte points" and "byte values".  I believe that we won't wish to
discuss TES in other than abstract fashion.

I admit I am going a little deeper than I am truly familiar with, so
please accept my apologies if I got this all mixed up.

> - surrogate: I'm inclined to say that this should refer 
> *only* to a UTF-16
> code value in the range d800 - dfff; equal to "surrogate code value"

The problem here is a clash with the traditional definition of
"surrogate", which would be much closer to your "surrogate pair" below.
Can't we call these "surrogate prefixes"?

> - surrogate pair: a valid pair of UTF-16 surrogate code values used to
> encode an "astral" character; note that a surrogate pair is 
> *different*
> from the character they encode: surrogates come from the 
> sphere of code
> values, not the sphere of characters/codepoints


/|/|ike

Re: surrogate terminology

Re: surrogate terminology

Re: surrogate terminology

Re: surrogate terminology

Re: surrogate terminology

RE: surrogate terminology (was Re: Surrogate support in *ML?

Re: surrogate terminology

Re: surrogate terminology

Re: surrogate terminology

Re: surrogate terminology

Re: Rép. : Re: surrogate terminology

Re: surrogate terminology

Re: Rép. : Re: surrogate terminology

Rép. : Re: surrogate terminology

RE: surrogate terminology

Re: surrogate terminology

Re: surrogate terminology

Re: surrogate terminology (was Re: Surrogate support in *ML?

RE: surrogate terminology (was Re: Surrogate support in *ML?

19 matches

Site Navigation

Mail list logo

Footer information