Re: Abstract character?

2002-07-24 Thread Mark Davis

I disagree with Ken, but don't have time now to write a lengthy
reply.. I'll try to get to that soon.

Mark
__
http://www.macchiato.com
◄  “Eppur si muove” ►

- Original Message -
From: "Doug Ewell" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Cc: "Kenneth Whistler" <[EMAIL PROTECTED]>
Sent: Tuesday, July 23, 2002 19:44
Subject: Re: Abstract character?


> Kenneth Whistler  wrote:
>
> >> UTF-16 does not allow the representation of an unpaired surrogate
> >> 0xD800 followed by another, coincidental unpaired surrogate
0xDC00.
> >> (It maps the two to U+1.)  Among the standard UTFs, only
UTF-32
> >> allows the two to be treated as unpaired surrogates.
> >
> > Actually, not that, either.
> >
> >> In fact, before UTF-8 was
> >> "tightened up" in 3.2, the only UTF that DID NOT permit these two
> >> coincidental unpaired surrogates was UTF-16.
> >>
> >> UTF-8:  D800 DC00 <==> ED A0 80 ED B0 80 (no longer legal)
> >> UTF-32:  D800 DC00 <==> D800 DC00
> >
> > This is ill-formed in UTF-32, and thereby, illegal.
>
> I'm glad to hear that unpaired surrogates are now also illegal in
> UTF-32, and presumably also in UTF-16.  However, I did do my
homework
> before writing yesterday's post, and that wasn't the impression I
got,
> so I sense another opportunity to tighten up the definitions before
> Unicode 4.0 is released.
>
> In UAX #28, "Unicode 3.2," the section on "Elimination of Irregular
> Sequences" starts out talking about "transformation formats such as
> UTF-8."  However, the rest of the section deals exclusively with
UTF-8;
> UTF-16 and UTF-32 are not mentioned.
>
> UAX #19, "UTF-32" (written by Mark) is listed in the header block as
> having been updated to Unicode 3.2, but it does not state anywhere
that
> unpaired surrogates are illegal.  In particular, the following
passages
> from UAX #19 led me to believe that all code points, from 0x
through
> 0x10 inclusive, are legal in UTF-32:
>
> "UTF-32 is restricted in values to the range 0..1016,
> which precisely matches the range of characters defined in the
Unicode
> Standard (and other standards such as XML), and those representable
by
> UTF-8 and UTF-16."
>
> "(b) An illegal UTF-32 code unit sequence is any byte sequence that
> would correspond to a numeric value outside of the range 0 to
> 1016.
>
> "(c) An irregular UTF-32 code unit sequence is an eight-byte
sequence
> where the first four bytes correspond to a high surrogate, and the
next
> four bytes correspond to a low surrogate. As a consequence of C12,
these
> irregular UTF-32 sequences shall not be generated by a conformant
> process."
>
> I suggest that the Unicode 4.0 text specifically state, in
unambiguous
> terms, which code points are and are not valid in UTF-8, UTF-16, and
> UTF-32.  And if it is true that the surrogate code points 0xD800
through
> 0xDFFF are illegal in UTF-32, then I suggest that UAX #18 be revised
to
> state this unambiguously.
>
> -Doug Ewell
>  Fullerton, California
>
>
>





Re: Abstract character?

2002-07-23 Thread Doug Ewell

I typo'd:

> I suggest that UAX #18 be revised to
> state this unambiguously.

s/#18/#19/

-Doug





Re: Abstract character?

2002-07-23 Thread Doug Ewell

Kenneth Whistler  wrote:

>> UTF-16 does not allow the representation of an unpaired surrogate
>> 0xD800 followed by another, coincidental unpaired surrogate 0xDC00.
>> (It maps the two to U+1.)  Among the standard UTFs, only UTF-32
>> allows the two to be treated as unpaired surrogates.
>
> Actually, not that, either.
>
>> In fact, before UTF-8 was
>> "tightened up" in 3.2, the only UTF that DID NOT permit these two
>> coincidental unpaired surrogates was UTF-16.
>>
>> UTF-8:  D800 DC00 <==> ED A0 80 ED B0 80 (no longer legal)
>> UTF-32:  D800 DC00 <==> D800 DC00
>
> This is ill-formed in UTF-32, and thereby, illegal.

I'm glad to hear that unpaired surrogates are now also illegal in
UTF-32, and presumably also in UTF-16.  However, I did do my homework
before writing yesterday's post, and that wasn't the impression I got,
so I sense another opportunity to tighten up the definitions before
Unicode 4.0 is released.

In UAX #28, "Unicode 3.2," the section on "Elimination of Irregular
Sequences" starts out talking about "transformation formats such as
UTF-8."  However, the rest of the section deals exclusively with UTF-8;
UTF-16 and UTF-32 are not mentioned.

UAX #19, "UTF-32" (written by Mark) is listed in the header block as
having been updated to Unicode 3.2, but it does not state anywhere that
unpaired surrogates are illegal.  In particular, the following passages
from UAX #19 led me to believe that all code points, from 0x through
0x10 inclusive, are legal in UTF-32:

"UTF-32 is restricted in values to the range 0..1016,
which precisely matches the range of characters defined in the Unicode
Standard (and other standards such as XML), and those representable by
UTF-8 and UTF-16."

"(b) An illegal UTF-32 code unit sequence is any byte sequence that
would correspond to a numeric value outside of the range 0 to
1016.

"(c) An irregular UTF-32 code unit sequence is an eight-byte sequence
where the first four bytes correspond to a high surrogate, and the next
four bytes correspond to a low surrogate. As a consequence of C12, these
irregular UTF-32 sequences shall not be generated by a conformant
process."

I suggest that the Unicode 4.0 text specifically state, in unambiguous
terms, which code points are and are not valid in UTF-8, UTF-16, and
UTF-32.  And if it is true that the surrogate code points 0xD800 through
0xDFFF are illegal in UTF-32, then I suggest that UAX #18 be revised to
state this unambiguously.

-Doug Ewell
 Fullerton, California





Re: Abstract character?

2002-07-23 Thread Kenneth Whistler

Lars Marius Garshol followed up:

> H. OK. So combining diacritics are also abstract characters? 

Yes, clearly.

Each encoded character in the Unicode CCS, ipso facto, associates
an abstract character with a code point.

So U+0300 COMBINING GRAVE ACCENT associates the code point U+0300
with an abstract character {grave accent mark that attaches above
a base form}. We have agreed to associate, further, a normative
name COMBINING GRAVE ACCENT with that encoded character, to
facilitate transcoding U+0300 COMBINING GRAVE ACCENT with any other
CCS which might include the same abstract character.

> (I
> was also unclear on ZWNJ and similar things, but you explicitly
> mention that above, so...)

Yep, they're all abstract characters, by the nature of the beast.

>  
> | (Note above -- abstract characters are also a concept which applies
> | to other character encodings besides the Unicode Standard, and not
> | all encoded characters in other character encodings automatically
> | make it into the Unicode Standard, for various architectural
> | reasons.)
> 
> Right. So VIQR, for example, also has abstract characters, then?

Yes, I think the character encoding model is broad enough to apply
to any CCS, not just the Unicode Standard. That's basically why we
can transcode between Unicode and legacy standards.

> However, it does raise a new problem. Isn't the definition of 'string'
> in the XPath specification then wrong?
> 
>   Strings consist of a sequence of zero or more characters, where a
>   character is defined as in the XML Recommendation [XML]. A single
>   character in XPath thus corresponds to a single Unicode abstract
>   character with a single corresponding Unicode scalar value (see
>   [Unicode]); [...]
>   http://www.w3.org/TR/xpath#strings >
> 
> As far as I can tell, one of these two claims must be wrong. That is,
> either a single XPath character does not necessarily correspond to a
> single Unicode abstract character, or else a single XPath character
> need not correspond to a single scalar value.

No, I think it is correct, if somewhat convoluted. Basically, it
is trying to say that each XPath character corresponds to a single
Unicode encoded character. By my discussion in the previous note,
a Unicode encoded character maps a (single) abstract character to
a (single) code point [and because of the constraints of the UTF
definitions, the only valid code points are the Unicode scalar
values].
 
> 
> Does that sound reasonable?

The trick for understanding Unicode -- once you have the
basic encoding principle down -- is to realize that the ideal
one-to-one mapping situation is violated in practice for a variety
of reasons. In the clarification about abstract characters that
I quoted earlier in this thread, I abridged further discussion
about all the edge cases. For those strong of stomach, read on.

--Ken

 formulation of CCS and exceptions 


What is the CCS?

The CCS is a function f.

The domain X of the function is a repertoire of abstract characters.

The codomain Y of the function is the codespace.

For each x [abstract character in the repertoire], f(x) 
[a code point in the codespace] is assigned to x.

Ideally, the CCS is one-to-one, since two different abstract characters [x] 
cannot be mapped to the same code point [f(x)], and two different
code points are not assigned to the same abstract character.

The CCS is not onto, since there are many code points which are not 
assigned to any abstract character.

In other words, conceptually a CCS is an injection. What we keep
doing is expanding the domain (but not the subdomain) of the injection.
(I.e., we keep adding encoded characters, but don't expand the
codespace.)

In actuality, the Unicode CCS is messier. As we know, it is not really
one-to-one in practice:

   a. There are instances where two different abstract characters
  are mapped to the same code point. These are the "false
  unifications", and usually engender some more-or-less
  vociferous discussion about requiring a disunification.
  (Cf. U+0192, which unified f-hook with the florin sign.)
  There are also legacy overunifications which result in
  intentionally ambiguous characters: U+002D hyphus,
  U+0027 apostroquote, U+0060 gravquote, and the like.
  False unifications engender controversy in inverse proportion
  to the length of time people have lived with their ambiguities,
  so Unicode tends to be more smothered by nitpicking about
  the examples which are introduced by Unicode itself.

   b. There are instances where the same abstract character is
  mapped to more than one code point. These are the "false
  disunifications", and constitute the set of compatibility
  characters that we give singleton canonical mappings to

Re: Abstract character?

2002-07-23 Thread Kenneth Whistler
ral range
for the code points, as demanded by some of the implementers.

> "code point" should be defined as an integer corresponding to an
> encoded character in any CCS, not just Unicode.

This doesn't really work, since it doesn't account for the
unassigned (reserved) code points, nor the noncharacters.
The Unicode architecture for its codespace is more complex than
any other CCS, precisely because the encoding is more complex:
only Unicode has three bijective encoding forms, and only Unicode
has noncharacters. These need to be taken into account.

> The integers 0xD800..0xDFFF are legal *as code units* in UTF-16. IMHO
> allowing them as code points (i.e. allowing any process to conformantly
> generate unpaired surrogates) is a really bad idea. 

This I agree with.

> The set of code
> point sequences that are validly representable in each UTF should be
> identical (which ensures that mappings between UTFs are bijective and
> always succeed iff the input is valid in the source UTF).

I also think this is of paramount importance.

> I.e. U+D800..DFFF, like U+11, should be undesignated and
> unrepresentable.

However, you can't go quite this far. As Markus pointed out, code points
themselves may have properties -- even code points which cannot, in
principle, be assigned to characters. And there are already existing
APIs which handle these code points. Their function is clearly
*designated* by the standard, normatively; that, however, is different
from saying that an abstract character could ever be assigned to them.

--Ken




Re: Abstract character?

2002-07-23 Thread Markus Scherer

So far, the Unicode Standard has defined code points to be from the contiguous range 
of 0..0x10.
Some definitions are fuzzy in the standard, with hopes of clarification in Unicode 4.0.

It is true that UTF-16 cannot encode , but it can encode .

There are at least three reasons why not to forbid the representation
of surrogate code points in UTF-16 (and also UTF-32)
or the code-pointed-ness of surrogates:

1. Compatibility.
UTF-16 was explicitly created to be backwards compatible with UCS-2.
Valid UCS-2 text must be valid UTF-16 text.
In UCS-2, code points d800..dfff were legal, so they must be in UTF-16.

2. Performance.
When you iterate through a UTF-16/32 string, you don't want to forbid
surrogate code points because it adds complexity to your logic.
In fact, iterating through UTF-16 text currently does not produce any
decoding errors.
When you go through  you get code points
d800, 0061, dc00, 10001.

Similarly, you don't want to forbid appending d800 to a string
because the application might deliberately append code units
(and dc00 would follow), or the application might just be blind
towards surrogates and pass code units through one by one
(UCS-2 application) with reasonable hopes that a surrogate pair
would be rejoined by default.

3. Properties.
An API that takes a code point and returns a property for that code point
must be able to deal with surrogate code points because there are non-trivial
properties assigned to them, e.g., general category Cs.

Surrogate code points have been listed in the UCD for a long time,
which shows that they are different from illegal code point values
like 0x11 or -1.

markus





Re: Abstract character?

2002-07-23 Thread Peter_Constable


On 07/22/2002 03:38:50 PM Kenneth Whistler wrote:

>Abstract character
>
>   that which is encoded; an element of the repertoire (existing
>   independent of the character encoding standard, and often
>   identifiable in other character encoding standards, as well
>   as the Unicode Standard); the implicit basis of transcodings.

[snip]

>>  - do  (Å) and  (A followed by combining ring
>>above) represent the same abstract character?
>
>Yes. That is the implicit claim behind a specification of canonical
>equivalence.

This brings to mind another question: what's the relationship between
character sequences and abstract characters? Does < 0041, 030A > represent
a single abstract character or a sequence of abstract characters? Ken's
answer above suggests a single abstract character. Actually, the question
that's really bothering me is the next one.

Moving one step further (perhaps you already guessed where I was going),
what of < 1000, 102D, 102F >? Whether we consider it a single abstract
character, or a sequence of abstract characters, the more important
question to me is whether it is the same abstract character (sequence) as
< 1000, 102F, 102D >. The only thing that makes sense is that they are the
same abstract character sequences. But, they are not canonically
equivalent! Is the contrapositive to your statement true? I.e. is it true
that lack of canonical equivalence implies a distinction in abstract
character (sequences)?



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>







Re: Abstract character?

2002-07-23 Thread David Hopwood

-BEGIN PGP SIGNED MESSAGE-

Mark Davis wrote:
> A small correction to Ken's message:
> 
> >The Unicode scalar value
> >definitionally excludes D800..DFFF, which are only code unit
> >values used in UTF-16, and which are not code points associated
> >with any well-formed UTF code unit sequences.
> 
> The UTC in has decided to make scalar value mean unambiguously the
> code points ..D7FF, E000..10, i.e., everything but surrogate
> code points.

I think it would be a mistake for the standard to refer to "surrogate
code points". The term "code point" is used for other CCS's where there
may also be gaps in the code space; in that case, the gaps are not
considered valid code points. When 0xD800..0xDFFF are used in UTF-16,
they are used as code units, not code points. As Unicode code points,
0xD800..0xDFFF are (or at least should be) invalid in the same sense
that 0x11 is.

I.e. IMHO "Unicode scalar value" and "Unicode code point" should be
synonyms, with the set of valid values 0..0xD7FF, 0xE000..0x10.
"code point" should be defined as an integer corresponding to an
encoded character in any CCS, not just Unicode.

> While surrogate code points cannot be represented in
> UTF-8 (as of Unicode 3.2), the UTC has not decided that the surrogate
> code points are illegal in all UTFs; notably, they are legal in
> UTF-16.

The integers 0xD800..0xDFFF are legal *as code units* in UTF-16. IMHO
allowing them as code points (i.e. allowing any process to conformantly
generate unpaired surrogates) is a really bad idea. The set of code
point sequences that are validly representable in each UTF should be
identical (which ensures that mappings between UTFs are bijective and
always succeed iff the input is valid in the source UTF).
I.e. U+D800..DFFF, like U+11, should be undesignated and
unrepresentable.

(As well as UTF-16, the definition of UTF-32 in UAX #19 does not
specifically exclude 0xD800..0xDFFF, although the ISO 10646 definition
does. In this case I think Unicode should be changed to be consistent
with ISO 10646.)

> Ken is pushing for this change; I believe it would be a very bad idea.

What precisely do you think would be a bad idea?

- -- 
David Hopwood <[EMAIL PROTECTED]>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5  0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip


-BEGIN PGP SIGNATURE-
Version: 2.6.3i
Charset: noconv

iQEVAwUBPT0/MjkCAxeYt5gVAQEOvQf8DEmtbZpQ59nSSbVa8HN/BXCoMG/UOqYy
lSknQ+dUaIS3S0QgpVSIs5tFOjShw2YZ117cXioxzADMbU2MlbY3NITJYkatbgqf
UWIH9ENnqe0YDLdg1FWjyFFWuYLz1kf7c4M16OblhrHMJCjc9+Gba8dikIjJolWi
WNtzfX9ftuzcvFwssReGjyemXMhN6ugeUv3T1hGXjMRT834rSG9eLEr98BWpE1xR
m8wQPBWizSCDF3xFrRg6SwfSt1g+SrhGjLd/ccG96ENdC1XBHYyF4WgggdIO6Ilb
0WSaLbBV4uEPxyPihsy4pV3w8GLRXDhwpK34InLRHJFkMcgNWMTE2w==
=Kn1u
-END PGP SIGNATURE-




Re: Abstract character?

2002-07-23 Thread Lars Marius Garshol


* Kenneth Whistler
| 
| Abstract character
| 
|that which is encoded; an element of the repertoire (existing
|independent of the character encoding standard, and often
|identifiable in other character encoding standards, as well
|as the Unicode Standard); the implicit basis of transcodings.
| 
|Note that while in some sense abstract characters exist a
|priori by virtue of the nature of the units of various writing
|systems, their exact nature is only pinned down at the point
|that an actual encoding is done. They are not always obvious,
|and many new abstract characters may arise as the result of
|particular textual processing needs that can be addressed by
|characters. (E.g. WORD JOINER, OBJECT REPLACEMENT CHARACTER,
|etc., etc.)

This helps a little, but not all that much. I think spelling out the
details of how the term relates to the other terms would help.

The rest of the definitions wre quite clear.
 
* Lars Marius Garshol
|
|   - are all assigned Unicode characters also abstract characters?
 
* Kenneth Whistler
|
| Yes. Or rather: all encoded characters are assigned to abstract
| characters.

H. OK. So combining diacritics are also abstract characters? (I
was also unclear on ZWNJ and similar things, but you explicitly
mention that above, so...)
 
| (Note above -- abstract characters are also a concept which applies
| to other character encodings besides the Unicode Standard, and not
| all encoded characters in other character encodings automatically
| make it into the Unicode Standard, for various architectural
| reasons.)

Right. So VIQR, for example, also has abstract characters, then?
 
* Lars Marius Garshol
|
|  - do  (Å) and  (A followed by combining ring
|above) represent the same abstract character?
 
* Kenneth Whistler
|
| Yes. That is the implicit claim behind a specification of canonical
| equivalence.

Right. Then I think I've more-or-less got it.

This helped a lot. Thank you!

However, it does raise a new problem. Isn't the definition of 'string'
in the XPath specification then wrong?

  Strings consist of a sequence of zero or more characters, where a
  character is defined as in the XML Recommendation [XML]. A single
  character in XPath thus corresponds to a single Unicode abstract
  character with a single corresponding Unicode scalar value (see
  [Unicode]); [...]
  http://www.w3.org/TR/xpath#strings >

As far as I can tell, one of these two claims must be wrong. That is,
either a single XPath character does not necessarily correspond to a
single Unicode abstract character, or else a single XPath character
need not correspond to a single scalar value.

Does that sound reasonable?

-- 
Lars Marius Garshol, Ontopian http://www.ontopia.net >
ISO SC34/WG3, OASIS GeoLang TChttp://www.garshol.priv.no >





Re: Abstract character?

2002-07-22 Thread Doug Ewell

Mark Davis  wrote:

> The UTC in has decided to make scalar value mean unambiguously the
> code points ..D7FF, E000..10, i.e., everything but surrogate
> code points. While surrogate code points cannot be represented in
> UTF-8 (as of Unicode 3.2), the UTC has not decided that the surrogate
> code points are illegal in all UTFs; notably, they are legal in
> UTF-16.

They are not legal in UTF-16 unless you believe that the two code points
(0xD800, 0xDC00) are fundamentally equivalent to the single code point
0x1 -- that is, unless you believe Unicode *is* UTF-16.

UTF-16 does not allow the representation of an unpaired surrogate 0xD800
followed by another, coincidental unpaired surrogate 0xDC00.  (It maps
the two to U+1.)  Among the standard UTFs, only UTF-32 allows the
two to be treated as unpaired surrogates.  In fact, before UTF-8 was
"tightened up" in 3.2, the only UTF that DID NOT permit these two
coincidental unpaired surrogates was UTF-16.

UTF-8:  D800 DC00 <==> ED A0 80 ED B0 80 (no longer legal)
UTF-32:  D800 DC00 <==> D800 DC00
- but -
UTF-16:  D800 DC00 ==> D800 DC00 ==> 1

> Ken is pushing for this change; I believe it would be a very bad idea.
> (I think the reasons have already appeared on this list, so I am not
> trying to reopen the discussion; just state the current situation.)

I don't recall seeing the reasons conclusively discussed on this list;
I'd be happy to hear them again.  I've been complaining about the
paragraph after D29 for two years now.

-Doug Ewell
 Fullerton, California





Re: Abstract character?

2002-07-22 Thread Mark Davis

A small correction to Ken's message:

>The Unicode scalar value
>definitionally excludes D800..DFFF, which are only code unit
>values used in UTF-16, and which are not code points associated
>with any well-formed UTF code unit sequences.

The UTC in has decided to make scalar value mean unambiguously the
code points ..D7FF, E000..10, i.e., everything but surrogate
code points. While surrogate code points cannot be represented in
UTF-8 (as of Unicode 3.2), the UTC has not decided that the surrogate
code points are illegal in all UTFs; notably, they are legal in
UTF-16.

Ken is pushing for this change; I believe it would be a very bad idea.
(I think the reasons have already appeared on this list, so I am not
trying to reopen the discussion; just state the current situation.)

Mark
__
http://www.macchiato.com
◄  “Eppur si muove” ►

- Original Message -
From: "Kenneth Whistler" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Monday, July 22, 2002 13:38
Subject: Re: Abstract character?


> Lars Marius Garshol asked:
>
> > I'm trying to find out what an abstract character is. I've been
> > looking at chapter 3 of Unicode 3.0, without really achieving
> > enlightenment.
> >
> > The term Unicode scalar value (apparently synonymous with code
point)
> > seems clear. It is the identifying number assigned to assigned
> > Unicode characters.
>
> Here is one of my attempts at a more rigorous term rectification:
>
> Abstract character
>
>that which is encoded; an element of the repertoire (existing
>independent of the character encoding standard, and often
>identifiable in other character encoding standards, as well
>as the Unicode Standard); the implicit basis of transcodings.
>
>Note that while in some sense abstract characters exist a
>priori by virtue of the nature of the units of various writing
>systems, their exact nature is only pinned down at the point
>that an actual encoding is done. They are not always obvious,
>and many new abstract characters may arise as the result of
>particular textual processing needs that can be addressed by
>characters. (E.g. WORD JOINER, OBJECT REPLACEMENT CHARACTER,
>etc., etc.)
>
> Code point
>
>A number from 0..10; a "point" in the codespace 0..10.
>
> Encoded character
>
>An *association* of an abstract character with a code point.
>
> Unicode scalar value
>
>A number from 0..D7FF, E000..10; the domain of the
>functions which define UTF's. The Unicode scalar value
>definitionally excludes D800..DFFF, which are only code unit
>values used in UTF-16, and which are not code points associated
>    with any well-formed UTF code unit sequences.
>
> Assignment (of code points)
>
>Refers to the process of associating abstract character with
>code points. Mathematically a code point is
>"assigned to" an abstract character and an abstract
>character is "mapped to" a code point.
>
>This is distinguished from the vaguer sense of "assigned"
>in general parlance as meaning "a code point given some
>designated function by the standard", which would include
>noncharacters and surrogates.
>
> >
> > So far, so good. Some questions:
> >
> >  - are all assigned Unicode characters also abstract characters?
>
> Yes. Or rather: all encoded characters are assigned to abstract
> characters.
>
> (See above for my distinction between "assigned" and
> "designated", which would apply to noncharacters and surrogate
> code points -- neither of which classes of code points get
> assigned to abstract characters.)
>
> >
> >  - it seems that not all abstract characters have code points
(since
> >abstract characters can be formed using combining characters).
Is
> >    that correct?
>
> Yes. (Note above -- abstract characters are also a concept which
> applies to other character encodings besides the Unicode Standard,
> and not all encoded characters in other character encodings
automatically
> make it into the Unicode Standard, for various architectural
reasons.)
>
> >
> >  - do  () and  (A followed by combining
ring
> >above) represent the same abstract character?
>
> Yes. That is the implicit claim behind a specification of canonical
> equivalence.
>
> --Ken
>
> >
> > Would be good if someone could clear this up.
> >
> > --
> > Lars Marius Garshol, Ontopian http://www.ontopia.net
>
> > ISO SC34/WG3, OASIS GeoLang TChttp://www.garshol.priv.no >
> >
> >
> >
>
>
>





Re: Abstract character?

2002-07-22 Thread Barry Caplan

I usually define an abstract character in talks I give as "an element of a writing 
system that you care about, independent of glyphs, and certainly independent of 
endings or specific code points". 

If it could be described more precisely than that, it wouldn't be "abstract", would 
it? :)

This is usually brought up in a series of definitions  leading from "character" (what 
we are referring to here as "abstract" character, and then:

- "character list" - a list of "characters" one is interested in
- "character set" - a list of "character lists", which may or may not be ordered, but 
still has no codepoints
- "encoding scheme" - an algorithm for assigning code points to a "character set"
- "code point" the representation of an "abstract character" in an "encoding scheme"
- "font" - a series of glyphs that are used to display a characters represented by 
code points, in their immediate context

All of this is filled with examples - building to an explanation of Unicode. For 
example, wrt "abstract character, I ask the audience to ponder if "upper case A" and 
"lower case a", are the same "abstract character". Also, I ask them to ponder if 
"lower case a" displayed in "Helvetica" is the same "character as "lower case a" in " 
Times Roman". Finally, how about  "lower case a in 9 point Helvetica" and "lower case 
a in 18 point Helvetica"?

And apropos a thread from last week, Unicode introduces new concepts such as 
"character properties" which means the anticipation and intrigue I spend time building 
in the audience that there is a neat solution to the historical morass I just spent 40 
minutes describing, gets thoroughly dashed! Joy!

Implicit in this set of definitions is of course that a "character" may or may not be 
of interest to all "character lists", and therefore may or may not end of represented 
in more than one encoding. Also note that even when it does end up in more than one, 
this model in no way implies a round trip capability.

This leads nicely into a discussion about some very important aspects of 
internationalizing code and working with 3rd party components..

Barry Caplan
www.i18n.com

At 01:38 PM 7/22/2002 -0700, Kenneth Whistler wrote:
>Lars Marius Garshol asked:
>
>> I'm trying to find out what an abstract character is. I've been
>> looking at chapter 3 of Unicode 3.0, without really achieving
>> enlightenment. 
>> 
>> The term Unicode scalar value (apparently synonymous with code point)
>> seems clear. It is the identifying number assigned to assigned
>> Unicode characters.
>
>Here is one of my attempts at a more rigorous term rectification:
>
>Abstract character
>
>   that which is encoded; an element of the repertoire (existing
>   independent of the character encoding standard, and often
>   identifiable in other character encoding standards, as well
>   as the Unicode Standard); the implicit basis of transcodings.
>
>   Note that while in some sense abstract characters exist a
>   priori by virtue of the nature of the units of various writing
>   systems, their exact nature is only pinned down at the point
>   that an actual encoding is done. They are not always obvious,
>   and many new abstract characters may arise as the result of
>   particular textual processing needs that can be addressed by
>   characters. (E.g. WORD JOINER, OBJECT REPLACEMENT CHARACTER,
>   etc., etc.)
>
>Code point
>
>   A number from 0..10; a "point" in the codespace 0..10.
>
>Encoded character
>
>   An *association* of an abstract character with a code point.
>
>Unicode scalar value
>
>   A number from 0..D7FF, E000..10; the domain of the
>   functions which define UTF's. The Unicode scalar value
>   definitionally excludes D800..DFFF, which are only code unit
>   values used in UTF-16, and which are not code points associated
>   with any well-formed UTF code unit sequences.
>
>Assignment (of code points)
>
>   Refers to the process of associating abstract character with
>   code points. Mathematically a code point is
>   "assigned to" an abstract character and an abstract
>   character is "mapped to" a code point.
>
>   This is distinguished from the vaguer sense of "assigned"
>   in general parlance as meaning "a code point given some
>   designated function by the standard", which would include
>   noncharacters and surrogates.
>
>> 
>> So far, so good. Some questions:
>> 
>>  - are all assigned Unicode characters also abstract character

Re: Abstract character?

2002-07-22 Thread Kenneth Whistler

Lars Marius Garshol asked:

> I'm trying to find out what an abstract character is. I've been
> looking at chapter 3 of Unicode 3.0, without really achieving
> enlightenment. 
> 
> The term Unicode scalar value (apparently synonymous with code point)
> seems clear. It is the identifying number assigned to assigned
> Unicode characters.

Here is one of my attempts at a more rigorous term rectification:

Abstract character

   that which is encoded; an element of the repertoire (existing
   independent of the character encoding standard, and often
   identifiable in other character encoding standards, as well
   as the Unicode Standard); the implicit basis of transcodings.

   Note that while in some sense abstract characters exist a
   priori by virtue of the nature of the units of various writing
   systems, their exact nature is only pinned down at the point
   that an actual encoding is done. They are not always obvious,
   and many new abstract characters may arise as the result of
   particular textual processing needs that can be addressed by
   characters. (E.g. WORD JOINER, OBJECT REPLACEMENT CHARACTER,
   etc., etc.)

Code point

   A number from 0..10; a "point" in the codespace 0..10.

Encoded character

   An *association* of an abstract character with a code point.

Unicode scalar value

   A number from 0..D7FF, E000..10; the domain of the
   functions which define UTF's. The Unicode scalar value
   definitionally excludes D800..DFFF, which are only code unit
   values used in UTF-16, and which are not code points associated
   with any well-formed UTF code unit sequences.

Assignment (of code points)

   Refers to the process of associating abstract character with
   code points. Mathematically a code point is
   "assigned to" an abstract character and an abstract
   character is "mapped to" a code point.

   This is distinguished from the vaguer sense of "assigned"
   in general parlance as meaning "a code point given some
   designated function by the standard", which would include
   noncharacters and surrogates.

> 
> So far, so good. Some questions:
> 
>  - are all assigned Unicode characters also abstract characters?

Yes. Or rather: all encoded characters are assigned to abstract
characters.

(See above for my distinction between "assigned" and
"designated", which would apply to noncharacters and surrogate
code points -- neither of which classes of code points get
assigned to abstract characters.)

> 
>  - it seems that not all abstract characters have code points (since
>abstract characters can be formed using combining characters). Is
>that correct?

Yes. (Note above -- abstract characters are also a concept which
applies to other character encodings besides the Unicode Standard,
and not all encoded characters in other character encodings automatically
make it into the Unicode Standard, for various architectural reasons.)

> 
>  - do  (Å) and  (A followed by combining ring
>above) represent the same abstract character?

Yes. That is the implicit claim behind a specification of canonical
equivalence.

--Ken

> 
> Would be good if someone could clear this up.
> 
> -- 
> Lars Marius Garshol, Ontopian http://www.ontopia.net >
> ISO SC34/WG3, OASIS GeoLang TChttp://www.garshol.priv.no >
> 
> 
> 





Re: Abstract character?

2002-07-22 Thread Markus Scherer

Lars Marius Garshol wrote:

> I'm trying to find out what an abstract character is.


http://www.unicode.org/reports/tr17/
http://oss.software.ibm.com/icu/docs/papers/forms_of_unicode/

markus





Abstract character?

2002-07-22 Thread Lars Marius Garshol


I'm trying to find out what an abstract character is. I've been
looking at chapter 3 of Unicode 3.0, without really achieving
enlightenment. 

The term Unicode scalar value (apparently synonymous with code point)
seems clear. It is the identifying number assigned to assigned
Unicode characters.

So far, so good. Some questions:

 - are all assigned Unicode characters also abstract characters?

 - it seems that not all abstract characters have code points (since
   abstract characters can be formed using combining characters). Is
   that correct?

 - do  (Å) and  (A followed by combining ring
   above) represent the same abstract character?

Would be good if someone could clear this up.

-- 
Lars Marius Garshol, Ontopian http://www.ontopia.net >
ISO SC34/WG3, OASIS GeoLang TChttp://www.garshol.priv.no >