RE: Collation - last character?

Kenneth Whistler Mon, 18 Mar 2002 13:12:30 -0800

Lars Kristan responded:

> Markus Scherer wrote:
> > How about U+10ffff?
> > It is a non-character, which gives it a high (unassigned 
> > character) weight in the UCA. It is the highest code point = 
> > "the last character".
> 
> That is definitely not what I was looking for. It is an illegal codepoint,


Not exactly. What ISO/IEC 10646 says is that [10FFFF] shall not be used
[for representing graphic characters]. So you cannot have an interchangeable
character encoding there, but that doesn't mean that the code *position*
per se is illegal -- it is part of the 10646 architecture.

In Unicode terminology, U+10FFFF is a "non-character" (see Unicode 3.1 for
details). You cannot exchange any interpretation of that, but that doesn't
prevent you from using it (or U+FFFF, or any other non-character code point)
for internal processing purposes, as needed.

Markus' suggestion for using U+10FFFF was as such an internal processing
sentinel, since the definition of the Unicode Collation Algorithm will
automatically give it the highest weight.

> while I was looking for a legal codepoint, and one that would not 'happen to
> be' the last, but would be 'defined as' last.

Actually, in the UCA, U+10FFFF is *defined as* last, by the nature of the
handling of weightings for unassigned code points.

But I understand that you may be looking for an interchangeable character
that would be defined as sorting last.

> 
> Initially, I wanted to have such a codepoint, which would counterpart the
> underscore (_). Meaning, it would be a valid alpha character (one that is
> guaranteed to be accepted for identifiers, even as the first character), and
> would have a non-zero-width representation.

This is a contradictory requirement, as best I can tell.

The highest-weighted alpha characters in the current UCA table are all
the Han characters. So without tailoring, you'd pick U+2A6D6 (the last
character in CJK Vertical Extension B) as the highest weight. But then
any tailoring of Han ordering could, in principle, destabilize that. And any
future addition to the Han character encoding would certainly cause a
problem. So the highest-weighted alpha character cannot merely be the
currently highest-weighted alpha character.

Instead, as you surmise, you would have to have a "special", analogous
to the underscore, but weighted higher than any ordinary alphas. The
problem is that you would first have to pick such a beast and get
it accepted into the identifier syntax of all the programming languages.
The status of underscore is somewhat accidental in this regard, since
it represents an identifier hack to indicate multi-word identifiers
as units, and was grandfathered into many formal language syntaxes.
But its status as "lowest" weight is somewhat arbitrary, too. It certainly
is not the lowest weighted "special" in the Unicode Collation Algorithm,
so there is always the potential, by admission of new specials
into programming language identifier syntax, that underscore won't
sort lowest, either.

Also, you need to keep in mind that specials in the UCA behave differently
than you might presume from simple, single-pass weighted sortings that
only deal with primary weights. You are expecting:

_abc
a_bc
ab_c
abc
abc_
bbc

but in the UCA, the (default) ordering of those strings would be:

abc
_abc
a_bc
ab_c
abc_
bbc

since "abc" < the same string with any special character in it unweighted
at the primary level.

And as regards symbols which could be used as specials to try to
get the highest weight behavior, they all sort lower than the alphas
in the current UCA tables, anyway. So some other route would be
required to fix that.

> 
> Asmus Freytag [[EMAIL PROTECTED]] also noted that there could be use for
> such characters in user interfaces. However, for this type of usage, it
> would be preferred to have two zero-width, non-breaking characters, that
> would typically NOT be allowed in user input,

But this is the sort of thing for which you can use a non-interchanged
sentinel that has the appropriate weighting behavior, if you want.

> allowing the application to
> keep reserved items on top or bottom of a sorted list, also knowing that the
> user can never delete them or add an item with the same name, as long as
> these are screened at point of input. Things get more complicated if you
> allow reversed sort order, so I cannot say at this point whether or not
> anyone would really choose to use such an approach.
> 
> The question would then be, if we pursue this issue, are we looking for a
> single character, that would counterpart the underscore, or are we looking
> for four characters, two alpha characters and two zero-width spaces? To
> allow for the latter, I now think that these would fit more in the General
> Punctuation block than in the Specials block.

I do not feel that this is an *encoding* issue at all. Nor is it even
an issue for the Unicode Collation Algorithm to define such a usage.

What you are looking for is something that could be agreed upon by
the programming language communities as:

   1. a symbol, from among the vast collection already encoded in
      Unicode, that would be agreed to by the programming language
      communities as acceptable in identifiers, as is "_".

   2. when using a simple, single-level ordering (e.g., for
      sorting menu-items), would be given a primary weight above
      all alphas, as "_" would be given a primary weight below
      all alphas.

   3. when using a multi-level, sophisticated ordering according
      to the UCA, would also be given a primary weight above all
      alphas, as "_" would be given a primary weight below all
      alphas, so as to preserve the expected behavior, while allowing
      all the sophistication of language-specific ordering behavior
      for sorting lists.

In either case, whether doing simple sorting or complex, multi-level
sorting, you are talking about some tailored behavior here. You
can't just sort on code point order, and you cannot simply use the
Unicode Collation Algorithm without tailoring to get the effects you
want.

By the way, my suggestion for an appropriate, already encoded symbol
to meet your requirements would be U+221E INFINITY. ;-) Or how about
U+261F WHITE DOWN POINTING INDEX, if you want something more iconic?

--Ken

RE: Collation - last character?

Reply via email to