RE: Collation - last character?

2002-03-22 Thread Yves Arrouye

 TUS does not prevent anyone to put noncharacter code points in Unicode
 strings. As a matter of fact, p. 23 of TUS 3.0 reads U+ is reserved
 for
 private program use as a sentinel or other signal. I would expect this
 to
 hold true for the noncharacters that were introduced later too. It may
 not
 fit your needs if you're looking for a character, but it is available for
 use by applications.
 
 But it is *not* available to *users* to put into lists to make certain
 elements sort at the end.

When dealing with user-specified lists, I would if possible introduce some
markup so that my application can deal with those two special cases
(lowest/highest) as it wishes internally without burdening the user with the
need to enter an improbable (in her everyday's context) codepoint.

YA





Re: Collation - last character?

2002-03-20 Thread Kenneth Whistler

David Hopwood said:

  At 09:01 AM 3/19/02 -0800, Yves Arrouye wrote:
  TUS does not prevent anyone to put noncharacter code points in Unicode
  strings. As a matter of fact, p. 23 of TUS 3.0 reads U+ is reserved
  for private program use as a sentinel or other signal. 

  
  But it is *not* available to *users* to put into lists to make certain
  elements sort at the end.
 
 No, but U+1FFFD is.

Make that U+10FFFD, of course.

Incidentally, in case anyone is interested, in the default table for
the Unicode Collation Algorithm, the character with the lowest primary
weight (other than zero, or variables set to be ignorable) is:

02D0 ; [.081F.0020.0002.02D0] # MODIFIER LETTER TRIANGULAR COLON

That is the value in the current table (inclusive of the Unicode 3.0.1
repertoire). In table which matches the current table under ballot
for ISO 14651, extending the repertoire to Unicode 3.1.0, the same
entry still has the lowest primary weight, but the absolute value
has changed to:

02D0  ; [.09D3.0020.0002.02D0] # MODIFIER LETTER TRIANGULAR COLON

--Ken





RE: Collation - last character?

2002-03-19 Thread Yves Arrouye

 Markus Scherer wrote:
  How about U+10?
  It is a non-character, which gives it a high (unassigned
  character) weight in the UCA. It is the highest code point =
  the last character.
 
 That is definitely not what I was looking for. It is an illegal codepoint,
 while I was looking for a legal codepoint, and one that would not 'happen
 to
 be' the last, but would be 'defined as' last.

TUS does not prevent anyone to put noncharacter code points in Unicode
strings. As a matter of fact, p. 23 of TUS 3.0 reads U+ is reserved for
private program use as a sentinel or other signal. I would expect this to
hold true for the noncharacters that were introduced later too. It may not
fit your needs if you're looking for a character, but it is available for
use by applications.

YA





Re: Collation - last character?

2002-03-19 Thread David Hopwood

-BEGIN PGP SIGNED MESSAGE-

Asmus Freytag wrote:
 At 09:01 AM 3/19/02 -0800, Yves Arrouye wrote:
 TUS does not prevent anyone to put noncharacter code points in Unicode
 strings. As a matter of fact, p. 23 of TUS 3.0 reads U+ is reserved
 for private program use as a sentinel or other signal. I would expect
 this to hold true for the noncharacters that were introduced later too.
 It may not fit your needs if you're looking for a character, but it is
 available for use by applications.
 
 But it is *not* available to *users* to put into lists to make certain
 elements sort at the end.

No, but U+1FFFD is.

- -- 
David Hopwood [EMAIL PROTECTED]

Home page  PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5  0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip


-BEGIN PGP SIGNATURE-
Version: 2.6.3i
Charset: noconv

iQEVAwUBPJfXpzkCAxeYt5gVAQFj0Af/Ra4b0SIRLm+tNqy7MOcmNOsfYKM72HnY
K5vLKHy6Qsqj2YaBsrbD84QXXps6g9HBEDSfo6fxJ6d6LVtA2QUeQGHvM0tD9quJ
PKxZAwyTSQxjx+HDlwRJ5yQEZLrosQs8Irq6zr1JdLkhbgLp1saNn8lr04gw9gEr
8gF1dW8UtdtZR2APkKdIp79yo3IxMlOygNSymB5FOo7WqpfZCGU8au1Wn7yuP6N3
BfsPpTy/yuSh7R6PxPCGNu2SdksLxI/rThAtyO4nhOllcHWMbHtbzpBAQbkMxNWU
+VHz5kUFK48cwY6DhjlL6bDb+tRYxYtOLH9K0RK8ddfs9PoO+lQNaQ==
=NNaF
-END PGP SIGNATURE-




RE: Collation - last character?

2002-03-18 Thread Lars Kristan


Markus Scherer wrote:
 How about U+10?
 It is a non-character, which gives it a high (unassigned 
 character) weight in the UCA. It is the highest code point = 
 the last character.

That is definitely not what I was looking for. It is an illegal codepoint,
while I was looking for a legal codepoint, and one that would not 'happen to
be' the last, but would be 'defined as' last.

Initially, I wanted to have such a codepoint, which would counterpart the
underscore (_). Meaning, it would be a valid alpha character (one that is
guaranteed to be accepted for identifiers, even as the first character), and
would have a non-zero-width representation.

Asmus Freytag [[EMAIL PROTECTED]] also noted that there could be use for
such characters in user interfaces. However, for this type of usage, it
would be preferred to have two zero-width, non-breaking characters, that
would typically NOT be allowed in user input, allowing the application to
keep reserved items on top or bottom of a sorted list, also knowing that the
user can never delete them or add an item with the same name, as long as
these are screened at point of input. Things get more complicated if you
allow reversed sort order, so I cannot say at this point whether or not
anyone would really choose to use such an approach.

The question would then be, if we pursue this issue, are we looking for a
single character, that would counterpart the underscore, or are we looking
for four characters, two alpha characters and two zero-width spaces? To
allow for the latter, I now think that these would fit more in the General
Punctuation block than in the Specials block.


Lars Kristan




RE: Collation - last character?

2002-03-18 Thread Kenneth Whistler

Lars Kristan responded:

 Markus Scherer wrote:
  How about U+10?
  It is a non-character, which gives it a high (unassigned 
  character) weight in the UCA. It is the highest code point = 
  the last character.
 
 That is definitely not what I was looking for. It is an illegal codepoint,

Not exactly. What ISO/IEC 10646 says is that [10] shall not be used
[for representing graphic characters]. So you cannot have an interchangeable
character encoding there, but that doesn't mean that the code *position*
per se is illegal -- it is part of the 10646 architecture.

In Unicode terminology, U+10 is a non-character (see Unicode 3.1 for
details). You cannot exchange any interpretation of that, but that doesn't
prevent you from using it (or U+, or any other non-character code point)
for internal processing purposes, as needed.

Markus' suggestion for using U+10 was as such an internal processing
sentinel, since the definition of the Unicode Collation Algorithm will
automatically give it the highest weight.

 while I was looking for a legal codepoint, and one that would not 'happen to
 be' the last, but would be 'defined as' last.

Actually, in the UCA, U+10 is *defined as* last, by the nature of the
handling of weightings for unassigned code points.

But I understand that you may be looking for an interchangeable character
that would be defined as sorting last.

 
 Initially, I wanted to have such a codepoint, which would counterpart the
 underscore (_). Meaning, it would be a valid alpha character (one that is
 guaranteed to be accepted for identifiers, even as the first character), and
 would have a non-zero-width representation.

This is a contradictory requirement, as best I can tell.

The highest-weighted alpha characters in the current UCA table are all
the Han characters. So without tailoring, you'd pick U+2A6D6 (the last
character in CJK Vertical Extension B) as the highest weight. But then
any tailoring of Han ordering could, in principle, destabilize that. And any
future addition to the Han character encoding would certainly cause a
problem. So the highest-weighted alpha character cannot merely be the
currently highest-weighted alpha character.

Instead, as you surmise, you would have to have a special, analogous
to the underscore, but weighted higher than any ordinary alphas. The
problem is that you would first have to pick such a beast and get
it accepted into the identifier syntax of all the programming languages.
The status of underscore is somewhat accidental in this regard, since
it represents an identifier hack to indicate multi-word identifiers
as units, and was grandfathered into many formal language syntaxes.
But its status as lowest weight is somewhat arbitrary, too. It certainly
is not the lowest weighted special in the Unicode Collation Algorithm,
so there is always the potential, by admission of new specials
into programming language identifier syntax, that underscore won't
sort lowest, either.

Also, you need to keep in mind that specials in the UCA behave differently
than you might presume from simple, single-pass weighted sortings that
only deal with primary weights. You are expecting:

_abc
a_bc
ab_c
abc
abc_
bbc

but in the UCA, the (default) ordering of those strings would be:

abc
_abc
a_bc
ab_c
abc_
bbc

since abc  the same string with any special character in it unweighted
at the primary level.

And as regards symbols which could be used as specials to try to
get the highest weight behavior, they all sort lower than the alphas
in the current UCA tables, anyway. So some other route would be
required to fix that.

 
 Asmus Freytag [[EMAIL PROTECTED]] also noted that there could be use for
 such characters in user interfaces. However, for this type of usage, it
 would be preferred to have two zero-width, non-breaking characters, that
 would typically NOT be allowed in user input,

But this is the sort of thing for which you can use a non-interchanged
sentinel that has the appropriate weighting behavior, if you want.

 allowing the application to
 keep reserved items on top or bottom of a sorted list, also knowing that the
 user can never delete them or add an item with the same name, as long as
 these are screened at point of input. Things get more complicated if you
 allow reversed sort order, so I cannot say at this point whether or not
 anyone would really choose to use such an approach.
 
 The question would then be, if we pursue this issue, are we looking for a
 single character, that would counterpart the underscore, or are we looking
 for four characters, two alpha characters and two zero-width spaces? To
 allow for the latter, I now think that these would fit more in the General
 Punctuation block than in the Specials block.

I do not feel that this is an *encoding* issue at all. Nor is it even
an issue for the Unicode Collation Algorithm to define such a usage.

What you are looking for is something that could be agreed upon by
the programming 

Re: Collation - last character?

2002-03-15 Thread Michael \(michka\) Kaplan

Since collation depends on the language and not the code point or encoding
or anything else, there is no absolute last character that would be the last
character in every possible collation?


MichKa

Michael Kaplan
Trigeminal Software, Inc.  -- http://www.trigeminal.com/

- Original Message -
From: Lars Kristan [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Friday, March 15, 2002 9:21 AM
Subject: Collation - last character?



 Is there a character (codepoint), that is guaranteed to be sorted
(collated)
 after all other codepoints?

 Like:

 _WantThisOneOnTop
 Able
 Baker
 NoMatterWhat
 ^WantThisOneOnBottom
 ^^and_so_on

 Where _ is the underscore, which is usually collated 'quite high'.
 And ^ is the hipothetical character I am querying about.


 Lars Kristan








Re: Collation - last character?

2002-03-15 Thread Kenneth Whistler

Lars Kristan asked:

 Is there a character (codepoint), that is guaranteed to be sorted (collated)
 after all other codepoints?
 
 Like:
 
 _WantThisOneOnTop
 Able
 Baker
 NoMatterWhat
 ^WantThisOneOnBottom
 ^^and_so_on
 
 Where _ is the underscore, which is usually collated 'quite high'.
 And ^ is the hipothetical character I am querying about.

ISO/IEC 14651 contains a special symbol S which is deliberately
left at the end of the list of all other primary-weighted symbols,
so that there will be a highest weight. You would still have to
tailor the table, to assign a particular character a high weight
making use of S or a weight tailored with respect to S,
since there is no highest character, per se, in the list. In the
amendment to 14651 under current ballot, S is still present.
In the default table, the highest weighted characters before S
are the Han characters, so that the last Extension B character would
be weighted high.

In the Unicode Collation Algorithm (UTS #10), there is no explicit
weight assigned corresponding to S, but a primary weight
assignment of 0x is guaranteed to be higher than that of
any Han character. (The Han character weights are constructed
synthetically based on first element primary weights in the
range 0xFF40..0xFFBF.) Once again, if you want a *character* to
correspond to that highest weight, then you have to tailor the
table to do so. But then, of course, you could assign any character
you want to have that highest weight value, including a private
use character or even a noncharacter code point.

--Ken




RE: Collation - last character?

2002-03-15 Thread Lars Kristan

Kenneth Whistler wrote:
 In the Unicode Collation Algorithm (UTS #10), there is no explicit
 weight assigned corresponding to S, but a primary weight
 assignment of 0x is guaranteed to be higher than that of
 any Han character.

Well, then I am proposing to introduce such a character. U+FFFD could be
used, but then why repeat the mistake of assigning two roles to a single
codepoint.

U+FFF0? My proposal for its rendering would be an overline. 7.1.2 of UTS #10
would state that U+FFF0 must have the highest possible collation weight for
any language (collation).

I don't have a specific need for such a character. It simply occurred to me
that it may prove to be useful to have it. If not for developers, then for
end users. Maybe for those who want to be the last in a telephone directory
(assuming it's in Unicode;), or for those who want a file to appear at the
bottom of a folder. Which is what I once wanted to have, hence the idea.


Lars Kristan




Re: Collation - last character?

2002-03-15 Thread Asmus Freytag

At 11:13 AM 3/15/02 -0800, you wrote:
Once again, if you want a *character* to
correspond to that highest weight, then you have to tailor the
table to do so. But then, of course, you could assign any character
you want to have that highest weight value, including a private
use character or even a noncharacter code point.

This works for people that do their own tailorings. What about users
that want to create a list such that certain items go to the top and
others to the bottom?

Unless an implementer provides some reasonable choices for such
characters, there seems little that users can do. And each implementer
would assign different characters, if any.

The need to have a default choice at the top of a list, or 'none of
the above' at the bottom of a list, is a pretty common task in user
interfaces. Perhaps it would be worth considering support for that
not just in the overall machinery of a tailored implementation but
already in the default weights, to encourage consistent behavior.

A./




Re: Collation - last character?

2002-03-15 Thread Markus Scherer

How about U+10?
It is a non-character, which gives it a high (unassigned character) weight in the UCA. 
It is the highest code point = the last character.

It cannot be a Private-Use character, so few people will be tempted to tailor it to 
something other than its default UCA weight.
It also sorts highest in a Unicode-code point order-strcmp.

I think that at least in the ICU implementation of UCA, except if you tailor U+10, 
it will give you the highest weight.

markus