RE: Collation - last character?
TUS does not prevent anyone to put noncharacter code points in Unicode strings. As a matter of fact, p. 23 of TUS 3.0 reads U+ is reserved for private program use as a sentinel or other signal. I would expect this to hold true for the noncharacters that were introduced later too. It may not fit your needs if you're looking for a character, but it is available for use by applications. But it is *not* available to *users* to put into lists to make certain elements sort at the end. When dealing with user-specified lists, I would if possible introduce some markup so that my application can deal with those two special cases (lowest/highest) as it wishes internally without burdening the user with the need to enter an improbable (in her everyday's context) codepoint. YA
Re: Collation - last character?
David Hopwood said: At 09:01 AM 3/19/02 -0800, Yves Arrouye wrote: TUS does not prevent anyone to put noncharacter code points in Unicode strings. As a matter of fact, p. 23 of TUS 3.0 reads U+ is reserved for private program use as a sentinel or other signal. But it is *not* available to *users* to put into lists to make certain elements sort at the end. No, but U+1FFFD is. Make that U+10FFFD, of course. Incidentally, in case anyone is interested, in the default table for the Unicode Collation Algorithm, the character with the lowest primary weight (other than zero, or variables set to be ignorable) is: 02D0 ; [.081F.0020.0002.02D0] # MODIFIER LETTER TRIANGULAR COLON That is the value in the current table (inclusive of the Unicode 3.0.1 repertoire). In table which matches the current table under ballot for ISO 14651, extending the repertoire to Unicode 3.1.0, the same entry still has the lowest primary weight, but the absolute value has changed to: 02D0 ; [.09D3.0020.0002.02D0] # MODIFIER LETTER TRIANGULAR COLON --Ken
RE: Collation - last character?
Markus Scherer wrote: How about U+10? It is a non-character, which gives it a high (unassigned character) weight in the UCA. It is the highest code point = the last character. That is definitely not what I was looking for. It is an illegal codepoint, while I was looking for a legal codepoint, and one that would not 'happen to be' the last, but would be 'defined as' last. TUS does not prevent anyone to put noncharacter code points in Unicode strings. As a matter of fact, p. 23 of TUS 3.0 reads U+ is reserved for private program use as a sentinel or other signal. I would expect this to hold true for the noncharacters that were introduced later too. It may not fit your needs if you're looking for a character, but it is available for use by applications. YA
Re: Collation - last character?
-BEGIN PGP SIGNED MESSAGE- Asmus Freytag wrote: At 09:01 AM 3/19/02 -0800, Yves Arrouye wrote: TUS does not prevent anyone to put noncharacter code points in Unicode strings. As a matter of fact, p. 23 of TUS 3.0 reads U+ is reserved for private program use as a sentinel or other signal. I would expect this to hold true for the noncharacters that were introduced later too. It may not fit your needs if you're looking for a character, but it is available for use by applications. But it is *not* available to *users* to put into lists to make certain elements sort at the end. No, but U+1FFFD is. - -- David Hopwood [EMAIL PROTECTED] Home page PGP public key: http://www.users.zetnet.co.uk/hopwood/ RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66 15 01 Nothing in this message is intended to be legally binding. If I revoke a public key but refuse to specify why, it is because the private key has been seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip -BEGIN PGP SIGNATURE- Version: 2.6.3i Charset: noconv iQEVAwUBPJfXpzkCAxeYt5gVAQFj0Af/Ra4b0SIRLm+tNqy7MOcmNOsfYKM72HnY K5vLKHy6Qsqj2YaBsrbD84QXXps6g9HBEDSfo6fxJ6d6LVtA2QUeQGHvM0tD9quJ PKxZAwyTSQxjx+HDlwRJ5yQEZLrosQs8Irq6zr1JdLkhbgLp1saNn8lr04gw9gEr 8gF1dW8UtdtZR2APkKdIp79yo3IxMlOygNSymB5FOo7WqpfZCGU8au1Wn7yuP6N3 BfsPpTy/yuSh7R6PxPCGNu2SdksLxI/rThAtyO4nhOllcHWMbHtbzpBAQbkMxNWU +VHz5kUFK48cwY6DhjlL6bDb+tRYxYtOLH9K0RK8ddfs9PoO+lQNaQ== =NNaF -END PGP SIGNATURE-
RE: Collation - last character?
Markus Scherer wrote: How about U+10? It is a non-character, which gives it a high (unassigned character) weight in the UCA. It is the highest code point = the last character. That is definitely not what I was looking for. It is an illegal codepoint, while I was looking for a legal codepoint, and one that would not 'happen to be' the last, but would be 'defined as' last. Initially, I wanted to have such a codepoint, which would counterpart the underscore (_). Meaning, it would be a valid alpha character (one that is guaranteed to be accepted for identifiers, even as the first character), and would have a non-zero-width representation. Asmus Freytag [[EMAIL PROTECTED]] also noted that there could be use for such characters in user interfaces. However, for this type of usage, it would be preferred to have two zero-width, non-breaking characters, that would typically NOT be allowed in user input, allowing the application to keep reserved items on top or bottom of a sorted list, also knowing that the user can never delete them or add an item with the same name, as long as these are screened at point of input. Things get more complicated if you allow reversed sort order, so I cannot say at this point whether or not anyone would really choose to use such an approach. The question would then be, if we pursue this issue, are we looking for a single character, that would counterpart the underscore, or are we looking for four characters, two alpha characters and two zero-width spaces? To allow for the latter, I now think that these would fit more in the General Punctuation block than in the Specials block. Lars Kristan
RE: Collation - last character?
Lars Kristan responded: Markus Scherer wrote: How about U+10? It is a non-character, which gives it a high (unassigned character) weight in the UCA. It is the highest code point = the last character. That is definitely not what I was looking for. It is an illegal codepoint, Not exactly. What ISO/IEC 10646 says is that [10] shall not be used [for representing graphic characters]. So you cannot have an interchangeable character encoding there, but that doesn't mean that the code *position* per se is illegal -- it is part of the 10646 architecture. In Unicode terminology, U+10 is a non-character (see Unicode 3.1 for details). You cannot exchange any interpretation of that, but that doesn't prevent you from using it (or U+, or any other non-character code point) for internal processing purposes, as needed. Markus' suggestion for using U+10 was as such an internal processing sentinel, since the definition of the Unicode Collation Algorithm will automatically give it the highest weight. while I was looking for a legal codepoint, and one that would not 'happen to be' the last, but would be 'defined as' last. Actually, in the UCA, U+10 is *defined as* last, by the nature of the handling of weightings for unassigned code points. But I understand that you may be looking for an interchangeable character that would be defined as sorting last. Initially, I wanted to have such a codepoint, which would counterpart the underscore (_). Meaning, it would be a valid alpha character (one that is guaranteed to be accepted for identifiers, even as the first character), and would have a non-zero-width representation. This is a contradictory requirement, as best I can tell. The highest-weighted alpha characters in the current UCA table are all the Han characters. So without tailoring, you'd pick U+2A6D6 (the last character in CJK Vertical Extension B) as the highest weight. But then any tailoring of Han ordering could, in principle, destabilize that. And any future addition to the Han character encoding would certainly cause a problem. So the highest-weighted alpha character cannot merely be the currently highest-weighted alpha character. Instead, as you surmise, you would have to have a special, analogous to the underscore, but weighted higher than any ordinary alphas. The problem is that you would first have to pick such a beast and get it accepted into the identifier syntax of all the programming languages. The status of underscore is somewhat accidental in this regard, since it represents an identifier hack to indicate multi-word identifiers as units, and was grandfathered into many formal language syntaxes. But its status as lowest weight is somewhat arbitrary, too. It certainly is not the lowest weighted special in the Unicode Collation Algorithm, so there is always the potential, by admission of new specials into programming language identifier syntax, that underscore won't sort lowest, either. Also, you need to keep in mind that specials in the UCA behave differently than you might presume from simple, single-pass weighted sortings that only deal with primary weights. You are expecting: _abc a_bc ab_c abc abc_ bbc but in the UCA, the (default) ordering of those strings would be: abc _abc a_bc ab_c abc_ bbc since abc the same string with any special character in it unweighted at the primary level. And as regards symbols which could be used as specials to try to get the highest weight behavior, they all sort lower than the alphas in the current UCA tables, anyway. So some other route would be required to fix that. Asmus Freytag [[EMAIL PROTECTED]] also noted that there could be use for such characters in user interfaces. However, for this type of usage, it would be preferred to have two zero-width, non-breaking characters, that would typically NOT be allowed in user input, But this is the sort of thing for which you can use a non-interchanged sentinel that has the appropriate weighting behavior, if you want. allowing the application to keep reserved items on top or bottom of a sorted list, also knowing that the user can never delete them or add an item with the same name, as long as these are screened at point of input. Things get more complicated if you allow reversed sort order, so I cannot say at this point whether or not anyone would really choose to use such an approach. The question would then be, if we pursue this issue, are we looking for a single character, that would counterpart the underscore, or are we looking for four characters, two alpha characters and two zero-width spaces? To allow for the latter, I now think that these would fit more in the General Punctuation block than in the Specials block. I do not feel that this is an *encoding* issue at all. Nor is it even an issue for the Unicode Collation Algorithm to define such a usage. What you are looking for is something that could be agreed upon by the programming
Re: Collation - last character?
Since collation depends on the language and not the code point or encoding or anything else, there is no absolute last character that would be the last character in every possible collation? MichKa Michael Kaplan Trigeminal Software, Inc. -- http://www.trigeminal.com/ - Original Message - From: Lars Kristan [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Friday, March 15, 2002 9:21 AM Subject: Collation - last character? Is there a character (codepoint), that is guaranteed to be sorted (collated) after all other codepoints? Like: _WantThisOneOnTop Able Baker NoMatterWhat ^WantThisOneOnBottom ^^and_so_on Where _ is the underscore, which is usually collated 'quite high'. And ^ is the hipothetical character I am querying about. Lars Kristan
Re: Collation - last character?
Lars Kristan asked: Is there a character (codepoint), that is guaranteed to be sorted (collated) after all other codepoints? Like: _WantThisOneOnTop Able Baker NoMatterWhat ^WantThisOneOnBottom ^^and_so_on Where _ is the underscore, which is usually collated 'quite high'. And ^ is the hipothetical character I am querying about. ISO/IEC 14651 contains a special symbol S which is deliberately left at the end of the list of all other primary-weighted symbols, so that there will be a highest weight. You would still have to tailor the table, to assign a particular character a high weight making use of S or a weight tailored with respect to S, since there is no highest character, per se, in the list. In the amendment to 14651 under current ballot, S is still present. In the default table, the highest weighted characters before S are the Han characters, so that the last Extension B character would be weighted high. In the Unicode Collation Algorithm (UTS #10), there is no explicit weight assigned corresponding to S, but a primary weight assignment of 0x is guaranteed to be higher than that of any Han character. (The Han character weights are constructed synthetically based on first element primary weights in the range 0xFF40..0xFFBF.) Once again, if you want a *character* to correspond to that highest weight, then you have to tailor the table to do so. But then, of course, you could assign any character you want to have that highest weight value, including a private use character or even a noncharacter code point. --Ken
RE: Collation - last character?
Kenneth Whistler wrote: In the Unicode Collation Algorithm (UTS #10), there is no explicit weight assigned corresponding to S, but a primary weight assignment of 0x is guaranteed to be higher than that of any Han character. Well, then I am proposing to introduce such a character. U+FFFD could be used, but then why repeat the mistake of assigning two roles to a single codepoint. U+FFF0? My proposal for its rendering would be an overline. 7.1.2 of UTS #10 would state that U+FFF0 must have the highest possible collation weight for any language (collation). I don't have a specific need for such a character. It simply occurred to me that it may prove to be useful to have it. If not for developers, then for end users. Maybe for those who want to be the last in a telephone directory (assuming it's in Unicode;), or for those who want a file to appear at the bottom of a folder. Which is what I once wanted to have, hence the idea. Lars Kristan
Re: Collation - last character?
At 11:13 AM 3/15/02 -0800, you wrote: Once again, if you want a *character* to correspond to that highest weight, then you have to tailor the table to do so. But then, of course, you could assign any character you want to have that highest weight value, including a private use character or even a noncharacter code point. This works for people that do their own tailorings. What about users that want to create a list such that certain items go to the top and others to the bottom? Unless an implementer provides some reasonable choices for such characters, there seems little that users can do. And each implementer would assign different characters, if any. The need to have a default choice at the top of a list, or 'none of the above' at the bottom of a list, is a pretty common task in user interfaces. Perhaps it would be worth considering support for that not just in the overall machinery of a tailored implementation but already in the default weights, to encourage consistent behavior. A./
Re: Collation - last character?
How about U+10? It is a non-character, which gives it a high (unassigned character) weight in the UCA. It is the highest code point = the last character. It cannot be a Private-Use character, so few people will be tempted to tailor it to something other than its default UCA weight. It also sorts highest in a Unicode-code point order-strcmp. I think that at least in the ICU implementation of UCA, except if you tailor U+10, it will give you the highest weight. markus