Lars Kristan responded: > Markus Scherer wrote: > > How about U+10ffff? > > It is a non-character, which gives it a high (unassigned > > character) weight in the UCA. It is the highest code point = > > "the last character". > > That is definitely not what I was looking for. It is an illegal codepoint,
Not exactly. What ISO/IEC 10646 says is that [10FFFF] shall not be used [for representing graphic characters]. So you cannot have an interchangeable character encoding there, but that doesn't mean that the code *position* per se is illegal -- it is part of the 10646 architecture. In Unicode terminology, U+10FFFF is a "non-character" (see Unicode 3.1 for details). You cannot exchange any interpretation of that, but that doesn't prevent you from using it (or U+FFFF, or any other non-character code point) for internal processing purposes, as needed. Markus' suggestion for using U+10FFFF was as such an internal processing sentinel, since the definition of the Unicode Collation Algorithm will automatically give it the highest weight. > while I was looking for a legal codepoint, and one that would not 'happen to > be' the last, but would be 'defined as' last. Actually, in the UCA, U+10FFFF is *defined as* last, by the nature of the handling of weightings for unassigned code points. But I understand that you may be looking for an interchangeable character that would be defined as sorting last. > > Initially, I wanted to have such a codepoint, which would counterpart the > underscore (_). Meaning, it would be a valid alpha character (one that is > guaranteed to be accepted for identifiers, even as the first character), and > would have a non-zero-width representation. This is a contradictory requirement, as best I can tell. The highest-weighted alpha characters in the current UCA table are all the Han characters. So without tailoring, you'd pick U+2A6D6 (the last character in CJK Vertical Extension B) as the highest weight. But then any tailoring of Han ordering could, in principle, destabilize that. And any future addition to the Han character encoding would certainly cause a problem. So the highest-weighted alpha character cannot merely be the currently highest-weighted alpha character. Instead, as you surmise, you would have to have a "special", analogous to the underscore, but weighted higher than any ordinary alphas. The problem is that you would first have to pick such a beast and get it accepted into the identifier syntax of all the programming languages. The status of underscore is somewhat accidental in this regard, since it represents an identifier hack to indicate multi-word identifiers as units, and was grandfathered into many formal language syntaxes. But its status as "lowest" weight is somewhat arbitrary, too. It certainly is not the lowest weighted "special" in the Unicode Collation Algorithm, so there is always the potential, by admission of new specials into programming language identifier syntax, that underscore won't sort lowest, either. Also, you need to keep in mind that specials in the UCA behave differently than you might presume from simple, single-pass weighted sortings that only deal with primary weights. You are expecting: _abc a_bc ab_c abc abc_ bbc but in the UCA, the (default) ordering of those strings would be: abc _abc a_bc ab_c abc_ bbc since "abc" < the same string with any special character in it unweighted at the primary level. And as regards symbols which could be used as specials to try to get the highest weight behavior, they all sort lower than the alphas in the current UCA tables, anyway. So some other route would be required to fix that. > > Asmus Freytag [[EMAIL PROTECTED]] also noted that there could be use for > such characters in user interfaces. However, for this type of usage, it > would be preferred to have two zero-width, non-breaking characters, that > would typically NOT be allowed in user input, But this is the sort of thing for which you can use a non-interchanged sentinel that has the appropriate weighting behavior, if you want. > allowing the application to > keep reserved items on top or bottom of a sorted list, also knowing that the > user can never delete them or add an item with the same name, as long as > these are screened at point of input. Things get more complicated if you > allow reversed sort order, so I cannot say at this point whether or not > anyone would really choose to use such an approach. > > The question would then be, if we pursue this issue, are we looking for a > single character, that would counterpart the underscore, or are we looking > for four characters, two alpha characters and two zero-width spaces? To > allow for the latter, I now think that these would fit more in the General > Punctuation block than in the Specials block. I do not feel that this is an *encoding* issue at all. Nor is it even an issue for the Unicode Collation Algorithm to define such a usage. What you are looking for is something that could be agreed upon by the programming language communities as: 1. a symbol, from among the vast collection already encoded in Unicode, that would be agreed to by the programming language communities as acceptable in identifiers, as is "_". 2. when using a simple, single-level ordering (e.g., for sorting menu-items), would be given a primary weight above all alphas, as "_" would be given a primary weight below all alphas. 3. when using a multi-level, sophisticated ordering according to the UCA, would also be given a primary weight above all alphas, as "_" would be given a primary weight below all alphas, so as to preserve the expected behavior, while allowing all the sophistication of language-specific ordering behavior for sorting lists. In either case, whether doing simple sorting or complex, multi-level sorting, you are talking about some tailored behavior here. You can't just sort on code point order, and you cannot simply use the Unicode Collation Algorithm without tailoring to get the effects you want. By the way, my suggestion for an appropriate, already encoded symbol to meet your requirements would be U+221E INFINITY. ;-) Or how about U+261F WHITE DOWN POINTING INDEX, if you want something more iconic? --Ken