Re: Upside Down Fu character
This discussion has veered close enough to my pet project (the Shwa script) to comment. I plan to implement it in the PUA, both to demonstrate its value and viability and to find the problems and correct them before they get frozen into Unicode (if ever). I accept that means an eventual recoding, which I hope can be applied without changing the last two hex digits of the encoding, e.g. changing just the pages. That seems to me a reasonable and viable route to inclusion. What worries me is the number of technical decisions that will have been made long before an application for inclusion into Unicode. It seems to me there is a risk that those decisions will then become written in stone for compatibility reasons, and will never have been exposed to the debate they would engender on this list, for instance. There seems to be a gap in the process. One solution would be to institute a sandbox in the PUA, inspired by the ConScript Unicode Registry except with non-permanent entries. There, new characters and scripts could enjoy widespread use and delayed stability until they were ready for inclusion, while at the same time profiting from the scrutiny of the Unicode community. If this Unicode Prep existed, you could assign Upside-Down Fu to it without much hesitation, and see who uses it in plaintext for a few years. If accepted, it would have to be recoded, but that might be the least of all evils. On Mon, Jan 9, 2012 at 9:23 PM, Asmus Freytag asm...@ix.netcom.com wrote: On 1/9/2012 2:52 AM, vanis...@boil.afraid.org wrote: From: Asmus Freytagasmusf_at_ix.netcom.com I have no opinion on the Upside-down FU ideograph as a candidate for encoding, but I think any analysis of its merits needs to be more nuanced than what your message seemed to imply. A./ While I generally agree with your more nuanced view on this matter, Asmus, I'm afraid I have to disagree in this particular case. The upside down Fu has been used decoratively for a thousand years (it's a Chinese pun), and if anyone wanted to use it in plain text, they would have by now. With a character of such antiquity, there really is no question of computer technology suppressing its use. Put simply, people have either used this character in plain text, or they haven't. If someone can dig up a couple example texts, then it's no question. If nobody can find those example texts, I think that speaks volumes on the utility of the character and its suitability to encoding. -Van Van, I wrote I have no opinion... Reading your reply may nudge me closer to having an opinion :) And, for the record, I think what you wrote is rather nuanced. If there's a smoking gun, I'm sure that would settle the encodability question, and given the history of the character, you make a good argument that searching for plain text in pre-digital technologies is feasible, and appropriate. Still, I'm interested in the general issue - what to do about a (hypothetical) character or a hypothetical new use for an otherwise existing character that doesn't have the benefit of first having been around for thousands of years in an age of hand-lettering or hot-metal print. If Unicode wants to be the only game in town, and if non-digital text is disappearing as a medium, how does one address innovation without leaving broken (non-supportable) digital data as a prerequisite. Currency symbols have been given an exemption from this chicken and egg conundrum, because everyone realizes that using temporary encodings for them until their use is established is unreasonable. So, my question remains, are there any other avenues besides hot-metal printed text and compatibility encodings to demonstrate that a character (not this example, but in general) is a viable candidate? A./
Re: Sorting and Volapük
German does both, so there may be a CLDR locale for the choice you need. On Sun, Jan 1, 2012 at 4:27 PM, Michael Everson ever...@evertype.com wrote: Swedish and Finnish treat ä and ö as separate letters of the alphabet, but sort them at the end after z. Volapük sorts a ä b c d e f g h i j k l m n o ö p r s t u ü v w x y z, with ä a separate letter after a, ö separate after o, and ü separate after u. There is as yet no CLDR locale for Volapük... does anyone know if any other language treats ä/ö/ü in the same way? Michael Everson * http://www.evertype.com/
Re: Sorting and German (was: Sorting and Volapük)
Sounds like Michael could use the Austrian system. On Sun, Jan 1, 2012 at 6:46 PM, Otto Stolz otto.st...@uni-konstanz.de wrote: Happy New Year, on Sun, Jan 1, 2012 at 4:27 PM, Michael Eversonever...@evertype.com wrote: Volapük sorts [...] ä a separate letter after a, ö separate after o, and ü separate after u. does anyone know if any other language treats ä/ö/ü in the same way? Am 2012-01-01 16:54, schrieb Peter Cyrus: German does both, Not really. According to DIN 5007, German features two different sort orders: • In lists of personal names, Ä, Ö, Ü may be sorted as AE, OE, and UE, respectively; this order is mainly used in telefone directories. • In dictionaries and encyclopedias, Ä, Ö, Ü are sorted as A, O, and U, respectively. As encyclopedias may well comprise personal names, the scope of the former scheme is not well defined, imho, and I stick to the latter one, whenever I have to sort a list. In both schemes, ß is sorted as SS. In both schemes, true A (or AE, respectively) goes before Ä, iff two sort keys are otherwise identical; likewise for Ö, Ü, and ß. In Austria, a third scheme is used in telefone directories (but not in the yellow pages): Here, Ä, Ö, and Ü, are indeed treated as distinct letters, to go between A and B, O and P, and U and V, respectively; and ß is treated as a distinct pair of letters, ro go between SS and ST. Best wishes, Otto Stolz
empty codepoints at 00 and FF?
I'm sure this is explained somewhere, but I can't find it. Must blocks leave the first and last codepoints unassigned?
Re: missing characters: combining marks above runs of more than 2 base letters
Ken, you mention defined markup constructions, but nothing would prevent specialized rendering software from, for example, connecting a left half mark with the corresponding right half mark via titlo, even though the text is still only plain text with no markup, right? The titlo would simply not display as such in the absence of the right software. On Fri, Nov 18, 2011 at 8:03 PM, Ken Whistler k...@sybase.com wrote: On 11/17/2011 11:28 PM, Philippe Verdy wrote: Could the Unicode text specify that a left half mark, when it is followed by a right half-mark on the same line, has to be joined ? And which character can we select in a font to mark the intermediate characters between them ? No. This kind of stuff is not plain text. Mathematicians and musical scorere long ago got over the notion that marking of scoped constructs (with beams and ties in music, and similar kinds of scoping for expressions in math) could be plain text. People who score text, whether metricians, prosodic analysts, or phoneticians, need to learn the same lesson. Unicode is not a repository for text scoring hacks, with the expectation that all of the rendering implementations will quietly incorporate this kind of complexity into their already complex requirements for plain text rendering of writing systems. People who need to score text will have to make use of specialized rendering software and defined markup constructions, just like the mathematicians and musicians do. --Ken
Re: missing characters: combining marks above runs of more than 2 base letters
I mention it because I think plain text could at least indicate what _should_ display (instead of what *does* display), and a rich environment could make the same text look great. I think we'll all need for a long time more to write text that displays adequately as plain text in the absence of even OpenType advancexd typography features. On Fri, Nov 18, 2011 at 8:46 PM, Ken Whistler k...@sybase.com wrote: On 11/18/2011 11:21 AM, Peter Cyrus wrote: Ken, you mention defined markup constructions, but nothing would prevent specialized rendering software from, for example, connecting a left half mark with the corresponding right half mark via titlo, even though the text is still only plain text with no markup, right? The titlo would simply not display as such in the absence of the right software. Correct. Specialized rendering software can pretty much do whatever its programmers want it to do. But there would be no reason to limit that to what it could do with the hacky left- and right-half marks, either. Specialized rendering software could detect a sequence letter, combining titlo, letter, letter, decide the three letters constituted a Cyrillic number and draw the titlo over all three letters as well. Or a specialized Cyrillic font could contain ligatures which would do the same, without requiring specialized code in a rendering engine. The problem would be there will be people who would expect such specialized rendering to be specified *in* the standard and be supported by *non*-specialized rendering engines and fonts, because their multi-letter titlos don't display correctly when posted on websites and viewed by people who don't have specialized rendering software or specialized fonts. That's when the answer has to be no. At that point, the responsibility really falls on the folks who need to score text to define the higher-level protocols to do so, and then convince the people who want to support that kind of text convention to do the implementation(s) required to make it happen. --Ken
more flexible pipeline for new scripts and characters
I've only been on this list for some months, and I only came to it with my own little project in mind, but it occurs to me, as I follow all these threads, that Unicode might benefit from a more flexible process of adaptation, of Unicodification. The model would be an asymptotic approach to standardization that tolerated an amount of change in inverse proportion to time elapsed. In other words, people could propose a new script or character and rather than have it discussed before encoding and then encoded in permanence, with no possibility even to correct obvious errors as in U+FE18, instead it would be provisionally accepted but still subject to modifications as implementors worked with it. Hopefully, most mistakes would be unearthed early and corrections applied before much text had been encoded. As time passed and the encoding became more stable, the size of mistake open to correction would be reduced, e.g. to spelling errors, until it was frozen as a result of this process before being declared permanent. My thought is that some of the problems that I've seen discussed might have been discovered and addressed had a community been using the proposed standard before it became immutable. In the current process, that transition may occur too early to be useful. It may be easier to fix all the existing text if very little time has passed, than to fix all future text forever. This idea could also be extended to new characters and scripts that might or might not make it into Unicode : Unicode could offer a provisional acceptance that allowed users to demonstrate the utility of the proposed changes once they're in Unicode, even if they're later modified or withdrawn. This policy might have prevented the recoding of Tengwar, Cirth, Shavian, Phaistos Disc and Deseret as they moved from the PUA to the SMP. It seems to me that the current policy is intended to offer implementors a guarantee of our best effort to save them the trouble of chasing down problems, but in fact they might prefer to have some problems fixed early (while the developers still remember the application) than to have to take unfixed problems into account forever. This idea is WAY beyond my expertise, but I thought I'd mention it for you all to consider.
Re: more flexible pipeline for new scripts and characters
I guess what I'm proposing is that the proposed allocations be implemented, so that problems may be unearthed, even as the users accept that the standard is still only provisional. On Wed, Nov 16, 2011 at 3:25 PM, Asmus Freytag asm...@ix.netcom.com wrote: Peter, in principle, the idea of a provisional status is a useful concept whenever one wants to publish something based on potentially doubtful or possibly incomplete information. And you are correct, that, in principle, such an approach could be most useful whenever there's no possibility of correcting some decision taking in standardization. Unicode knows the concept of a provisional property, which works roughly in the manner you suggested. However, for certain types of information to be standardized, in particular the code allocation and character names, it would be rather problematic to have extended provisional status. The reason is that once something is exposed in an implementation, it enables users to create documents. These documents would all have to be provisional, because they would become obsolete once a final (corrected or improved) code allocation were made. The whole reason that some aspects of character encoding are write once (can never be changed) is to prevent such obsolete data in documents. Therefore, the only practical way is that of having a bright line between proposed allocations (that are not implemented and are under discussion) and final, published allocations that anyone may use. Instead of a provisional status, the answer would seem to lie in making the details of proposed allocations more accessible for review during the period where they are under consideration and balloting in the standardization committee. One possible way to do that would be to make repertoire additions subject to the Public Review process. Another would be for more interested people to become members and to follow submissions as soon as they hit the Unicode document registry. The former is much more labor-intensive and I suspect not something the Consortium could easily manage with the existing funding and resources. The latter would have the incidental benefit of adding to the funding for the work of the Consortium by providing some additional funding via from membership fees. A./
Re: definition of plain text
Perhaps the idea of something embedded in the text that then controls the display of the subsequent run of text is the very definition of markup, whether or not that markup is a special character or an ASCII sequence like /spanspan style=gait:xxx; or /spanspan style=font:xxx;. On Mon, Oct 17, 2011 at 1:07 AM, Richard Wordingham richard.wording...@ntlworld.com wrote: On Sun, 16 Oct 2011 21:37:20 +0200 Peter Cyrus pcy...@alivox.net wrote: Perhaps, awkwardly. But that is ultimately equivalent to marking the gait on every letter, in which case I probably wouldn't need to distinguish between initial and non-initial letters. If you allow C(R)V(C) as a 'fixed' syllable structure, the location of the syllable boundary in words like /tatrat/ would be significant, as in Thai. /tata/ would also be awkward if you had null initials, as, again, some claim for Thai. (There are languages that need to be analysed as having a phonemic contrast between null initials and initial glottal stops, even if German isn't one of them.) You might be able to handle syllable breaks just by having an optional syllable break character, analogous to CGJ and ZWSP. Marking gait on every letter may not be necessary, but gait-selecting characters present issues. They're analogous to the deprecated numeric shape selectors U+206E and U+206F, whose use is strongly discouraged. These characters need explicit support in rendering engines, which is an argument against gait-selecting characters. You might be able to propagate gait by contextual substitution, *if* you could propagate it through automatic line breaks. Richard.
Re: definition of plain text
It's been done already : the International Phonetic Alphabet. If we all just wrote in that, it would make Unicode much easier to implement, too. I'm just working on Plan B, just in case. On Mon, Oct 17, 2011 at 8:48 PM, Ken Whistler k...@sybase.com wrote: On 10/17/2011 1:23 AM, Peter Cyrus wrote: Perhaps the idea of something embedded in the text that then controls the display of the subsequent run of text is the very definition of markup, whether or not that markup is a special character or an ASCII sequence like/spanspan style=gait:xxx; or/spanspan style=font:xxx;. Yep. And FWIW, rather than invent a new conscript for what you are attempting to do, my recommendation would be to simply use a strictly defined phonetic subset of the already-encoded Unicode characters, and then use XML markup to define gait. Your marked-up text would be more voluminous, but the chances of it displaying decently and being processable with widely available tools would be much better. --Ken
Re: definition of plain text
Your idea of propagation seems worth exploring - thanks! On Mon, Oct 17, 2011 at 1:07 AM, Richard Wordingham richard.wording...@ntlworld.com wrote: On Sun, 16 Oct 2011 21:37:20 +0200 Peter Cyrus pcy...@alivox.net wrote: Perhaps, awkwardly. But that is ultimately equivalent to marking the gait on every letter, in which case I probably wouldn't need to distinguish between initial and non-initial letters. If you allow C(R)V(C) as a 'fixed' syllable structure, the location of the syllable boundary in words like /tatrat/ would be significant, as in Thai. /tata/ would also be awkward if you had null initials, as, again, some claim for Thai. (There are languages that need to be analysed as having a phonemic contrast between null initials and initial glottal stops, even if German isn't one of them.) You might be able to handle syllable breaks just by having an optional syllable break character, analogous to CGJ and ZWSP. Marking gait on every letter may not be necessary, but gait-selecting characters present issues. They're analogous to the deprecated numeric shape selectors U+206E and U+206F, whose use is strongly discouraged. These characters need explicit support in rendering engines, which is an argument against gait-selecting characters. You might be able to propagate gait by contextual substitution, *if* you could propagate it through automatic line breaks. Richard.
definition of plain text
Is there a definition or guideline for the distinction between plain text and rich text? For example, in the expression 3², the exponent is a single character, superscript two. Semantically, this expression is equivalent to 3^2, using a visible character to indicate exponentiation and then leaving the exponent in normal notation. Both seem to me clear examples of plain text. But if the circumflex were replaced by an invisible character that meant the following number should be superscripted, would that still be plain text? Or would it be formatting that should be relegated to markup? What about a character that inhibited the composition of following Hangul jamo into a syllable? That seems to me to be markup, but if it could be replaced by a medial ZWNJ, I'm no longer sure. Is the ZWNJ another tricky case? One could say that it's an invisible formatting character whose role is simply to control how other characters are displayed, and thus it should be markup? For that matter, perhaps the normal space is a type of markup, especially when it triggers the use of a final variant in the previous character. Finally, aren't the LTR and RTL characters markup? What if we wanted characters that put a run of text into vertical directionality? One candidate guideline would be that plain text never include anything that affects non-adjacent characters. But isn't that just the equivalent of requiring repetition of markup for each character? For example, if you wanted to write 3²⁽ⁿ⁺¹⁾ with m instead of n, the plain text would be 3^2^(^m^+^1^), using ^ as a superscripting prefix. If that is acceptable as plain text, then perhaps the Unicode superscripted characters should all decompose into a superscripting prefix. Maybe I just need more sleep...
Re: definition of plain text
Ken, your explanation seems more permissive than I had anticipated. Your example of 3sup2/sup would seem to me at risk of behaving in unforeseen ways if, for instance, it were split up. Wouldn't it match a string up 2? Wouldn't it fail to match 3²? I guess I thought that plain text should be more canonical. I was also being coy about my real question, but perhaps I shouldn't have been. I'm working on a conscript intended as a universal phonetic script, and even though it is unlikely ever to merit inclusion in Unicode, I'd like to design it to Unicode's standards. (For the curious, the work done to date is online at www.shwascript.org.) One particularity of this script is that it is written in different gaits, depending on the phonology of the language. Languages with open syllables, like most Niger-Congo or Austronesian languages, would write it as a syllabary. Languages with fixed syllables, like Chinese, Korean or Vietnamese, would write it as blocks, like Hangul. Languages with variable syllables, like most Indo-European languages, would write it as an alphabet. And Afro-Asiatic languages would write the vowels as diacritics to highlight the triliteral roots. But all these gaits would use the same underlying letters, and the same underlying Unicode PUA characters. The obvious way to encode this is to add a set of invisible characters that specify the gait of the following run of plain text. Each would also serve as the end-of-run character for the preceding run. This solution seems to me analogous to the use of LTR and RTL to mark runs for directionality, but I don't know enough about the UBA to know where the pitfalls are or whether a better solution is feasible. These gait characters would be ignored in search, which is the desired behavior. Alternatives might include markup or even different fonts, but the gaits seem to me as much part of the text as the letters themselves. Writers will have to explicitly change gaits when they want to embed a Chinese name in an English text, for example. It seems unwieldy to capture that information at the keyboard and then package it separately for encoding, transmission and rendering. Nor does considering it as a case distinction seem elegant. May I ask for advice? On Fri, Oct 14, 2011 at 9:17 PM, Ken Whistler k...@sybase.com wrote: On 10/14/2011 11:47 AM, Joó Ádám wrote: Peter asked for what the Unicode Consortium considers plain text, ie. what principles it apllies when deciding whether to encode a certain element or aspect of writing as a character. In turn, you thoroughly explained that plain text is what the Unicode Consortium considered to be plain text and encoded as characters. Correct. And basically, that is what it comes down to. One cannot look at *rendered* text and somehow know, a priori, exactly how that text should be represented in characters. (In the case of most of what is still being considered for encoding, rendered text means non-digitally printed historic printed materials, because there isn't any character encoding for it in the first place, and hence no compatibility encoding issues.) Sure, there are some general principles which apply: 1. We don't represent font size differences by means of encoded characters. 2. We don't represent text coloration differences by means of encoded characters. 3. We don't represent pictures by means of encoded characters. and so on. Add your favorites. But character encoding as a process engaged in by character encoding committees (in this case, the UTC and SC2/WG2) is an art form which needs to balance: existing practice, if any; graphological analysis of writing systems; complexity of implementation for proposed solutions to encoding; architectural consistency across the entire standard; linguistic politics in user communities; and even national body politics involved in voting on amendments to the standard. It is impossible to codify that process in a set of a priori, axiomatic principles about what is and is not plain text, and then sit in committee and run down some check list and determine, logically, what exactly is and is not a character to be encoded. People can wish all they want that it were that way, but it ain't. So yeah, what the Unicode Consortium considers to be plain text is what can be represented by a sequence of Unicode characters, once those characters ended up standardized and published in the standard. You can't start at the other end, define exactly what plain text is, and then pick and choose amongst the already standardized characters based on that definition. Given the universal (including historic) scope of the Unicode Standard, that way lies madness. --Ken