Re: PUA properties, default or otherwise (was: Re: What is the principle?)
From: Doug Ewell [EMAIL PROTECTED] To: Unicode Mailing List [EMAIL PROTECTED] Cc: Kenneth Whistler [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Wednesday, March 31, 2004 8:38 AM Subject: PUA properties, default or otherwise (was: Re: What is the principle?) This discussion has focused pretty tightly on the *default* properties of PUA code points, without really addressing the issue of specifying new properties to override those defaults, and I think that's a mistake. Exactly what I was saying. But you had more arguments for my remark. But Ken and Rick are absolutely right that very few companies are going to see a business opportunity in this. Even SC UniPad, which has implemented many comparatively arcane features of Unicode, has never done anything with the PUA, though it has been on their future versions list for 6 years now. One of the main reason may be that they are exactly limited by the lack of accurate properties for PUAs. But I see no reason why there could not exist an interoperable format to send these properties. In proposed to include that information in fonts (notably OpenType), but it may also be sent separately (in a font without the glyphs?) Of course we can argue that some of the missing features may in some cases be encoded directly within the maintext (for example by using RLO/PDF controls in the plain-text to override the BiDi properties. I also don't think that such application is only for idiosyncratic characters. There are LOTS of scripts on earth that will probably never go to the scrutiny of Unicode, but that users may wish to start studying in a interoperable way with common reusable technical solutions to creater the documents they need. You may think that using some rich text format (Word DOC, Acrobat PDF, HTML+SVG...) would paliate the lack of standardization. But I do think that there is still some place for plain texts.
Re: What is the principle?
From: Kenneth Whistler [EMAIL PROTECTED] Consider another example. The normalization algorithm has to work for *all* Unicode code points, assigned or not, because it guarantees stability into the future when characters are encoded at code points which were previously unencoded. It also, then, obviously has to work for PUA characters, as well. That implies that two additional properties *MUST* have some default values set for PUA characters. One of those is decomposition, which is defaulted to the null string (no decomposition) for all PUA characters. The other is canonical combining class, which is defaulted to ccc=0 for all PUA characters. Doing anything else would have just been stupid. But again, None of the Above was not an option. All these arguments are in favor of a definition for default properties set with reasonnable values that match the most common (?) needs. Still this should not prohibit the use and interchange of other properties. and these defaults are then not mandatory and are overridable. Even in the case of an API that requires being able to do something like: Character(E000).getProperty(), that API may be prefeeded with a table of properties override for PUAs.
RE: Unicode 4.0.1 Released
Rick McGowan wrote: Unicode 4.0.1 has been released! [...] The main new features in Unicode 4.0.1 are the following: [...] 3. Unicode Character Database: [...] * Changed: general category of U+200B ZERO WIDTH SPACE * Changed: bidi class of several characters (If I am asking a FAQ, I apologize in advance...) So far, my understanding was that the normative properties of existing code points where carved in stone. Won't these fixes break applications out there? I.e., won't they turn previously conformant applications into non conformant ones? _ Marco
Re: What is the principle?
From: Michael Everson [EMAIL PROTECTED] At 17:02 -0800 2004-03-30, Mike Ayers wrote: I feel obligated to take this one step further - these folks are forgetting that P stands for private. Their use of this space is their own problem, in all senses. It does not seem reasonable to me that *any* standard behavior could be expected of PUA code points, from operating systems or applications, as such may have chosen to, or may yet choose to, use those code points to encapsulate very un-font-rendering-like behavior, and such a decision, made past, present or future, is a perfectly valid private use. Which I assume means: it's wrong for Unicode to make ANY property pronouncements for ANY PUA characters, since that defines them, and removes the P from the Use. Do you mean here that any properties currently defined in Unicode for PUAs should be deprecated with their current normative value, and left to implementers, so that no application can be said non-conforming if it implements other defaults? May be this would require some adjustments in the normative wordings related to Unicode conformance... And as well, variant selectors, if they are used on PUAs should not be constrained as well (the current restrictions for variant selectors usage should not apply to PUAs as well, given that a VSn should still be fully ignorable including for PUAs that have no defined normative semantic in Unicode, meaning that the combination of PUA+VSn has also no defined normative semantic in Unicode itself). Leave that for implementations, and may be we'll ease the development of new scripts, by allowing other groups to work on some interchangeable formats based on PUAs, which could then be later integrated in Unicode after an easier phase where these scripts would have been experimented. It would ease the adoption of a later consensus, and would offer a great tool for developers and searchers, that could safely base their work based on Unicode encoding conventions Also this would be a good indicator that specialized 8-bit code sets are no longer necessary, and IANA could then close its 8-bit encodings registry, in favor of PUA-based encodings defined by some conventional rules which could then become a standard and open extension mechanism... This will have the advantage of avoiding pressures on Unicode to normalize new scripts too fast, and longer open experimentations would avoid many future errors in the new normalized scripts. The CSUR registry is one approach for the definition of new scripts, SIL.org has its own, but for now I see little efforts to allow specifying these properties in a partially interchangeable format, and one reason can be that Unicode has made too many restrictions on the usage of PUAs, so that developers fear that their protocols which need them become non conforming. I do think that there must exist a way to have PUAs used safely without ambiguities or risks of collisions, using extensions mechanisms similar to namespaces in XML, and some normative declarations and possibly a registry of PUA sets (why not the IANA charsets registry if it can reference the associated properties with some URL to a script definition schema?).
Re: What is the principle?
On 30/03/2004 17:32, Michael Everson wrote: At 17:02 -0800 2004-03-30, Mike Ayers wrote: I feel obligated to take this one step further - these folks are forgetting that P stands for private. Their use of this space is their own problem, in all senses. It does not seem reasonable to me that *any* standard behavior could be expected of PUA code points, from operating systems or applications, as such may have chosen to, or may yet choose to, use those code points to encapsulate very un-font-rendering-like behavior, and such a decision, made past, present or future, is a perfectly valid private use. Which I assume means: it's wrong for Unicode to make ANY property pronouncements for ANY PUA characters, since that defines them, and removes the P from the Use. This is of course a principle which they have already broken, as they have defined default properties for all of them. Although in principle people can implement non-default properties, no one has, as far as I know. The result is that in practice the P has been removed from the PUA and it has been restricted to LTR base characters. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: What is the principle?
On 30/03/2004 16:46, Kenneth Whistler wrote: ... Work it out. Any proposal to assign property ranges into the PUA would run up on the rocks of all the details. And *then* it would meet a stonewall in the UTC. And *then* it would meet another stonewall in SC2. Quit banging your head against the walls and look for alternatives more likely to lead somewhere. The only alternative I see is to rewrite from scratch the display routines of my favourite OS. I think banging my head against walls is likely to be faster. After all, even the hardest wall cracks eventually, and my head is quite hard. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Fixed Width Spaces (was: Printing and Displaying DependentVowels)
On Tuesday, March 30, 2004 11:42 PM, Ernest Cline va escriure: The main usage is with compound words such as ice cream or Louis XIV or commercial phrases such as Camry SE where for esthetic reasons an author would prefer that the space not expand upon justification, Well, as one that takes the pain to enter ALT+0160 here and there (particularly around « and » in French), I should say that I certainly would like the space between Louis and XIV, or between Camry and SE to stay of fixed width; on the other hand, I would expect the one between ice and cream to expand according to the rythm of the paragraph, in order to not break the reading. Like in Mum, I want an ice cream against Mum, Iwant anice cream I am not aware of any style guides that offer either normative or informative guidance for either choice. The French guides of styles (after all, we can use Unicode to write French as well as English, can't we?) generally say that NBSP should not be expanded on justification. I do not know right now (I miss access to definitive references) if this is general to all non-breaking spaces, including those that do have fixed-width per se, or if it specifically applies to U+00A0. It should be outlined that non-breaking spaces occur rather frequently in French (around several punctuation characters), and because many word processors are not rich enough to encode it as it should (i.e., as ZWNBSP+THSP+ZWNBSP, \uFEFF\u2009\uFEFF), well they encode it as U+00A0 :-(. NBSP ZWNJ breaks, but should it justify? ^^ This is an error, isn't it? Antoine
RE: Why is U+17C1 of General category Mc while U+0E40 and U+0EC) are of category Lo ?
[EMAIL PROTECTED] wrote: Thai (and Lao, whose encoding closely parallels that of Thai) are encoded in Unicode on unique principles: by a straight left-to-right typewriter-style encoding. This was done for compatibility with the pervasive Thai 8-bit standard. It also means that for collation purposes what are historically left-side vowels must be moved after the following consonant. For more on collation of Thai, Lao, and Khmer, see the proposed update to ISO/IEC 14651 CTT (and the UAX 10 DUCET), and a tailoring for the CTT, in the two documents: N2718 http://www.dkuug.dk/jtc1/sc2/wg2/docs/n2718.doc N2717 http://www.dkuug.dk/jtc1/sc2/wg2/docs/n2717.doc (Note that the swapping part for Thai/Lao of the tailoring is dealt with by other means (in the prehandling) in the Unicode collation algorithm.) Note that the Thai characters are not labeled LETTER or VOWEL SIGN or what have you, but simply CHARACTER. Yes, but that has no particular consequence. Note that the vowel signs are in the documents referenced above treated as vowel signs, regardless of if they are called LETTER, VOWEL SIGN, or CHARACTER (and, actually, regardless of their general category, as it happens). There is also the complication that some of the consonant characters are logically used as vowel (parts), but the modern convention is to ignore that in the collation rules, and always treat them as consonants in collation. /kent k
Re: Fixed Width Spaces (was: Printing and Displaying DependentVowels)
On 30/03/2004 18:01, fantasai wrote: Ernest Cline wrote: The main usage is with compound words such as ice cream or Louis XIV or commercial phrases such as Camry SE where for esthetic reasons an author would prefer that the space not expand upon justification, Given wide enough measures, good text layout program should be able to produce justified text without very noticeable changes in word spacing. NBSP doesn't break, but should it justify? I believe NBSP should be, to the reader, indistinguishable from a regular space. It does not have a semantic function as a compound- word-joiner; it's just a space that doesn't break, and therefore should be treated like any other space. ~fantasai So perhaps the best thing to do in cases like Ernest's and mine, where a fixed width non-breaking space is required, is to use FIGURE SPACE, which I understand is non-breaking. But then perhaps this is too wide in some circumstances - in many fonts it is twice the regular width of SPACE. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
sara am ordering (was RE: Why is U+17C1 of General category Mc while U+0E40 and U+0EC) are of category Lo ?
Kent: Your doc says, quote, emphasis added And Ó should be ordered as Ò followed by í (**which is the logical sequence, despite the Unicode compatibility decomposition**). /quote What do you mean here by logical sequence? That that's how it should be interpreted phonologically and for sorting purposes, or that that is the correct encoded sequence for decomposed representations? If the latter, that seems to me to be quite wrong: I would not expect *any* data that includes a decomposed representation of sara am to have the sequence C, sara aa, nikkahit : it would always be the other way around: C, nikkahit, sara aa . Of course, if the former, I would agree. Peter Peter Constable Globalization Infrastructure and Font Technologies Microsoft Windows Division
Re: What is the principle?
Peter Kirk peterkirk at qaya dot org wrote: Which I assume means: it's wrong for Unicode to make ANY property pronouncements for ANY PUA characters, since that defines them, and removes the P from the Use. This is of course a principle which they have already broken, as they have defined default properties for all of them. Although in principle people can implement non-default properties, no one has, as far as I know. The result is that in practice the P has been removed from the PUA and it has been restricted to LTR base characters. Unicode allows the properties of the PUA code points, unlike all others, to be customized by the end user. I've done so myself, on the Web page I mentioned. Characters are classified as General Category Lo, Nd, or No, and the digits have numeric values. Although all are still LTR base characters, there's no reason they had to be (except that that's how my script works); for Tengwar there would be both RTL digits and combining marks. The perception that no-one has yet implemented custom PUA properties does not mean that doing so is prohibited or unworkable, any more than the shortage of widely available rendering engines for the Tibetan and Khmer encoding models implies that those models are unworkable. Failure to see this distinction, between (a) what Unicode allows and prohibits and (b) what software products do and do not support, is doing more to convince us of the hardness of Peter's head than anything else. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: French typographic thin space (was: Fixed Width Spaces)
From: Antoine Leca [EMAIL PROTECTED] The French guides of styles (after all, we can use Unicode to write French as well as English, can't we?) generally say that NBSP should not be expanded on justification. I do not know right now (I miss access to definitive references) if this is general to all non-breaking spaces, including those that do have fixed-width per se, or if it specifically applies to U+00A0. It should be outlined that non-breaking spaces occur rather frequently in French (around several punctuation characters), and because many word processors are not rich enough to encode it as it should (i.e., as ZWNBSP+THSP+ZWNBSP, \uFEFF\u2009\uFEFF), well they encode it as U+00A0 :-(. In fact the French typographic tradition for French is to use a THIN non-breaking space, which is not what NBSP encodes precisely, but what is used as a common APPROXIMATION simply because the THIN non-justifiable and non-breaking space is absent from legacu 8-bit sets (including ISO-8859-1, ISO-8859-15, Windows 1252, CP850, for the most widely used ones). The rule is to use this thins space (called une fine or une espace fine in French) before composed punctuations with two separated glyphs: the colon, semi-colon, exclamation point and interrogation point, and between « and the quoted phrase, and between the quoted phrase and ». A similar rule exists also in traditional English typography, however there's a small variant here: the French thin space is a bit wider than the English one, so the best approximation for French is to use NBSP, and for English to use nothing (also because most fonts made by English typographers already incorporate the additional very thin space within the spacing width of the punctuation mark)... There are pros and cons with the NBSP approximation used in French. Some have argued that it would be better to not encode anything here, and instead to use fonts containing punctuation marks that already include the appropriate additional spacing within the glyph spacing width. Still, many French typography composition engines (notably those by newspapers, magazines, guides and diaries -- for example the French product Calligrame distributed by X-Media in various countries, or other composition engines used by regional or national newspapers) already recognize the sequence NBSP+punctuation or punctuation+NBSP and interpret the NBSP code as meaning the presence of the French espace fine, so printed books, newspapers and magazines already apply the correct style (these newspapers in Frnace are already used since long to use SGML to create their laser masters, and to use quite advanced, precise nd coherent stylesheets, that are part of the signature of the publication, i.e. its maquette design, that also incorporates many custom logographs and symbols, notably in dictionnaries, guides and newspapers). So yes the correct code for French should be ZWNBSP+THSP+ZWNBSP (but beware of the difference of spacing between the English and French thin space, with one at 1/6 em, the other at 1/8 em...)
Re: What is the principle?
On 31/03/2004 08:08, Doug Ewell wrote: ... The perception that no-one has yet implemented custom PUA properties does not mean that doing so is prohibited or unworkable, any more than the shortage of widely available rendering engines for the Tibetan and Khmer encoding models implies that those models are unworkable. Failure to see this distinction, between (a) what Unicode allows and prohibits and (b) what software products do and do not support, is doing more to convince us of the hardness of Peter's head than anything else. Doug, I don't know who you are accusing of failing to see this distinction, but it certainly isn't me. I have made it very clear several times that I understand that IN PRINCIPLE I am free to write my own operating system, or a large part of it, to display these characters as I wish. The problem is one IN PRACTICE. Your advice reminds me of the advice that might have been given to Burbage (?) not to hire Shakespeare, but rather to use a team of monkeys because given enough time they would write the same plays - true, but not practical. The ones I am comparing to monkeys are would-be PUA users like myself who are no more capable than monkeys of writing OSs in a sensible time frame. (Sadly there are no OSs in the Shakespeare category.) :-) But this practical problem would go away (in time, but a lot less time than it would take me to write an OS!) if Unicode specified different DEFAULT (read only ones supported in any commercial or open source software) properties for parts of the PUA, and the software companies implemented this - which would be trivial if specified. You claim to have customised the properties of PUA characters. Do you mean that you have written software which processes them according to your customisations? It is easy to list properties. It is very hard to implement them, if one has to start from scratch, without any help from the established manufacturers. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
RE: Fixed Width Spaces (was: Printing and Displaying DependentVowels)
So perhaps the best thing to do in cases like Ernest's and mine, where a fixed width non-breaking space is required, is to use FIGURE SPACE, which I understand is non-breaking. But then perhaps this is too wide in some circumstances - in many fonts it is twice the regular width of SPACE. Going out on a limb here... It sorta seems like the need to keep phrases like Louis XIV together is a valid one the deserves a solution, but it also seems fairly esoteric-- typesetters and people who give a lot of thought to the presentation of their text might use this, but most people wouldn't. This makes me wonder if it's a plain-text thing. I'm not saying this is a problem that should be solved through markup, but if you care enough about the presentation of the text to care about this, you're probably also already using styled text to specify other things you care about, such as the font you're using. And if you know what font you're using, you can use THREE-PER-EM SPACE or FOUR-PER-EM SPACE (or maybe SIX-PER-EM SPACE or FIGURE SPACE), because you know which one is the right width in your font. For that matter, if a typical space is usually either a third or an em or a quarter of an em wide, my guess is you could probably use either THREE-PER-EM SPACE or FOUR-PER-EM SPACE anyway, and even if this didn't exactly match the width of a space in the particular font used to render your text, it'd probably still look okay. But then again, I'm not a typographer. Fading back into the background... --Rich Gillam Language Analysis Systems, Inc.
Re: What is the principle?
On: 2004-03-31 06:43:38 -0800 Peter Kirk peterkirk at qaya.org scribed: The only alternative I see is to rewrite from scratch the display routines of my favourite OS. I think banging my head against walls is likely to be faster. After all, even the hardest wall cracks eventually, and my head is quite hard. Bang on, O Mighty One! Yer ol' Pal, Youtie _ Get tax tips, tools and access to IRS forms all in one place at MSN Money! http://moneycentral.msn.com/tax/home.asp
Re: What is the principle?
On 30/03/2004 16:30, Kenneth Whistler wrote: ... Uh, sorry, Peter, but the implications here are so much b, err, ... baloney. The majority of the world's scripts are left-to-right. They also happen to be non-Western. There are more *Indic* scripts encoded in the Unicode Standard than *Western* scripts. The majority of *entities* that the majority of users put into PUA characters in actual application usage are unencoded CJK ideograph variants and symbols from Asian code pages. It was primarily the need to accomodate those *Eastern* users that drove the setting of default values for the PUA. OK, in that case let's allocate properties to PUA characters in proportion to the number of RTL vs LTR scripts, and the proportion of combining marks vs. base characters, in actual encoded scripts. The majority of PUA characters are unchanged. A significant minority become RTL or non-spacing. A lot of effort has gone into accommodating certain *Eastern* users. Something like 100,000 CJK characters have already been defined, and already that is not enough and they have requisitioned two more planes of PUA with LTR properties. Fair enough if they might be needed. But what if users of certain other scripts e.g. RTL scripts want just a handful of PUA characters with the properties they need? Why is preference given to CJK? This sounds like bias to me even if I was wrong to call it western. This bias is also reflected in their system software which (as far as I know with no exceptions) does not allow users to specify properties for PUA characters other than the default decided by the UTC. Bias? Or business sense? If you want some specialized behavior for software, you either write it yourself, or pay someone to write it, or convince someone else that adding such a feature to the software *they* write will pay for the investment cost in terms of incremental increased sales. You may not like how the software industry works, but thems the breaks for any mature industry. Well, I don't quite see why it is business sense for software companies to support the huge PUAs for variant CJK characters, outside the 100,000 or so already defined by Unicode. I do understand that it is business sense not to support user specification of properties, because that would be hard work for little or no gain. ... Scenario: The UTC listens to you and defines some section of the PUA as strong right-to-left by default for use in PUA-defined bidirectional scripts. Somebody else is *already* using that section of the PUA for something else. Now they have an interoperability problem, because the default behavior they were depending on changes over in some future version of some software, not under their control, and they data gets munged by bidi. Well, they weren't supposed to rely on these default properties anyway, they were supposed to use the PUA at their own risk. They are not the only ones who are messed up by features of software which is not under their control. But it might be preferable in practice to define an additional PUA with RTL properties and one with default ignorable properties, outside all of the existing PUAs. I am not asking for a large space; very likely 256 characters of each type would be more than adequate. This is the kind of stuff the UTC refuses to start up by trying to provide some subdivision of semantics in the PUA. *That* is the principle, by the way, which guides the UTC position on the PUA: Use at your own risk, by private agreement. What we do want is compatibility between our applications and the system software, and this proposal is the way to do that. I don't see how any proposal to create some particular behavior in the PUA is a way to accomplish that. If a new PUA is created with default RTL properties, one can expect that system software will soon support it at least to the extent of defining these characters as RTL for bidi algorithm etc purposes. Similarly with default ignorable. ... A default value for a property is not a requirement by the UTC *ON AN IMPLEMENTER* that they use that value. They can use whatever property values they desire, but if they depart from what system platforms provide them (by default) then they are buying themselves an implementation task to get characters to do what they want. Ken, you are a master of understatement. The task they are buying themselves is a rewrite of the whole system. Companies don't provide the details needed for others to customise individual modules, and it would probably be a breach of copyright etc to attempt to do so. Open Source is different here, of course. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Fixed Width Spaces (was: Printing and Displaying DependentVowels)
On 31/03/2004 08:49, Language Analysis Systems, Inc. Unicode list reader wrote: So perhaps the best thing to do in cases like Ernest's and mine, where a fixed width non-breaking space is required, is to use FIGURE SPACE, which I understand is non-breaking. But then perhaps this is too wide in some circumstances - in many fonts it is twice the regular width of SPACE. Going out on a limb here... It sorta seems like the need to keep phrases like Louis XIV together is a valid one the deserves a solution, but it also seems fairly esoteric-- typesetters and people who give a lot of thought to the presentation of their text might use this, but most people wouldn't. This makes me wonder if it's a plain-text thing. I'm not saying this is a problem that should be solved through markup, but if you care enough about the presentation of the text to care about this, you're probably also already using styled text to specify other things you care about, such as the font you're using. And if you know what font you're using, you can use THREE-PER-EM SPACE or FOUR-PER-EM SPACE (or maybe SIX-PER-EM SPACE or FIGURE SPACE), because you know which one is the right width in your font. For that matter, if a typical space is usually either a third or an em or a quarter of an em wide, my guess is you could probably use either THREE-PER-EM SPACE or FOUR-PER-EM SPACE anyway, and even if this didn't exactly match the width of a space in the particular font used to render your text, it'd probably still look okay. But then again, I'm not a typographer. Fading back into the background... --Rich Gillam Language Analysis Systems, Inc. Fair enough. To most people, a space is a space. To rather more, there is a second kind of space which they expect to be non-breaking and often also expect to be fixed width. (Those who had the latter expectation have had a nasty surprise today because with the release of 4.0.1 NBSP is suddenly no longer fixed width.) The problem is that when we get beyond that we get lost in a world of typography, and in uncertainty over which spaces are supposed to be breaking or non-breaking, fixed or variable width, and if fixed what width. It would be useful to have all of this clearly laid out somewhere, so that those of us who do care about what our text looks like, but are not professional typographers, know what we should use. LouisTHREE-PER-EM SPACEXVI may have lost his head, but we don't want his number also to fall off on to the next line, or even to become too far separated from his name. We need to know what kind of space to use to resist the guillotine! -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
RE: What is the principle?
Title: RE: What is the principle? From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Peter Kirk Sent: Wednesday, March 31, 2004 9:12 AM On 30/03/2004 16:30, Kenneth Whistler wrote: But what if users of certain other scripts e.g. RTL scripts want just a handful of PUA characters with the properties they need? Why is preference given to CJK? This sounds like bias to me even if I was wrong to call it western. Oh, yes, Peter, you have a identified a clear bias against... against... against... uh, certain hypothetical situations? If you want some specialized behavior for software, you either write it yourself, or pay someone to write it, or convince someone else that adding such a feature to the software *they* write will pay for the investment cost in terms of incremental increased sales. You may not like how the software industry works, but thems the breaks for any mature industry. Well, I don't quite see why it is business sense for software companies to support the huge PUAs for variant CJK characters, outside Support? ROFL! Call up one of those companies and tell them that you are having trouble displaying PUA fonts, eastern or otherwise. I'd like to snoop on that call. they were supposed to use the PUA at their own risk. Well, gee, somebody understands that principle so clearly WHEN IT APPLIES TO SOMEONE ELSE. This is the kind of stuff the UTC refuses to start up by trying to provide some subdivision of semantics in the PUA. *That* is the principle, by the way, which guides the UTC position on the PUA: Use at your own risk, by private agreement. ...and quit bothering us about it. That's gotta be in there somewhere. If not, I have an amendment to propose. What we do want is compatibility between our applications and the system software, and this proposal is the way to do that. No. The *only* way to maintain compatibility between your applications and the system software is to ensure that your applications only do things that are supported by the system software. If you want RTL PUA, ask your system software vendor. Here, you're just whining into the wind. /|/|ike
Re: Fixed Width Spaces (was: Printing and Displaying DependentVowels)
Language Analysis Systems, Inc. Unicode list reader scripsit: It sorta seems like the need to keep phrases like Louis XIV together is a valid one the deserves a solution, but it also seems fairly esoteric-- typesetters and people who give a lot of thought to the presentation of their text might use this, but most people wouldn't. This makes me wonder if it's a plain-text thing. In the TeX typesetting tradition, at least, it *is* done by markup. -- John Cowan [EMAIL PROTECTED] www.reutershealth.com www.ccil.org/~cowan Promises become binding when there is a meeting of the minds and consideration is exchanged. So it was at King's Bench in common law England; so it was under the common law in the American colonies; so it was through more than two centuries of jurisprudence in this country; and so it is today. --Specht v. Netscape
Re: Fixed Width Spaces (was: Printing and Displaying DependentVowels)
Peter Kirk wrote: LouisTHREE-PER-EM SPACEXVI may have lost his head, but we don't want his number also to fall off on to the next line, or even to become too far separated from his name. We need to know what kind of space to use to resist the guillotine! NBSP You should not rely on fixed-width spaces to approximate regular spaces. A simple switch from Arial Narrow to Verdana will demonstrate why: the widths of normal spaces and non-breaking spaces are related to the width of the fonts' glyphs, whereas the width of a fixed-width space is related to the /height/ of the glyphs. -- http://fantasai.inkedblade.net/contact
Re: Fixed Width Spaces (was: Printing and Displaying DependentVowels)
On 31/03/2004 11:57, Kenneth Whistler wrote: ... To most people, a space is a space. To rather more, there is a second kind of space which they expect to be non-breaking and often also expect to be fixed width. (Those who had the latter expectation have had a nasty surprise today because with the release of 4.0.1 NBSP is suddenly no longer fixed width.) Hardly. It has *always* been the intent and understanding of the UTC that NBSP was comparable in all ways to a SPACE character, except for disallowing line break opportunities. ... Thanks for the clarification. I should say that the behaviour of NBSP suddenly reverted to what it had been in previous versions of the standard, although a perhaps inadvertant change was made in 4.0.0. Nevertheless, there does seem to be a widespread misunderstanding that NBSP is intended to be fixed width, and in many systems it is implemented as such. Perhaps there is a need to clarify this further, perhaps by reinstating text similar to what was in Unicode 3.0. I take your point about the advantages of having the drafters of the standard available to explain parts of the standard which are unclear. I certainly wish we could do that with other texts that you allude to. But there must also be controls here. If the text says black, we can't have the drafters saying that the text really means white. They can say that they made a mistake, and correct it in a new version, but there are limits on how far they can reinterpret even a text which they wrote themselves. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: What is the principle?
[Original Message] From: Kenneth Whistler [EMAIL PROTECTED] To: [EMAIL PROTECTED] Peter Kirk continued: You can do it privately. See above. But attempting to do such things in terms of formally specified usages of the PUA is an invitation to failure of interoperability. I don't understand this last comment. Scenario: The UTC listens to you and defines some section of the PUA as strong right-to-left by default for use in PUA-defined bidirectional scripts. Somebody else is *already* using that section of the PUA for something else. Now they have an interoperability problem, because the default behavior they were depending on changes over in some future version of some software, not under their control, and they data gets munged by bidi. This is the kind of stuff the UTC refuses to start up by trying to provide some subdivision of semantics in the PUA. *That* is the principle, by the way, which guides the UTC position on the PUA: Use at your own risk, by private agreement. Which is why if any private use characters with default characteristics other than those of the existing Private Use blocks are ever to be part of Unicode they will need to be added as additional Private Use blocks, not by redefining existing PUA's There are currently some 10 totally unused planes, with not even any tentative plans for them, Allocating one or two those into additional Private Use Areas with a variety of default characteristics instead of the monotonous default characteristics of the existing Private Use Areas should not prove too difficult. For example, 26 blocks of 128 Private Use Combining Marks each, each block corresponding to one of the existing canonical combining classes (with perhaps a larger block for class 0) would amply satisfy the needs of most private use scripts for combining marks. Similarly, blocks for additional characters that would have other properties should be simple to define and for most combinations of property values, 128 characters should also prove to be exceedingly ample I'd have to take the time to list them, but a quick glance convinces me that there are at most several hundred combinations that would need to be supported if we limit things to just those combinations already in use. (it might take more, if for example all 256 potential combining classes were supported instead of the 26 listed in UCD.html), At 128 characters per combination plus more for a few that might need them, it should prove possible to handle this in 1 or 2 planes.
Re: What is the principle?
Peter Kirk wrote... ... I have a real requirement. The UTC has the power to meet my requirement, and to do so rather simply. I am asking them to meet it. Actually, you are not asking UTC anything. You are discussing the PUA on a public-access mail list. There's a big difference. This *is* the place to discuss as you are doing, and a good place to formulate your positions for eventual submission of a proposal, if any. Once you have formulated a position and you actually want to ask the UTC to do something or vote on something, then please fill out the form: http://www.unicode.org/reporting.html That's one place to start. If you have more text than will fit in the form, or you wish to submit a PDF or other bulky document with circles and arrows and such things to UTC, then please discuss it with me off list. I will assist you and see that your document is forwarded and submitted appropriately into the UTC process. See also RFC 3718, particularly sections 8 and 9. As usual, this is all my own opinion and reflects no official policy or position. Rick
Re: Fixed Width Spaces (was: Printing and Displaying DependentVowels)
Peter continued: Thanks for the clarification. I should say that the behaviour of NBSP suddenly reverted to what it had been in previous versions of the standard, although a perhaps inadvertant change was made in 4.0.0. Even that is not correct. The *Introduction* to UAX #14 was expanded by 3 paragraphs between the Unicode 3.2.0 and the Unicode 4.0.0 version, in an attempt to help explain the context of how a line break algorithm works, by measuring lines and then seeking a locally optimal line break. In that context, the issue of how compression or expansion of a line works under justification was raised, and the author of UAX #14 added some explanatory qualifications regarding what spaces are involved in the kinds of compression and expansion which can impact line measurement and thus the choice of optimal line break positions. That text omitted mention of NBSP as parallel to SPACE in that context -- that was an oversight by the author and not caught in editorial review. When it became clear that the paragraph in question was being (erroneously) cited as proving that the intent of the UTC was that NBSP be implemented as a fixed-width space, the author acknowledged the oversight and quickly fixed the text. There is *NO* UTC decision on record to make the NBSP be a fixed-width space, in the history of its decision making. Nevertheless, there does seem to be a widespread misunderstanding that NBSP is intended to be fixed width, and in many systems it is implemented as such. Perhaps there is a need to clarify this further, perhaps by reinstating text similar to what was in Unicode 3.0. I didn't cite the parallel text from Unicode 4.0 along with the Unicode 1.0, Unicode 2.0, and Unicode 3.0 text I quoted, for the simple reason that it is almost word-for-word identical to Unicode 3.0. There is no need to reinstate any text -- it was unchanged and its intent was unchanged. I take your point about the advantages of having the drafters of the standard available to explain parts of the standard which are unclear. I certainly wish we could do that with other texts that you allude to. But there must also be controls here. If the text says black, we can't have the drafters saying that the text really means white. They can say that they made a mistake, and correct it in a new version, but there are limits on how far they can reinterpret even a text which they wrote themselves. Of course. Exegesis provided above. Now please stop claiming that the status of NBSP has changed, either pre- or post-4.0.0. That some implementations treat NBSP as fixed-width is a matter of those implementations. Note that even SPACE is treated as fixed-width by some implementations, and has a long history of that. Any implementation that is mono-pitch has a fixed-width SPACE, and that goes back to the dark prehistory of SPACE as a Teletype character. The Unicode Standard does not require that SPACE or NBSP be fixed-width, nor does it preclude an implementation which, for whatever reason (limitations of mechanical rendering, font design, or simply aesthetics) treats them as fixed-width. The point the standard is making is that the nominally *fixed-width* space characters (U+2000..U+200A, U+3000) are, by their very character identity, associated with particular display widths. But even for those, as UAX #14 notes, there are typographical practices which may result, for example, in an ideographic space character being compressed or a thin space character being expanded. What *matters* is that the encoded content of the text be correctly specified in an interoperable manner and that proper typographic practice be followed to produce the rendered results that people desire. The Unicode Standard provides a large number of space characters to assist that. But if even this most elaborate set of encoded space characters in the history of character encoding standards does not suffice, then, as for TeX, you always have the option to move to mark-up to get the desired results. --Ken
Re: What is the principle?
On 31/03/2004 10:44, Mike Ayers wrote: ... Well, I don't quite see why it is business sense for software companies to support the huge PUAs for variant CJK characters, outside Support? ROFL! Call up one of those companies and tell them that you are having trouble displaying PUA fonts, eastern or otherwise. I'd like to snoop on that call. Well, support has a range of meanings. Call up one of those companies and tell them you are having trouble with one of the Indic or SE Asian scripts which they do claim to support, and I suspect you will discover what that support really means in practice - unless you can get through to the specialised development team. What I meant by support in this case was more that in the UTC they voted in favour of assigning two HUGE PUAs, consisting of more than one eighth of the entire Unicode code space, for variant CJK characters; and that in practice these characters can be displayed successfully with a variety of software from these companies. If CJK merits more than 100,000 PUA characters as well as a similar number of defined characters, why can't a measly two or three, or better 256 or so, be allowed for RTL languages and for combining marks? -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: What is the principle?
On 31/03/2004 10:44, Mike Ayers wrote: From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Peter Kirk Sent: Wednesday, March 31, 2004 9:12 AM On 30/03/2004 16:30, Kenneth Whistler wrote: But what if users of certain other scripts e.g. RTL scripts want just a handful of PUA characters with the properties they need? Why is preference given to CJK? This sounds like bias to me even if I was wrong to call it western. Oh, yes, Peter, you have a identified a clear bias against... against... against... uh, certain hypothetical situations? Well, if you haven't read it between the lines, the clear bias is against RTL scripts and those scripts (including Indic by the way) which use combining characters. There is no way currently (with the default properties) to support PUA characters relating to such scripts, although there is for western and CJK scripts. ... they were supposed to use the PUA at their own risk. Well, gee, somebody understands that principle so clearly WHEN IT APPLIES TO SOMEONE ELSE. Yes, Ken! Read the context and don't snip it. He is the one who said (correctly) that what I get when I use the PUA must be at my own risk, but, I quote: Somebody else is *already* using that section of the PUA for something else. Now they have an interoperability problem... Why is their interoperability problem something which the UTC cares about, when mine isn't? Why doesn't the use at your own risk principle apply to them just as much as to me? ... No. The *only* way to maintain compatibility between your applications and the system software is to ensure that your applications only do things that are supported by the system software. If you want RTL PUA, ask your system software vendor. Here, you're just whining into the wind. If you want me to quit whining, quit asking me to do things which you and I know very well are a waste of time. System software vendors are not going to do what I want, and we all know that very well. But I have a real requirement. The UTC has the power to meet my requirement, and to do so rather simply. I am asking them to meet it. Actually my current requirement is not so much for RTL PUA as for PUA variation selectors and/or combining characters which are default ignorable. RTL PUA is not so much of a problem, because at least in principle it should be possible to make PUA characters RTL by enclosing them in RLO ... PDF. I am not sure how well this is actually supported by system software. My current requirement could be met by defining a probably quite small set of PUA combining characters (with combining class zero) which would be default ignorable. For an example of why this might be useful, see my posting today to the Unicode Hebrew list. /|/|ike -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: What is the principle?
On 31/03/2004 12:40, Rick McGowan wrote: Peter Kirk wrote... ... I have a real requirement. The UTC has the power to meet my requirement, and to do so rather simply. I am asking them to meet it. Actually, you are not asking UTC anything. You are discussing the PUA on a public-access mail list. There's a big difference. This *is* the place to discuss as you are doing, and a good place to formulate your positions for eventual submission of a proposal, if any. Thanks for the clarification. I was aware of the distinction, and was using am asking loosely. I am undecided yet whether to make a formal proposal. Ken seems to suggest that this would be a waste of time - although I can see some advantages in obtaining a formal rejection. I wonder if anyone else on the UTC or associated with it might give some hope for such a proposal? -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Fixed Width Spaces (was: Printing and Displaying DependentVowels)
On 31/03/2004 12:27, fantasai wrote: Peter Kirk wrote: LouisTHREE-PER-EM SPACEXVI may have lost his head, but we don't want his number also to fall off on to the next line, or even to become too far separated from his name. We need to know what kind of space to use to resist the guillotine! NBSP You should not rely on fixed-width spaces to approximate regular spaces. A simple switch from Arial Narrow to Verdana will demonstrate why: the widths of normal spaces and non-breaking spaces are related to the width of the fonts' glyphs, whereas the width of a fixed-width space is related to the /height/ of the glyphs. But, as Ken has just clarified, with NBSP Louis' neck may be stretched rather uncomfortably, if not cut completely. Here is what I don't want to see (fixed width font required): Louis XVI was guillotinedin 1793. Here is what I do want: Louis XVI was guillotinedin 1793. These columns are unrealistically narrow to make the point clear, although such narrow columns are sometimes found in newspapers. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
RE: Fixed Width Spaces (was: Printing and Displaying DependentVowels)
The NBSP issue was extensively discussed a couple of years ago, I don't remember in which list. In short, it was wrongly used by early web users as a fixed width space, and there is such a vast legacy it cannot be changed. However, there are other applications that use the intended meaning - see ISO 8859. Jony -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Peter Kirk Sent: Wednesday, March 31, 2004 10:13 PM To: Kenneth Whistler Cc: [EMAIL PROTECTED] Subject: Re: Fixed Width Spaces (was: Printing and Displaying DependentVowels) On 31/03/2004 11:57, Kenneth Whistler wrote: ... To most people, a space is a space. To rather more, there is a second kind of space which they expect to be non-breaking and often also expect to be fixed width. (Those who had the latter expectation have had a nasty surprise today because with the release of 4.0.1 NBSP is suddenly no longer fixed width.) Hardly. It has *always* been the intent and understanding of the UTC that NBSP was comparable in all ways to a SPACE character, except for disallowing line break opportunities. ... Thanks for the clarification. I should say that the behaviour of NBSP suddenly reverted to what it had been in previous versions of the standard, although a perhaps inadvertant change was made in 4.0.0. Nevertheless, there does seem to be a widespread misunderstanding that NBSP is intended to be fixed width, and in many systems it is implemented as such. Perhaps there is a need to clarify this further, perhaps by reinstating text similar to what was in Unicode 3.0. I take your point about the advantages of having the drafters of the standard available to explain parts of the standard which are unclear. I certainly wish we could do that with other texts that you allude to. But there must also be controls here. If the text says black, we can't have the drafters saying that the text really means white. They can say that they made a mistake, and correct it in a new version, but there are limits on how far they can reinterpret even a text which they wrote themselves. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
RE: What is the principle?
No. The *only* way to maintain compatibility between your applications and the system software is to ensure that your applications only do things that are supported by the system software. If what is meant here by your applications is any applications running on your system, then that is correct. If it means applications you have developed, then I'd suggest a revision: whenever your application depends upon system-supplied services, it must do things in the ways expected by those services; if those services don't serve the needs of an application, you must implement that functionality on your own. E.g. SIL's Graphite technology can deal with RTL PUA characters, but then it isn't relying on system-supplied services to do complex-script shaping of text. Peter Peter Constable Globalization Infrastructure and Font Technologies Microsoft Windows Division
Re: What is the principle?
Ernest suggested: There are currently some 10 totally unused planes, with not even any tentative plans for them, Allocating one or two those into additional Private Use Areas with a variety of default characteristics instead of the monotonous default characteristics of the existing Private Use Areas should not prove too difficult. Fine. Make your formal proposal to the UTC and to SC2/WG2 and see whether it is difficult or not to convince the committees of the appropriateness of your approach. For example, 26 blocks of 128 Private Use Combining Marks each, each block corresponding to one of the existing canonical combining classes (with perhaps a larger block for class 0) would amply satisfy the needs of most private use scripts for combining marks. Similarly, blocks for additional characters that would have other properties which would be what, exactly? should be simple to define and for most combinations of property values, which would be what, exactly? As of Unicode 4.0.1, PropertyAliases.txt now lists 82 distinct character properties. Some of those, particularly those most relevant to complex script behavior and rendering, such as General_Category, Bidi_Class, Canonical_Combining_Class, Joining_Type, etc., are multi-valued. Do you have any idea how big the numbers start getting when combinatorics start to get involved here? Or are you planning to do the research first, via a comprehensive implementation of character properties such as IUC, to first determine what the actual existing number of combinations of property values is for the existing repertoire and properties and then make a principled projection of that into the uncertain world of characters for scripts which have not yet been encoded or modeled? 128 characters should also prove to be exceedingly ample For what? I'd have to take the time to list them, but a quick glance convinces me that there are at most several hundred combinations that would need to be supported if we limit things to just those combinations already in use. This may be correct, but you'd have to make the case based on the existing data from property implementations. (it might take more, if for example all 256 potential combining classes were supported instead of the 26 listed in UCD.html), At 128 characters per combination plus more for a few that might need them, it should prove possible to handle this in 1 or 2 planes. Which still begs the fundamental questions: Why this scheme instead of a much more flexible scheme, as outlined by Rick, for having an implementation with API support for establishing PUA properties on an as-needed basis? (Which requires *no* action by the UTC at all, by the way.) What makes you think, once you have such a scheme of property combinations worked out, and once you convinced the UTC of it (which I doubt), that you could also convince SC2/WG2 to do something comparable in 10646 to keep the standards in synch? Recall that SC2/WG2 has almost *no* concept of character properties -- those are added by the Unicode Standard. Bring in a proposal that says, We need to add two more planes of private use characters, with these special properties, because XYZ... and you'll get a row of blank stares from the national body representatives. Finally, assuming that you could get something like this into the standards, what makes you think that the platform vendors would complicate and expand their character property tables to support this speculative scheme? They have the option to not support all characters in the standard, and a new plane or two full of PUA characters with a checkerboard of speculative property assignments strike me as prime candidates for the kind of stuff they would simply say, We have no interest in supporting these things. I think you're spitting into the wind if you think you can force, through the character standardization process, the major platform vendors to support the kind of PUA functionality you are after, when they could do so *today* via much more extensible and architecturally sensible means given the existing PUA characters, but have not yet chosen to do so. --Ken
Re: What is the principle?
From: Ernest Cline [EMAIL PROTECTED] I'd have to take the time to list them, but a quick glance convinces me that there are at most several hundred combinations that would need to be supported if we limit things to just those combinations already in use. (it might take more, if for example all 256 potential combining classes were supported instead of the 26 listed in UCD.html), At 128 characters per combination plus more for a few that might need them, it should prove possible to handle this in 1 or 2 planes. This seems highly excessive. We already have plenty of PUA space. All what we need is a standard way (file format? protocol?) to transport PUA character properties, and possibly encode a reference (URI?) to the definition file or service. If Unicode does not want to do this job, at least it could participate in such independant development by commenting about the protocol/format used to encode these properties (notably to make sure that the system remains extensible and can encode new properties that may be added later). This would work in relation with the evolution of the Unicode standard itself (versioning) which may be handled correctly (however less efficiently) through a sort of emulation layer that would mimic the behavior of new standardized characters and properties. I won't expect that every application will be able to interpret this protocol or implement the emulation layer, but at least it becomes possible to create less ambiguous interoperable solutions based on other existing standards (that's why I think that, if such separate development is created, it should be based on the most advanced interoperability technologies of today, notably XML and its schemas and namespaces). You think this is overkill? Well in some near future, I think that it will be difficult for applications to follow the evolutions of the Unicode standard, and differences of versions will cause soon a nightmare if there's no more formal way to specify what is implicitly part of a Unicode version (and does not need a complex negoctiation of protocol) clearly identified by a identifier resolvable by online services, and what can be supported the most completely as possible by an emulation layer. XML schemas, because they are versionnable, can really help here (notably because of the capability of modern XML parsers to use local caches for definition data, including local prebuilt-in implementations which are the most efficient). So I don't like the idea of adding more PUAs with other defaults. I much favor some more fredom on the use of PUAs, and a way to make what looks like a deviation of the standard today, a now conforming solution. It will become more important with the remaining scripts to encode, simply because we really lack some resources to be able to produce any standard for them. What this means is that the evolution of Unicode will soon become impossible without experimentation and gradual integration with some interoperable services. With the current standard stability policy, this need is even more important because further corrections of past errors will become nearly impossible (and so this will stop any attempt to make significant evolutions to the standard itself). It's clear that there are needs for PUAs today, just because Unicode is becoming an universal standard for more and more applications. If this universal standard blocks evolution, then others will want to develop indepant standards and there will be a risk of splits caused by OS vendors themselves. (see what has happened 15 years ago to Unix, and the high difficulty today to reunify what was initially a unique standard; thanks GNU and Linux have been the motors and such reunification, because other proprietary *nix versions are now converging for interoperability with Linux; but this unification is probably 15 to 20 years before it becomes true, unless *nix vendors decide to abandon prememtively some dead branches to keep only those that users want and are ready to learn and support themselves).
Doing Markup in Plain Text: A Modest Proposal for Planes 4-B of Unicode
XML has become the de facto standard for fancy text. It is therefore useful to explore ways and means of bringing XML into plain text, since obviously plain text is simpler than, and superior to, fancy text. The current method involving and and and / and who knows what else is obviously much too complicated, and cannot interoperate with even the simplest plain text. Fortunately, the characters in planes 4 through B can come to our rescue. Plane 4 will be divided into mini-blocks of 32 (or perhaps 64) characters. The Unicode Consortium will allocated these on the usual basis (first come first served, once and for all, and free) to users for the representation of start-tags. For example, supposing that block 4 was allocated to the W3C HTML WG, we might represent html as U+4, head as U+40001, body as U+40002, and so on. In this way, the start-tag (exclusive of attributes and attribute values) is reduced to a single character. The last block will not be allocated; U+4FFFC will be used to indicate the beginning of a comment, and U+4FFFD the beginning of a processing instruction. Plane 5 will be automatically assigned in parallel to plane 4 for the representation of end-tags: thus, U+5 would be /html. U+5FFFC and U+5FFFD will have the obvious meanings. Plane 6 will also be allocated as mini-blocks and used for the representation of attribute names. If a Plane 4 character is followed by a Plane 6 character, then the start-tag has at least one attribute. The last mini-block will not be allocated; 6FFFD will be used to indicate that the current tag has no more attributes. Plane 7 is reserved for future use. Planes 8 through A are clones of planes 0 through 3 respectively, and are used to represent attribute value, comment, and processing instruction text. In this way, only character content is encoded using traditional Unicode characters. It is expected that a secondary market in mini-blocks would eventually arise. -- But I am the real Strider, fortunately, John Cowan he said, looking down at them with his face [EMAIL PROTECTED] softened by a sudden smile. I am Aragorn son http://www.ccil.org/~/cowan of Arathorn, and if by life or death I can http://www.reutershealth.com save you, I will. --LotR Book I Chapter 10
Re: What is the principle?
Peter Kirk wrote... I am undecided yet whether to make a formal proposal. Ken seems to suggest that this would be a waste of time - Yes. I also think it would be a waste of time, but... although I can see some advantages in obtaining a formal rejection. ... I can also see some value in a formal rejection of any notion to subdivide the PUA into different property areas or make new PUAs with other properties. Rick
Re: What is the principle?
On 31/03/2004 12:28, Ernest Cline wrote: ... This is the kind of stuff the UTC refuses to start up by trying to provide some subdivision of semantics in the PUA. *That* is the principle, by the way, which guides the UTC position on the PUA: Use at your own risk, by private agreement. Which is why if any private use characters with default characteristics other than those of the existing Private Use blocks are ever to be part of Unicode they will need to be added as additional Private Use blocks, not by redefining existing PUA's There are currently some 10 totally unused planes, with not even any tentative plans for them, Allocating one or two those into additional Private Use Areas with a variety of default characteristics instead of the monotonous default characteristics of the existing Private Use Areas should not prove too difficult. For example, 26 blocks of 128 Private Use Combining Marks each, each block corresponding to one of the existing canonical combining classes (with perhaps a larger block for class 0) would amply satisfy the needs of most private use scripts for combining marks. Similarly, blocks for additional characters that would have other properties should be simple to define and for most combinations of property values, 128 characters should also prove to be exceedingly ample I'd have to take the time to list them, but a quick glance convinces me that there are at most several hundred combinations that would need to be supported if we limit things to just those combinations already in use. (it might take more, if for example all 256 potential combining classes were supported instead of the 26 listed in UCD.html), At 128 characters per combination plus more for a few that might need them, it should prove possible to handle this in 1 or 2 planes. Ernest, I support your general ideas here. But I am concerned about the implications of defining PUA characters with combining classes other than zero. I can see this causing some confusion with normalisation etc. And it does hugely multiply the number of PUA characters required. Let's think when one might need PUA characters with cc0. The relevant cases are all like B, M1, M2, where B is a base character and M1 and M2 are combining characters, one or both of them in your proposed extended PUA. And cc0 is required only if you want this sequence to be canonically equivalent to B, M2, M1, and so want one of these to be converted to the other during normalisation - a reordering which can only happen if M1 and M2 both have cc0 (and different). Is it really necessary to support to this level of detail the concept of canonical equivalence of PUA sequences? Would it not be enough for those specifying the PUA characters to specify one of the orderings as correct and the other as a spelling error? I really can't see this requirement being widespread enough to justify defining the thousands of PUA characters with different combining classes which you propose. My proposal would rather be for a single group of PUA combining marks which all have cc=0, and are all default ignorable, with the result that they are not displayed when a regular font is selected. These could be used for non-standardised diacritics, mark-up (I mean this in the old-fashioned sense of marks added to the text rather than as a way of specifying formatting etc) etc, and also in effect as variation selectors if the private font specifies pseudo-digraphs. I don't know exactly how many might be required, but I am thinking tens or hundreds rather than thousands. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Doing Markup in Plain Text: A Modest Proposal for Planes 4-B of Unicode
Oops. Well... *That* was a day early. Rick
RE: What is the principle?
Mike Ayers [EMAIL PROTECTED] writes: Support? ROFL! Call up one of those companies and tell them that you are having trouble displaying PUA fonts, eastern or otherwise. I'd like to snoop on that call. Apple seemed pretty concerned about displaying PUA fonts on Mac OS X recently on this mail list. Personally, I doubt if I could get Microsoft to care if Windows or Word was causing my monitor to spin around and spit out pea soup, but if, say, Xerox was having trouble displaying the correct spelling of its directors' names and mentioned that they might have to go to Open Office for this, I'm sure Microsoft would find it quite important. It has nothing to do with the PUA; it has to do with whose complaining and how much weight they carry. This is the kind of stuff the UTC refuses to start up by trying to provide some subdivision of semantics in the PUA. *That* is the principle, by the way, which guides the UTC position on the PUA: Use at your own risk, by private agreement. ...and quit bothering us about it. That's gotta be in there somewhere. If not, I have an amendment to propose. Why don't we add that note to other blocks? It'd be so much easier if we could just tell the people using, say, the Hebrew block that we've thrown something together for you, don't bother us if it doesn't work. Surely Unicode didn't waste two planes for something that no one can practically use. -- ___ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm
Re: Fixed Width Spaces (was: Printing and Displaying DependentVowels)
Peter Kirk scripsit: But, as Ken has just clarified, with NBSP Louis' neck may be stretched rather uncomfortably, if not cut completely. Here is what I don't want to see (fixed width font required): Louis XVI was guillotinedin 1793. This, however, is a matter of presentation rather than semantics, and as such fitly belongs in the realm of presentational markup. In HTML, one might specify ttnbsp;/tt to generate a fixed-width space. -- John Cowan [EMAIL PROTECTED] http://www.reutershealth.com I amar prestar aen, han mathon ne nen,http://www.ccil.org/~cowan han mathon ne chae, a han noston ne 'wilith. --Galadriel, LOTR:FOTR
Re: What is the principle?
While I disagree with most of what you've said on this list, it is not an unreasonable proposal to change the default properties for some ranges of the private use blocks. I don't think that this would, in practice, really disturb any applications, because of #1 below. I have, however, a few observations. 1. PUA properties, as is clear from Ken's excellent descriptions, are simply defaults. With the exception of normalization, no Unicode implementation is required to observe them. So even if this change is made, any conformant implementation is free to simply ignore it and just assign its own properties. This would not be a magic wand. 2. Unicode properties are not sufficient for rendering. With technologies such as Apples, all of the other work can be done in a font. With OpenType, most but not all can -- in particular, reordering has to be done by the application/OS. So complex scripts that require reordering still would not be interchangeable without private agreement. 3. Even excluding the normalization properties and other obvious inapplicable properties (such as name or age), there are some 50-odd possible character properties, many of them with multiple possible values: see http://www.unicode.org/Public/UNIDATA/PropertyAliases.txt http://www.unicode.org/Public/UNIDATA/UCD.html#Properties http://www.unicode.org/Public/UNIDATA/PropertyValueAliases.txt A concrete proposal would have to specify exactly which properties were relevant, and what the values are for the proposed ranges. (Clearly an even partition according to all the possible combinations would be completely impractical.) If the goal is rendering, this means looking at the possible combinations of properties that are relevant for rendering and proposing a division that makes sense. Mark __ http://www.macchiato.com - Original Message - From: Peter Kirk [EMAIL PROTECTED] To: Rick McGowan [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Wed, 2004 Mar 31 16:24 Subject: Re: What is the principle? On 31/03/2004 12:40, Rick McGowan wrote: Peter Kirk wrote... ... I have a real requirement. The UTC has the power to meet my requirement, and to do so rather simply. I am asking them to meet it. Actually, you are not asking UTC anything. You are discussing the PUA on a public-access mail list. There's a big difference. This *is* the place to discuss as you are doing, and a good place to formulate your positions for eventual submission of a proposal, if any. Thanks for the clarification. I was aware of the distinction, and was using am asking loosely. I am undecided yet whether to make a formal proposal. Ken seems to suggest that this would be a waste of time - although I can see some advantages in obtaining a formal rejection. I wonder if anyone else on the UTC or associated with it might give some hope for such a proposal? -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Doing Markup in Plain Text: A Modest Proposal for Planes 4-B of Unicode
From: [EMAIL PROTECTED] XML has become the de facto standard for fancy text. It is therefore useful to explore ways and means of bringing XML into plain text, since obviously plain text is simpler than, and superior to, fancy text. The current method involving and and and / and who knows what else is obviously much too complicated, and cannot interoperate with even the simplest plain text. Fortunately, the characters in planes 4 through B can come to our rescue. Is it a joke? XML does not need it, and if something is expected it's possibly a structured binary representation of XML for easier processing with simplicied parsers. Your proposal also ignores the current developement of XML with namespaces and parsed entities resolved according to interchangeable schemas... (these last can be described in a structured binary representation, by respecting the XML document InfoSet, but until there's a need and formal definition for use in XML RPC services, I see no usage for such obscure encoding in planes 4 to 11 (because such encoding is unparsable, notably for document validation and reusability with non colliding namespaces)... Thanks the current text-based XML syntax offers much more flexibility than what you propose here, which could lead to new flaws (probably even more than with the current syntax supported by existing XML parser implementations), including at the security level (think about how you would break XML Signatures)...
Re: Fixed Width Spaces (was: Printing and Displaying DependentVowels)
Peter Kirk wrote: Louis XVI was guillotinedin 1793. Here is what I do want: Louis XVI was guillotinedin 1793. Louis\ XVI was guillotined in 1793. If you aren't using TeX, and you're doing this type of justification in small columns, your program ought to provide a way to do this. This is approaching italics or small capitals; it's necessary to look right, but it's not plain text. -- ___ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm
Wasting Planes (was: RE: What is the principle?)
Surely Unicode didn't waste two planes for something that no one can practically use. Plane 15 and Plane 16 private use characters weren't the invention of the UTC, by the way. They derive from the original specification of ISO/IEC 10646-1. From ISO/IEC 10646-1: 1993: The code positions of 32 planes from Plane E0 to Plane FF of Group 00 shall be for Private Use. The code positions of the 32 groups from Group 60 to Group 7F shall be for Private Use. That would have been: U-00E0..U-00FD U-6000..U-7FFD That was 8224 *planes* of private use code positions. Amendment 1 (the one that defined UTF-16) amended that to read: The code positions of the 32 groups from Group 60 to Group 7F shall be for private use. The code positions of Plane 0F and Plane 10, and of the 32 planes from Plane E0 to Plane FF, of Group 00 shall be for private use. The 6400 code positions E000 to F8FF of the Basic Multilingual Plane shall be for private use. That was 8226 *planes* of private use code positions, besides the 6400 code positions on the BMP (which had been defined earlier, but not spelled out in the same clause with the rest of the private use allocation). The addition of Plane 0F and Plane 10 was so there were some private use planes accessible via UTF-16. In that grand proliferation of wastage, 10646 allowed for 539,089,084 private use code positions. That was a wee tad more than anyone actually needed to use, by the way. More recent amendments to 10646 have simply settled on the principle that *all* code positions beyond U-0010 are reserved, leaving the 6400 private use code positions on the BMP, plus Plane 0F and Plane 10. In the grand scheme of things, that seems to be the Goldilocks solution -- not too small (6400) and not too big (539,089,084) -- but jst right (137,468). There are people who have valid reasons for making use of Plane 0F or Plane 10 private use characters, by the way, but most of those reasons have to do with CJK. And the reason for that should be pretty obvious -- only the CJK script deals with the kind of entity numbers (multiple 10's of thousands) that make the 6400 code points of the BMP PUA seem cramped. *Any* other unencoded script, for example, with the possible exceptions of Egyptian hieroglyphics or Tangut ideographs, would fit into the BMP PUA with plenty of room to spare. --Ken
Re: What is the principle?
[Original Message] From: Peter Kirk [EMAIL PROTECTED] Ernest, I support your general ideas here. But I am concerned about the implications of defining PUA characters with combining classes other than zero. I can see this causing some confusion with normalisation etc. And it does hugely multiply the number of PUA characters required. snip Is it really necessary to support to this level of detail the concept of canonical equivalence of PUA sequences? If you want them to be able to interact with the existing combining marks then any proposal for more specific private use characters will need to include combining characters for every existing combining class. 128 characters per class may prove to be overly generous, but it serves as a starting point for discussion. The number was chosen because of the stated preference of assigning character blocks that line up in groups of 128. A detailed proposal would definitely need to examine existing scripts as it would be wasteful to assign too many yet pointless to assign too few. I can't see any useful proposal for more specific Private Use characters as using less than half a plane. Any proposal that uses more than one plane will need a lot of justifying to have any chance, and even with ten unspoken planes out there, Any proposal that would call for more than two planes will not go anywhere.
Re: Fixed Width Spaces (was: Printing and Displaying DependentVowels)
On 31/03/2004 14:25, [EMAIL PROTECTED] wrote: Peter Kirk scripsit: But, as Ken has just clarified, with NBSP Louis' neck may be stretched rather uncomfortably, if not cut completely. Here is what I don't want to see (fixed width font required): Louis XVI was guillotinedin 1793. This, however, is a matter of presentation rather than semantics, and as such fitly belongs in the realm of presentational markup. In HTML, one might specify ttnbsp;/tt to generate a fixed-width space. I disagree. Surely there is something SEMANTICALLY different about the space in Louis XVI. One semantic difference is that it is non-breaking. But another one is that these words should not be split apart. An additional semantic distinction might be that they should be treated as one word for the purposes of word breaking algorithms. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: What is the principle?
On 31/03/2004 13:30, Kenneth Whistler wrote: ... I think you're spitting into the wind if you think you can force, through the character standardization process, the major platform vendors to support the kind of PUA functionality you are after, when they could do so *today* via much more extensible and architecturally sensible means given the existing PUA characters, but have not yet chosen to do so. --Ken Ken, I take many of your points. But in this last paragraph you are comparing two very different things. If Ernest's proposal were accepted, major platform vendors could (although that does not necessarily imply that they would) implement it rather simply by updating the tables of character properties within their systems. Indeed I would expect such tables to be updated more or less automatically by some process of importing and compiling the Unicode character database (including the default properties for the PUA). That is a far easier task than the one which they have not yet chosen to do, to support tables of character properties within PUA fonts, because this latter requires significant software development effort and may not fit well within existing system architecture. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Fixed Width Spaces (was: Printing and Displaying DependentVowels)
Here is what I do want: Louis XVI was guillotinedin 1793. Louis\ XVI was guillotined in 1793. If you aren't using TeX, and you're doing this type of justification in small columns, your program ought to provide a way to do this. Other possible approaches that any industrial-strength typesetting program ought to provide: A. Select Louis XVI. Set 'Keep together on line' as a property to prevent inappropriate line breaking. Set 'Prevent inter-word space justification' to prevent the justification algorithm from adjusting the space width beyond the value provided by the SPACE in the font. B. Select Louis XVI. Enter it into the hyphenation and line breaking dictionary used by the program and set appropriate properties on the entry in the dictionary. C. Simply select the space in the text and set it to 'no-break', 'no-adjust'. Any of these alternatives could be implemented with just a simple U+0020 SPACE character sitting in the text itself. That is in addition to solutions that make use of actual fixed-width space character codes surrounded by ZWNBSP characters to prevent line breaking. The point is that looking to encode a special character in Unicode for every distinct visual effect in typesetting is not necessarily the first, best solution to settle on. It might not even be seventh or eighth best on the possible list of alternative approaches to solve the problem. --Ken Louis XVI was guillotined on Jan. 21, 1793, facing death with courage.
Re: Unicode 4.0.1 Released
Marco Cimarosti scripsit: So far, my understanding was that the normative properties of existing code points where carved in stone. Not all normative properties are immutable. A normative property is simply one which you have to get right if you claim conformance to that part of Unicode: you cannot make PLUS SIGN a letter. Immutable properties are those which Unicode guarantees will never change; they are a subset of the normative properties. Won't these fixes break applications out there? I.e., won't they turn previously conformant applications into non conformant ones? They will conform to previous versions but not to newer versions. -- BALIN FUNDINUL UZBAD KHAZADDUMU[EMAIL PROTECTED] BALIN SON OF FUNDIN LORD OF KHAZAD-DUM http://www.ccil.org/~cowan
Arabic Shaping Classes
Well, I've decided to start what is probably a quixotic quest for a better set of private use characters. Such a proposal will need to be complete, but it had best be as simple as possible. That leads me to my first question. Where is the Arabic Shaping Class property normally taken care of? Is this something that is normally handled inside a font, and as such, something that can be left with a generic default, or is this something which would require explicitly setting aside ranges for each shaping class. (With 54 existing shaping classes, I sincerely hope that not having to set ranges for each class will prove feasible.) Ernest Cline [EMAIL PROTECTED]
Re: What is the principle?
On 31/03/2004 14:27, Mark Davis wrote: While I disagree with most of what you've said on this list, it is not an unreasonable proposal to change the default properties for some ranges of the private use blocks. I don't think that this would, in practice, really disturb any applications, because of #1 below. I have, however, a few observations. 1. PUA properties, as is clear from Ken's excellent descriptions, are simply defaults. With the exception of normalization, no Unicode implementation is required to observe them. So even if this change is made, any conformant implementation is free to simply ignore it and just assign its own properties. This would not be a magic wand. Understood. But I was rather thinking that at least some implementations base their character properties directly on the Unicode character database. Isn't this what ICU does? And so, if the PUA default properties are the ones in the UCD, they would automatically be used by implementations. 2. Unicode properties are not sufficient for rendering. With technologies such as Apples, all of the other work can be done in a font. With OpenType, most but not all can -- in particular, reordering has to be done by the application/OS. So complex scripts that require reordering still would not be interchangeable without private agreement. This is why the suggestions made for storing character properties in the font are unrealistic; they require major restructuring of system software (close to rewriting the whole OS, as I wrote earlier), not just tinkering. I accept that there may be some practical limitations on PUA complex scripts, but I would like them to be a lot less than they are now. 3. Even excluding the normalization properties and other obvious inapplicable properties (such as name or age), there are some 50-odd possible character properties, many of them with multiple possible values: see http://www.unicode.org/Public/UNIDATA/PropertyAliases.txt http://www.unicode.org/Public/UNIDATA/UCD.html#Properties http://www.unicode.org/Public/UNIDATA/PropertyValueAliases.txt A concrete proposal would have to specify exactly which properties were relevant, and what the values are for the proposed ranges. (Clearly an even partition according to all the possible combinations would be completely impractical.) If the goal is rendering, this means looking at the possible combinations of properties that are relevant for rendering and proposing a division that makes sense. That is why I (rather than Ernest) have discussed only rendering related properties like bidi and default ignorable. I realise that there may be other properties which need to be considered, but I am not yet sure which these are. I sense that you prefer to change the default properties of existing PUA characters rather than add new ones. Might it be sensible to adjust the properties in one of the PUA planes but leave the other one untouched? Has ANYONE actually defined characters in one or other of these planes, and if so, which? It would make more sense to change the default properties of a plane which no one is actually using. Mark __ http://www.macchiato.com -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: What is the principle?
On 31/03/2004 15:32, Ernest Cline wrote: [Original Message] From: Peter Kirk [EMAIL PROTECTED] Ernest, I support your general ideas here. But I am concerned about the implications of defining PUA characters with combining classes other than zero. I can see this causing some confusion with normalisation etc. And it does hugely multiply the number of PUA characters required. snip Is it really necessary to support to this level of detail the concept of canonical equivalence of PUA sequences? If you want them to be able to interact with the existing combining marks then any proposal for more specific private use characters will need to include combining characters for every existing combining class. ... I don't see it. If the PUA combining marks have cc=0, they can never be reordered. As long as other marks are always written in canonical order, they will in practice never be moved relative to other marks. Perhaps you are thinking of a sequence something like B, M1, M2, M3 in which M1 and M2 interact typographically, but M1 is PUA and M2 and M3 are not. The normal Unicode rule would be that cc(M1)=cc(M2). But this is no guarantee against a reordering to B, M1, M3, M2, which is still canonically equivalent but M1 and M2 have been separated. If instead cc(M1)=0cc(M2)cc(M3), B, M1, M3, M2 is still canonically equivalent but with M1 and M2 separated, but the situation is no worse than by the normal Unicode rule. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
RE: Unicode 4.0.1 Released
* Changed: bidi class of several characters Won't these fixes break applications out there? I.e., won't they turn previously conformant applications into non conformant ones? And the other thing to understand about this particular change is that it is the outcome of a years-long debate and a painstakingly negotiated settlement that reminded me of other difficult negotiations involving Middle Eastern issues. The upshot will be that Microsoft-based applications will come *into* compliance with the Bidirectional Algorithm, with the crucial few character property changes as stated. And IBM and other vendors who had implemented based on the prior property values agreed that it was worth the tweak in order to bring the entire industry into an agreed, interoperable state for bidirectional behavior (in the absence of higher-level protocols) involving the crucial ASCII characters, '+', '-', and '/', that interact with URL's, dates, times, and numbers in a bidirectional context. --Ken
Re: What is the principle?
comments below. Mark __ http://www.macchiato.com - Original Message - From: Peter Kirk [EMAIL PROTECTED] To: Mark Davis [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Wed, 2004 Mar 31 19:15 Subject: Re: What is the principle? On 31/03/2004 14:27, Mark Davis wrote: While I disagree with most of what you've said on this list, it is not an unreasonable proposal to change the default properties for some ranges of the private use blocks. I don't think that this would, in practice, really disturb any applications, because of #1 below. I have, however, a few observations. 1. PUA properties, as is clear from Ken's excellent descriptions, are simply defaults. With the exception of normalization, no Unicode implementation is required to observe them. So even if this change is made, any conformant implementation is free to simply ignore it and just assign its own properties. This would not be a magic wand. Understood. But I was rather thinking that at least some implementations base their character properties directly on the Unicode character database. Isn't this what ICU does? And so, if the PUA default properties are the ones in the UCD, they would automatically be used by implementations. Yes, some do (and ICU does pick up the default). Just pointing out that implementations can freely choose the properties (except normalization). BTW, you have been mentioning the combining class; you can have combining marks in the PUA, but they have to have zero combining classes. 2. Unicode properties are not sufficient for rendering. With technologies such as Apples, all of the other work can be done in a font. With OpenType, most but not all can -- in particular, reordering has to be done by the application/OS. So complex scripts that require reordering still would not be interchangeable without private agreement. This is why the suggestions made for storing character properties in the font are unrealistic; they require major restructuring of system software (close to rewriting the whole OS, as I wrote earlier), not just tinkering. I accept that there may be some practical limitations on PUA complex scripts, but I would like them to be a lot less than they are now. ANY dynamic reassignment of properties requires a major overhaul. There have been proposals over the years for exchange of PU property data. All of them have died, and I never expect to see any succeed. The reason is that most implementations just get properties with static calls, e.g. isLetter(x). To change it to be dynamic, all of these calls in all programs would have to be changed to reference a dynamic collection of properties. In a single-threaded world, this wouldn't be too bad. But that is not our world -- which is a multi-threaded world -- there it is nasty; and horrible if the same document is expected to contain different sets of PU properties. There are also performance implications, since properties are used so heavily in processing. These are not whims of software vendors; they would be very expensive retrofits for essentially no benefit. 3. Even excluding the normalization properties and other obvious inapplicable properties (such as name or age), there are some 50-odd possible character properties, many of them with multiple possible values: see http://www.unicode.org/Public/UNIDATA/PropertyAliases.txt http://www.unicode.org/Public/UNIDATA/UCD.html#Properties http://www.unicode.org/Public/UNIDATA/PropertyValueAliases.txt A concrete proposal would have to specify exactly which properties were relevant, and what the values are for the proposed ranges. (Clearly an even partition according to all the possible combinations would be completely impractical.) If the goal is rendering, this means looking at the possible combinations of properties that are relevant for rendering and proposing a division that makes sense. That is why I (rather than Ernest) have discussed only rendering related properties like bidi and default ignorable. I realise that there may be other properties which need to be considered, but I am not yet sure which these are. Those alone won't work. If you want stuff to render right, then you have to include *any* property that systems may use to affect display. You do want these characters to linebreak correctly, eh? That's why I said that a complete proposal would have to spell out all the properties would be considered, and give reasons for the inclusion/exclusions. I sense that you prefer to change the default properties of existing PUA characters rather than add new ones. Might it be sensible to adjust the properties in one of the PUA planes but leave the other one untouched? Has ANYONE actually defined characters in one or other of these planes, and if so, which? It would make more sense to change the default properties of a plane which no one is actually using. 1. There is no way I would advocate
Re: Fixed Width Spaces (was: Printing and Displaying DependentVowels)
Peter Kirk wrote: On 31/03/2004 14:25, [EMAIL PROTECTED] wrote: Peter Kirk scripsit: But, as Ken has just clarified, with NBSP Louis' neck may be stretched rather uncomfortably, if not cut completely. Here is what I don't want to see (fixed width font required): Louis XVI was guillotinedin 1793. This, however, is a matter of presentation rather than semantics, and as such fitly belongs in the realm of presentational markup. In HTML, one might specify ttnbsp;/tt to generate a fixed-width space. I disagree. Surely there is something SEMANTICALLY different about the space in Louis XVI. One semantic difference is that it is non-breaking. But another one is that these words should not be split apart. An additional semantic distinction might be that they should be treated as one word for the purposes of word breaking algorithms. non-breaking and non-stretching are presentational properties, not semantic ones. They don't change the meaning of the space: it's still just a space, not a hyphen or the letter g. They don't affect non-visual media; we don't break lines in spoken speech. Louis XVI is semantically different from Louis' head because the former is a bare noun whereas the latter is a noun phrase, but as far as the reader is concerned, they're both separated with a space. Whether the space breaks or not or stretches or not has no effect on either the meaning or correctness of the text. It only affects its (visual) aesthetic quality. ~fantasai -- http://fantasai.inkedblade.net/contact
Re: What is the principle?
[Original Message] From: Kenneth Whistler [EMAIL PROTECTED] To: [EMAIL PROTECTED] Scenario: The UTC listens to you and defines some section of the PUA as strong right-to-left by default for use in PUA-defined bidirectional scripts. Somebody else is *already* using that section of the PUA for something else. Now they have an interoperability problem, because the default behavior they were depending on changes over in some future version of some software, not under their control, and they data gets munged by bidi. So? Let *them* fix *their* software. They should know, same as the rest of us, that you can't depend on the PUA. If they wanted LTR base glyphs, then they should have coded that into their system. Didn't someone just say that the normative properties of characters are not necessarily etched in stone? If that's true of anything, it should be of the PUA. ~mark
Re: Doing Markup in Plain Text: A Modest Proposal for Planes 4-B of Unicode
[EMAIL PROTECTED] wrote: XML has become the de facto standard for fancy text. It is therefore useful to explore ways and means of bringing XML into plain text, since obviously plain text is simpler than, and superior to, fancy text. The current method involving and and and / and who knows what else is obviously much too complicated, and cannot interoperate with even the simplest plain text. Fortunately, the characters in planes 4 through B can come to our rescue. Heh... I've occasionally caught myself almost wishing for this kind of setup, ridiculous though it be. It would be nice to be able to get just the *content* of the text without having to bother with all that mucking about with HTML rendering engines and whatnot. I suppose only a programmer (and a semi-Luddite one at that, who won't or can't use existing packages) would really care, though. Now, if we can just simplify ASCII down to *one* character and some variation selectors... ~mark
Line Break class of U+FE51 Small Ideographic Comma
Given that U+3001 IDEOGRAPHIC COMMA and U+FE50 SMALL COMMA are both of Line Break class CL, wouldn't it make sense for U+FE51SMALL IDEOGRAPHIC COMMA to also be of class CL instead of class ID?
Re: Doing Markup in Plain Text: A Modest Proposal for Planes 4-B of Unicode
Mark E. Shoulson scripsit: Heh... I've occasionally caught myself almost wishing for this kind of setup, ridiculous though it be. It would be nice to be able to get just the *content* of the text without having to bother with all that mucking about with HTML rendering engines and whatnot. TSaxon (http://www.ccil.org/~cowan/XML/tagsoup/tsaxon) is the ticket here, with a trivial stylesheet that just specifies text output. Use the -H switch to allow arbitrary HTML input. -- John Cowan [EMAIL PROTECTED]http://www.reutershealth.com Not to know The Smiths is not to know K.X.U. --K.X.U.