Re: Questions on ZWNBS - for line initial holam plus alef
On 19/09/2003 00:47, Kent Karlsson wrote: ... How should a text rendering library deal with dbl_diacritic>? Should the character after the diacritic be drawn under the right half of the diacritic, yes Nitpick: under or above, as appropriate, the right "half" of the dbl diacritic. There are some dbl diacritics that are below (combining class 233, in the current cc numbering). The double diacritics aren't really intended for use with bidi scripts, ... While we are nitpicking... This may be true of the currently defined double diacritics. But there may be double diacritics not yet defined in RTL scripts. This thread started with a suggestion that Hebrew holam might be considered as a double diacritic, at least in that it tends to be centred above the gap between the preceding and following characters, and a similar analysis might be suitable for the Arabic hamza currently being discussed on the bidi list. And I can see that people might well want to use the existing double diacritics e.g. to indicate ligatures or double articulation in Hebrew or Arabic script phonetic transcriptions. ... but logically the character after a dbl diacritic would ideally in RTL go under the LEFT "half" of the diacritic (I don't expect implementations to actually do the latter). They might not have to do this in the current version of Unicode, because of the accident that no double diacritics have so far been defined for RTL scripts, but they might have to in the next one. So it might be sensible to avoid using direction-bound code here. /kent k -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
RE: Questions on ZWNBS - for line initial holam plus alef
... > >How should a text rendering library deal with >dbl_diacritic>? Should the character after the diacritic be > >drawn under the right half of the diacritic, > > yes Nitpick: under or above, as appropriate, the right "half" of the dbl diacritic. There are some dbl diacritics that are below (combining class 233, in the current cc numbering). The double diacritics aren't really intended for use with bidi scripts, but logically the character after a dbl diacritic would ideally in RTL go under the LEFT "half" of the diacritic (I don't expect implementations to actually do the latter). /kent k
Re: Questions on ZWNBS - for line initial holam plus alef
At 08:36 PM 9/18/03 -0400, Noah Levitt wrote: On Mon, Aug 11, 2003 at 12:57:11 -0700, Kenneth Whistler wrote: > Kent asked: > > > How should a freestanding double diacritic be encoded (for purposes of > > meta-discussions, and the like): or > diacritic, SPACE>? > > It *could* be represented as , of course, > or for that matter , or other possibilities. > The combining character sequence, in either case, is the > sequence. How should a text rendering library deal with ? Should the character after the diacritic be drawn under the right half of the diacritic, yes or beyond its rightmost ink? no. For the simple fallback rendering, this should be driven by the ABC width of the double diacritic. The drawback of that method is that it must make assumptions about the width of the characters to which the combining mark is applied. By making the A width = -B/2 and the C width = - B/2, the combining character has no advance widht and its central point resides between the two characters. For more complex rendering, both the central point and the width would need to be adjusted, as well as the height for use with capitals vs lower case. A./ Noah
Re: Questions on ZWNBS - for line initial holam plus alef
On Mon, Aug 11, 2003 at 12:57:11 -0700, Kenneth Whistler wrote: > Kent asked: > > > How should a freestanding double diacritic be encoded (for purposes of > > meta-discussions, and the like): or > diacritic, SPACE>? > > It *could* be represented as , of course, > or for that matter , or other possibilities. > The combining character sequence, in either case, is the > sequence. How should a text rendering library deal with ? Should the character after the diacritic be drawn under the right half of the diacritic, or beyond its rightmost ink? Noah
RE: Questions on ZWNBS - for line initial holam plus alef
Kent Karlsson said: > I see no particular *technical* problem with using WJ, though. In > contrast > to the suggestion of using CGJ (re. another problem) anywhere else but > at the end of a combining sequence. CGJ has combining class 0, despite > being invisible and not ("visually") interfering with any other > combining > mark. Using CGJ at a non-final position in a combining sequence puts > in doubt the entire idea with combining classes and normal forms. Why? There are any number of combining characters with combining class 0, including the vast majority of Indic dependent vowels, for instance. A combining character sequence is a base character followed by any number of combining characters. There is no constraint in that definition that the combining characters have to have non-zero combining class. Canonical reordering is scoped to stop at combining class = 0. It doesn't say that it applies to combining character sequences per se. It applies to *decomposed* character sequences (meaning, effectively, any sequence which has had the recursive application of the decomposition mappings done). Take a Myanmar example: /kau/: character sequence: <1000, 1031, 102C, 1039, 200C> combining?: no yes yes yesno combining classes:0 0 0 9 0 comb char sequence:-- canon reorder scope: ---| ---| -| ---| The combining character sequence here is: <1000, 1031, 102C, 1039> The *syllable* consists of that plus the trailing ZWNJ. But the relevant sequences for application of the canonical reordering algorithm are each sequence starting with combining class zero and continuing through any sequence with combining class not zero. I don't see how introduction of CGJ into such sequences calls any of the definitions or algorithms into question. --Ken
Re: Questions on ZWNBS - for line initial holam plus alef
Kenneth Whistler scripsit: > D17a Defective combining character sequence: A combining character > sequence that does not start with a base character. > > * Defective combining character sequences occur when a sequence >of combining characters appears at the start of a string or >follows a control or format character. Such sequences are >defective from the point of view of handling of combining >marks, but are not ill-formed. > ^^ What, if anything, does the term "ill-formed" mean when attached to a sequence of characters? I understood that every sequence of characters whatsoever is permitted. -- "But the next day there came no dawn, John Cowan and the Grey Company passed on into the [EMAIL PROTECTED] darkness of the Storm of Mordor and werehttp://www.ccil.org/~cowan lost to mortal sight; but the Dead http://reutershealth.com followed them. --"The Passing of the Grey Company"
Re: Questions on ZWNBS - for line initial holam plus alef
On Wednesday, August 06, 2003 12:38 PM, Kent Karlsson <[EMAIL PROTECTED]> wrote: > Since I think should be canonically > equivalent to , but cannot be made > so (now), the only ways out seem to be to either formally deprecate > CGJ, or at least confine it to very specific uses. Other occurrences > would not be ill-formed or illegal, but would then be non-conforming. There's a way to specify that is well-formed, but not : a CGJ can be authorized in a combining sequence only if it precedes a base character, or is precedes a combining character which combining class is strictly lower than the combining class of the previous character. So, with this definition, with the combining classes indicated: - is well-formed because 220 < 230. It is distinct from: , whose canonical ordering is - is ill-formed because 230 > 220. The CGJ is superfluous and should be removed to create: - is ill-formed because 220 = 220. The CGJ is superfluous and should be removed to create: which is well-formed and in canonical order. - is ill-formed because 220 = 220. The CGJ is superfluous and should be removed to create: which is well-formed and in canonical order. This "well-formed" rule would clearly give an exact semantic for CGJ, used in the middle of a combining sequence as the only way to bypass the canonical reordering of combining characters.
Re: Questions on ZWNBS - for line initial holam plus alef
Peter Kirk followed up: > On 07/08/2003 07:27, Philippe Verdy wrote: > > >On Thursday, August 07, 2003 2:40 AM, Doug Ewell <[EMAIL PROTECTED]> wrote: > > > >>Kenneth Whistler wrote: > >> > >>>But I challenge you to find anything in the standard that > >>>*prohibits* such sequences from occurring. > >>> > >>> > >>I've learned that this question of "illegal" or "invalid" character > >>sequences is one of the main distinguishing factors between those who > >>truly understand Unicode and those who are still on the Road to > >>Enlightenment. > >> > >>... > >> > >If the term "valid" cannot be changed, then I suggest defining > >"conforming" for encoded text independantly of its validity (a > >"conforming text" would still need to use a "valid encoding"). > > > As a very quick thought, maybe what we need is not restrictions to the > Unicode standard but a set of rules for each language or group of > languages, defining exactly how Unicode characters should be used to > write the words etc of that language. Such definitions might be > independent of the actual Unicode standard. I emphatically agree with Peter on this. The impulse to get the Unicode Standard to head down the road to becoming the "spelling standard" for all languages of the world has to be constrained, simply because there is not the expertise or the bandwidth in the UTC to accomplish this and because it isn't the business of the UTC in the first place. This is the kind of task which *must* be distributed to the relevant stakeholders around the world, wherever they may be and however their relevant jurisdictions are defined and constituted. The establishment of orthographic rules for particular language in the context of the Unicode Standard means transferring the notion of what the printed conventions for that language are -- whatever they may be -- into a determination of exactly which Unicode characters are to be used to represent those conventions, including any constraints on cooccurrence with particular format control characters, and so on. The scope of the task of defining rendering rules in the Unicode Standard is generic to script behavior -- establishing the general rules of the road, as it were, for how the scripts behave in the encoding, so that people and implementations have a determinate sense of what order characters should be in, what it means for combining characters to "combine" with base characters, how format control characters may impact script rendering generically, and so on. But beyond that, one is getting into the realm of orthographic rules for particular languages or jurisdictions and the realm of typographic conventions for particular styles and regions. Making those determinations belongs to the stakeholders themselves: ministries, academies, associations, type designers, whoever. It is precisely because the developers of the Unicode Standard cannot foresee all possible orthographic conventions and uses to which the standard may be put in representing text that it is deliberately permissive: essentially any sequence of characters is "legal", and it is up to the users of the standard to determine, for them, what is a *sensible* sequence of characters for their multitudinous purposes. --Ken
Re: Questions on ZWNBS - for line initial holam plus alef
At 14:22 -0700 2003-08-08, Kenneth Whistler wrote: Philippe, you are tilting at windmills, here. There is no chance that the UTC is going to consider such a character, in my assessment, let alone give it the properties you suggest. Nor WG2 either. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Questions on ZWNBS - for line initial holam plus alef
On Friday, August 08, 2003 9:54 PM, Peter Kirk <[EMAIL PROTECTED]> wrote: > On 08/08/2003 08:54, Philippe Verdy wrote: > > But I'm not sure that ZERO WIDTH SYMBOL is the best name, unless you > are suggesting other uses in which it really has zero width. Well, it > might have in a case like line initial holam which shifts on to a > following silent alef, but that is a rather special case. I just picked "SYMBOL" to just match the required property that would match other spacing variants of diacritics. The "ZERO WIDTH" is probably confusive, but it just marks the fact that it has no associated glyph and a null *minimum* width (which expands to the largest diacritic(s) with which it is combined). Its main role would be to fill the gap for missing spacing versions of existing diacritics. What about the name "INVISIBLE CARRIER SYMBOL" ? (note that I avoid any occurence of the term "COMBINING" in the name, because there would be no requirement for this character to be followed by any diacritic(s), but the character would itself be handled as a symbol, in a way similar to the existing spacing diacritics (that are already of category Sk, and are conceptually a combination of the INVISIBLE CARRIER SYMBOL and diacritics, defined for compatibility purpose as an approximation of the sequence SPACE+diacritic). It is worth noting that for now it is quite tricky to get an isolated diacritic without getting deceptive results (in some cases, the only way to do it is by using what Unicode describes as "defective" combining sequences, not illegal by themselves but whose rendering and interpretation is not guaranteed. On the opposite, Unicode offers a standard way to force the appearance of the dotted circle for an isolated diacritic, a function that may not always be desirable, using a dotted circle symbol as the base character. As someone corrected me in this list, SPACE+combiningdiacritic is admitted in the standard, but only as a compatibility equivalence for spacing diacritics, where in fact the isolated spacing diacritic is really a symbol (gc=Sk), unlike the base SPACE character used in the compatibility decomposition (which has gc=Zs), meaning that SPACE+combining diacritic does not have the same textual semantics as the effectively already encoded spacing diacritics (all of them seem to have property gc=Sk, and are not considered as Letters with gc=Lo, and that's why I thought the name "SYMBOL" was accurate). Also I tried to justify a possible codepoint assignment at U+20CF, where it would group more logically, given that the U+02XX block is already full and U+20XX is used for both symbols (including currencies) and a set of additional combining diacritics. Of course the U+20CF is just a suggestion, not something approved or documented. -- Philippe. Spams non tolérés: tout message non sollicité sera rapporté à vos fournisseurs de services Internet.
Re: Questions on ZWNBS - for line initial holam plus alef
On Sunday, August 10, 2003 11:53 AM, Kent Karlsson <[EMAIL PROTECTED]> wrote: > <> > > Spams de Philippe Verdy non tolérés: tout message non sollicité sera > rapporté à son fournisseur de services Internet. There was no spam in the message you deleted. This was a single post to the list, no cross-posting, no advertizing, no product sold, no money claimed, no required action, no identity forged, and no deceptive subject line, the message was on topic... Reread the definition of spam: "bulk + unsollicitated". May be you don't like my message, but reporting it to my ISP will not be successful for you, and in fact you risk more by doing so because my ISP could complain to yours. If you think you don't like my message which was on topic, don't reply to it, delete it, ignore it, but don't do such false claim... Thanks.
Re: Questions on ZWNBS - for line initial holam plus alef
On 11/08/2003 12:26, Kenneth Whistler wrote: Peter Kirk wrote: I think this may be a "Peter mistake". I meant to refer to spacing diacritics. Sorry. It is certainly highly inappropriate for spacing diacritics to be considered word boundaries. Why? It is entirely dependent on the orthography and conventions involved. ... Well, agreed, there may be orthographic conventions in which a spacing diacritic is considered a word boundary or a break opportunity e.g. if used like a hyphen. But there are other mechanisms for forcing a word boundary where otherwise there would not be one. Are there to suppress a word boundary? Perhaps I need to encode to avoid the word boundary implication? Would this work? ... There is probably as much (or more) bad ASCII usage of spacing diacritics like `this', where a grave accent character is being misapplied to make a directional quotation mark, as there is actual, linguistically appropriate use of spacing diacritics. But this is an abuse of the spacing diacritic as punctuation. Proper, linguistically appropriate use of spacing diacritics should not be broken in order to support abuse. Or, if the standard wants to support such abuse, we can reserve for the abuse and define a new character XXX such that has the properties for the linguistically appropriate use. Also, everyone should consider carefully the status of UAX #29, Text Boundaries. 2 Conformance This is informative material. There are many different ways to divide text elements corresponding to grapheme clusters, words and sentences, and the Unicode Standard and this document do not restrict the ways in which implementations can do this. This specification is a default mechanism; more sophisticated engines can and should tailor it for particular locales or environments. ... The whole UAX is informative. ... Then let it be correctly informative and not full of misinformation. And let its default mechanism and recommendations be appropriate for the majority of uses, including such cases as list of diacritics which may occur in any orthography. Ken, it seems to me all the more clearly from looking at the latest batch of postings on this list that the mechanism defined by Unicode is fundamentally flawed. It works, but it creates a serious and needless complication for all kinds of other processes, including rendering and higher level processes. These processes cannot simply take a space as a space and process it as such. Every time they come across a space (which is very often!) they have to test whether it is followed by a combining character, and if it is they have to treat that space specially. This has created a serious problem for implementers, which is why they have produced non-conforming implementations - and we are not talking about small companies which have rushed into the market recently, we are talking about Microsoft, among others, which has been sponsoring Unicode for the start, I understand. Surely the UTC should not create difficulties for implementers and then just shout at them for getting things wrong. The UTC should try to produce a standard which is workable without unnecessary complications I agree that it works better to use NBSP here. There are fewer such problems, but they have not gone away entirely. And NBSP is more likely to be treated by implementers (in the absence of other guidelines from Unicode) as fixed width, not trimmed to the width needed for the diacritic. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Questions on ZWNBS - for line initial holam plus alef
On 11/08/2003 08:39, Doug Ewell wrote: Peter Kirk wrote: Thank you, Ken. Well, you make it sound as if the problems are minimal, and that version I can just about accept. But if Philippe is correct about what he says about UAX#29 and UAX#14, there are some more serious problems. It is certainly highly inappropriate for non-spacing diacritics to be considered word boundaries. Non-spacing diacritics had better not be word boundaries, otherwise a string like Québec (spelled with U+0301, as here) would be considered two words. I don't have time right now to look up the relevant properties and UAX's, but I sincerely hope this is just another "Philippe mistake" and not a general misinterpretation that anyone might make. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/ I think this may be a "Peter mistake". I meant to refer to spacing diacritics. Sorry. It is certainly highly inappropriate for spacing diacritics to be considered word boundaries. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
RE: Questions on ZWNBS - for line initial holam plus alef
> For me the term "difficult" is inappropriate. In fact it is invalid for > interoperability (even though it is valid, not forbidden, for > ISO10646/Unicode, as an string fragment for intermediate processing), > and such sequence should not occur in actual documents, out of any > external processing context which defines its behavior. So that fact that you can't stick it into XML won't cause you many tears then. Good.
Re: Questions on ZWNBS - for line initial holam plus alef
From: "Jon Hanna" <[EMAIL PROTECTED]> > If this is > > different, then it is not XML but a derived language (for example HTML or > > SGML which are using more "relaxed" syntaxes). > > XML is derived from SGML, not the other way around. Still doesn't matter. I did not say that, despite the sentence may let you think so. Of course XML is born based on the ground of SGML and its HTML application, but now contains enough differences that it can no longer be considered an application of SGML, as it is both a subset and a superset of SGML (XML allows things forbidden in SGML, and forbids things that is completely valid in SGML). Additionally the DTD syntax profile used in XML is very limited face to SGML, and even this DTD syntax is not enough to represent in SGML XML features like namespaces (in XML, namespace prefixes can be freely substituted without requiring a new DTD, and are resolved as URIs instead of being part of the element or attribute names). Naming conventions in XML are based on two orthogonal dimensions, unlike in HTML and SGML which just use a single namespace. Finally DTDs are being deprecated in XML, because they cannot represent correctly the semantics of allowed attributes and even the allowed content models for schemas (so a XML document would validate with a DTD which would not if the schema was defined more precisely with a XSD schema: nearly all DTDs I have seen for XML, HTML and SGML contain important comments that cannot be represented in a parsable way. OK I used the term DOM instead of InfoSet but what I said was "DOM-like" data-representation (meaning InfoSet if this is what is used to represent the document). I won't discuss the case of element names or attribute names, which are by essence constrained by XML datatypes and do not represent any arbitrary Unicode text. But CDATA sections, attribute values (in non validating parsers), and anonymous text elements are where the handling of initial/final whitespaces as well as sequences of whitespaces, cause problems. This is clearly NOT markup, but plain text data, which may or may not be constrained by datatype facets, without even the need to specify a special xml:whitespace attribute in the markup of the document itself. As validating documents against their definitions is an optional part of a valid XML document, normalization of whitespace sequences occurs only if the schema is known. In the case of standardized schemas, like XHTML, it becomes mandatory, and there's no way to bypass this rule, as any client could assume and load the corresponding schema and preprocess the DOM-like data contained in the parsed document to create data which will not expose unnormalized whitespaces. So the behavior of spaces must be assumed by authors which canot predict if the XML parser will validate or not the parsed document. It is clearly not a rendering issue in fonts or XSLT processors or stylesheets. I see absolutely no place where a XML author can create a valid XML schema instance that will work with parsers if the author wants to use SPACE+diacritics sequences in the document. The only way to bypass safely this behavior is to use unparsed entities to represent the leading SPACE, or the whole combining sequence. This is really a shame that there is no "XML-safe" base character in Unicode to represent leading spacing diacritics in actual documents (either in HTML, XML, SGML, or even for other Rich-Text format, including TeX, RTF, or proprietary text formats like MS-Doc, or PDF which already can and do use Unicode as its now prefered encoding). Ignoring the extremely huge number of applications assuming this role to spaces, is then a critical caveat as such rules cannot be changed easily.
Re: Questions on ZWNBS - for line initial holam plus alef
There are a number of incorrect statements. My comments below. - Original Message - From: "Peter Kirk" <[EMAIL PROTECTED]> To: "Kenneth Whistler" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Monday, August 11, 2003 16:28 Subject: Re: Questions on ZWNBS - for line initial holam plus alef > I was aware that there should not be a line break or word break between > the space and the NSM, although I suspect that many implementers will > not be aware of this, or at least will not test for it properly and so > treat any space as a word break and a line break opportunity. Hard to be clearer than what is written in the LineBreak UAX. (see below). > As I just > wrote, this requirement to test all spaces for following NSMs is a > significant inefficiency built into the standard. This is incorrect. Characters (not just spaces) only need to be checked for following NSMs in *those processes where that makes a difference*. And in most of those processes, like line-break, some lookahead is required anyway. To see, for example, whether there is a linebreak after a character X, in almost all cases I have to look at the character after X, and in many cases I have to look at more than one character. Notice, for example, that in the sequence "a" I have to look ahead to see if there is a ":", so that French punctuation works correctly. In practice, looking at a character past a space does not represent a significant performance issue. One is typically using a mechanism (like an augmented state machine) that maintains enough state that that is not an issue. > > But there is still a problem if there is considered by default to be a > word break and a line break opportunity AFTER the NSM. I would suggest, > as a candidate for a concrete proposal, that the default behaviour be > adjusted so that there is no word break or line break opportunity here > either. It helps if "concrete proposals" were actually, well, concrete. I see no problem with Line Break. (http://www.unicode.org/reports/tr14/#Algorithm): Space + NSM is treated as a unit, with behavior that is pretty consistent with a stand-alone accent like "^". To quote: LB 7a In all of the following rules, if a space is the base character for a combining mark, the space is changed to type ID. In other words, break before SP CM* in the same cases as one would break before an ID. Treat SP CM* as if it were ID If you want non-breaking behavior, you use NBSP + NSM; if you want breaking behavior, you use SP + NSM. The algorithm does that. I also see no problem with word-break (http://www.unicode.org/reports/tr29/#Word_Boundaries). Look at the specific text. To quote: Treat a grapheme cluster as if it were a single character: the first character of the cluster. GC→FC(3) ... Otherwise, break everywhere (including around ideographs). Any÷Any(14) None of the other rules are relevant. So what this does is that SPACE + NSM will break before the space and after the NSM (assuming there is only one). So it will behave like a symbol, such as "*", or ")", or "^". The one area I do see that there may be an issue is with one that you didn't mention, http://www.unicode.org/reports/tr29/#Sentence_Boundaries. Sp + NSM should not behave as Sp in the rules (8), (10), and (11). Even there, it will produce at most a minor oddity. If we wanted to change it, the *concrete* change would be to replace (4) by: Treat a grapheme cluster as if it were a single character: the first character of the cluster, except if that first character is a space. In that case, change to Any. SGC→FC(4a) GC→FC(4b) > > -- > Peter Kirk > [EMAIL PROTECTED] (personal) > [EMAIL PROTECTED] (work) > http://www.qaya.org/ > > >
Re: Questions on ZWNBS - for line initial holam plus alef
From: "Jon Hanna" <[EMAIL PROTECTED]> > Lots of different things happen that affect the whitespace of an XML > document (whether a DOM tree is constructed or not, since it isn't the only > legal way to process an XML document). Of course one is not required to build an actual DOM tree, however XML, HTML and alike is now defined in terms of the DOM, where the text/xml syntax is just a serialization, which is the only place where whitespaces normalization is defined (such normalization does not occur at the DOM level, and a XML document may be serialized with another concrete syntax than the one assigned to the "text/xml" MIME type, registered and documented by the W3C. When processing XML documents, the DOM part is the most important feature and it is logically separated from the concrete syntax used by text XML parsers. The W3C defines very strict rules to ensure that the DOM-equivalent data will be preserved, and whitespace normalization in XML documents serialized as "text/xml" is mandatory, or it is not a valid "text/xml" serialization. Processing a "text/xml" document in a way that would be incompatible with what a DOM tree builder would create is not conforming. If this is different, then it is not XML but a derived language (for example HTML or SGML which are using more "relaxed" syntaxes). In XML, whitespace normalization can be overriden using very precise rules within the parser only, but not in the resulting DOM-tree, so it is important to understand each step that goes from the concreate text/xml syntax to the DOM-tree or its equivalents (notably the successive steps required in parsed entities, named entities, ...) No XML application is required to use the "text/xml" MIME syntax, and there exists such examples (for example the serialization and compression formats used by WAP, MMS, Nec's i-Mode, and SOAP). If an application does not build the DOM tree, it is still required to perform namespace resolution and to solve named entities according to the standard "text/xml" MIME rules formulated by the W3C reference, including all its facets, needed for interoperability of document properties independantly of the character encoding used in the serialized document, or its syntaxic representation. In my opinion, all XML-based languages should be defined now in terms of its DOM structure, and the XML application should be defined by a valid DTD, or beter now with a now standard XSD schema, that can be processed by validating parsers (parsers that absolutely need to create a DOM-like tree or flow of tokens with strictly defined properties, value sets and behavior.) Without DOM interoperability, XML would be another imprecise language like HTML, with very little reusability due to naming conflicts. This is the most important benefit of XHTML (strictly based on XML) face to HTML (4.x and before) and SGML (all versions), notably when a schema is explicitly specified for the document, and is loaded for validating purposes (some schemas are normative like XHTML, and canot be changed by authors)
Re: Questions on ZWNBS - for line initial holam plus alef
On 11/08/2003 18:46, Mark Davis wrote: There are a number of incorrect statements. My comments below. Thanks for the clarifications. Sorry about the inaccuracies. On some maybe Philippe misled me, on others it is just my inadequate understanding. ... In practice, looking at a character past a space does not represent a significant performance issue. One is typically using a mechanism (like an augmented state machine) that maintains enough state that that is not an issue. Understood. I hope Microsoft is listening. ... It helps if "concrete proposals" were actually, well, concrete. Of course! But I need help to get rid of any inaccuracies before the concrete sets. I see no problem with Line Break. (http://www.unicode.org/reports/tr14/#Algorithm): Space + NSM is treated as a unit, with behavior that is pretty consistent with a stand-alone accent like "^". To quote: LB 7a In all of the following rules, if a space is the base character for a combining mark, the space is changed to type ID. In other words, break before SP CM* in the same cases as one would break before an ID. Treat SP CM* as if it were ID If you want non-breaking behavior, you use NBSP + NSM; if you want breaking behavior, you use SP + NSM. The algorithm does that. Thank you. I have looked at this. Well, the ideal for me would be a mechanism whereby base + NSM was AL, rather than ID or GL. The problem comes, if I understand correctly, with a sequence like SP XX CM* AL, where I want a break opportunity after SP but not before AL. If I use NBSP for XX, I get not breaking opportunity at all. If I use SP, I may get a break before AL. But I suppose SP SP CM* WJ AL would do what I want, perhaps also SP ZWSP NBSP CM* AL as the break opportunity after ZWSP takes precedence over the no break before NBSP. I also see no problem with word-break (http://www.unicode.org/reports/tr29/#Word_Boundaries). Look at the specific text. To quote: Treat a grapheme cluster as if it were a single character: the first character of the cluster. GC→FC(3) ... Otherwise, break everywhere (including around ideographs). Any÷Any(14) None of the other rules are relevant. So what this does is that SPACE + NSM will break before the space and after the NSM (assuming there is only one). So it will behave like a symbol, such as "*", or ")", or "^". OK, no real problem then. In some circumstances it might have been better for space + NSM to behave like a letter rather than a symbol may be more appropriate, but I recognise that tailoring may be required for fine details. The one area I do see that there may be an issue is with one that you didn't mention, http://www.unicode.org/reports/tr29/#Sentence_Boundaries. Sp + NSM should not behave as Sp in the rules (8), (10), and (11). Even there, it will produce at most a minor oddity. If we wanted to change it, the *concrete* change would be to replace (4) by: Treat a grapheme cluster as if it were a single character: the first character of the cluster, except if that first character is a space. In that case, change to Any. SGC→FC(4a) GC→FC(4b) Do you mean: "SGC → Any (4a)"? How should I go about making a concrete proposal for this? Anyway, many thanks for your help. I think I am beginning to realise that this is a small problem which has been blown out of proportion by others. I still see the space + NSM choice as a rather poor initial design, but one which can be lived with. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Questions on ZWNBS - for line initial holam plus alef
Peter Kirk scripsit: > So far so good, but when I get to an accent with no predefined spacing > variant, I have a problem! No you don't. If you want to say is the diacritic used to represent linguolabial sounds in the IPA, then you just encode U+0020 U+033C at the beginning of the next line. If the seagull doesn't line up properly, you complain to the foundry or the implementor. -- John Cowan [EMAIL PROTECTED]http://www.ccil.org/~cowan Is it not written, "That which is written, is written"?
Re: Questions on ZWNBS - for line initial holam plus alef
On 12/08/2003 04:17, Jon Hanna wrote: Thanks for the clarification. I probably misunderstood Jon's intention. But is there a problem if, for example, an application sees the string and regularises it (wrongly!) to combining mark>? Yes, I was not saying that it wouldn't be sensible to begin a line of text with a spacing diacritic (whether precomposed or created using space or NBSP). I was saying that it wouldn't be sensible to begin a line with a combining diacritic, since that combining diacritic would be combining with a newline character which it's difficult to think of any possible sensible meaning for. Attribute normalisation would change the sequence U+000A, to U+0020, which would arguably change the meaning, but changing the meaning of a meaningless construct isn't a problem to my mind. Thanks for the clarification. I think the combining mark would not combine with the new line mark but would be a defective combining sequence. I might wish to do this simply because, according to UTR #14 this is the only way to get a combining mark to be treated as AL as I might wish. Probably not the best way to do this, but not illegal! So it seems to me that this attribute normalisation is a problem. It is a problem for the higher level protocol as thinks it has created a space but in fact it has created a combining sequence which it must not treat as a space. A legal sequence at a lower level, even if meaningful, should not confuse the higher level. (Indeed I don't think the higher level ought to be confused even by illegal sequences at the lower level, it should be transparent as far as possible.) So the higher level protocol needs to know not only not to split a space, combining mark sequence but also not to create one where one was not present before. Perhaps it needs to insert a suitable separator (ZWNJ?) to ensure that when the space is created it is not combined with the combining mark. So another example of needless complication created by the long-standing decision to permit space as a carrier for combining marks. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Questions on ZWNBS - for line initial holam plus alef
Peter Kirk scripsit: > Philippe or anyone else, would it be "XML-safe" to use NBSP rather than > SP as the base character for spacing diacritics in XML? Perhaps that's > the answer here. I know there are still some issues of detail concerning > the line breaking, but apart from that is there any other problem? NBSP is not usable in attribute values other than those of type CDATA, but it is usable in character content. XML does not consider it whitespace (the only whitespace characters are LF, SPACE, TAB and marginally CR. IMHO, it is best practice not to use anything in attribute values, certainly non-CDATA attribute values, that is in any way intended to handle fully general text: attribute values should be protocol strings. -- John Cowan<[EMAIL PROTECTED]> http://www.reutershealth.com http://www.ccil.org/~cowan Yakka foob mog. Grug pubbawup zink wattoom gazork. Chumble spuzz. -- Calvin, giving Newton's First Law "in his own words"
Re: Questions on ZWNBS - for line initial holam plus alef
Peter Kirk scripsit: > Sorry, I'm confused. Are you saying that the input processing will > translate line breaks into spaces within attribute values, unless > inserted as ? Well, I suppose this is fair enough as it is up to > the user not to enter garbage. Yes, that is how attribute values work. The idea is that when you have a long string in an attribute value, you can introduce a line break for readability without its having any effect on processing, thus: The line break gets turned into a space before the application sees it. Additionally, if you have a long list of tokens in an attribute value, thus: the application does not have to deal with either the line break or the tab character specially, but sees simply a list of tokens separated by a single space. > OK if this is clearly illegal, but this might restrict use of some > languages in NMTOKEN. Would NBSP + combining be allowed? No, it isn't. As I say, attribute values aren't meant to handle arbitrary natural-language human-readable text. > There is some potential for real trouble here, if one process outputs an > NMTOKEN starting with a combining character preceded by a separating > space, or something else which is changed into a space, and another > process takes the new space plus combining character as a unit and so > doesn't recognise the separation. If the second processor is XML-compliant, it will treat the space as a token separator, not as part of the token (as I say, spacing diacritics aren't allowed in tokens). If the XML document is printed or displayed in its raw form (that is, treating it as plain rather than structured text), you may see something a bit strange, but that will not affect the processing model. > reading this will soon start flooding the Internet with tokens beginning > with combining characters in the hope of crashing implementations or > finding back doors. Very, very unlikely. -- Winter: MIT, John Cowan Keio, INRIA,[EMAIL PROTECTED] Issue lots of Drafts. http://www.ccil.org/~cowan So much more to understand! http://www.reutershealth.com Might simplicity return?(A "tanka", or extended haiku)
Re: Questions on ZWNBS - for line initial holam plus alef
- Original Message - From: "John Cowan" <[EMAIL PROTECTED]> To: "Peter Kirk" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Wednesday, August 13, 2003 5:31 AM Subject: Re: Questions on ZWNBS - for line initial holam plus alef > Peter Kirk scripsit: > > > Philippe or anyone else, would it be "XML-safe" to use NBSP rather than > > SP as the base character for spacing diacritics in XML? Perhaps that's > > the answer here. I know there are still some issues of detail concerning > > the line breaking, but apart from that is there any other problem? For XML, using NBSP would be safe, however this is another caveat as it introduce a non-break property, which may be an issue for the rendering, but normally not for text processing. This can be corrected by saying that NBSP+combining does not have a non-break property, and that a "don't break here" format control can be used if needed to specify the breaking behavior. In that case, this change in properties of the combining sequence (in fact something that was still not specified until now) would be harmless (as the behavior was not clearly specified and implementation dependant), and we could say that SPACE+diacritics is deprecated in favor of NBSP+diacritics (which would NOT inherit the non-breaking behavior but would have its own properties). > NBSP is not usable in attribute values other than those of type CDATA, > but it is usable in character content. XML does not consider it whitespace > (the only whitespace characters are LF, SPACE, TAB and marginally CR. And NEL (for compatibility with EBCDIC systems).
RE: Questions on ZWNBS - for line initial holam plus alef
Suggested but not accepted. I am inherently suspicious when pressure is being exerted to decide complex and difficult questions in a hurry. Jony > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Peter Kirk > Sent: Wednesday, August 13, 2003 8:43 PM > To: Philippe Verdy > Cc: [EMAIL PROTECTED] > Subject: Re: Questions on ZWNBS - for line initial holam plus alef > > > On 13/08/2003 11:09, Philippe Verdy wrote: > > >... For this reason, defective > >combining sequences (combining characters without a leading base > >character) should be forbidden (invalid for XML). > > > > > If there is even the remotest possibility of this happening, > we need to > know quickly! Defective combining sequences are legal Unicode and are > now being suggested for use in Hebrew e.g. for holam male. But such a > definition would be useless if XML restricts the texts it can > represent > to a subset of Unicode excluding such sequences. > > > -- > Peter Kirk > [EMAIL PROTECTED] (personal) > [EMAIL PROTECTED] (work) > http://www.qaya.org/ > > > >
Re: Questions on ZWNBS - for line initial holam plus alef
On 12/08/2003 07:05, John Cowan wrote: Very true. But what is this whitespace normalization? 1) Throughout the document, line-end characters and sequences are normalized to LF. Not relevant here. 2) In attribute values, LF, CR, and TAB characters are normalized to spaces. Not relevant here. This would be relevant if it is legal for the character after LF, CR, and TAB to be a combining mark. Is this legal? In this case what was previously a defective (but legal) combining sequence would turn into a non-defective one, but the intended whitespace would be lost. 3) In attribute values that have a declared type other than CDATA, multiple spaces are compressed to a single space, and leading and trailing spaces are removed. After this is done, there can be no spaces in attributes of type ID, IDREF, ENTITY, NMTOKEN, NOTATION, or enumerated types. In the types IDREFS and ENTITIES, spaces are used to separate individual tokens, none of which may begin with a combining character. In the remaining type, NMTOKENS, individual characters may begin with a combining character, so it is possible that such a token, if not the first in the attribute, will be rendered in a peculiar way, with the combining character placed over the separating space. But that is a mere rendering glitch and in no way affects anything. Not just a rendering glitch, I suspect. If the combining character is combined with the separating space, the space loses many of its separating functions, and perhaps keeps a confusing subset of them with all sorts of possibilities of error. At best tokens beginning with combining characters will be unusable. At worst they will crash the implementation (and count on someone trying deliberately to do that!). The only safe thing to do is to specify that space followed by a combining mark is NEVER considered to be a space and this combination is NEVER generated. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Questions on ZWNBS - for line initial holam plus alef
On 12/08/2003 09:00, Philippe Verdy wrote: This is really a shame that there is no "XML-safe" base character in Unicode to represent leading spacing diacritics in actual documents (either in HTML, XML, SGML, or even for other Rich-Text format, including TeX, RTF, or proprietary text formats like MS-Doc, or PDF which already can and do use Unicode as its now prefered encoding). Ignoring the extremely huge number of applications assuming this role to spaces, is then a critical caveat as such rules cannot be changed easily. Philippe or anyone else, would it be "XML-safe" to use NBSP rather than SP as the base character for spacing diacritics in XML? Perhaps that's the answer here. I know there are still some issues of detail concerning the line breaking, but apart from that is there any other problem? -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Questions on ZWNBS - for line initial holam plus alef
On 13/08/2003 14:07, Philippe Verdy wrote: I did not notice that the discussion about Hebrew holam male was related. In fact I don't know anything about the hebrew alphabet so I could not understand the semantics discussed, and so di not note that was a "defective" encoding (in terms of combining sequences). Well, it wasn't very releated - although the subject line here "line initial holam plus alef" reminds me that it is very near to where we started this thread. When using the term "forbidden", it was only related to possible security problems with XML, but the term was certainly too much expeditive. However, given that possible security and parsing issues do exist, the case of used to encode "holam-male" may be another argument to propose a neutral/invisible base character for combining characters. For the case of Hebrew, it then needs to have a "letter" behavior, but for the case of other isolated diacritics in Latin,Greek Cyrillic, and probably also Hiragana, Katakana (voice marks) it should better be handled as a symbol. I suggested several semantics for this invisible character(s) in a earlier message: - A invisible symbol - An invisible LTR letter - An invisible RTL letter all of them having a *compatibility* decomposition (or NFKD form) as a SPACE like other existing spacing combining marks, but not being canonical equivalent of SPACE (to keep separately the legacy semantics, properties, behavior and known caveats unchanged and implementation/usage-dependant, as they are now with SPACE+NSM which could then be discouraged in Unicode and strongly deprecated in SGML/HTML/XML) My latest idea is to use RLM as in effect your "invisible RTL letter". So I would encode word or line initial holam male as . This is technically a defective combining sequence (is that correct?), as RLM is a format control character, but the RLM has the double effect of keeping the holam separate from any spaces which a higher level protocol might put there and ensuring RTL directionality. And I suppose the same technique would be legal with any combining character. But of course it would all be spoiled if XML were to forbid defective combining sequences, which fortunately is unlikely. Actually I suppose you could use or for your spacing diacritics as the RLM or LRM would protect the space from combination with any previous space etc. Or perhaps . As RLM effectively disappears in searches etc, in effect you have your compatibility decomposition. I note that there is no line break opportunity in . But is there one after the space in ? If so, has a third advantage, that it gives the right line break opportunity when this sequence is word initial, which it wouldn't do without the RLM. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
RE: Questions on ZWNBS - for line initial holam plus alef
> I do agree: a XML document could require the use at some place of a > given attribute or element. If this attribute name follows the element > name > after a line break, which gets changed into a space during parsing, > forcing > XML parsers to treat SPACE+combining as a unbreakable grapheme > cluster acting like a letter would have the effect of creating a new > element > name which may violate the lement name identity. Now suppose that the > attribute name contains a colon, you have created a custom namespace > name, under which you can add any element you like, even if this was > forbidden by the content-model of the reference schema. 1. SPACE is treated "blindly" as a SPACE by XML. String + space + combining + string would not be treated as a single token, no matter how that space was introduced. That's what you were complaining about in the first place (as far as I can make out). 2. While nmtokens can begin with a combining character names cannot, nor can they contain spaces. 3. This would in no way change the content-model. So even if the above two points didn't hold they would only sneak the document past something which performed validation before parsing(!), and where the content-model was already pretty loose (so it didn't complain about the unrecognised attribute). You've just discovered a way to disguise one document that isn't well-formed as a different document that isn't well-formed. l33t! > So this would invalidate existing documents, or create holes allowing > insertion of arbitrary XML content, if the XML application is not > validating extremely strictly the element names (the pair namespace+ > name) and exclude completely from processing any unrecognized > element (including all its content and attributes). This argument is not on friendly terms with the concept of causality. This would be a > breach in the content model which may have been validated and tested > for security in another layer of the document encoding process (notably > when XML documents are created from templates, such as XSL > processors, or custom C source using simple template substitution). Testing validity without testing well-formedness is not possible. > So for me the sequence SPACE+combining should not be acceptable > as a valid grapheme cluster within element names or attribute names, As it already isn't. > and thus would need to be excluded from NMTOKEN. The correct > way to do it is to consider it NOT A LETTER, but a symbol (Sk), > exactly like other spacing diacritics, which are already invalid in > NMTOKEN. Wait a second. That was my justification for why the fact that space+combining is ALREADY prohibited from NMTOKEN shouldn't be considered a failure on the part of XML to allow for freedom of choice with the strings used for NMTOKENs. Now you actually want to introduce this (already existent) feature. > There still remains the unresolved question of grapheme clusters > that could span the starting "<" or ending ">" or "/>" of tags, or > the leading "&" of a entitity reference. No there isn't. What goes before <, >, / or & isn't a problem since those are all non-combining characters and a new unit for any sort of processing treating more than one codepoint as a unit. What goes after < or & has to be a name (not an nmtoken) and as such is already prohibited from beginning with a combiner. What goes after > is already dealt with by the Charmod, and even if you ignore charmod apart from the possibility of normalisation turning the sequence U+003E, U+0338 into U+226E (a possibility that is well noted) it still isn't going to hurt.
Re: Questions on ZWNBS - for line initial holam plus alef
Peter Kirk scripsit: > >2) In attribute values, LF, CR, and TAB characters are normalized to > >spaces. Not relevant here. > > This would be relevant if it is legal for the character after LF, CR, > and TAB to be a combining mark. Is this legal? In this case what was > previously a defective (but legal) combining sequence would turn into a > non-defective one, but the intended whitespace would be lost. The point is that there is no such thing as an *intended* line break in an attribute value; it will *always* be translated to a space before the application sees it. (More exactly, line-break characters can be inserted into attribute values, but only with the use of a numeric character reference such as " ".) > Not just a rendering glitch, I suspect. If the combining character is > combined with the separating space, the space loses many of its > separating functions, and perhaps keeps a confusing subset of them with > all sorts of possibilities of error. The space(s) will be used to separate individual tokens at processing time. No spacing diacritic (either single-character or space+combining) is permitted in a NMTOKEN. > At best tokens beginning with > combining characters will be unusable. At worst they will crash the > implementation (and count on someone trying deliberately to do that!). In effect, the combining character will constitute a defective combining sequence at the beginning of the individual token. Stepping away from the letter of the standard for a moment, there is no real reason to begin a NMTOKEN with a combining character. It is only allowed is a result of the miscegenation of SGML concepts with Unicode ones. In SGML's original design of tokens, they consisted of letters and digits (and a few punctuation marks, which functioned as letters). There were four kinds: a NUMBER could contain only digits, a NAME could not begin with a digit, a NUTOKEN had to begin with a digit, and a NMTOKEN had no restrictions. ID and IDREF had the same syntax as NAME with additional semantics. Later, the categories "letter" and "digit" were generalized, by redefining the concrete syntax, to be whatever you wanted, and were renamed "name-start" and "name" characters (technically, a name character was a letter *or* a digit). When SGML was simplified to produce XML, only NMTOKEN, the most general type of token, was kept. However, in order to keep the semantics of "letter" and "digit" in the Unicode world, "letter" was extended to be any letter and "digit" to be any digit *or* combining character. That worked well for ID and IDREF, since treating combining characters as part of "digit" prevented them from appearing first, as was only sensible. Unfortunately, NMTOKENs, since there were no restrictions, became able to begin with a combining character, though that made no real sense. To write in a restriction would make it impossible to specify XML's concrete syntax in SGML terms, which did not allow for three different classes of characters within tokens. So we wound up with a basically useless capability that if used will only cause trouble. -- John Cowan [EMAIL PROTECTED] www.reutershealth.com ccil.org/~cowan Dievas dave dantis; Dievas duos duonos --Lithuanian proverb Deus dedit dentes; deus dabit panem --Latin version thereof Deity donated dentition; deity'll donate doughnuts --English version by Muke Tever God gave gums; God'll give granary --Version by Mat McVeagh
Re: Questions on ZWNBS - for line initial holam plus alef
Some of this seems to be in reference to an earlier contention that Text Boundaries (inc. Lines) break between the space and the non-spacing mark. I think this was attributed to Phillipe. [This may not be true: I don't actually read his email, because the information content per line falls below my email threshold; not to say that there may not be information there, but I cannot afford to take the time to find out -- sadly, one of my character flaws.] All of the text boundaries preserve grapheme cluster boundaries, which never separate a base character (including space and NBSP) from a following NSM. In addition, each of the boundary types above grapheme clusters make some statement about the behavior of the grapheme cluster. For example, with line boundaries a SPACE + NSM has a special behavior. With the others, the behavior is the same as the base character. As Ken points out, in any event these are default boundaries, and can be tailored. That being said, if the normal behavior of the default can be improvied, and someone has a concrete proposal for doing so, then it can be considered. Mark __ http://www.macchiato.com ► “Eppur si muove” ◄ - Original Message - From: "Kenneth Whistler" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Monday, August 11, 2003 12:26 Subject: Re: Questions on ZWNBS - for line initial holam plus alef > Peter Kirk wrote: > > > I think this may be a "Peter mistake". I meant to refer to spacing > > diacritics. Sorry. > > > > It is certainly highly inappropriate for spacing diacritics to > > be considered word boundaries. > > Why? It is entirely dependent on the orthography and conventions > involved. There is probably as much (or more) bad ASCII usage > of spacing diacritics like `this', where a grave accent character > is being misapplied to make a directional quotation mark, as > there is actual, linguistically appropriate use of spacing > diacritics. > > Also, everyone should consider carefully the status of UAX #29, > Text Boundaries. > > > 2 Conformance > > This is informative material. There are many different ways to > divide text elements corresponding to grapheme clusters, words > and sentences, and the Unicode Standard and this document do not > restrict the ways in which implementations can do this. > > This specification is a default mechanism; > more sophisticated engines can and should tailor it for particular > locales or environments. ... > > > The whole UAX is informative. It is a here's-how-you-can-approach- > the-problem implementation guide with some suggestions for > rules and classes. > > *If* you are working with an orthography that uses one or more > spacing diacritics, and > *If* those spacing diacritics need to be represented by > sequences, > > then you are in the situation where your implementation of > text boundaries should take sequences explicitly > into account, so as to result in expected behavior for that > orthography. > > Everyone has had experiences with their platform UI producing > bad results for text boundaries. The Solaris platform I am > writing this on right now, for example, implements a double-click > word selection that treats the string "`this'," above, including > the grave accent, the apostrophe, and the comma, as a "word". > Is that right or wrong? Well, it depends on what you are trying > to do, I expect. > > But even the most sophisticated platform implementers can only > do so much with processes like default word selection. It is > bound to be wrong for one purpose or another and for one > orthography or another. Ultimately you need to have tailored > processes that can be orthography-specific if you want to > get best results. > > --Ken > > >
Re: Questions on ZWNBS - for line initial holam plus alef
On 12/08/2003 20:28, John Cowan wrote: Peter Kirk scripsit: 2) In attribute values, LF, CR, and TAB characters are normalized to spaces. Not relevant here. This would be relevant if it is legal for the character after LF, CR, and TAB to be a combining mark. Is this legal? In this case what was previously a defective (but legal) combining sequence would turn into a non-defective one, but the intended whitespace would be lost. The point is that there is no such thing as an *intended* line break in an attribute value; it will *always* be translated to a space before the application sees it. (More exactly, line-break characters can be inserted into attribute values, but only with the use of a numeric character reference such as " ".) Sorry, I'm confused. Are you saying that the input processing will translate line breaks into spaces within attribute values, unless inserted as ? Well, I suppose this is fair enough as it is up to the user not to enter garbage. Not just a rendering glitch, I suspect. If the combining character is combined with the separating space, the space loses many of its separating functions, and perhaps keeps a confusing subset of them with all sorts of possibilities of error. The space(s) will be used to separate individual tokens at processing time. No spacing diacritic (either single-character or space+combining) is permitted in a NMTOKEN. OK if this is clearly illegal, but this might restrict use of some languages in NMTOKEN. Would NBSP + combining be allowed? At best tokens beginning with combining characters will be unusable. At worst they will crash the implementation (and count on someone trying deliberately to do that!). In effect, the combining character will constitute a defective combining sequence at the beginning of the individual token. Stepping away from the letter of the standard for a moment, there is no real reason to begin a NMTOKEN with a combining character. It is only allowed is a result of the miscegenation of SGML concepts with Unicode ones. In SGML's original design of tokens, they consisted of letters and digits (and a few punctuation marks, which functioned as letters). There were four kinds: a NUMBER could contain only digits, a NAME could not begin with a digit, a NUTOKEN had to begin with a digit, and a NMTOKEN had no restrictions. ID and IDREF had the same syntax as NAME with additional semantics. Later, the categories "letter" and "digit" were generalized, by redefining the concrete syntax, to be whatever you wanted, and were renamed "name-start" and "name" characters (technically, a name character was a letter *or* a digit). When SGML was simplified to produce XML, only NMTOKEN, the most general type of token, was kept. However, in order to keep the semantics of "letter" and "digit" in the Unicode world, "letter" was extended to be any letter and "digit" to be any digit *or* combining character. That worked well for ID and IDREF, since treating combining characters as part of "digit" prevented them from appearing first, as was only sensible. Unfortunately, NMTOKENs, since there were no restrictions, became able to begin with a combining character, though that made no real sense. To write in a restriction would make it impossible to specify XML's concrete syntax in SGML terms, which did not allow for three different classes of characters within tokens. So we wound up with a basically useless capability that if used will only cause trouble. There is some potential for real trouble here, if one process outputs an NMTOKEN starting with a combining character preceded by a separating space, or something else which is changed into a space, and another process takes the new space plus combining character as a unit and so doesn't recognise the separation. Any hackers and virus programmers reading this will soon start flooding the Internet with tokens beginning with combining characters in the hope of crashing implementations or finding back doors. Of course this wouldn't have been a problem if Unicode had never defined space plus combining character as legal and meaningful. But this is not my problem! -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Questions on ZWNBS - for line initial holam plus alef
On 13/08/2003 11:09, Philippe Verdy wrote: ... For this reason, defective combining sequences (combining characters without a leading base character) should be forbidden (invalid for XML). If there is even the remotest possibility of this happening, we need to know quickly! Defective combining sequences are legal Unicode and are now being suggested for use in Hebrew e.g. for holam male. But such a definition would be useless if XML restricts the texts it can represent to a subset of Unicode excluding such sequences. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Questions on ZWNBS - for line initial holam plus alef
Peter Kirk scripsit: > These processes cannot > simply take a space as a space and process it as such. Every time they > come across a space (which is very often!) they have to test whether it > is followed by a combining character, and if it is they have to treat > that space specially. This must be done for all other base characters as well. > This has created a serious problem for > implementers, which is why they have produced non-conforming > implementations - and we are not talking about small companies which > have rushed into the market recently, we are talking about Microsoft, > among others, which has been sponsoring Unicode for the start, I > understand. You don't have (nor do I) the vaguest idea why Microsoft produced this particular nonconforming implementation, or whether they consider it a bug or not. > Surely the UTC should not create difficulties for > implementers and then just shout at them for getting things wrong. The > UTC should try to produce a standard which is workable without > unnecessary complications. This is sheer conjecture. -- John Cowan www.ccil.org/~cowan [EMAIL PROTECTED] www.reutershealth.com [P]olice in many lands are not complaining that local arrestees are insisting on having their Miranda rights read to them, just like perps in American TV cop shows. When it's explained to them that there are in a different country, where those rights do not exist, they become outraged. --Neal Stephenson
RE: Questions on ZWNBS - for line initial holam plus alef
> Of course one is not required to build an actual DOM tree, > however XML, HTML > and alike is now defined in terms of the DOM, where the text/xml syntax is > just a serialization, which is the only place where whitespaces > normalization is defined (such normalization does not occur at the DOM > level, and a XML document may be serialized with another concrete syntax > than the one assigned to the "text/xml" MIME type, registered and > documented > by the W3C. No. "XML documents are made up of storage units called entities, which contain either parsed or unparsed data. Parsed data is made up of characters, some of which form character data, and some of which form markup. Markup encodes a description of the document's storage layout and logical structure. XML provides a mechanism to impose constraints on the storage layout and logical structure." (XML, Introduction. XML1.1 will not change that). *XML applications* can be defined in terms of the DOM, but they can also be defined in terms of the XML Information Set, XPath, by extending one of the above, or through some other model (e.g. in terms of SAX events). Many applications are defined in terms of the Information Set or XPath. None of this actually matters here of course, because there is still no problem with the use of space and NBSP with combining characters unless you use that in names or nmtokens. and whitespace normalization in XML documents > serialized as "text/xml" is mandatory, or it is not a valid "text/xml" > serialization. But it doesn't matter. > Processing a "text/xml" document in a way that would be incompatible with > what a DOM tree builder would create is not conforming. Doesn't matter. If this is > different, then it is not XML but a derived language (for example HTML or > SGML which are using more "relaxed" syntaxes). XML is derived from SGML, not the other way around. Still doesn't matter. > If an application does not build the DOM tree, it is still required to > perform namespace resolution Namespace resolution, do you mean complying with Namespaces in XML? XML parsers aren't required to do that, and it still doesn't matter. > Without DOM interoperability, XML would be another imprecise language like > HTML, HTML is pretty precise, most of the imprecision is quite possible in XML as well. Comparing HTML with XML is a pretty fruitless exercise beyond "oh look this one has point brackets as well". Still doesn't matter. > with very little reusability due to naming conflicts. Naming conflicts are perfectly possible with XML applications that don't use Namespaces. Which they are perfectly within the spec in doing, and where combining diacritics still don't matter.
Re: Questions on ZWNBS - for line initial holam plus alef
Peter Kirk <[EMAIL PROTECTED]> writes: > On 08/08/2003 08:54, Philippe Verdy wrote: > > > ... Could there be another codepoint assigned that has > > > >these properties: > > > >20CF;ZERO WIDTH SYMBOL;Sk;0;ON; 0020N; > > [...] > But I'm not sure that ZERO WIDTH SYMBOL is the best name, unless you > are suggesting other uses in which it really has zero width. Well, it > might have in a case like line initial holam which shifts on to a > following silent alef, but that is a rather special case. What would be a better name? ACCENT CARRIER? /Thomas -- Thomas Widmann, MA +44 141 419 9872 Glasgow, Scotland, EU [EMAIL PROTECTED] http://www.widmann.uklinux.net
Re: Questions on ZWNBS - for line initial holam plus alef
Peter Kirk scripsit: > On 13/08/2003 11:09, Philippe Verdy wrote: > > >... For this reason, defective > >combining sequences (combining characters without a leading base > >character) should be forbidden (invalid for XML). > > > > > If there is even the remotest possibility of this happening, we need to > know quickly! As a member of the XML Core Working Group of the W3C, I can assure you that there is not even the remotest possibility of it. -- John Cowan [EMAIL PROTECTED]http://www.ccil.org/~cowan Is it not written, "That which is written, is written"?
Re: Questions on ZWNBS - for line initial holam plus alef
John Hudson scripsit: > Again, you are working on the assumption that U+0020 is represented by an > actual painted glyph and not e.g. by a horizontal offset. In my experience, > the more sophisticated the application -- e.g. a professional page layout > application rather than a word processor -- the more likely it is that > white space characters will not be consistently treated as painted glyphs. I'm working on the assumption that applications that claim to conform to Unicode actually do conform to it. If they don't, and it's not the font foundry's fault, then complain, complain, complain! It's not Unicode that's broken, it's the implementation. > I've heard convincing arguments from the engineeers of such applications > that the space character shouldn't be a glyph in the font at all, but > should simply be a numeric value telling applications how large an offset > to apply. Since most fonts do not contain glyphs for variant white space > characters such as thin and hair spaces, applications typically treat these > as offset values. Painting a glyph is only one way to represent a character. Nothing in the Unicode Standard says those oddball spaces have to work "correctly" with combining diacritics. -- A mosquito cried out in his pain, John Cowan "A chemist has poisoned my brain!" http://www.ccil.org/~cowan The cause of his sorrow http://www.reutershealth.com Was para-dichloro- [EMAIL PROTECTED] Diphenyltrichloroethane.(aka DDT)
Re: Questions on ZWNBS - for line initial holam plus alef
On Monday, August 11, 2003 12:27 AM, Kenneth Whistler <[EMAIL PROTECTED]> wrote: > A point I keep trying to make, but which often gets overlooked > by people trying to code Unicode mechanisms for dealing with > edge cases, is that the design goal of the Unicode Standard is, > and always has been, to represent *plain text content*. It > cannot, and should not, IMO, deal with requirements for > representing arbitrarily fine distinctions of typographical > detail in all manuscripts and other documents in all writing > systems of the world. Spacing diacritics are not "on the edge" of the standard, when they are already given a full block and handled there as symbols (not as letters as suggested in some parts of UAX's), with their own identity independant of their actual glyphic representation. I am not discussing about the typesetting of these grapheme clusters but really about the textual semantics of such combining sequences with an invisible base character, affecting all their properties and not fully described in the various standard annexes. Due to the huge legacy use of SPACE+diacritics in legacy text, and the already normative parts of some standard annexes, it will be hard to correct the behavior or change the text of these annexes. And it's where a new better base character than SPACE could help solve cleanly the ambiguities. -- Philippe. Spams non tolérés: tout message non sollicité sera rapporté à vos fournisseurs de services Internet.
Re: Questions on ZWNBS - for line initial holam plus alef
- Original Message - From: "Doug Ewell" <[EMAIL PROTECTED]> To: "Unicode Mailing List" <[EMAIL PROTECTED]> Cc: "Peter Kirk" <[EMAIL PROTECTED]>; "Kenneth Whistler" <[EMAIL PROTECTED]> Sent: Monday, August 11, 2003 5:39 PM Subject: Re: Questions on ZWNBS - for line initial holam plus alef > Peter Kirk wrote: > > > Thank you, Ken. Well, you make it sound as if the problems are > > minimal, and that version I can just about accept. But if Philippe is > > correct about what he says about UAX#29 and UAX#14, there are some > > more serious problems. It is certainly highly inappropriate for > > non-spacing diacritics to be considered word boundaries. > > Non-spacing diacritics had better not be word boundaries, otherwise a > string like Québec (spelled with U+0301, as here) would be considered > two words. I don't have time right now to look up the relevant > properties and UAX's, but I sincerely hope this is just another > "Philippe mistake" and not a general misinterpretation that anyone might > make. Not a mistake from me, sorry. From you yes: Peter Kirk probably wanted to speak about *spacing* diacritics (when coded with SPACE+NSM). There is no such *spacing* character in "Québec". Don't accuse me of something I did not say. And be more tolerant please with what is an obvious typo in the message from Peter Kirk. Instead of just flaming, could you better read the message and accept errors and correct them instead of sending such unconstructive replied. Thanks.
Re: Questions on ZWNBS - for line initial holam plus alef
On 13/08/2003 04:44, Jon Hanna wrote: No, the safe thing to do (and the thing that is done) is to treat the space as a space ignoring the fact that the NMTOKEN contains a combining character, this is even safer than your suggestion since it can't mis-identify the combining properties of a character. OK, it's safe, but it is a misuse of Unicode. As space plus combining character is a unit in Unicode, it should be treated as a unit by higher level protocols. If higher level protocols are allowed to do arbitrary things within Unicode units, there is no end to the possible confusion. See for example, from Unicode 4.0 chapter 3: C7 A process shall interpret a coded character representation according to the character semantics established by this standard, if that process does interpret that coded character representation. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Questions on ZWNBS - for line initial holam plus alef
On 11/08/2003 06:59, Jon Hanna wrote: There are only two theoretical problems that I can see here, the first is that a whitespace character other than space gets converted to space by attribute value normalisation, and that this changes the meaning of the text in some way. This could only occur if the combining character were the first character in a line of text, which is quite a nonsensical construct to begin with. Not at all! Imagine a tutorial on a language, which might well list the accents used, in a format like this: ` (grave accent) is used with a, e and o, and indicates more open pronunciation ^ (circumflex accent) is used with any vowel, and indicates lengthening So far so good, but when I get to an accent with no predefined spacing variant, I have a problem! -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Questions on ZWNBS - for line initial holam plus alef
- Original Message - From: "Peter Kirk" <[EMAIL PROTECTED]> To: "Jon Hanna" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Wednesday, August 13, 2003 3:05 PM Subject: Re: Questions on ZWNBS - for line initial holam plus alef > On 13/08/2003 04:44, Jon Hanna wrote: > > >No, the safe thing to do (and the thing that is done) is to treat the space > >as a space ignoring the fact that the NMTOKEN contains a combining > >character, this is even safer than your suggestion since it can't > >mis-identify the combining properties of a character. > > > > > OK, it's safe, but it is a misuse of Unicode. As space plus combining > character is a unit in Unicode, it should be treated as a unit by higher > level protocols. If higher level protocols are allowed to do arbitrary > things within Unicode units, there is no end to the possible confusion. > See for example, from Unicode 4.0 chapter 3: > > C7 A process shall interpret a coded character representation according > to the character > semantics established by this standard, if that process does interpret > that coded character > representation. OK, but XML inherits its behavior from SGML and you won't change it. The only way to bypass this would be to use entitiy references to encode the base space needed by the Unicode convention, so this is related to what Unicode defines as a higher level protocol, needed here to bypass the limitations of basic text. However it still creates a problem within CDATA sections, which are not supposed to contain entity references. One needs then to use the XML CDATA escaping mechanism with another escaping system specific to CDATA sections (which are formally anonymous text elements and equivalent to them).
Re: Questions on ZWNBS - for line initial holam plus alef
On 11/08/2003 11:45, Kenneth Whistler wrote: Peter Kirk responded: On 11/08/2003 06:59, Jon Hanna wrote: There are only two theoretical problems that I can see here, the first is that a whitespace character other than space gets converted to space by attribute value normalisation, and that this changes the meaning of the text in some way. This could only occur if the combining character were the first character in a line of text, which is quite a nonsensical construct to begin with. Not at all! Imagine a tutorial on a language, which might well list the accents used, in a format like this: ` (grave accent) is used with a, e and o, and indicates more open pronunciation ^ (circumflex accent) is used with any vowel, and indicates lengthening We're going round and round in circles here. Those are not lines starting with a combining character, but lines starting with a *spacing diacritic*. So far so good, but when I get to an accent with no predefined spacing variant, I have a problem! Either you have the spacing diacritic encoded (as in those instances), or the standard indicates that you can represent one by applying the nonspacing, *combining* mark to SPACE. In those instances, the line still doesn't start with a combining mark -- it starts with a SPACE character serving as the base character for the combining mark. --Ken Thanks for the clarification. I probably misunderstood Jon's intention. But is there a problem if, for example, an application sees the string and regularises it (wrongly!) to ? -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Questions on ZWNBS - for line initial holam plus alef
From: "Kenneth Whistler" <[EMAIL PROTECTED]> > It is perfectly reasonable, as I see it, to consider the > in a sequence to be: > a. significant > b. part of the characters in a document that are not markup > (at least in the cases we are talking about, since the > problem is not about defining Nmtokens for markup in > Biblical Hebrew, but rather the representation of the > Biblical Hebrew document content itself) > > So I *still* don't see the problem you are on about, and even > if there was one, the xml:space attribute could be used to > require preservation of a particular space. May be you are forgetting that in XML and HTML, attributes (including "spacial attributes like "xml:space" can have default values, and in fact they have such values set in DTD or schemas to by normative XML applications like XHTML. Authors are not supposed to modify normative schemas or DTDs, and so use elements with their default attributes. This is the case of XHTML as an application of XML, and HTML as an application of SGML (neither HTML or SGML parsers will interpret the xml:space attribute, and XML parsers will handle it only if they are validating documents with their DTD or schema)
Re: Questions on ZWNBS - for line initial holam plus alef
Philippe Verdy scripsit: > Of course one is not required to build an actual DOM tree, however XML, HTML > and alike is now defined in terms of the DOM, where the text/xml syntax is > just a serialization, This is absolutely false. XML is defined by the XML Recommendation, which is entirely syntactic. As a matter of convenience, many other XML recommendations use the XML Infoset, which is by no means the same as the DOM. The DOM is an abstract API for programmatic access to the content of XML documents. > which is the only place where whitespaces > normalization is defined (such normalization does not occur at the DOM > level, and a XML document may be serialized with another concrete syntax > than the one assigned to the "text/xml" MIME type, registered and documented > by the W3C. "May" be, yes. You can serialize it in ASN.1 if you want to. That doesn't make ASN.1 an instance of XML. > [W]hitespace normalization in XML documents > serialized as "text/xml" is mandatory, or it is not a valid "text/xml" > serialization. Very true. But what is this whitespace normalization? 1) Throughout the document, line-end characters and sequences are normalized to LF. Not relevant here. 2) In attribute values, LF, CR, and TAB characters are normalized to spaces. Not relevant here. 3) In attribute values that have a declared type other than CDATA, multiple spaces are compressed to a single space, and leading and trailing spaces are removed. After this is done, there can be no spaces in attributes of type ID, IDREF, ENTITY, NMTOKEN, NOTATION, or enumerated types. In the types IDREFS and ENTITIES, spaces are used to separate individual tokens, none of which may begin with a combining character. In the remaining type, NMTOKENS, individual characters may begin with a combining character, so it is possible that such a token, if not the first in the attribute, will be rendered in a peculiar way, with the combining character placed over the separating space. But that is a mere rendering glitch and in no way affects anything. > No XML application is required to use the "text/xml" > MIME syntax, and there exists such examples (for example the serialization > and compression formats used by WAP, MMS, Nec's i-Mode, and SOAP). That is not the definition of "XML application" given in the XML Recommendation, which is the sole authority on the subject. You can invent your own definitions if you like, but you need not expect to be listened to. > If an application does not build the DOM tree, it is still required to > perform namespace resolution No XML application is required to perform "namespace resolution", whatever that may be. > to solve named entities according to the > standard "text/xml" MIME rules formulated by the W3C reference, Only certain named entities *must* be resolved: specifically, internal entities that are defined in the internal subset. > In my opinion, all XML-based languages should > be defined now in terms of its DOM structure, and the XML application should > be defined by a valid DTD, or beter now with a now standard XSD schema, that > can be processed by validating parsers (parsers that absolutely need to > create a DOM-like tree or flow of tokens with strictly defined properties, > value sets and behavior.) In your *opinion*. > Without DOM interoperability, XML would be another imprecise language like > HTML, with very little reusability due to naming conflicts. Nonsense. *plonk* -- There is / One art John Cowan <[EMAIL PROTECTED]> No more / No less http://www.reutershealth.com To do / All things http://www.ccil.org/~cowan With art- / Lessness -- Piet Hein
Re: Questions on ZWNBS - for line initial holam plus alef
On 11/08/2003 16:06, Mark Davis wrote: Some of this seems to be in reference to an earlier contention that Text Boundaries (inc. Lines) break between the space and the non-spacing mark. I think this was attributed to Phillipe. [This may not be true: I don't actually read his email, because the information content per line falls below my email threshold; not to say that there may not be information there, but I cannot afford to take the time to find out -- sadly, one of my character flaws.] All of the text boundaries preserve grapheme cluster boundaries, which never separate a base character (including space and NBSP) from a following NSM. In addition, each of the boundary types above grapheme clusters make some statement about the behavior of the grapheme cluster. For example, with line boundaries a SPACE + NSM has a special behavior. With the others, the behavior is the same as the base character. As Ken points out, in any event these are default boundaries, and can be tailored. That being said, if the normal behavior of the default can be improvied, and someone has a concrete proposal for doing so, then it can be considered. Mark __ http://www.macchiato.com ► “Eppur si muove” ◄ I was aware that there should not be a line break or word break between the space and the NSM, although I suspect that many implementers will not be aware of this, or at least will not test for it properly and so treat any space as a word break and a line break opportunity. As I just wrote, this requirement to test all spaces for following NSMs is a significant inefficiency built into the standard. But there is still a problem if there is considered by default to be a word break and a line break opportunity AFTER the NSM. I would suggest, as a candidate for a concrete proposal, that the default behaviour be adjusted so that there is no word break or line break opportunity here either. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
RE: Questions on ZWNBS - for line initial holam plus alef
Michael wrote: > The Name Police reject this utterly. ZERO WIDTH cannot have an > expanding dynamic width. Then what about ZERO WIDTH SPACE, which, according to TUS3, p. 238, "can grow to have a visible width when justified"? And it has the NamesList comment: * nominally zero width, but may expand in justification (But U+0082, BREAK PERMITTED HERE, which otherwise is very similar to ZWSP according to 6429, does apparently not allow such stretching...) /kent k
Re: Questions on ZWNBS - for line initial holam plus alef
At 11:36 AM 8/11/2003, John Cowan wrote: > So far so good, but when I get to an accent with no predefined spacing > variant, I have a problem! No you don't. If you want to say is the diacritic used to represent linguolabial sounds in the IPA, then you just encode U+0020 U+033C at the beginning of the next line. If the seagull doesn't line up properly, you complain to the foundry or the implementor. Again, you are working on the assumption that U+0020 is represented by an actual painted glyph and not e.g. by a horizontal offset. In my experience, the more sophisticated the application -- e.g. a professional page layout application rather than a word processor -- the more likely it is that white space characters will not be consistently treated as painted glyphs. I've heard convincing arguments from the engineeers of such applications that the space character shouldn't be a glyph in the font at all, but should simply be a numeric value telling applications how large an offset to apply. Since most fonts do not contain glyphs for variant white space characters such as thin and hair spaces, applications typically treat these as offset values. Painting a glyph is only one way to represent a character. Regards, John Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] The sight of James Cox from the BBC's World at One, interviewing Robin Oakley, CNN's man in Europe, surrounded by a scrum of furiously scribbling print journalists will stand for some time as the apogee of media cannibalism. - Emma Brockes, at the EU summit
Re: Questions on ZWNBS - for line initial holam plus alef
On 13/08/2003 15:54, Jony Rosenne wrote: Suggested but not accepted. I am inherently suspicious when pressure is being exerted to decide complex and difficult questions in a hurry. Jony Jony, I am not trying to hurry anything. I am putting a lot of time and effort into trying to reach proper decisions on these complex and difficult questions. What I am not prepared to do is to accept a quick answer that the lowest common denominator of printers don't bother to do X, therefore we need not bother to support X in Unicode although X is a definite requirement of a significant subset of Hebrew users. If you have problems with this particular suggestion, let's discuss them on the Hebrew list. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
RE: Questions on ZWNBS - for line initial holam plus alef
> The only way to bypass this would be to use entitiy references to encode > the base space needed by the Unicode convention, so this is related to > what Unicode defines as a higher level protocol, needed here to bypass > the limitations of basic text. However it still creates a problem within > CDATA sections, which are not supposed to contain entity references. > One needs then to use the XML CDATA escaping mechanism with > another escaping system specific to CDATA sections (which are > formally anonymous text elements and equivalent to them). Wow! You can't have a CDATA section within or containing a name or nmtoken. You can't have an entity reference within element or attribute names, the most common use of names. You don't want an entity reference with any other name or within an nmtoken, it would be very poor design to use characters that were awkward for developers (the only people who would ever have to deal with this stuff at that level) to type. CDATA sections aren't affected by the part of white-space handling we are discussing. The idea of creating an escaping mechanism specific to (or at all applicable to) CDATA sections is mind-hurtingly bad even in hypothetical terms.
RE: Questions on ZWNBS - for line initial holam plus alef
Kenneth Whistler wrote: > Kent Karlsson said: > > > I see no particular *technical* problem with using WJ, though. In > > contrast > > to the suggestion of using CGJ (re. another problem) > anywhere else but > > at the end of a combining sequence. CGJ has combining class > 0, despite > > being invisible and not ("visually") interfering with any other > > combining > > mark. Using CGJ at a non-final position in a combining sequence puts > > in doubt the entire idea with combining classes and normal forms. > > Why? See above (I DID write the motivation!). Combining classes are generally assigned according to "typographic placement". Combining characters (except those that are really letters) that have the "same" placement, and "interfere typographically" are assigned the same combining class, while those that don't get different classes, and the relative order is then considered unimportant (canonically equivalent). How is then, e.g. supposed to be different from (supposing all involved characters are fully supported), when is NOT supposed to be much different from (them being canonically equivalent)? An invisible combining character does not interfere typographically with anything, it being invisible! The other invisible (per se!) combining characters with combining class 0, the variation selectors, are ok, since their *conforming* use is vary highly constrained. Maybe I've been wrong, but I have taken CGJ as similarly constrained as it was given a semantics only when followed by a base character (but now it seems to have no semantics at all). > There are any number of combining characters with combining > class 0, including the vast majority of Indic dependent vowels, > for instance. These are ok. They are not invisible, and the vowels should not reorder amongst themselves in a single combining sequence (I know, there is normally only one vowel per syllable, but as the Hebrew discussion has shown, one should not generalise too much), regardless of placement (before, above, below, after, before&after, ...). So at least they should have the same combining class, regardless of typographic placement. (This should have been the case also for the Hebrew vowels...) But class 0 (which is specially treated), I'm not sure if that was ideal. > A combining character sequence is a base character followed > by any number of combining characters. There is no constraint > in that definition that the combining characters have to > have non-zero combining class. Well, you cannot *conformantly* place a VS anywhere in a combining sequence! Only certain combinations of base+vs are allowed in any given version of Unicode. (Breaking that does not make the combining sequence ill-formed, or illegal, but would make it non-conformant, just like using an unassigned code point.) > Canonical reordering is scoped to stop at combining class = 0. (I know it is. But I confess I'm not sure why.) > It doesn't say that it applies to combining character sequences > per se. It applies to *decomposed* character sequences > (meaning, effectively, any sequence which has had the recursive > application of the decomposition mappings done). Yes, for the definition of normalisation. But not necessary for canonical equivalence. Your point? > Take a Myanmar example: /kau/: > > character sequence: <1000, 1031, 102C, 1039, 200C> > combining?: no yes yes yesno > combining classes:0 0 0 9 0 > comb char sequence:-- > canon reorder scope: ---| ---| -| ---| > > The combining character sequence here is: <1000, 1031, 102C, 1039> > The *syllable* consists of that plus the trailing ZWNJ. > But the relevant sequences for application of the > canonical reordering algorithm are each sequence starting > with combining class zero and continuing through any > sequence with combining class not zero. Formally, a character *pair* based definition is enough: xy S yx,if 0 < cc(y) < cc(x) (and apply that repeatedly); no need to define any "canonically reordering scope", though that may be marginally more efficient in an implementation of normalisation (but this is getting beside the topic of this discussion). > I don't see how introduction of CGJ into such sequences calls > any of the definitions or algorithms into question. No, not the algorithm, but the basic idea and design. The algorithm as such has no "idea" how or why the combining class numbers were assigned. But we humans do, or might have. Again, why should not be canonically equivalent to , when is canonically equivalent to ? And I want a design answer, not a formal answer! (The latter I already know, and is uninteresting.) Since I think should be canonically equivalent to , but cannot be made so (now), the only ways out seem to be to either formally deprecate CGJ, or at least confine it to very specific uses. Other occurrences would not be ill-formed or illegal, but would then be non-conformin
Re: Questions on ZWNBS - for line initial holam plus alef
From: "John Cowan" <[EMAIL PROTECTED]> > Peter Kirk scripsit: > > > So far so good, but when I get to an accent with no predefined spacing > > variant, I have a problem! > > No you don't. If you want to say is the diacritic used to > represent linguolabial sounds in the IPA, then you just encode U+0020 U+033C > at the beginning of the next line. If the seagull doesn't line up properly, > you complain to the foundry or the implementor. It's true that you can complain to a foundry for an inappropriaet glyph positioning but not to an implementor of other components dealing with text boundaries. The inaccuracies we are spaeaking about are not in the glyph representation but in text handling algorithms, these last ones being clearly part of the Unicode standard, unlike font problems.
Re: Questions on ZWNBS - for line initial holam plus alef
On Saturday, August 09, 2003 3:11 PM, Kent Karlsson <[EMAIL PROTECTED]> wrote: > Michael wrote: > > The Name Police reject this utterly. ZERO WIDTH cannot have an > > expanding dynamic width. > > Then what about ZERO WIDTH SPACE, which, according to TUS3, p. 238, > "can grow to have a visible width when justified"? And it has the > NamesList comment: > * nominally zero width, but may expand in justification > > (But U+0082, BREAK PERMITTED HERE, which otherwise is very similar > to ZWSP according to 6429, does apparently not allow such > stretching...) > > /kent k - ZERO WIDTH SPACE would be good only if it had not the "Zs" general category which qualifies it as a whitespace, and a word breaker (in fact the same problem occurs with the general category offered by SPACE or NBSP, which is a good reason why they are highly criticizable as base characters for word-like sequences (even if there's a NBSP, there is still a word delimitation which may be important for orthographic and grammatical analysis, given that the main difference between SPACE and NBSP is mostly the line-breaking behavior but not the word-breaking behavior.) - BREAK PERMITTED HERE is a control and does not qualify as a base character. In fact, depending on the usage, the gaps to fill depend on the usage: 1) when the isolated diacritic is to be used as a spacing symbol but which should not be force glued with surrounding characters, the NBSP base character is a problem, and in fact it also has the wrong character properties which normally applies to the whole combining sequence that should normally inherit the properties of the first base character. For this usage, we need something like an "INVISIBLE SYMBOL" base character (with gc=Sk like for other existing spacing diacritics, and probably with neutral directionality). The combining sequence will have its width adjusted to the largest diacritic(s) applied to that "INVISIBLE SYMBOL" base character. The nearest existing character to fit this function is ZWS, but it is whitespace, not symbolic. 2) when the isolated diacritic is to be used as a regular letter within words (e.g.: in Traditional Hebrew), we need something like a "INVISIBLE LETTER" base character (with gc=Lo and neutral directionality), whose width is not necessarily supposed to be adjusted but may adjust depending depending on the left or right context (in rendering engines), so that one could use an isolated circumflex between each character in the pair "oo", and the diacritic being centered on the touching edges of each surrounding spacing base character, or it would create a sufficient margin on either side to make the isolated diacritic fit. The resulting combining sequence with the INVISIBLE LETTER and its non-spacing diacritics would be mostly non-spacing. But this rendering may be tricky to implement in many cases, and the renderer should be allowed to render it as a spacing diacritic, like for the invisible symbol, except that it would not be a symbol but really a letter that can fit within a word (and have applications for elided letters in the middle of a unbreakable word). This function is partially implementable with CGJ only if there's a preceding combining sequence or base letter, or by WJ (Word Joiner) but it is a format control and not applicable as a base character. For texts that want to present the isolated diacritic for its related normal function as a diacritic, the current best solution is to use the existing (spacing) dotted circle symbol as the base character. However this usage is quite technical, and too much Unicode related, and is not appropriate for all usages, where the dotted circle symbol base character may conflict with other usage (in a document) of this symbol (some other documents also prefer using for such presentation forms a gray-coloured Latin small letter o in some rich text like HTML or RTF, but this still has the problem that a rich-text format like HTML will break the plain-text into separate sequences, where the non-grayed diacritic muct still be rendered on top of this separate sequence: which base character can be used in that case? there's currently none, except trying with ZWS (does not work always), but should better be a non-spacing INVISIBLE LETTER, rather than a spacing INVISIBLE SYMBOL (which by itself has no defined width but has just a minimum width 0). -- Philippe. Spams non tolérés: tout message non sollicité sera rapporté à vos fournisseurs de services Internet.
Re: Questions on ZWNBS - for line initial holam plus alef
At 05:27 PM 8/8/2003, Kenneth Whistler wrote: Because the mechanism for doing so -- application to SPACE or to NBSP -- has been specified by the standard for a decade now. True enough, but I'm also a bit concerned about this mechanism because white space characters are another pesky thing that not all applications paint. TEX, perhaps most famously, uses its own 'glue' instead of the space glyph in the font. And what happens when word spacing is expanded or contracted in text? The diacritic mark ends up being shoved to the left or right of where it should be. Of course, if the space glyph is not painted you have to rely on blind offsets for mark positioning, because unpainted glyphs can't be found for smart positioning lookups. As someone who cares about typography, I don't like blind offsets because they don't offer precise enough control: I would much rather have a mechanism that I can reliably and precisely use with glyph positioning lookups. I'm not suggesting that the use of space/nbspace for this purpose should be deprecated, only that an alternate mechanism would be useful for those who want more control of how combining marks are rendered on a blank base. A similar but not identical issue was raised by Peter Constable when we were talking about Qere vs Ketiv readings in Biblical Hebrew. There are cases in which vowels are applied to ellided consonants, which in some texts results in marks applied to a blank base in mid-word. In this case, my concern about using space or nbspace is that these imply a word break where there is not, in fact, any break in the word: the blank base is part of the word. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] The sight of James Cox from the BBC's World at One, interviewing Robin Oakley, CNN's man in Europe, surrounded by a scrum of furiously scribbling print journalists will stand for some time as the apogee of media cannibalism. - Emma Brockes, at the EU summit
RE: Questions on ZWNBS - for line initial holam plus alef
Kent asked: > How should a freestanding double diacritic be encoded (for purposes of > meta-discussions, and the like): or diacritic, SPACE>? It *could* be represented as , of course, or for that matter , or other possibilities. The combining character sequence, in either case, is the sequence. But it *should* be represented by something visually more meaningful, such as , which is how the standard itself tends to represent it when needing to engage in a meta-discussion. The whole point of a double diacritic is its graphic application to two base characters, which point is lost in the discussion if you don't show a graphic base when displaying the character in isolation. > How should combining characters (spacing as well > as non-spacing) that are not vertically centered *roughly* be displayed, > e.g. , should that *roughly* > be displayed with or without a typographic void to the left of it? It's up to the application. And again, I would say that if this level of detail is a concern to the person originating the text, then the better convention is to represent the combining character on a *visible* generic base. > So > if I want a space (though not an overgrown one), should one use > ? Or even ZWSP, SPACE, right-side combining character>, to prevent "space > collapse". > And similarly for left-side combining characters. Likewise for defective > combining sequences. If I want a visible pseudo-base, a dotted ring, or > an > underline, the answers are fairly clear, using a suitable character as a > base. Exactly. Which is why you should use such conventions if you care about the placement in this detail. Otherwise, you up-level and make use of whatever mechanisms a typesetting application makes available for individual adjustment of the placement of glyphs. --Ken > But not for the cases above. I don't think that should entirely up > to each font (maker), without any recommendation. (A "should" rather > than a "shall" is quite sufficient.) > > /kent k > >
Re: Questions on ZWNBS - for line initial holam plus alef
On Saturday, August 09, 2003 12:49 AM, Michael Everson <[EMAIL PROTECTED]> wrote: > At 14:22 -0700 2003-08-08, Kenneth Whistler wrote: > > > Philippe, you are tilting at windmills, here. There is no chance > > that the UTC is going to consider such a character, in my > > assessment, let alone give it the properties you suggest. > > Nor WG2 either. Why that? Because I suggest something that some other may think as useful to fill a large gap in Unicode for spcing diacritics, but I'm not trusted enough due to my errors or confusions here, so that this suggestion would be endorsed by more "serious" UTC or WG2 members? I admit that the properties of such character can be discussed, and is possibly not necessarily a "Sk" symbol, but a "Lo" letter, in which case the name "INVISIBLE LETTER" may be appropriate (where it could also fill the gap for Hebrew "Yerushala(y)im", but this is a possibly distinct function for a missing letter in phonology). Why do you think it is stupid to have a single carrier character that would avoid adding new spacing diacritics, when the standard combining diacritics could be used without less "quirks" like "defective" sequences just to produce the desired effect? If you think that spacing diacritics are stupid, why then are they given these properties and not deprecated (no more recommanded) in the standard, in favor of the SPACE+diacritics sequences, which are really not equivalent to spacing diacritics used as symbols (sometimes described also as "MODIFIER LETTER" which is very misleading according to their gc=Sk property) and as base characters (to which other diacritics can be applied) ? -- Philippe. Spams non tolérés: tout message non sollicité sera rapporté à vos fournisseurs de services Internet.
Re: Questions on ZWNBS - for line initial holam plus alef
On 06/08/2003 03:38, Kent Karlsson wrote: Kenneth Whistler wrote: Kent Karlsson said: I see no particular *technical* problem with using WJ, though. In contrast to the suggestion of using CGJ (re. another problem) anywhere else but at the end of a combining sequence. CGJ has combining class 0, despite being invisible and not ("visually") interfering with any other combining mark. Using CGJ at a non-final position in a combining sequence puts in doubt the entire idea with combining classes and normal forms. Why? See above (I DID write the motivation!). Combining classes are generally assigned according to "typographic placement". Combining characters (except those that are really letters) that have the "same" placement, and "interfere typographically" are assigned the same combining class, while those that don't get different classes, ... Not true, as we have seen for Hebrew. It's supposed to be true, but isn't, and the problems can't be fixed. ... and the relative order is then considered unimportant (canonically equivalent). How is then, e.g. supposed to be different from (supposing all involved characters are fully supported), when is NOT supposed to be much different from (them being canonically equivalent)? ... There is no difference when the characters really do not interfere typographically. But when they do, there is a real and, in some languages, meaningful distinction. ... ... the only ways out seem to be to either formally deprecate CGJ, or at least confine it to very specific uses. Other occurrences would not be ill-formed or illegal, but would then be non-conforming. OK, let's confine it to those specific uses where it is really needed, e.g. to get round the problem of combining characters with different combining classes which actually do interact typographically, and perhaps there was another one being suggested. I have no problem with that - as long as the list of permitted uses is not set in stone, so that new uses can be approved when they are discovered. But there is no good reason to object to its use in those cases where it is needed, simply because in many other cases it is not needed. -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
RE: Questions on ZWNBS - for line initial holam plus alef
> >3) In attribute values that have a declared type other than > CDATA, multiple > > spaces are compressed to a single space, and leading and > trailing spaces > > are removed. After this is done, there can be no spaces in attributes > > of type ID, IDREF, ENTITY, NMTOKEN, NOTATION, or enumerated types. > > In the types IDREFS and ENTITIES, spaces are used to separate > > individual tokens, none of which may begin with a combining character. > > In the remaining type, NMTOKENS, individual characters may begin > > with a combining character, so it is possible that such a token, if > > not the first in the attribute, will be rendered in a peculiar way, > > with the combining character placed over the separating space. > > But that is a mere rendering glitch and in no way affects anything. > > > > > Not just a rendering glitch, I suspect. If the combining character is > combined with the separating space, the space loses many of its > separating functions, and perhaps keeps a confusing subset of them with > all sorts of possibilities of error. At best tokens beginning with > combining characters will be unusable. At worst they will crash the > implementation (and count on someone trying deliberately to do that!). > The only safe thing to do is to specify that space followed by a > combining mark is NEVER considered to be a space and this combination is > NEVER generated. No, the safe thing to do (and the thing that is done) is to treat the space as a space ignoring the fact that the NMTOKEN contains a combining character, this is even safer than your suggestion since it can't mis-identify the combining properties of a character. This effectively bans space+combining (and for that matter NBSP+combining since NBSP isn't allowed in NMTOKENs) within an NMTOKEN and means that if you attempt to begin an NMTOKEN with space+combining it will be treated as beginning with the combining character. The resulting lost of expressive power in having this banned is negligible, it means that you can't use what is quite a linguistic oddity (space+combining is mainly used in meta-discussion of combining marks as was mentioned earlier) in a context where it is human-readable (hopefully) but not fully general text. NMTOKENs should only be given "raw" to a user by relatively low-level tools (i.e. general purpose XML tools for developers), in other contexts they should be represented by a more user-friendly and application-appropriate indicator (perhaps text, perhaps not) so the inability to use space+combining won't apply at that level.
Re: Questions on ZWNBS - for line initial holam plus alef
Peter Kirk wrote: > Thank you, Ken. Well, you make it sound as if the problems are > minimal, and that version I can just about accept. But if Philippe is > correct about what he says about UAX#29 and UAX#14, there are some > more serious problems. It is certainly highly inappropriate for > non-spacing diacritics to be considered word boundaries. Non-spacing diacritics had better not be word boundaries, otherwise a string like Québec (spelled with U+0301, as here) would be considered two words. I don't have time right now to look up the relevant properties and UAX's, but I sincerely hope this is just another "Philippe mistake" and not a general misinterpretation that anyone might make. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
RE: Questions on ZWNBS - for line initial holam plus alef
> Thanks for the clarification. I probably misunderstood Jon's intention. > But is there a problem if, for example, an application sees the string > and regularises it (wrongly!) to combining mark>? Yes, I was not saying that it wouldn't be sensible to begin a line of text with a spacing diacritic (whether precomposed or created using space or NBSP). I was saying that it wouldn't be sensible to begin a line with a combining diacritic, since that combining diacritic would be combining with a newline character which it's difficult to think of any possible sensible meaning for. Attribute normalisation would change the sequence U+000A, to U+0020, which would arguably change the meaning, but changing the meaning of a meaningless construct isn't a problem to my mind.
Re: Questions on ZWNBS - for line initial holam plus alef
On Wednesday, August 06, 2003 10:19 PM, Kenneth Whistler <[EMAIL PROTECTED]> wrote: > Kent Karlsson responded: > > > > > I see no particular *technical* problem with using WJ, though. > > > > In contrast > > > > to the suggestion of using CGJ (re. another problem) > > > anywhere else but > > > > at the end of a combining sequence. CGJ has combining class > > > 0, despite > > > > being invisible and not ("visually") interfering with any other > > > > combining > > > > mark. Using CGJ at a non-final position in a combining sequence > > > > puts in doubt the entire idea with combining classes and normal > > > > forms. > > > > > > Why? > > > > See above (I DID write the motivation!). > > I guess that I did not (and still do not) see the motivation for > your final statement. > > > Combining classes are generally > > assigned according to "typographic placement". Combining characters > > (except those that are really letters) that have the "same" > > placement, and "interfere typographically" are assigned the same > > combining class, while those that don't get different classes, and > > the relative order is then considered unimportant (canonically > > equivalent). How is then, > > e.g. supposed to be different from > > (supposing all involved characters > > are fully supported), when is NOT > > supposed to be much different from > > (them being canonically equivalent)? An invisible combining > > character does not interfere typographically with anything, it > > being invisible! > > The same thing can be said about any inserted invisible character, > combining or not. > > How is: supposed to be different from > > > How is: supposed to be different from > > > In display, they might not be distinct, unless you were doing some > kind of show-hidden display. Yet these sequences are not canonically > equivalent, and the presence of an embedded control character or an > embedded format control character would block canonical reordering. I disagree with you, using a LRM mark in the middle of a combining sequence is conforming to canonicalization rules but is clearly ill-formed, as well as using a NULL control in the middle, which breaks the combining sequence. So in your two examples above, inserting the LRM or NULL splits a combining sequence and creates 3 ones, each with their own properties, and the last one is ill-formed as it contains a combining character after a control and not a base or combining character. The proposal to use CGJ however is legal: it does not break the combining sequences and grapheme clusters, and thus the whole encoded sequence encoded with CGJ will be considered by rendering engines, where CGJ is a no-op for rendering but not for the canonical ordering where I see its only well-formed use as a canonical ordering fix for NF* normalized forms, or before a base character to extend the combining sequences used by renderers or character parsers and breakers. So your example with: would in fact be rendered and parsed as three combining sequences: , , i.e. a wellformed , a control (normally invisible, but may be edited with a visible glyph with a dotted square like in the Unicode charts), and a ill-formed isolated (most probably rendered with a dotted circle). So it cannot be thought as equivalent and not even rendered equivalently as: or its canonical equivalents (not in normalized order but still conforming and well-formed, and handled equivalently): -- Philippe. Spams non tolérés: tout message non sollicité sera rapporté à vos fournisseurs de services Internet.
Re: Questions on ZWNBS - for line initial holam plus alef
On 11/08/2003 18:03, John Cowan wrote: You don't have (nor do I) the vaguest idea why Microsoft produced this particular nonconforming implementation, or whether they consider it a bug or not. Don't make assumptions about things you don't know anything about. I have been working closely and personally with Microsoft's head of typography on support for Hebrew and other scripts in Uniscribe. While I don't happen to have detailed information on this particular point, I am aware of some of the constraints that Microsoft has been under e.g. to avoid the inefficiency of calling Uniscribe for rendering of plain text in western languages. This is why they have been slow to support use of arbitrary diacritics with Latin text. I think this issue may have been fixed with the soon to be released new version of Uniscribe, and perhaps the problem with spaces and diacritics has also been fixed. We'll see. Surely the UTC should not create difficulties for implementers and then just shout at them for getting things wrong. The UTC should try to produce a standard which is workable without unnecessary complications. This is sheer conjecture. No, it is not. For one thing I have not said that the UTC has done anything bad, and certainly not that it has done so deliberately, only that it should not do so. But it is not just me who has pointed to the difficulty for implementers of the space + diacritic convention which the UTC defined (with inadequate forethought rather than malicious intention), see also John Hudson's independent opinions and the failure of Microsoft to implement it. I was wrong to suggest that the UTC is shouting at implementers for getting things wrong though I think it should so so if they do. But UTC members have told me to complain to implementers for getting things wrong. As for my last statement, that is simply my opinion. If you wish to disagree with it, do you prefer that the UTC should deliberately produce an unworkable standard, or that it should introduce unnecessary complications? -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Questions on ZWNBS - for line initial holam plus alef
On 08/08/2003 13:56, Thomas M. Widmann wrote: Peter Kirk <[EMAIL PROTECTED]> writes: On 08/08/2003 08:54, Philippe Verdy wrote: ... Could there be another codepoint assigned that has these properties: 20CF;ZERO WIDTH SYMBOL;Sk;0;ON; 0020N; [...] But I'm not sure that ZERO WIDTH SYMBOL is the best name, unless you are suggesting other uses in which it really has zero width. Well, it might have in a case like line initial holam which shifts on to a following silent alef, but that is a rather special case. What would be a better name? ACCENT CARRIER? /Thomas Perhaps CARRIER FOR COMBINING CHARACTERS - not COMBINING CHARACTER CARRIER as that gives the wrong idea that this should itself be a combining character, it should not. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Questions on ZWNBS - for line initial holam plus alef
Peter Kirk asked: > Thanks for the clarification. I probably misunderstood Jon's intention. > But is there a problem if, for example, an application sees the string > and regularises it (wrongly!) to combining mark>? Then you have a problem, of course. What the Unicode Standard says about application of nonspacing combining marks to SPACE seem clear to me. What other standards say about space folding is clear in their own contexts. If someone is implementing both such standards together, then one has to be careful how the requirements articulate. In Unicode terms, a space folding is an example of a "knowing modification" of the content of the text. It is perfectly o.k. to modify Unicode text, of course, *as long as you know what you are doing* -- i.e., you aren't converting valid text to bit hash because you aren't conforming to the meaning of the characters or to their encoding forms. Now if a process is doing a space folding, but is applying it to Unicode text as a "semi-ignorant modification", i.e., without being aware of the fact that nonspacing combining marks can apply to SPACE characters (and that such sequences are valid combining character sequences and should be treated analogously with other grapheme clusters, viz UAX #29), then it is modifying the text away from its intended content without *knowing* what it is actually doing. Such mistakes are programming errors in application of the relevant standards. Of course a standard which mandates space folding is also within its rights to mandate, for example, the non-use of nonspacing marks applied to SPACE characters. It can simply rule out such sequences as valid for its context, in which case the problem goes away. The important thing here is to know what you are doing when you modify text, and, as far as possible, to accomplish such modifications in ways that are the same as other processes which also know what they are doing. That is the basis for interoperability of textual data. --Ken
Re: Questions on ZWNBS - for line initial holam plus alef
Peter Kirk wrote: > I think this may be a "Peter mistake". I meant to refer to spacing > diacritics. Sorry. > > It is certainly highly inappropriate for spacing diacritics to > be considered word boundaries. Why? It is entirely dependent on the orthography and conventions involved. There is probably as much (or more) bad ASCII usage of spacing diacritics like `this', where a grave accent character is being misapplied to make a directional quotation mark, as there is actual, linguistically appropriate use of spacing diacritics. Also, everyone should consider carefully the status of UAX #29, Text Boundaries. 2 Conformance This is informative material. There are many different ways to divide text elements corresponding to grapheme clusters, words and sentences, and the Unicode Standard and this document do not restrict the ways in which implementations can do this. This specification is a default mechanism; more sophisticated engines can and should tailor it for particular locales or environments. ... The whole UAX is informative. It is a here's-how-you-can-approach- the-problem implementation guide with some suggestions for rules and classes. *If* you are working with an orthography that uses one or more spacing diacritics, and *If* those spacing diacritics need to be represented by sequences, then you are in the situation where your implementation of text boundaries should take sequences explicitly into account, so as to result in expected behavior for that orthography. Everyone has had experiences with their platform UI producing bad results for text boundaries. The Solaris platform I am writing this on right now, for example, implements a double-click word selection that treats the string "`this'," above, including the grave accent, the apostrophe, and the comma, as a "word". Is that right or wrong? Well, it depends on what you are trying to do, I expect. But even the most sophisticated platform implementers can only do so much with processes like default word selection. It is bound to be wrong for one purpose or another and for one orthography or another. Ultimately you need to have tailored processes that can be orthography-specific if you want to get best results. --Ken
Re: Questions on ZWNBS - for line initial holam plus alef
Ken Whistler posted: Of course a standard which mandates space folding is also within its rights to mandate, for example, the non-use of nonspacing marks applied to SPACE characters. It can simply rule out such sequences as valid for its context, in which case the problem goes away. And for such standards or applications one can usually use U+00A0 NO-BREAK SPACE to force multiple spacings. One can also use this followed by a non-spacing combining character to call for rendering of that combining character in isolation. My feeling is that because of the special qualities of regular SPACE using NBSP (U+00A0) should be the more robust way to go. Essentially, since the Unicode specifications say that a non-spacing diacritic can be applied to any base character, including the spaces, it is up to fonts and other presentation software to support this and to try to make the results look good according to othrographic and cultural expectations, just as it is with any text coded in Unicode. Sometimes fonts don't do this. I would not at all be surprised to find for example that _g_ followed by U+0325 COMBINING RING BELOW would come out with the combining ring overlapping the tail of the _g_ unless I were using a font especially designed for linguistic use. I would not be at all surprised that some fonts and display devices wouldn't justify NBSP + COMBINING DOT BELOW at the beginning of a line. But good typographical fonts should justify such combinations and should presumably change the width of NBSP when appropriate. Such changes of width and shapes are what one finds with ligatures in fonts that support ligatures. Jim Allan
Re: Questions on ZWNBS - for line initial holam plus alef
From: "Peter Kirk" <[EMAIL PROTECTED]> > On 13/08/2003 11:09, Philippe Verdy wrote: > > >... For this reason, defective > >combining sequences (combining characters without a leading base > >character) should be forbidden (invalid for XML). > > > > > If there is even the remotest possibility of this happening, we need to > know quickly! Defective combining sequences are legal Unicode and are > now being suggested for use in Hebrew e.g. for holam male. But such a > definition would be useless if XML restricts the texts it can represent > to a subset of Unicode excluding such sequences. I did not notice that the discussion about Hebrew holam male was related. In fact I don't know anything about the hebrew alphabet so I could not understand the semantics discussed, and so di not note that was a "defective" encoding (in terms of combining sequences). When using the term "forbidden", it was only related to possible security problems with XML, but the term was certainly too much expeditive. However, given that possible security and parsing issues do exist, the case of used to encode "holam-male" may be another argument to propose a neutral/invisible base character for combining characters. For the case of Hebrew, it then needs to have a "letter" behavior, but for the case of other isolated diacritics in Latin,Greek Cyrillic, and probably also Hiragana, Katakana (voice marks) it should better be handled as a symbol. I suggested several semantics for this invisible character(s) in a earlier message: - A invisible symbol - An invisible LTR letter - An invisible RTL letter all of them having a *compatibility* decomposition (or NFKD form) as a SPACE like other existing spacing combining marks, but not being canonical equivalent of SPACE (to keep separately the legacy semantics, properties, behavior and known caveats unchanged and implementation/usage-dependant, as they are now with SPACE+NSM which could then be discouraged in Unicode and strongly deprecated in SGML/HTML/XML)
Re: Questions on ZWNBS - for line initial holam plus alef
From: "Jon Hanna" <[EMAIL PROTECTED]> > I was saying that it wouldn't be sensible to begin a line with a > combining diacritic, since that combining diacritic would be combining > with a newline character which it's difficult to think of any possible > sensible meaning for. A newline is a control with a whitespace property and a line-breaking behavior. It must not combine with a combining diacritic, according to the UAX definition of grapheme clusters. So +NSM is clearly defective and must be parsed as two distinct combining sequences, the first one for the newline sequence, the second one being "defective" as the combining character does not have a base character to which it applies (the standard suggests using a dotted circle to render it in editors, but suggests nothing for the rendering of final documents, which could simply drop the defective sequence or display it with a replacement base character, or use a dotted circle, or a invisible glyph. So the result in this case is implementation dependant, and not interoperable. For me the term "difficult" is inappropriate. In fact it is invalid for interoperability (even though it is valid, not forbidden, for ISO10646/Unicode, as an string fragment for intermediate processing), and such sequence should not occur in actual documents, out of any external processing context which defines its behavior.
RE: Questions on ZWNBS - for line initial holam plus alef
At 15:11 +0200 2003-08-09, Kent Karlsson wrote: Michael wrote: The Name Police reject this utterly. ZERO WIDTH cannot have an expanding dynamic width. Then what about ZERO WIDTH SPACE, which, according to TUS3, p. 238, "can grow to have a visible width when justified"? And it has the NamesList comment: * nominally zero width, but may expand in justification (Rolls eyes.) Fine. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Questions on ZWNBS - for line initial holam plus alef
Jon Hanna scripsit: > If this is not the case (I'm not entirely sure this bans what XML does with > spaces) then all we would need is a change so that rather than a de facto > ban on space+combining within names and nmtokens we would have an explicit > ban on the same; then we'd all be happy, except possibly for some sadistic > XML application designer that was planning on use that combination out of > ill-will towards his or her colleagues. Space in any case is not allowed in a token. There are far worse conformance problems than this anyway, notably the fact that canonical equivalence is not respected in XML names: a start-tag that is decomposed and an end-tag that is composed (or vice versa) will not match. -- The Imperials are decadent, 300 pound John Cowan <[EMAIL PROTECTED]> free-range chickens (except they have http://www.reutershealth.com teeth, arms instead of wings andhttp://www.ccil.org/~cowan dinosaurlike tails).--Elyse Grasso
Re: Questions on ZWNBS - for line initial holam plus alef
On 08/08/2003 08:54, Philippe Verdy wrote: ... Could there be another codepoint assigned that has these properties: 20CF;ZERO WIDTH SYMBOL;Sk;0;ON; 0020N; i.e. being considered symbolic, not a whitespace, with combining class 0 (not combining), and used as an explicit base for a isolated spacing diacritic to never show with a dotted circle? (note U+20CF is just a suggestion, as it fits at end of the symbolic block used for currency symbols, just before the "extended" combining characters block, and because the U+02XX block where other "Sk" spacing diacritics are defined is full). The compatibility decomposition to a space is to make it in sync with other compatibly decomposable spacing diacritics. The new character would allow to represent diacritics that currently don't have a spacing counterpart, and use them as if they were letter like. Let's look at a similar diacritic which currently has an existing "precombined" spacing version: 00B4;ACUTE ACCENT;Sk;0;ON; 0020 0301N;SPACING ACUTE Philippe, this sounds like an excellent suggestion, at least in general terms. There is a missing function here, which has been provided (since Unicode 1.0) by overloading the characters space and NBSP with an inappropriate second function. Of course we can't make existing practice illegal, but we can recommend that in future versions of the standard your new ZERO WIDTH SYMBOL character should be used for display of isolated diacritics where there is no separate spacing form. We can also suggest that the width of the combination should be that of the diacritic only. But I'm not sure that ZERO WIDTH SYMBOL is the best name, unless you are suggesting other uses in which it really has zero width. Well, it might have in a case like line initial holam which shifts on to a following silent alef, but that is a rather special case. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
RE: Questions on ZWNBS - for line initial holam plus alef
the > solution with > SPACE is really tricky due to the special treatment of SPACE notably > in HTML, SGML, XML I disagree. There are a few different things that happen with whitespace in such technologies. Some of these only apply to elements that do not allow any character data apart from whitespace to appear directly within them, and hence are not an issue here. Some happen at relatively high level of processing, e.g. rendering (not parsing) of HTML, and as such should correctly process spaces combined with combining characters. There are only two theoretical problems that I can see here, the first is that a whitespace character other than space gets converted to space by attribute value normalisation, and that this changes the meaning of the text in some way. This could only occur if the combining character were the first character in a line of text, which is quite a nonsensical construct to begin with. The other would be with names, qnames, nmtokens and such. These are not normal textual content; they are human-readable constructs that are based on normal text because that makes it easier for some developers to work at a plain-text level (if they speak the natural language that the human-readable constructs were based on). Support for the linguistic oddity of a dialectic divorced from the context in which it would normally exist would have little justification in this place except for fulfilling the general goal of "completeness". Completeness is a laudable aim of course, but extreme edge-cases need only be brought in if they are both safe and cheap. Anyone designing an XML application who frequently considers isolated diacritics as the most natural choice in part of such tokens probably needs to take a couple of weeks holidays before continuing the design. Of course some of the characters that could be considered to be precomposed isolated diacritics are banned from use in nmtokens anyway.
Re: Questions on ZWNBS - for line initial holam plus alef
From: "Peter Kirk" <[EMAIL PROTECTED]> > There is some potential for real trouble here, if one process outputs an > NMTOKEN starting with a combining character preceded by a separating > space, or something else which is changed into a space, and another > process takes the new space plus combining character as a unit and so > doesn't recognise the separation. Any hackers and virus programmers > reading this will soon start flooding the Internet with tokens beginning > with combining characters in the hope of crashing implementations or > finding back doors. Of course this wouldn't have been a problem if > Unicode had never defined space plus combining character as legal and > meaningful. But this is not my problem! I do agree: a XML document could require the use at some place of a given attribute or element. If this attribute name follows the element name after a line break, which gets changed into a space during parsing, forcing XML parsers to treat SPACE+combining as a unbreakable grapheme cluster acting like a letter would have the effect of creating a new element name which may violate the lement name identity. Now suppose that the attribute name contains a colon, you have created a custom namespace name, under which you can add any element you like, even if this was forbidden by the content-model of the reference schema. So this would invalidate existing documents, or create holes allowing insertion of arbitrary XML content, if the XML application is not validating extremely strictly the element names (the pair namespace+ name) and exclude completely from processing any unrecognized element (including all its content and attributes). This would be a breach in the content model which may have been validated and tested for security in another layer of the document encoding process (notably when XML documents are created from templates, such as XSL processors, or custom C source using simple template substitution). So for me the sequence SPACE+combining should not be acceptable as a valid grapheme cluster within element names or attribute names, and thus would need to be excluded from NMTOKEN. The correct way to do it is to consider it NOT A LETTER, but a symbol (Sk), exactly like other spacing diacritics, which are already invalid in NMTOKEN. There still remains the unresolved question of grapheme clusters that could span the starting "<" or ending ">" or "/>" of tags, or the leading "&" of a entitity reference. For this reason, defective combining sequences (combining characters without a leading base character) should be forbidden (invalid for XML). So there remains a unsolved conflict here: defective combining sequences cause security or validity problems in XML documents, and a non-defective SPACE+combining sequence cause also security problems. There's no secure choice to represent spacing diacritics which are not already encoded in a precomposed form...
Re: Questions on ZWNBS - for line initial holam plus alef
Peter, in XML you really don't want to use attributes for any general text; there are too many restrictions on the content. For example, we never put translatable text into them. Attributes should really be treated more like sequences of symbols, with a constrained syntax. This is also not in violation of the Unicode conformance clause. A "space plus combining character" is a unit in some sense. That is, it is a combining character sequence (and grapheme cluster). However, there is no clause that says that such units cannot be changed, or that any particular sequence of characters cannot be changed; operations such as case mapping or normalization do just that, they change characters. There are restrictions on what can be changed *if* a process purports to not modify the text (C10). But an XML parser is certainly capable of interpreting a sequence A B, and deciding that it wants to change A to C. If the parser interpreted the 0x0041 in UTF-16 as a Z or a Greek Alpha, *that* would be a violation of C7. But interpreting a space as a space, then deciding to modify it, is perfectly legit. Mark __ http://www.macchiato.com ► “Eppur si muove” ◄ - Original Message - From: "Peter Kirk" <[EMAIL PROTECTED]> To: "John Cowan" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Wednesday, August 13, 2003 05:09 Subject: Re: Questions on ZWNBS - for line initial holam plus alef > On 12/08/2003 20:28, John Cowan wrote: > > >Peter Kirk scripsit: > > > > > > > >>>2) In attribute values, LF, CR, and TAB characters are normalized to > >>>spaces. Not relevant here. > >>> > >>> > >>This would be relevant if it is legal for the character after LF, CR, > >>and TAB to be a combining mark. Is this legal? In this case what was > >>previously a defective (but legal) combining sequence would turn into a > >>non-defective one, but the intended whitespace would be lost. > >> > >> > > > >The point is that there is no such thing as an *intended* line break in > >an attribute value; it will *always* be translated to a space before > >the application sees it. (More exactly, line-break characters can > >be inserted into attribute values, but only with the use of a numeric > >character reference such as " ".) > > > > > Sorry, I'm confused. Are you saying that the input processing will > translate line breaks into spaces within attribute values, unless > inserted as ? Well, I suppose this is fair enough as it is up to > the user not to enter garbage. > > > > > > >>Not just a rendering glitch, I suspect. If the combining character is > >>combined with the separating space, the space loses many of its > >>separating functions, and perhaps keeps a confusing subset of them with > >>all sorts of possibilities of error. > >> > >> > > > >The space(s) will be used to separate individual tokens at processing > >time. No spacing diacritic (either single-character or space+combining) > >is permitted in a NMTOKEN. > > > > > OK if this is clearly illegal, but this might restrict use of some > languages in NMTOKEN. Would NBSP + combining be allowed? > > > > > > >>At best tokens beginning with > >>combining characters will be unusable. At worst they will crash the > >>implementation (and count on someone trying deliberately to do that!). > >> > >> > > > >In effect, the combining character will constitute a defective combining > >sequence at the beginning of the individual token. > > > >Stepping away from the letter of the standard for a moment, there is > >no real reason to begin a NMTOKEN with a combining character. It is > >only allowed is a result of the miscegenation of SGML concepts with > >Unicode ones. > > > >In SGML's original design of tokens, they consisted of letters and digits > >(and a few punctuation marks, which functioned as letters). There were > >four kinds: a NUMBER could contain only digits, a NAME could not begin > >with a digit, a NUTOKEN had to begin with a digit, and a NMTOKEN had no > >restrictions. ID and IDREF had the same syntax as NAME with additional > >semantics. Later, the categories "letter" and "digit" were generalized, > >by redefining the concrete syntax, to be whatever you wanted, and were > >renamed "name-start" and "name" characters (technically, a name character > >was a letter *or* a digit). > > > >When SGML was simplified to produce XML, only NMTOKEN, the most general > &
Re: Questions on ZWNBS - for line initial holam plus alef
From: "Peter Kirk" <[EMAIL PROTECTED]> > I note that there is no line break opportunity in . But is > there one after the space in ? If so, combining character> has a third advantage, that it gives the right line > break opportunity when this sequence is word initial, which it wouldn't > do without the RLM. How can we be so complicated when a new base character with the needed properties would be much simpler and easier to support in implementations? What is wrong with the encoding of new recommanded alternatives to SPACE or NBSP, i.e. an invisible symbol, an invisible LTR letter, an invisible RTL letter? This way we can fix some issues in the current text of UAX'es but recommand that new writers use a new base character which will behave correctly without those too complex hacks that users and implementers won't understand.
RE: Questions on ZWNBS - for line initial holam plus alef
> OK, it's safe, but it is a misuse of Unicode. As space plus combining > character is a unit in Unicode, it should be treated as a unit by higher > level protocols. If higher level protocols are allowed to do arbitrary > things within Unicode units, there is no end to the possible confusion. > See for example, from Unicode 4.0 chapter 3: > > C7 A process shall interpret a coded character representation according > to the character > semantics established by this standard, if that process does interpret > that coded character > representation. If this is not the case (I'm not entirely sure this bans what XML does with spaces) then all we would need is a change so that rather than a de facto ban on space+combining within names and nmtokens we would have an explicit ban on the same; then we'd all be happy, except possibly for some sadistic XML application designer that was planning on use that combination out of ill-will towards his or her colleagues.
Re: Questions on ZWNBS - for line initial holam plus alef
- Original Message - From: "Jon Hanna" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Thursday, August 14, 2003 1:49 PM Subject: RE: Questions on ZWNBS - for line initial holam plus alef > > I do agree: a XML document could require the use at some place of a > > given attribute or element. If this attribute name follows the element > > name > > after a line break, which gets changed into a space during parsing, > > forcing > > XML parsers to treat SPACE+combining as a unbreakable grapheme > > cluster acting like a letter would have the effect of creating a new > > element > > name which may violate the lement name identity. Now suppose that the > > attribute name contains a colon, you have created a custom namespace > > name, under which you can add any element you like, even if this was > > forbidden by the content-model of the reference schema. > > 1. SPACE is treated "blindly" as a SPACE by XML. String + space + combining > + string would not be treated as a single token, no matter how that space > was introduced. That's what you were complaining about in the first place > (as far as I can make out). > 2. While nmtokens can begin with a combining character names cannot, nor can > they contain spaces. > 3. This would in no way change the content-model. So even if the above two > points didn't hold they would only sneak the document past something which > performed validation before parsing(!), and where the content-model was > already pretty loose (so it didn't complain about the unrecognised > attribute). > > You've just discovered a way to disguise one document that isn't well-formed > as a different document that isn't well-formed. l33t! > > > So this would invalidate existing documents, or create holes allowing > > insertion of arbitrary XML content, if the XML application is not > > validating extremely strictly the element names (the pair namespace+ > > name) and exclude completely from processing any unrecognized > > element (including all its content and attributes). > > This argument is not on friendly terms with the concept of causality. > > This would be a > > breach in the content model which may have been validated and tested > > for security in another layer of the document encoding process (notably > > when XML documents are created from templates, such as XSL > > processors, or custom C source using simple template substitution). > > Testing validity without testing well-formedness is not possible. > > > So for me the sequence SPACE+combining should not be acceptable > > as a valid grapheme cluster within element names or attribute names, > > As it already isn't. > > > and thus would need to be excluded from NMTOKEN. The correct > > way to do it is to consider it NOT A LETTER, but a symbol (Sk), > > exactly like other spacing diacritics, which are already invalid in > > NMTOKEN. > > Wait a second. That was my justification for why the fact that > space+combining is ALREADY prohibited from NMTOKEN shouldn't be considered a > failure on the part of XML to allow for freedom of choice with the strings > used for NMTOKENs. Now you actually want to introduce this (already > existent) feature. > > > There still remains the unresolved question of grapheme clusters > > that could span the starting "<" or ending ">" or "/>" of tags, or > > the leading "&" of a entitity reference. > > No there isn't. What goes before <, >, / or & isn't a problem since those > are all non-combining characters and a new unit for any sort of processing > treating more than one codepoint as a unit. What goes after < or & has to be > a name (not an nmtoken) and as such is already prohibited from beginning > with a combiner. What goes after > is already dealt with by the Charmod, and > even if you ignore charmod apart from the possibility of normalisation > turning the sequence U+003E, U+0338 into U+226E (a possibility that is well > noted) it still isn't going to hurt. One note: in Unicode, grapheme clusters (considered unbreakable) are more than just combining sequences! Look at CGJ, WJ, ZWJ, ... So what is after or *before* a base character may impact parsing grapheme clusters! As the well-formedness of XML documents goes even before its validity (which is optional, but required in some applications that need to parse the DOM-tree or InfoSet rather than), this impacts the way Unicode can be used (read it as "embedded") within XML. Depending on where this encoded text is used (NMTOKENs, text elements, attribute values,...) the em
Re: Questions on ZWNBS - for line initial holam plus alef
From: "John Cowan" <[EMAIL PROTECTED]> > Peter Kirk scripsit: > > On 13/08/2003 11:09, Philippe Verdy wrote: > > > > >... For this reason, defective > > >combining sequences (combining characters without a leading base > > >character) should be forbidden (invalid for XML). > > > > > > > > If there is even the remotest possibility of this happening, we need to > > know quickly! > > As a member of the XML Core Working Group of the W3C, I can assure you that > there is not even the remotest possibility of it. OK, "forbidden" is possibly excessive. Do you prefer the terms "strongly discouraged in favor of a new encoding" that could be used by applications that are concerned by security and parsing issues? If there's no such new encoding proposed, at least XML Core WG members could discuss about the way to solve the security problems. There may exist some solutions which I did not think about...
RE: Questions on ZWNBS - for line initial holam plus alef
> > <> > > > > Spams de Philippe Verdy non tolérés: tout message non sollicité sera > > rapporté à son fournisseur de services Internet. > > There was no spam in the message you deleted. This was a > single post to the list, no cross-posting, no advertizing, no > product sold, no money claimed, no required action, no > identity forged, and no deceptive subject line, the message > was on topic... > > Reread the definition of spam: "bulk + unsollicitated". May > be you don't like my message, but reporting it to my ISP will > not be successful for you, and in fact you risk more by doing > so because my ISP could complain to yours. Some people just don't get sarcasm... ;-( For someone who has a such a sentence as yours at the end of their mails, you do generate quite a lot of unsolicited comments, misreading the contents of the messages you reply to. (Yes, that does annoy me.) > If you think you don't like my message which was on topic, > don't reply to it, delete it, ignore it, but don't do such > false claim... Please don't go off on a limb on something that the message you replied to did not talk about. Please read the message first, and understand what it's about. Then read it again, and make sure you have not misread. And keep any replies at a suitable length. And no, I will not let you mislead readers of this list by not commenting on what you write. I'm not the only one chastising you. You may have noticed e.g. Ken do the same, in no subtle ways. You are of course welcome to participate in the discussions on this list. But please, . be careful about terminology, and about what is what, . don't misread, and misreply, to messages from others, . keep your posting at a reasonable length, concentrated on the issue at hand, . don't mislead readers by erroneous statements formulated as unquestionable truths, . (I'm sure to have missed something...). That way you can contribute positively to the discussion, while not constantly annoying or misleading people. Don't worry Philippe, I of course never intended to report you anywhere. Just trying to get you to behave a bit more conscientiously. /kent k > Thanks. >
RE: Questions on ZWNBS - for line initial holam plus alef
> From: "Jon Hanna" <[EMAIL PROTECTED]> > > Some of these only apply to elements that do not allow any > > character data apart from whitespace to appear directly within them, and > > hence are not an issue here. Some happen at relatively high level of > > processing, e.g. rendering (not parsing) of HTML, and as such should > > correctly process spaces combined with combining characters. > > Here I have to disagree: in XML, the normalization of whitespaces occurs > during parsing before the DOM tree is built, and so the initial > whitespaces > are made inaccessible; rendering occurs only later based on the parsed > DOM tree. This is to ensure the equivalence of the encoding under very > strict conditions defined in the XML standard (and retrofitted now in the > HTML standard to mimic the standard practices of HTML 4.01 in > XHTML 1.0 (and now 1.1 with the XHTML modularization). Lots of different things happen that affect the whitespace of an XML document (whether a DOM tree is constructed or not, since it isn't the only legal way to process an XML document). Of course rendering can do something further to parsing with whitespace. Rendering can do whatever the rendering engine wants to do, it isn't defined by XML. When an application receives U+0020, U+0020, U+0302, U+0020 then it should probably (unless there are good application-specific reasons why not) treat that more or less the same as if it had received U+0020, U+005E, U+0020 (if there are minor glyph differences fair enough). This isn't a matter of XML's whitespace rules, but it is a matter of how what we are discussing affects XML-based technology as a whole. Further it is completely true that some of the rules only affect elements that only allow element content. > Strict conformance for the behavior of these whitespaces is mandatory and > cannot be bypassed or negociated, Well if a non-validating parser hasn't seen a declaration for an attribute of type NMTOKENS it would treat it as being of type CDATA which would alter how whitespace was treated. However that is mostly correct, it just isn't a problem except if someone attempts to use the sequence {space, combining char} in a name or nmtoken, which as I said would be a pretty bizarre design decision anyway. notably when XML data needs to be > certified against alteration, i.e. cryptographically signed. (XML > signature > is now standardized), or when the DOM tree is used and altered in a > predictable way with technologies like XPath which needs to refer to > exact encoding position in the encoded Unicode NFC form of text elements, > attribute values, or CDATA sections. Yep, Yep, Yep. Still doesn't mean there is any problem.
RE: Questions on ZWNBS - for line initial holam plus alef
... > > (them being canonically equivalent)? An invisible combining > character > > does not interfere typographically with anything, it being > invisible! > > The same thing can be said about any inserted invisible character, > combining or not. > > How is: supposed to be different from > The first would be an å followed by separate dot below (under a space, according to p. 131 of TUS 3.0). The second one is an with a separate ring above (over a space according to TUS 3.0 p. 131). > How is: supposed to be different from > As above (yea, would look the same as ; but neither of these are singe combining sequences). > In display, they might not be distinct, unless you were doing > some kind of > show-hidden display. Yet these sequences are not canonically > equivalent, and the presence of an embedded control character or an > embedded format control character would block canonical reordering. > > Of course, they *might* be distinct in rendering, depending on > what assumptions the renderer makes about default ignorable > characters and their interaction with combining character sequences. > But you cannot depend on them being distinct in display -- the > standard doesn't mandate the particulars here. Well, it does (did?) say "should"... > Whether you think it is *reasonable* or not that there should be > non-canonically equivalent ways of representing the same > visual display, sequences such as those above, including sequences > with CGJ, are possible and allowed by the standard. They are: > >a. well-formed sequences, conformantly interpretable >b. could be displayed by reasonable renderers, making reasonable > assumptions, as visually identical > > I have been pointing out use of the CGJ, which *exists* as an encoded Regrettable! > character, and which has a particular set of properties defined, > would result in the kinds of non-canonically equivalent ordering > distinctions required in Hebrew, if inserted into vowel sequences. As I've mentioned, if restricted (similar to the VS restrictions) to particular cases (like just before (or between) Hebrew (and Arabic) vowel marks, then ok. But only because the combining classes of the Arabic and Hebrew vowel marks are bizarre (read: wrong). ... > > The other invisible (per se!) combining characters with combining > > class 0, the variation selectors, are ok, since their *conforming* use > > is > > vary highly constrained. Maybe I've been wrong, but I have taken > > CGJ as similarly constrained as it was given a semantics only when > > followed by a base character (but now it seems to have no semantics > > at all). > > There was no such constraint defined for CGJ. While perhaps not explicitly stated as a restriction, the only *intended* use (after some suggestions had been dropped) was to be at the *end* of a combining character sequence. > The current statement > about CGJ is merely that it should be ignored in language-sensitive > sorting and searching unless "it specifically occurs within > a tailored collation element mapping." There is no constraint > on what particular sequences involving CGJ could be tailored > that way, and hence no constraint on what particular sequences > CGJ might occur in, in Unicode plain text. > > > > A combining character sequence is a base character followed > > > by any number of combining characters. There is no constraint > > > in that definition that the combining characters have to > > > have non-zero combining class. > > > > Well, you cannot *conformantly* place a VS anywhere in a combining > > sequence! Only certain combinations of base+vs are allowed in > > any given version of Unicode. (Breaking that does not make the > > combining sequence ill-formed, or illegal, but would make it > > non-conformant, just like using an unassigned code point.) > > Actually, it is not non-conformant like using an unassigned > code point would be. The latter is directly subject to conformance > clause C6: > > C6 A process shall not interpret an unassigned code point as an >abstract character. > > The case for variation sequences is subtly different. Suppose > I encounter a variation sequence , where X could be > any Unicode character. X itself is conformantly interpretable. > VS1 itself is conformantly interpretable. The constraints are > on the interpretation of the variation sequence itself. And > they consist of: > > "Only the variation sequences specifically defined in the >file StandardizedVariants.txt in the Unicode Character >Database are sanctioned for standard use; in all other >cases the variation selector cannot change the visual >appearance of the preceding base character from what it >would have had in the absence of the variation selector." > > In other words, you can drop VS1's to your heart's content into > plain text, but a conformant implementation should ignore all > of them, unless a) it is interpreting variation selec
Re: Questions on ZWNBS - for line initial holam plus alef
At 10:58 -0700 2003-08-11, Peter Kirk wrote: On 11/08/2003 06:59, Jon Hanna wrote: There are only two theoretical problems that I can see here, the first is that a whitespace character other than space gets converted to space by attribute value normalisation, and that this changes the meaning of the text in some way. This could only occur if the combining character were the first character in a line of text, which is quite a nonsensical construct to begin with. Not at all! Imagine a tutorial on a language, which might well list the accents used, in a format like this: ` (grave accent) is used with a, e and o, and indicates more open pronunciation ^ (circumflex accent) is used with any vowel, and indicates lengthening So far so good, but when I get to an accent with no predefined spacing variant, I have a problem! It has been explained the mechanism for doing this, and it has been explained that if it is not implemented correctly you should yell at the implementors. In Mac OS X, for instance, the horizontal spacing seems to work all right for many accents, but they seem to prefer to rest just above the baseline. I'll report this as a rendering bug. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Questions on ZWNBS - for line initial holam plus alef
Philippe replied: > From: "Kenneth Whistler" <[EMAIL PROTECTED]> > > Of course a standard which mandates space folding is also > > within its rights to mandate, for example, the non-use of > > nonspacing marks applied to SPACE characters. It can simply > > rule out such sequences as valid for its context, in which > > case the problem goes away. > > Try to change now the XML or even the HTML or SGML > standards! I'm not trying to. > The use of space folding was standardized and widely > used long before Unicode published a workable standard. For HTML and XML this was clearly not the case, since the Unicode Standard was published before either of those standards, and was used as the document character set for both. For SGML, I grant you the practice is older. > So this > is a unsolved problem whose Unicode is the only source! It isn't an unsolved problem, as I can see. Look, HTML 4.0.1 defines ASCII space (U+0020), along with TAB, FF, and ZWSP, as "white space characters". The sum total I can find that it has to say about collapsing white space is: "In particular, user agents should collapse input white space sequences when producing output inter-word space." *Even if* this were to be taken as a mandate to blindly convert each sequence into , regardless of the presence of non-spacing marks in the data, which I doubt is the intent of the standard, the fix for that would be to simply apply the combining mark to U+00A0 NBSP instead. U+00A0 NBSP is *not* specified to be a "white space character" in HTML 4.0.1, and thus seems not to fall under the recommendation regarding collapsing white space sequences. > Now support of this space folding is a FULL MANDATORY part > of the XML standard, and it is by far more important and more > widely used than SPACE+diacritics sequences in plain-text Unicode. XML 1.0, section 2.10: "In editing XML documents, it is often convenient to use 'white space' (spaces, tabs, and blank lines) to set apart the markup for greater readability. Such white space is typically not intended for inclusion in the delivered version of the document. On the other hand, 'significant' white space that should be preserved in the delivered version is common, for example in poetry and source code. An XML processor must always pass all characters in a document that are not markup through to the application. ... A special attribute named xml:space may be attached to an element to signal an intention that in that element white space should be preserved by applications. ..." It is perfectly reasonable, as I see it, to consider the in a sequence to be: a. significant b. part of the characters in a document that are not markup (at least in the cases we are talking about, since the problem is not about defining Nmtokens for markup in Biblical Hebrew, but rather the representation of the Biblical Hebrew document content itself) So I *still* don't see the problem you are on about, and even if there was one, the xml:space attribute could be used to require preservation of a particular space. > If Unicode members can't fix it, will the W3C need to create a > formal request to you? What are you going on about? W3C architects have been familiar with this Unicode convention of applying NSM's to SPACE or NBSP as a means of representing an isolated spacing diacritic for years. --Ken
Re: Questions on ZWNBS - for line initial holam plus alef
Philippe Verdy wrote: > Spacing diacritics are not "on the edge" of the standard, The "edge" I was speaking of was the requirement for the exact display width of a nonspacing diacritic on top of a SPACE to be specifiable in some determinant way. > when they > are already given a full block and handled there as symbols (not as > letters as suggested in some parts of UAX's), with their own identity > independant of their actual glyphic representation. I am not > discussing about the typesetting of these grapheme clusters but > really about the textual semantics of such combining sequences > with an invisible base character, affecting all their properties and > not fully described in the various standard annexes. In case you didn't notice, I was responding to Peter Kirk's note -- not to yours. > Due to the > huge legacy use of SPACE+diacritics in legacy text, and the > already normative parts of some standard annexes, it will be hard > to correct the behavior or change the text of these annexes. Um, yes. > And it's where a new better base character than SPACE could > help solve cleanly the ambiguities. Um, no. Precisely because it would introduce *another* way to do what is already specified in the standard. It would, I predict, lead to nothing but more trouble. You might, perhaps, find it satisfying, but I can guarantee that there would then be a future critic complaining about an unnecessary distinction introduced into the standard. And then there would be *more* text in different places of the standard to try to correct and change, in an attempt to try to make consistent distinctions between the behavior of and . --Ken
Re: Questions on ZWNBS - for line initial holam plus alef
Peter Kirk scripsit: > The gap may not be large, but Philippe, John H and I have identified a > real gap. Why this antagonism against filling it? What you have identified is a set of implementation defects, not problems with the Unicode Standard. The standard way to do what you want is to precede the combining mark with SP or NBSP. If that "doesn't work", then the implementation that makes it not work needs to be fixed. -- John Cowan [EMAIL PROTECTED] http://www.ccil.org/~cowan Does anybody want any flotsam? / I've gotsam. Does anybody want any jetsam? / I can getsam. --Ogden Nash, _No Doctors Today, Thank You_
Re: Questions on ZWNBS - for line initial holam plus alef
On Monday, August 11, 2003 2:05 AM, Kenneth Whistler <[EMAIL PROTECTED]> wrote: > Um, no. Precisely because it would introduce *another* way > to do what is already specified in the standard. It would, I > predict, lead to nothing but more trouble. > > You might, perhaps, find it satisfying, but I can guarantee > that there would then be a future critic complaining about > an unnecessary distinction introduced into the standard. And > then there would be *more* text in different places of the > standard to try to correct and change, in an attempt to > try to make consistent distinctions between the behavior > of and . I don't think so: for texts that are already coded with SPACE+NSM, it won't be needed to do changes, as long as applications using them are satisfied with their existing behavior, even if it's ambiguous or causes problems in other applications. The rule would be not to change things, but offer to writers a way to create new texts without those ambiguities and problems, and correct them if authors wish it. For me, the "ACCENT ANCHOR" if you call it like this, is solving the usage of isolated diacritics as plain letters (such as the implied missing y in Hebrew Yerushala(y)im), and so would behave like an alphabetic character (whose directionality is still to define...) Existing coded spacing diacritics are coded as symbols (Sk) and mostly for accents used in LTR scripts, so the confusion of these symbols with letters behavior in some UAX's which give them the AL property (including for one case of SPACE+NSM) is not a problem. The usage as symbols is mostly correct for the case where a text is speaking about a diacritic as a isolated symbol and not within words (this is correct for most languages). The usage within words (for an implied missing base letter, including when this missing letter is an initial) leaves a distinct hole (for example if one was trying to encode a word like "(Y)erushala(y)im", where the missing base letter is the initial. For languages like Arabic and South-Asian scripts, there's no problem as there already is a base letter to hold initial combining vowel signs, which also works for the case of multiple combining vowels which should not stack but be writtenon this base letter. In fact in those languages, the missing consonnantal base letter is actually written with a visible glyph. But for Latin, Cyrillic, Greek, Hebrew, and probably other scripts, their isolated diacritics are missing a explicit coded form. And there is still the need even for Arabic and Brahmic scripts to be able to speak about the diacritic itself, without an explicit base letter, and so the SPACE+NSM combining sequence is for now the only solution with its undocumented properties problems. Reread some UAXes to see the problematic impact of SPACE+NSM in areas which are NOT related to rendering, notably when extracting word sequences (for search and indexing), managing keyboard selections, computing line breaks, and handling the directionality. Now consider the even greater impact with the legac use os SPACE as a normalizable padding whitespace (a key feature of SGML, HTML and XML), and the legacy use of SPACE+NSM cause too many problems that won't satisfy authors, which in some case will not be able to use it as it will not work as expected. Due to these problems, authors are then using even worse hacks, like using a control before the NSM, even if it creates "defective" combining sequences, and the dotted circle is sometimes displayed, and even if it is parsed with an invisible but still additional grapheme cluster for the control itself, whose presence is a pollution. Instead of forcing authors to use defective combining sequences like control+NSM, which would be a even worse hack, why not designating a clean and pure invisible base character with the required properties, so that it creates a pure combining sequence for the isolated diacritic(s)? So the question is which invisible base character(s) to define, with which properties? - A invisible symbolic base character (Sk), with neutral directionality (I called it a INVISIBLE SYMBOL); - A invisible letter base character (Lo) with neutral directionality (you call it a ACCENT ANCHOR, and I called it a INVISIBLE LETTER), or - A invisible letter base character (Lo) with LTR directionality and - A invisible letter base character (Lo) with RTL directionality Personnally, the term ACCENT ANCHOR seems ambiguous and does not indicate precisely the usage (it fits more like the current ambiguous usage of SPACE as this anchor for accents), and it seems restrictive to the kind of diacritic or other combining mark that may (should?) be applied to it. In addition, nothing would forbid to combine several diacritics or marks on this base character. Consider then these new characters are better base characters than SPACE, and define them with only a compatibility decomposition to SPACE, to match the previous encoding. If those new base characters are us
Re: Questions on ZWNBS - for line initial holam plus alef
Peter Kirk responded: > On 11/08/2003 06:59, Jon Hanna wrote: > > >There are only two theoretical problems that I can see here, the first is > >that a whitespace character other than space gets converted to space by > >attribute value normalisation, and that this changes the meaning of the text > >in some way. This could only occur if the combining character were the first > >character in a line of text, which is quite a nonsensical construct to begin > >with. > > > > > Not at all! Imagine a tutorial on a language, which might well list the > accents used, in a format like this: > > ` (grave accent) is used with a, e and o, and indicates more open > pronunciation > ^ (circumflex accent) is used with any vowel, and indicates lengthening We're going round and round in circles here. Those are not lines starting with a combining character, but lines starting with a *spacing diacritic*. > > So far so good, but when I get to an accent with no predefined spacing > variant, I have a problem! Either you have the spacing diacritic encoded (as in those instances), or the standard indicates that you can represent one by applying the nonspacing, *combining* mark to SPACE. In those instances, the line still doesn't start with a combining mark -- it starts with a SPACE character serving as the base character for the combining mark. --Ken
Re: Questions on ZWNBS - for line initial holam plus alef
On 10/08/2003 15:27, Kenneth Whistler wrote: Peter Kirk said: Tell Microsoft! (See Noah Levitt's posting.) Indeed. If this is indeed "The standard way to do what you want", then the standard needs to make it clear that the sequence of or has the properties which I want, i.e. it has the width of the combining mark alone, and not the full width of a space, This is up to the implementation and the font, and is not something that the Unicode Standard should mandate, IMO. This steps over the bound of the plain text content. ... Continuing to require that the Unicode Standard *must* specify some inherent mechanism for indicating the display width of combining character sequences clearly steps over the bounds of what is required to represent plain text content. --Ken Thank you, Ken. Well, you make it sound as if the problems are minimal, and that version I can just about accept. But if Philippe is correct about what he says about UAX#29 and UAX#14, there are some more serious problems. It is certainly highly inappropriate for non-spacing diacritics to be considered word boundaries. Philippe's quotations also show that Unicode does concern itself with details of character positioning and not just with plain text. Since Unicode does specify all kinds of properties to do with spacing, breaking, word and sentence boundaries, bidi behaviour etc etc, it is within the scope of Unicode and indeed the responsibility of Unicode to define appropriate values of all of these properties for spacing diacritics. I accept that some things I have mentioned may have gone beyond this responsibility, so I will withdraw those comments and continue to push only for appropriate values of the properties which Unicode does define. And, if Philippe is correct, many such properties are currently inappropriately defined, and so either the text needs to be changed to correct these mistakes or a new mechanism needs to be introduced. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
RE: Questions on ZWNBS - for line initial holam plus alef
... > >and "interfere typographically" are assigned the same > combining class, > >while those that don't get different classes, ... > > > Not true, as we have seen for Hebrew. It's supposed to be true, but > isn't, and the problems can't be fixed. The combining classes for Hebrew (and Arabic) vowels are bizarre. I have no idea how they came about. They should (ideally) probably have been dealt with in the same way as Indic vowels. /kent k
RE: Questions on ZWNBS - for line initial holam plus alef
> > If this is indeed "The standard way to do what you want", then the > > standard needs to make it clear that the sequence of > > mark> or has the properties which I > want, i.e. it > > has the width of the combining mark alone, and not the full > width of a > > space, > > This is up to the implementation and the font, and is not something > that the Unicode Standard should mandate, IMO. This steps over the > bound of the plain text content. I may agree with that, but id does not answer the questions I had earlier: How should a freestanding double diacritic be encoded (for purposes of meta-discussions, and the like): or ? How should combining characters (spacing as well as non-spacing) that are not vertically centered *roughly* be displayed, e.g. , should that *roughly* be displayed with or without a typographic void to the left of it? So if I want a space (though not an overgrown one), should one use ? Or even , to prevent "space collapse". And similarly for left-side combining characters. Likewise for defective combining sequences. If I want a visible pseudo-base, a dotted ring, or an underline, the answers are fairly clear, using a suitable character as a base. But not for the cases above. I don't think that should entirely up to each font (maker), without any recommendation. (A "should" rather than a "shall" is quite sufficient.) /kent k
Re: Questions on ZWNBS - for line initial holam plus alef
On 07/08/2003 07:27, Philippe Verdy wrote: On Thursday, August 07, 2003 2:40 AM, Doug Ewell <[EMAIL PROTECTED]> wrote: Kenneth Whistler wrote: But I challenge you to find anything in the standard that *prohibits* such sequences from occurring. I've learned that this question of "illegal" or "invalid" character sequences is one of the main distinguishing factors between those who truly understand Unicode and those who are still on the Road to Enlightenment. ... If the term "valid" cannot be changed, then I suggest defining "conforming" for encoded text independantly of its validity (a "conforming text" would still need to use a "valid encoding"). As a very quick thought, maybe what we need is not restrictions to the Unicode standard but a set of rules for each language or group of languages, defining exactly how Unicode characters should be used to write the words etc of that language. Such definitions might be independent of the actual Unicode standard. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Questions on ZWNBS - for line initial holam plus alef
Peter Kirk said: > Tell Microsoft! (See Noah Levitt's posting.) Indeed. > > If this is indeed "The standard way to do what you want", then the > standard needs to make it clear that the sequence of mark> or has the properties which I want, i.e. it > has the width of the combining mark alone, and not the full width of a > space, This is up to the implementation and the font, and is not something that the Unicode Standard should mandate, IMO. This steps over the bound of the plain text content. > and does not expand for justification, This is likewise an issue for the implementation. The Unicode Standard does not mandate how a typographic implementation must implement interword, intercharacter, or any other kind of justification. > is not a line breaking > opportunity, This, however, *is* specified. See UAX #14, in the section discussing CM (the line break class associated with combining marks): "If U+0020 SPACE is used as a base character, it is treated as AL instead of SP." What that means is that rather than sifting down through the line break rule determinations according to a lb=SP category, it is then handled as lb=AL, which puts it in the same class with ordinary letters for the purposes of determining a line break opportunity. Of course, a conformant Unicode implementation is not *required* to implement line-breaking as specified in UAX #14. But if it claims it is doing so, and does not handle SP+combining_mark combinations this way, then it is a nonconformant implementation of line-breaking. > does not in fact have any of the properties of a space. It does, in fact, have some of the properties of a space, since it is U+0020 SPACE, after all. But the important fact is that implementations are supposed to be implementing the semantics of the combining character sequence taking the SPACE as the base and any following *non*-spacing combining mark as applied to that base. If the implementations then result in inappropriate rendering or line-breaking for that sequence, that is, as Kent said, an issue to take up with the implementers. > I > expect to see such a clarification in the next edition of the Unicode > Standard. See above for the reasons why it is unlikely to be any more constrained by the standard than it already is. A point I keep trying to make, but which often gets overlooked by people trying to code Unicode mechanisms for dealing with edge cases, is that the design goal of the Unicode Standard is, and always has been, to represent *plain text content*. It cannot, and should not, IMO, deal with requirements for representing arbitrarily fine distinctions of typographical detail in all manuscripts and other documents in all writing systems of the world. Continuing to require that the Unicode Standard *must* specify some inherent mechanism for indicating the display width of combining character sequences clearly steps over the bounds of what is required to represent plain text content. --Ken
RE: Questions on ZWNBS - for line initial holam plus alef
> > Michael wrote: > > > The Name Police reject this utterly. ZERO WIDTH cannot have an > > > expanding dynamic width. > > > > Then what about ZERO WIDTH SPACE, which, according to TUS3, p. 238, > > "can grow to have a visible width when justified"? And it has the > > NamesList comment: > > * nominally zero width, but may expand in justification <> Note that *my* comment ("> >" above) only referred to the "name policing" (given that the policing principle Michael mentioned is already broken). /kent k Spams de Philippe Verdy non tolérés: tout message non sollicité sera rapporté à son fournisseur de services Internet.
Re: Questions on ZWNBS - for line initial holam plus alef
John Cowan asked: > > D17a Defective combining character sequence: A combining character > > sequence that does not start with a base character. > > > > * Defective combining character sequences occur when a sequence > >of combining characters appears at the start of a string or > >follows a control or format character. Such sequences are > >defective from the point of view of handling of combining > >marks, but are not ill-formed. > > ^^ > > What, if anything, does the term "ill-formed" mean when attached to > a sequence of characters? Nothing, really. The bullet goes on to point to the definition (D30) of "ill-formed", which applies to code *unit* sequences in the context of the encoding forms. The rewrite of Chapter 3 of the Unicode Standard dispensed with the ill-advised ;-) and confusing distinction between "illegal", "irregular", and "ill-formed" "code value sequences" in the context of the discussion of "transformations", in favor of a much starker and simpler distinction: a code unit sequence is either well-formed or it is not > I understood that every sequence of > characters whatsoever is permitted. As regards code *point* sequences, these sequences can either be conformant to the standard or not conformant to the standard. They are conformant if they meet the conformance requirements (the "C" clauses of Chapter 3). And as regards sequences of characters that basically comes down to not trying to interchange reserved or noncharacter code points. So if you include an reserved (unassigned) code point (for a particular version of the Unicode Standard) in an interchanged data stream, a recipient could claim that data stream is not conformant to (that version of) the standard. Shorthand: the data contains "illegal" characters. But even that is relative to the version of the standard, since a recipient of reserved code points is obliged to preserve their values -- they may, after all, be "legal" assigned code points in a future version of the standard that that particular implementation is not supporting. So, yeah, basically every sequence of code points "assigned to abstract characters" is "legal" for interchange. What you cannot interchange are code points with gc=Cs (U+D800..U+DFFF) or code points with gc=Cn (noncharacters and reserved). What D17a is trying to tell people is that while certain sequences of Unicode characters may be "defective" from the point of view of certain kinds of processing -- in this case rendering of combining character sequences -- that does not make them ill-formed (for which see the specification of encoding forms), nor does it make them nonconformant to the standard. There are many sequences of Unicode characters that we could dream up which would be abominable, distasteful, problematical, defective, implementation-busting, or just plain screwy, but the standard itself isn't prohibiting people from conformantly creating such sequences and then challenging Microsoft or anybody else to display them without blowing a gasket. One of the reasons why we have to be so incredibly careful now before introducing conceptually new *types* of characters, like the COMBINING GRAPHEME JOINER or such things as INVISIBLE BASE CHARACTER or COMBINING CLASS CHANGER or whatnot, is precisely that it gets harder and harder to program defensively against all the possible combinations and interactions that such beasties might have when mixed with everything else that is available. --Ken
Re: Questions on ZWNBS - for line initial holam plus alef
On Saturday, August 09, 2003 11:14 PM, Peter Kirk <[EMAIL PROTECTED]> wrote: > On 09/08/2003 13:41, John Cowan wrote: > > > Peter Kirk scripsit: > > > > > > > > > The gap may not be large, but Philippe, John H and I have > > > identified a real gap. Why this antagonism against filling it? > > > > > > > > > > What you have identified is a set of implementation defects, not > > problems with the Unicode Standard. The standard way to do what > > you want is to precede the combining mark with SP or NBSP. If that > > "doesn't work", then the implementation that makes it not work > > needs to be fixed. > > > > > > > Tell Microsoft! (See Noah Levitt's posting.) And the W3C or SGML commities with the *ML character model! > If this is indeed "The standard way to do what you want", then the > standard needs to make it clear that the sequence of mark> or has the properties which I want, i.e. > it has the width of the combining mark alone, and not the full width > of a space, and does not expand for justification, is not a line > breaking opportunity, does not in fact have any of the properties of > a space. I expect to see such a clarification in the next edition of > the Unicode Standard. Don't forget the issues created by the fact that in many cases, there's no other way than using "defective" sequences, hoping that the implementation will render the diacritic alone and not the dotted circle, and will correctly space the diacritic. For now the tricky solution using any (unspecified) control character before the diacritic is really a trick, and not interoperable, and it complexifies the plain-text search application where there is no predictable or stable base character to match this diacritic (in addition, many input methods or keyboard driver will not allow you to enter such "defective" sequence, meaning that for example the "Yerushala(y)im" word cannot be entered and searched exactly within a large text, as the implied invisible letter has no stable representation). Note that the CGJ solution will not work when the isolated diacritic must be the initial of a word or breakable token: for this case, the solution with SPACE is really tricky due to the special treatment of SPACE notably in HTML, SGML, XML and often SQL which "normalize" whitespaces. Thanks, the existing spacing diacritics do not have these problems as they are not canonically equivalent to the suggested SPACE+diacritic "compatibility equivalent", however this is only part of a solution for some diacritics (not ALL), and it only fills the use as symbols, but not as regular letters within the same word with surrounding letters. So there is really two gaps: a small gap for missing spacing diacritics used as symbols, and a large gap for all isolated diacritics used within a word (that the CGJ solution only solves in the middle or at end of a word, but not at its initial). -- Philippe. Spams non tolérés: tout message non sollicité sera rapporté à vos fournisseurs de services Internet.
RE: Questions on ZWNBS - for line initial holam plus alef
Ken Whistler wrote on 08/06/2003 03:19:34 PM: > > Again, why should not be canonically > > equivalent to , when > dot below> is canonically equivalent to ? > > And I want a design answer, not a formal answer! (The latter I already > > know, and is uninteresting.) > > The formal answer is the true and interesting answer! > > It shouldn't be canonically equivalent because it *isn't* > canonically equivalent. > > But instead of obsessing about the particular case of the CGJ, > admit that the same shenanigans can apply to any number of > default ignorable characters which will not result in visually > distinct renderings under normal assumptions about rendering. What I think is different here, Ken, is that a suggestion has been made that CGJ be recommended for use within a combining sequence in order to maintain a distinction for Biblical Hebrew, which it does by virtue of it's property of blocking canonical reordering. No other default ignorable has ever been specifically given this function. In introducing this function for a particular character (CGJ, in this case), the issue really arises for the first time. And I don't think it's insignificant: surely there will be implementers out there wondering what the implications are with a canonical-reordering blocker that can be inserted into sequences creating a distinction where none previously existed -- and where none was ever desired. (I think I mentioned this issue shortly after the CGJ suggestion was first raised.) - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485
Re: Questions on ZWNBS - for line initial holam plus alef
On 08/08/2003 17:27, Kenneth Whistler wrote: Philippe continued: On Saturday, August 09, 2003 12:49 AM, Michael Everson wrote: At 14:22 -0700 2003-08-08, Kenneth Whistler wrote: Philippe, you are tilting at windmills, here. There is no chance that the UTC is going to consider such a character, in my assessment, let alone give it the properties you suggest. Nor WG2 either. Why that? Because I suggest something that some other may think as useful to fill a large gap in Unicode for spcing diacritics, but I'm not trusted enough due to my errors or confusions here, so that this suggestion would be endorsed by more "serious" UTC or WG2 members? Mostly because there is no "large gap" here in the first place. The gap may not be large, but Philippe, John H and I have identified a real gap. Why this antagonism against filling it? Is it just because you don't like the name Philippe suggested? I accept that there may be rational arguments to be made that the gap is not significant enough for Unicode to fill, but I have not seen any such rational arguments, just "over my dead body" type irrational responses. Why do you think it is stupid to have a single carrier character that would avoid adding new spacing diacritics, when the standard combining diacritics could be used without less "quirks" like "defective" sequences just to produce the desired effect? Because the mechanism for doing so -- application to SPACE or to NBSP -- has been specified by the standard for a decade now. Understood. But John H has clearly spelled out several of the weaknesses in this mechanism. And this is not something set in stone, there is I think no mention of it in the stability document. So there is no a priori reason not to define a new and improved mechanism, with the old mechanism still supported but now discouraged. If you think that spacing diacritics are stupid, We do not. Some of them are necessary compatibility characters. Others have distinct usage as spacing forms that warrant their separate encoding. And what if it decided that others have "distinct usage as spacing forms" which cannot be adequately represented by space or NBSP plus diacritic? Of course we could propose more spacing diacritics, but surely rather than define a potentially large number of new spacing forms it would make sense to define one new character which can combine with any diacritic to produce a spacing form. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Questions on ZWNBS - for line initial holam plus alef
On 09/08/2003 13:41, John Cowan wrote: Peter Kirk scripsit: The gap may not be large, but Philippe, John H and I have identified a real gap. Why this antagonism against filling it? What you have identified is a set of implementation defects, not problems with the Unicode Standard. The standard way to do what you want is to precede the combining mark with SP or NBSP. If that "doesn't work", then the implementation that makes it not work needs to be fixed. Tell Microsoft! (See Noah Levitt's posting.) If this is indeed "The standard way to do what you want", then the standard needs to make it clear that the sequence of or has the properties which I want, i.e. it has the width of the combining mark alone, and not the full width of a space, and does not expand for justification, is not a line breaking opportunity, does not in fact have any of the properties of a space. I expect to see such a clarification in the next edition of the Unicode Standard. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Questions on ZWNBS - for line initial holam plus alef
On Tuesday, August 05, 2003 1:52 AM, Kenneth Whistler <[EMAIL PROTECTED]> wrote: > Peter, > > > > The carrier for a combining mark that is to display in isolation > > > without a base character is U+0020 SPACE. If you want to also > > > indicate the absence of a line break opportunity, then the > > > carrier is U+00A0 NO-BREAK SPACE (NBSP). > > > > > Neither of these is appropriate to the case I have in mind > > (described in greater detail below) as they are not zero width and > > therefore give an unwanted indent at the start of a line. > > Of course, because the whole point of this convention is to display > a non-spacing mark in isolation, not applied to a base character. > > > U+200B ZERO WIDTH SPACE might be > > appropriate, but this has the problem that it is a break > > opportunity, which is not always appropriate. > > U+200B ZERO WIDTH SPACE is not appropriate, for the same reason > the U+FEFF (or U+2060) is not appropriate: The Standard does > not specify the display of non-spacing marks on it as a means > of showing the marks without base characters. And, as you indicate, > U+200B (but also U+FEFF and U+2060) are implicated in the control > of line break opportunities. They are certainly not defined > as glyph display anchors or some such. Here I disagree: ZWS is a white-space, not a format control, and thus it has a glyphic and semantic identity by itself (unlike ZWNBSP or WJ). So ZWS clearly qualifies as a base character, and is certainly better (conceptually and per its breaking properties) than the standard ASCII space which has an implied minimum width (which may be too large to be used as a holder for a tiny diacritic like a dot above, or even an acute accent. 200B;ZERO WIDTH SPACE;Zs;0;BN;N; When we speak about combining sequences, they are already supposed to expand the width or height of a base character to which it applies, so ZWS despite being zero-width itself, does not make this property inherited to the combining sequence which includes it. For me, the best two candidates for holders of isolated diacritics are ZWS (if breakable before and after the combining sequence), or WJ (if not breakable when the isolated diacritic must be used within the same word without internal break opportunity). However WJ is a control and does not fit well for the second usage. Could there be another codepoint assigned that has these properties: 20CF;ZERO WIDTH SYMBOL;Sk;0;ON; 0020N; i.e. being considered symbolic, not a whitespace, with combining class 0 (not combining), and used as an explicit base for a isolated spacing diacritic to never show with a dotted circle? (note U+20CF is just a suggestion, as it fits at end of the symbolic block used for currency symbols, just before the "extended" combining characters block, and because the U+02XX block where other "Sk" spacing diacritics are defined is full). The compatibility decomposition to a space is to make it in sync with other compatibly decomposable spacing diacritics. The new character would allow to represent diacritics that currently don't have a spacing counterpart, and use them as if they were letter like. Let's look at a similar diacritic which currently has an existing "precombined" spacing version: 00B4;ACUTE ACCENT;Sk;0;ON; 0020 0301N;SPACING ACUTE
Re: Questions on ZWNBS - for line initial holam plus alef
Philippe continued: > On Saturday, August 09, 2003 12:49 AM, Michael Everson wrote: > > > At 14:22 -0700 2003-08-08, Kenneth Whistler wrote: > > > > > Philippe, you are tilting at windmills, here. There is no chance > > > that the UTC is going to consider such a character, in my > > > assessment, let alone give it the properties you suggest. > > > > Nor WG2 either. > > Why that? Because I suggest something that some other may think > as useful to fill a large gap in Unicode for spcing diacritics, but I'm > not trusted enough due to my errors or confusions here, so that this > suggestion would be endorsed by more "serious" UTC or WG2 > members? Mostly because there is no "large gap" here in the first place. > Why do you think it is stupid to have a single carrier character that > would avoid adding new spacing diacritics, when the standard > combining diacritics could be used without less "quirks" like > "defective" sequences just to produce the desired effect? Because the mechanism for doing so -- application to SPACE or to NBSP -- has been specified by the standard for a decade now. > If you think that spacing diacritics are stupid, We do not. Some of them are necessary compatibility characters. Others have distinct usage as spacing forms that warrant their separate encoding. > why then are they > given these properties and not deprecated (no more recommanded) > in the standard, Because the ones in the standard, and particularly the ASCII and Latin-1 spacing diacritics, were required for a number of legacy and implementation reasons... > in favor of the SPACE+diacritics sequences, ...and because these are not, and never have been, canonically equivalent. --Ken "Well then, if he be mad, as he is, and with a madness that mostly takes one thing for another, and white for black, and black for white, as was seen when he said the windmills were giants, and the monk's mules dromedaries, flocks of sheep armies of enemies, and much more to the same tune, it will not be very hard to make him believe that some country girl, the first I come across here, is the lady Dulcinea; and if he does not believe it, I'll swear it; and if he should swear, I'll swear again; and if he persists I'll persist still more, so as, come what may, to have my quoit always over the peg. Maybe, by holding out in this way, I may put a stop to his sending me on messages of this kind another time..."