RE: Glottal stops (bis) (was RE: Missing African Latin letters (bis))
From: [EMAIL PROTECTED] on behalf of Kenneth Whistler Athabascan languages in Canada are also written with practical orthographies such as these At least two of which (Dogrib and one or both varieties of Slavey) use a cased glottal stop, not U+0027. Nobody is agitating for an uppercase apostrophe. Not in Canada, that I know of. (I've seen indication of languages in Russia that have a case distinction for ' and possible also .) For these, and thousands of other documents published on Athabascan languages over the last century, there was just a glottal stop -- not an uppercase and a lowercase glottal stop. That's true of phonetic transcriptions. But for orthographies, there are some that have case. It is because that is what the IPA settled on for their prescriptive preference for the shape of a glottal stop. (Note: for a *glottal stop*, not for a *capital glottal stop*. The IPA does not have casing distinctions.) Which only tells us that there should be no predisposition to consider the glottal stop upper rather than lower case, or vice versa. It does not tell us that that character cannot be involved in a case-pair relationship. The prestige of the IPA specification is such that many fonts have used that form as well. And, indeed, it influenced the choice for the Unicode representative glyph, which in turn has influenced what OS vendors have put in their fonts. So, while there are multiple different glyphs in print for a glottal stop (see Pullum Ladusaw for different examples), most of which don't *look like* capital letters, the IPA glyph has become the preferred one, simply because IPA prefers it. All of which is very germane to my argument. And that is unfortunate, because that one glyph is the one that people think *looks like* a capital letter, and which thus causes the confusion when an orthographic innovation decides it needs to introduce casing for it. Not only think looks like, but behave as though it is. Now I presume from Michael's assertion that there is some Athabascan community *somewhere* that has started to make an initial case distinction for glottal stop, This thread began when I provided a scanned image. and that in the fonts they use, their uppercase glottal stop *looks like* the IPA glottal stop, and that for the body text they innovated a miniature of same. Hence the conclusion that we must treat the existing form as the *capital* and need to encode a new lowercase form. That alone is not the basis of the argument. You have provided the basis for additional, strong argumentation yourself: 0294 cannot be displayed using the lowercase glyph as it's design as a cap-height letter is well established in many fonts. If a new upper-case glottal character is created, a distinct lowercase glottal would be needed, but then there would be two characters (0294 and the new UC glottal) that have exactly the same appearance and would be getting confused with mixed up and inconsistent data and processes for years to come. That, however, is utterly backward. It is clear that in these cases, following 100 years of monocase usage of glottal stop, that the innovation (as in many adaptations of IPA) is to create an uppercase letter to go with the lowercase one. This argument is completely empty, as it depends on the premise that the existing character can be considered the lowercase one. You have asserted it to be so, but you have not given reasoning why it must be considered so. That seems to be especially required given that you observe in the same breath that during the 100 years of its usage this character has been monocase. Let's roll back the discussion for a moment. Suppose before this thread had started up someone had come along and said, 0294 is obviously caseless and has always been so; it should have a general category of Lo rather than Ll, just like the dental click (01C0) and other caseless phonetic symbols. Would you be able to make an compelling argument that it must be Ll and not Lo? I don't see how anyone possibly could. But to maintain the premise that this only-ever-monocased character is the lowercase one, you've got to have solid reasons to say it could not be Lo and must be Ll. IMO, it takes an emporer willing to wear clothes spun with thread that only the wisest could see to say that, though the cap-height character the Dogrib and Slavey are using as a capital has *exactly* the same appearance and metrics as 0294, it is actually the thing that is half the height and has a different shape that's the same as 0294, and that this exact replica is really a new innovation. Ken, you have not given reasons why 0294 cannot be considered uppercase -- no evidence that it has in the past been used as lowercase in a case pair, or that usage as an uppercase in a case pair would result in problems in implementation, usability, or management of data. You have merely asserted that the original character was a
RE: New symbols (was Qumran Greek)
and why aren't they linked together for us fringies? They are... For some reason, my first thought was of Ford Prefect asking the fellow regarding the not-well-publicized plans to build a by-pass, Have you ever thought of going into advertizing? -) No, it is made from the River Liffey When I was there in 1979, the river was introduced to me as the whiffy Liffey, and I was told that the ships in the river scooped up the water and took it up stream where they just put it into the bottles. Peter
RE: Glottal stops (bis) (was RE: Missing African Latin letters (bis))
From: [EMAIL PROTECTED] on behalf of Michael Everson to use the kinds of uppercase glyph models used in similar instances of after-the-fact uppercase inventions based on IPA or other phonetic alphabets and usages. A modified capital P would probably do. [??!!] Michael, you've seen what they are using. How will the community be served when type designers start creating fonts that have a cap-height glyph for 0294 supplemented by a modified capital P? If a band of Rumple-stiltskin Latins from Caesar's administration suddenly awoke from their 2000 year slumber, reviewed the situation and then pronounced, This 'w' is not acceptable to us; you shall be permitted to inscribe an additional sound from your barbaric northern tongue using an O split in two parts, and one size is adequate, how excited with their decision do you think we'd be? Peter Constable
RE: Transcoding Tamil in the presence of markup (was Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup))
From: [EMAIL PROTECTED] on behalf of Kenneth Whistler Unicode doesn't prevent styling, of course. But having 'logical' order instead of 'visual' makes it a hard task for the application and the renderer. This is witnessed by the thin-spread support for this. Yes... Ken conceded the claim too readily. Glyph re-ordering due to a logical encoding order that is different from visual order may mean that certain types of styling (of the re-ordered character) may not be supported in some implementations, but it does *not* mean that this is, in general, a hard task. Style information is applied to characters, and as long as there is a 1:m association between characters and glyphs and there is a path to transform the styling information to match the character/glyph transformations, styling is in principle possible. (There's a constraint that styling might not be possible if the styling differences require different fonts but the glyph transformations that occur require rule contexts to span such a style boundary.) (Expecting one component from a precomposed character to be styled differently from the rest, however, would be somewhat hard.) In particular, for reordering this is easy to demonstrate by considering a hypothetical complex-script rendering implementation in which processing is divided into two stages: character re-ordering, and glyph transformation. In the first stage, all that happens is that a string is mapped to a temporary string used internally only, in which characters are reordered into visual order. (Surroundrant characters with no decomposition would be mapped into multiple internal-use-only virtual characters.) Thus, a styled string such as stringkspan color=rede/span/string would transform in the first stage to stringspan color=rede/spank/string. There is nothing hard in such processing. (Of course, whether it is harder to get people to implement support for one thing rather than another is an entirely different question.) Peter Constable
Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)
On 08/12/2003 15:51, Philippe Verdy wrote: ... Peter Kirk writes: Agreed. But now we are told that the latter is illegal XML because a combining mark is not permitted (by XML, not by Unicode) after span. It is not forbidden by XML. It's just that handling a XML file (which is not plain-text) as if it was a Unicode plain-text when performing normalization of the file may produce unexpected composition of characters which are part of the XML syntax. ... Philippe, you have now stated this (several times). But just a day earlier you yourself stated that the rule forbidding combining marks at the start of a string would never be relaxed because it is fundamental to the XML containment model. You don't usually contradict yourself quite so obviously. Anyone, please, is it or is it not true that XML forbids, or will forbid in future versions, combining characters immediately after markup? -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Transcoding Tamil in the presence of markup (was Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup))
On 08/12/2003 16:17, Kenneth Whistler wrote: ... Having an 'invisible consonant' to call for rendering of the vowel sign in isolation (and without the dotted circle), would also help the limited number of cases where the styled single character is needed - but in a rather hackish way. That is what the SPACE as base character is for. If some renderers insist on rendering such combinations with a dotted circle glyph, that is an issue in the renderer -- it is not a defect in the encoding standard for not having a way to represent the vowel sign in isolation. SPACE is unsuitable for this function for at least two good reasons: 1) because of its word and line breaking characteristics; 2) because in a case like this no extra spacing is required. The vowel sign is a spacing character in itself, although a combining mark. SPACE is expected to add its own spacing. In the absence of clearly defined rules to the contrary, renderers will render this combination of SPACE with a Tamil vowel with an extra space which is not wanted. (As for which side of the vowel the space will appear, that is anyone's guess!) This is yet another example to add to a number that I have identified showing that the reuse of SPACE and NBSP as carriers for diacritics is an undesirable overloading of character semantics. I propose again a new base character for carrying combining marks, with no glyph and a width just as wide as that required to display the combining marks. The mechanism already defined for using SPACE and NBSP for this should be deprecated, although not abolished. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)
Peter Kirk writes: Philippe, you have now stated this (several times). But just a day earlier you yourself stated that the rule forbidding combining marks at the start of a string would never be relaxed because it is fundamental to the XML containment model. You don't usually contradict yourself quite so obviously. I don't know how you interpreted what I may have said a few days before. I have certainly not said that XML forbids combining marks at the start of XML, just that W3C does not _recommand_ it as well as any other defective combining sequences, as they are known to cause problems (for example when it's difficult to track the effective text file type) That's the same for NFC: it's just a recommandation, not a requirement, and for XML there are no such canonical equivalents, just distinct strings. It's up to the application using the _parsed_ XML document tree to do, if needed, the normalization steps. But this should occur only _after_ the document has been parsed and possibly validated according to its schema. Generally, noramlization of strings will only occur in the very last step just before output the result (for example for font rendering), but even at this step, the font may provide information which may require glyph processing or character substitutions that is not well performed with just a normalized NFC form. So in fact, the XML application can/should perform its own necessary normalizations only at steps where it has a benefit, but not at the file stream level as the XML stream itself is not plain text. __ ella for Spam Control has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com attachment: winmail.dat
Re: [OT]
On 08/12/2003 17:29, Philippe Verdy wrote: ... Nota: when speaking about alcohol in public areas, we have to add here in France a mandatory legal notice: L'abus d'alcool est dangereux pour la sante, appreciez et consommez le avec moderation. ... Despite your French notice about danger to the health (not to the sanity, though that might be true, too), Guinness was actually introduced as a health drink. I think the problem was that too many Irish people were spending their money on whiskey and not eating well, so Arthur Guinness introduced a drink that was so full of nutrients that you could live on it, and so heavy that you can't drink enough of it to get drunk! -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)
On 09/12/2003 03:41, Philippe Verdy wrote: Peter Kirk writes: Philippe, you have now stated this (several times). But just a day earlier you yourself stated that the rule forbidding combining marks at the start of a string would never be relaxed because it is fundamental to the XML containment model. You don't usually contradict yourself quite so obviously. I don't know how you interpreted what I may have said a few days before. I have certainly not said that XML forbids combining marks at the start of XML, just that W3C does not _recommand_ it as well as any other defective combining sequences, as they are known to cause problems (for example when it's difficult to track the effective text file type) So, let's get this clear. Within an XML or HTML document, if I want an e with a red acute accent on it, it is quite permissible to write: espan class=red-text{U+0301}/span where {U+0301} is replaced by the actual Unicode character, and red-text is defined in the stylesheet. So it is not a problem that there is a defective combining sequence, nor that the accent is not combined with the e as it would be in NFC. Is that correct? If this is correct, then the Tamil problem which Peter J is concerned about has gone away completely, or at least it is reduced to a tricky rendering issue. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)
Peter Kirk scripsit: Anyone, please, is it or is it not true that XML forbids, or will forbid in future versions, combining characters immediately after markup? XML 1.0 is silent on the subject. The W3C Character Model (which is not official yet) says that content developers SHOULD avoid composing characters at the beginning of constructs that may be significant, such as at the beginning of an entity that will be included, immediately after a construct that causes inclusion or immediately after markup. XML 1.1 (which is not official yet either) references the Character Model and states which constructs are significant. The technical meaning of SHOULD is defined by RFC 2119: This word [...] means that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course. -- John Cowan [EMAIL PROTECTED] www.reutershealth.com www.ccil.org/~cowan It's the old, old story. Droid meets droid. Droid becomes chameleon. Droid loses chameleon, chameleon becomes blob, droid gets blob back again. It's a classic tale. --Kryten, _Red Dwarf_
Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)
Anyone, please, is it or is it not true that XML forbids, or will forbid in future versions, combining characters immediately after markup? XML does not forbid it, it does recommend you avoid it. Charmod defines include-normalization and full-normalization which go beyond Unicode normalisation in guaranteeing that normalisation will not be altered through various concatenations and inclusions that may occur in the processing of XML data. These do forbid it, though I don't think Charmod insists on their being used. The specification of an application of XML could cite Charmod and insist on include- or full-normalisation. In some cases this would have no real effect (in some data-orientated rather than document-orientated uses of XML), in others it would be a restriction on what could be done in the application. Not forbidding it problems, the most spectacular being the possibility of COMBINING LONG SOLIDUS OVERLAY causing a well-formed XML document to have a canonically equivalent (in both the Unicode and XML concepts of c14n, since the latter makes use of NFC) document that was not well-formed XML. Colouring of diacritics can be performed through other means. http://www.w3.org/TR/charmod/benoit.svg is an SVG example. This seems a superior method for at least some of the use-cases cited anyway (I've missed some of this thread though). -- Jon Hanna | Toys and books http://www.hackcraft.net/ | for hospitals: | http://santa.boards.ie/
Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)
Philippe Verdy scripsit: When in doubt, don't perform any normalization of XML _files_ as they are NOT plain text: you need a XML parser to do it safely only in relevant sections of this file. All you could do safely is to possibly reencode XML files (for example from UTF-8 to UTF-16 encoding schemes). This is wildly overstated. XML files most certainly are plain text, though they may be interpreted as fancy text in contexts that understand XML. With the insignificant exception of a markup immediately followed by a U+0338 character, it is entirely safe to normalize XML files according to any normalization. (It is true that NK* normalization forms may lose information, but XML document authors are discouraged from using compatibility decomposables in any case.) What is not allowed, and this makes XML technically non-conformant to the Unicode Standard, is to make arbitrary and unsystematic replacements of one canonically equivalent form with another. For example, if an element name is h)Bétérogénéité (a favorite word of mine), decomposing the start-tag while leaving the end-tag composed would make the document no longer well-formed XML. In my opinion, this is a corner case that may be safely ignored. -- John Cowan www.reutershealth.com www.ccil.org/~cowan [EMAIL PROTECTED] 'Tis the Linux rebellion / Let coders take their place, The Linux-nationale / Shall Microsoft outpace, We can write better programs / Our CPUs won't stall, So raise the penguin banner of / The Linux-nationale.
Re: [OT]
Despite your French notice about danger to the health (not to the sanity, though that might be true, too), Guinness was actually introduced as a health drink. I think the problem was that too many Irish people were spending their money on whiskey and not eating well, so Arthur Guinness introduced a drink that was so full of nutrients that you could live on it, and so heavy that you can't drink enough of it to get drunk! Alas, you can easily drink enough of it to get drunk, even if you don't like being drunk, since its delicious taste will lead you to exceed your limit. Stout was indeed given as a health drink in small doses in certain cases, it's one of the few foods that are a good source of both iron and calcium. However the only doctor I've heard of recommending it in recent years was a bone specialist who was trained in China; he professed a belief that Guinness was why the Irish had thicker bones than the Chinese in his experience. There are considerably more doctors who would say that if you were going to drink a beer it should be stout, without going so far as to actually recommend it in and of itself. A pint of plain's your only man. -- Jon Hanna | Toys and books http://www.hackcraft.net/ | for hospitals: | http://santa.boards.ie/
Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)
So, let's get this clear. Within an XML or HTML document, if I want an e with a red acute accent on it, it is quite permissible to write: espan class=red-text{U+0301}/span where {U+0301} is replaced by the actual Unicode character, and red-text is defined in the stylesheet. So it is not a problem that there is a defective combining sequence, nor that the accent is not combined with the e as it would be in NFC. Is that correct? You can, whether you should is another thing, and whether it would render correctly yet another. -- Jon Hanna | Toys and books http://www.hackcraft.net/ | for hospitals: | http://santa.boards.ie/
RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)
[EMAIL PROTECTED] writes: What is not allowed, and this makes XML technically non-conformant to the Unicode Standard Where did you see that XML files need to be conformant to the Unicode standard? XML files are definitely NOT plain text (if this was the case, then it would be forbidden to interpret as a special markup character instead of the standard Unicode base character with its associated glyph)... _Only_ fragments of XML files are plain-text and are fully conforming to the Unicode standard. __ ella for Spam Control has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com attachment: winmail.dat
RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)
-Message d'origine- De : Peter Kirk [mailto:[EMAIL PROTECTED] Envoye : mardi 9 decembre 2003 13:17 A : [EMAIL PROTECTED] Cc : [EMAIL PROTECTED] Objet : Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup) On 09/12/2003 03:41, Philippe Verdy wrote: Peter Kirk writes: Philippe, you have now stated this (several times). But just a day earlier you yourself stated that the rule forbidding combining marks at the start of a string would never be relaxed because it is fundamental to the XML containment model. You don't usually contradict yourself quite so obviously. I don't know how you interpreted what I may have said a few days before. I have certainly not said that XML forbids combining marks at the start of XML, just that W3C does not _recommand_ it as well as any other defective combining sequences, as they are known to cause problems (for example when it's difficult to track the effective text file type) So, let's get this clear. Within an XML or HTML document, if I want an e with a red acute accent on it, it is quite permissible to write: espan class=red-text{U+0301}/span where {U+0301} is replaced by the actual Unicode character, and red-text is defined in the stylesheet. So it is not a problem that there is a defective combining sequence, nor that the accent is not combined with the e as it would be in NFC. Is that correct? That's right: the text element within span just contains the string with the isolated diacritic, it is already in NFC form despite it is defective. And it must not be parsed by creating a combining sequence that includes the terminating the span tag (interpretation of combining sequences is only valid within plain-text, and thus excludes syntactic characters used in XML. Note that this is not specific to XML. Any text/* format that is not plain text (notably programming source files, shell scripts, HTML files, stylesheets, and JavaScript files) should be handled this way, where the syntax of the language governs the rules for parsing it, before even trying to use Unicode definitions on parsed tokens used in that programming language. So normalization should never be performed on whole files that are not explicitly of file type text/plain (either with an explicit meta-data such as MIME headers during transmissions, or locally with OS-specific conventions on file extension such as .txt) When in doubt, for example in CVS repositories or in diff/merge tools, normalization must not be performed, and the current encoding form of text files must be preserved, each time that tools does not implement an accurate parser for the syntaxic and lexical rules of the effective file type or language, which may or may not accept defective combining sequences as valid plain-text strings (this includes identifiers, however Unicode recommands a list of characters that can be used to start an identifier, and this list excludes all non-starter combining characters.) __ ella for Spam Control has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com attachment: winmail.dat
Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)
Philippe Verdy scripsit: XML files are definitely NOT plain text (if this was the case, then it would be forbidden to interpret as a special markup character instead of the standard Unicode base character with its associated glyph)... You might as well say that C code is not plain text because it too is subject to special canons of interpretation. But both XML/HTML/SGML and the various programming languages are plain text: they are written with plain-text editors, manipulated with plain-text tools, and can be rendered with plain-text renderers. The fact that other things can be done with them is neither here nor there. -- In my last lifetime, John Cowan I believed in reincarnation;http://www.ccil.org/~cowan in this lifetime, [EMAIL PROTECTED] I don't. --Thiagi http://www.reutershealth.com
Re: New symbols (was Qumran Greek)
http://www.dkuug.dk/jtc1/sc2/wg2/docs/n2676.pdf is a complete listing of new symbols to go into Unicode Thanks!--how many Web sites do you all have? http://www.evertype.com/formal.html is a good link to what Michael Everson is doing. http://www.dkuug.dk/jtc1/sc2/wg2/docs/ is an index page with all the proposals and documents, if you want to keep up the new proposals. -- ___ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm
Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)
Hi Peter, All, Peter Kirk [EMAIL PROTECTED] wrote: [...] [About espan class=red-text#x0301;/span being correct HTML} [...] If this is correct, then the Tamil problem which Peter J is concerned about has gone away completely, or at least it is reduced to a tricky rendering issue. Jungshik and Martin already voted, that span style='color:#00f'#x0BB2;/span#x0BC6; is perfectly valid HTML, and I assume the same holds for #x0BB2;span style='color:#00f'#x0BC6;/span But, seeing real-life user agents mishandle this, and being confronted with task of writing a converter from legacy Tamil encodings (in visual order), there is some temptation to markup this as: {INV}#x0BC6;span style='color:#00f'#x0BB2;/span or respectively span style='color:#00f'{INV}#x0BC6;/span#x0BB2; With {INV} being the hypothetical, not-spacing-adding, invisible consonant. But a) {INV} doesn't exist (so far) and b) The user agents I tested render {SPACE}#x0BC6; with the misguided dotted circle. So, I can easily withstand this temptation (for now). Regards, Peter Jacobi -- +++ GMX - die erste Adresse für Mail, Message, More +++ Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net
RE: [OT]
[EMAIL PROTECTED] wrote: Stout was indeed given as a health drink in small doses in certain cases, it's one of the few foods that are a good source of both iron and calcium. However the only doctor I've heard of recommending it in recent years was a bone specialist who was trained in China; he professed a belief that Guinness was why the Irish had thicker bones than the Chinese in his experience. There are considerably more doctors who would say that if you were going to drink a beer it should be stout, without going so far as to actually recommend it in and of itself. You'll find also interesting studies about why French people experiment low levels of heart/vessels diseases, despite they eat above-average quantities of fat and sugars. One reason is that they drink wine. A very moderate absorbtion of alcohol is beneficial to the health, but this is true ONLY if you compare populations drinking NO alcohol and those that just drink a little. This is easy to see when comparing children that drink no alcohol and experiment more heart/vessel diseases than those that drink a few milliliters of alcohol each day (a small and beneficial absorbtion of alcohol is possible just by eating fruits or drinking it in a medical form as food supplements). A very small daily absorbtion of alcohol helps the body to fluidify the blood and clean up its vessels from excesses of fat. You don't need a lot (a full bottle of beer is not needed, but you cannot keep beer drinkable for a long time once it has been open). That's why beer is not a recommanded form of absorbtion of alcohol, as it requires you to exceed the sufficient level to get its benefit. On the opposite, you can open a 75cl bottle of wine at 6° (which contains 45ml of pure alcohol) and drink it on three days to have 15ml of pure alcohol each day (a reasonnable and beneficial level for adults of average weight of about 80kg). For children, due to their reduced weight, and thus reduced volume of blood, this quantity must be reduced accordingly, and this is possible by using medical forms of alcohol, that you will find in pharmacies in products that also contain essential oils, vitamins, and mineral supplements. You should know that even babies are sometimes given tiny quantities of alcohol within curing medications to help them recover from infectious diseases: the needed quantity is less than 5 milliliters. __ ella for Spam Control has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com attachment: winmail.dat
RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)
[EMAIL PROTECTED] writes: Philippe Verdy scripsit: XML files are definitely NOT plain text (if this was the case, then it would be forbidden to interpret as a special markup character instead of the standard Unicode base character with its associated glyph)... You might as well say that C code is not plain text because it too is subject to special canons of interpretation. But both XML/HTML/SGML and the various programming languages are plain text: they are written with plain-text editors, manipulated with plain-text tools, and can be rendered with plain-text renderers. The fact that other things can be done with them is neither here nor there. The fact that plain-text renderers are used is not relevant here as any normalization the renderer would use is hidden in the background and the renderer does not expose the transformations it makes to the editor itself. Also, noboby uses an editor that performs implicit normalization of text when saving file. If there's such an editor that can do it on the fly, this option should be disabled for source files. It's best for editors to allow the user select the parts in the text to normalize, and then apply normalization only in these selected parts. A more simple editor could implement a global normalization, but this should be an explicit editing action from the user. For various reasons, I would not like to use any Unicode plain-text editor that implicitly normalizes the text without asking me, to work on programming source files or XML or HTML files. But I will accept it, if the editor really understands the language or XML syntax (and exhibits it to the user with syntax coloring). __ ella for Spam Control has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com attachment: winmail.dat
Re: [OT]
On 09/12/2003 04:44, [EMAIL PROTECTED] wrote: Despite your French notice about danger to the health (not to the sanity, though that might be true, too), Guinness was actually introduced as a health drink. I think the problem was that too many Irish people were spending their money on whiskey and not eating well, so Arthur Guinness introduced a drink that was so full of nutrients that you could live on it, and so heavy that you can't drink enough of it to get drunk! Alas, you can easily drink enough of it to get drunk, even if you don't like being drunk, since its delicious taste will lead you to exceed your limit. I think the current version has been watered down and strengthened in alcohol compared to the original. And I am not thinking of being slightly woozy and not safe to drive; I am thinking of being blind drunk and unable to crawl home, which (I am told!) is much easier with whiskey. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)
On 09/12/2003 05:13, [EMAIL PROTECTED] wrote: So, let's get this clear. Within an XML or HTML document, if I want an e with a red acute accent on it, it is quite permissible to write: espan class=red-text{U+0301}/span where {U+0301} is replaced by the actual Unicode character, and red-text is defined in the stylesheet. So it is not a problem that there is a defective combining sequence, nor that the accent is not combined with the e as it would be in NFC. Is that correct? You can, whether you should is another thing, and whether it would render correctly yet another. Well, users need to know whether they should do this, or what else they should do, when this is the effect they require; and implementers need to know whether they should work towards making this render correctly, to meet the demands of users including the Tamil users in question. It seems that this is the simple and meaningful way of specifying the effect that it required. Rendering this is of course a challenge, but at least the requirement is clear. Your alternative suggestion using svg seemed to require the user to handle the details of glyph positioning with specified horizontal advances, which is surely a very strange requirement. Or maybe I have misunderstood what was going on here. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)
Your alternative suggestion using svg seemed to require the user to handle the details of glyph positioning with specified horizontal advances, which is surely a very strange requirement. Or maybe I have misunderstood what was going on here. Perhaps so does yours. It isn't clear whether the CSS for .red-text would have to over-ride the default behaviour whereby an inline element like span is rendered by stacking it to the left or right (depending on text directionality) of the previous inline element or text node, or if the accent should go over the e by default. Briefly testing on a Win2000 box I found that IE6 ignored the styling on the accent, Mozilla1.4 didn't show the accent, and Opera7.2 displayed the red accent (tests had the same results with #x0301; as with the combining character used directly). It isn't clear to me which, if any, of these are examples of conformant behaviour. -- Jon Hanna | Toys and books http://www.hackcraft.net/ | for hospitals: | http://santa.boards.ie/
Re: Transcoding Tamil in the presence of markup (was Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup))
On Mon, 8 Dec 2003, Peter Jacobi wrote: It would be most interesting, if someone can point out a wordprocessor or even a rendering library (shouldn't Pango be the solution to everything?), which enables styling of individual Tamil letters. I think Pango's attributed string ( http://developer.gnome.org/doc/API/2.0/pango/pango-Text-Attributes.html ) can be used for this. I believe that other layout/rendering libraries such as Uniscribe, ATSUI and the rendering/layout part of ICU have similar data type/APIs. Jungshik
Re: [OT]
At 12:44 + 2003-12-09, [EMAIL PROTECTED] wrote: A pint of plain's your only man. Yes, yes, yes, now will you people start talking about fragile-glass symbols or plate-and-cutlery symbols or something and drag this back into some semblance of topicality? Hm. We have a hot beverage symbol. Maybe we need a pint glass -- Michael Everson * * Everson Typography * * http://www.evertype.com
RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)
You might as well say that C code is not plain text because it too is subject to special canons of interpretation. C, C++ and Java source files are not plain text as well (they have their own text/* MIME type, which is NOT text/plain notably because of the rules associated with end-of-lines, notably in presence of comments). But both XML/HTML/SGML and the various programming languages are plain text. See text/xml, text/html and text/sgml MIME types. They also aren't text/plain so they have their own interpretation of Unicode characters which is not the one found in the Unicode standard. __ ella for Spam Control has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com attachment: winmail.dat
RE: Text Editors and Canonical Equivalence (was Coloured diacritics)
Hmm. Now here's some C++ source code (syntax colored as Philippe suggests, to imply that the text editor understands C++ at least well :enough to color it) int n = wcslen(L"caf"); (That's int n = wcslen(L"caf"); for those without HTML email) The L prefix on a string literal makes it a wide-character string, and wcslen() is simply a wide-character version of strlen(). (There is no guarantee that "wide character" means "Unicode character", but let's just assume that it does, for the moment). So, should n equal four or five? The answer would appear to depend on whether or not the source file was saved in NFC or NFD format. There is more to consider than just how and whether a text editor normalizes. If a text editor is capable of dealing with Unicode text, perhaps it should also be able to explicitly DISPLAY the actual composition form of every glyph. The question I posed in the previous paragraph should ideally be obvious by sight - if you see four characters, there are four characters; if you see five characters, there are five characters. This implies that such a text editor should display NFD text as separate glyphs for each character. On the other hand, such a text editor must also acknowledge that "" and "e + U+0301" are actually equivalent. The intention of canonical equivalence is that the glyphs should display the same - otherwise we'd need precomposed versions of, well, everything. So in other contexts, is should display them the same. Yuk. That's a lot to think about for anyone considering writing a programmers' text editor with serious Unicode support. Jill -Original Message- From: Philippe Verdy [mailto:[EMAIL PROTECTED]] Sent: Tuesday, December 09, 2003 2:04 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup) I would not like to use any Unicode plain-text editor that implicitly normalizes the text without asking me, to work on programming source files or XML or HTML files. But I will accept it, if the editor really understands the language or XML syntax (and exhibits it to the user with syntax coloring).
Re: Ideographic Description Characters
On Dec 8, 2003, at 6:20 PM, Mark Davis wrote: John, I don't see why you are saying that it is a 'no-no'. There is no reason that someone couldn't do something like that. Strictly speaking, it isn't in violation of TUS, which only says (p. 309), Ideographic Description Sequences are not to be used to provide alternative graphic representations of encoded ideographs. Less formally, however, the discussion in The Book is focused on using them to represent unencoded ideographs, and we have consistently suggested (a) that IDSs should be as short as possible, and (b) that they shouldn't be used at all for encoded ideographs in text exchange. It may be a good idea to update the language in The Book to specifically state that they are also useful for pedagogy and structural analysis of existing ideographs. John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://homepage..mac.com/jhjenkins/
Re: Transcoding Tamil in the presence of markup (was Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup))
I agree strongly. Reordering of glyphs doesn't affect the ability to maintain styles. Every reasonable package has to retain the mappings back and forth between character and glyph to maintain styles and to map highlighting/mouse clicks/etc. The only issue is for combinations. That is, the character = glyph mappings can be arbitrary combinations of the following: reordering: easy to retain style 1:1 mapping: easy to retain style 1:n mapping: also easy to retain style n:1 mapping: this is the place where it gets tricky. Any time the 1:n mapping is involved, maintaining styles is difficult. For example, with sampleredf/redgreeni/green/sample, if ligatures are used for fi, then you have some choices. (a) disallow the ligature, (b) color it all one or the other color, (c) if (and that's a big if) your font allows for the production of an fi ligature with two adjacent 'fitting' pieces, essentially contextual forms instead of a ligature, then you can do both the ligature and the color. Mark __ http://www.macchiato.com - Original Message - From: Peter Constable [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Tue, 2003 Dec 09 00:30 Subject: RE: Transcoding Tamil in the presence of markup (was Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)) From: [EMAIL PROTECTED] on behalf of Kenneth Whistler Unicode doesn't prevent styling, of course. But having 'logical' order instead of 'visual' makes it a hard task for the application and the renderer. This is witnessed by the thin-spread support for this. Yes... Ken conceded the claim too readily. Glyph re-ordering due to a logical encoding order that is different from visual order may mean that certain types of styling (of the re-ordered character) may not be supported in some implementations, but it does *not* mean that this is, in general, a hard task. Style information is applied to characters, and as long as there is a 1:m association between characters and glyphs and there is a path to transform the styling information to match the character/glyph transformations, styling is in principle possible. (There's a constraint that styling might not be possible if the styling differences require different fonts but the glyph transformations that occur require rule contexts to span such a style boundary.) (Expecting one component from a precomposed character to be styled differently from the rest, however, would be somewhat hard.) In particular, for reordering this is easy to demonstrate by considering a hypothetical complex-script rendering implementation in which processing is divided into two stages: character re-ordering, and glyph transformation. In the first stage, all that happens is that a string is mapped to a temporary string used internally only, in which characters are reordered into visual order. (Surroundrant characters with no decomposition would be mapped into multiple internal-use-only virtual characters.) Thus, a styled string such as stringkspan color=rede/span/string would transform in the first stage to stringspan color=rede/spank/string. There is nothing hard in such processing. (Of course, whether it is harder to get people to implement support for one thing rather than another is an entirely different question.) Peter Constable
RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)
You might as well say that C code is not plain text because it too is subject to special canons of interpretation. C, C++ and Java source files are not plain text as well (they have their own C, C++ and Java source files are plain text. text/* MIME type, which is NOT text/plain notably because of the rules I've seen text/cpp and text/java, but really there are no such types. I've also seen text/x-source-code which is at least legal, if of little value to interoperability. The correct MIME type for C and C++ source files is text/plain. I'd be prepared to give good odds that that is the case with Java source files as well. associated with end-of-lines, notably in presence of comments). As source files (that is, at the stage in processing at which a human user can see the source and edit it) the only handling required for end-of-lines is converstion of new line function characters, the same as for any other use of plain text. The treatment of end-of-lines as significant when processed (for example following one-line // comments) is a matter of what an application chooses to do with a particular character. This is no different than an indexer deciding that a plain text file contains a particular word, or for that matter in my putting coffee filters into my basket if I see coffee filters written on my shopping list. But both XML/HTML/SGML and the various programming languages are plain text. See text/xml, text/html and text/sgml MIME types. They also aren't text/plain so they have their own interpretation of Unicode characters which is not the one found in the Unicode standard. They have their own interpretation of tne Unicode characters which is *in addition to* the one found in the Unicode standard. As to all but the simplest applications that use Unicode (as interesting as many of them are, characters are of little use on their own).
RE: Glottal stops (bis) (was RE: Missing African Latin letters (bis))
At 23:40 -0800 2003-12-08, Peter Constable wrote: From: [EMAIL PROTECTED] on behalf of Michael Everson to use the kinds of uppercase glyph models used in similar instances of after-the-fact uppercase inventions based on IPA or other phonetic alphabets and usages. A modified capital P would probably do. [??!!] Michael, you've seen what they are using. How will the community be served when type designers start creating fonts that have a cap-height glyph for 0294 supplemented by a modified capital P? I meant that it would do for the code charts given Ken's model. It's not a very satisfactory model. I think language-specific font requirements for Latin is generally unsatisfactory, particularly where minorities are concerned. But I'm not in a position to fight this particular battle with Ken at the moment. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Text Editors and Canonical Equivalence (was Coloured diacritics)
Arcane Jill wrote: The intention of canonical equivalence is that the glyphs should display the same - otherwise we'd need precomposed versions of, well, everything. The intention of canonical equivalence is that *all* operations that involve interpreting the text treat two canonically equivalent strings the same. This is by no means limited to display. One of the first things that surprised me when I was first learning about Unicode in 1992 (from the big softcover 1.0 books) was how much attention was paid to processing issues. Topics like bidirectionality, backing store, sorting and searching, and what became known as the character-glyph model were all discussed. It was a real eye-opener for me to see a formal character standard that didn't just treat characters as something to be typed, displayed, and printed. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)
From: Philippe Verdy [mailto:[EMAIL PROTECTED] I see no particular value in this. The font rendering of spanbase diacritic/span should be exactly the same as that for spanbase/spanspandiacritic/span provided the font characteristics are the same or do not affect metrics. This is wrong here: there's no guarantee offered by HTML... My comment was intended to be referring to generic markup, not specifically HTML. Peter Peter Constable Globalization Infrastructure and Font Technologies Microsoft Windows Division
Re: [OT]
[EMAIL PROTECTED] wrote: Stout was indeed given as a health drink in small doses in certain cases, it's one of the few foods that are a good source of both iron and calcium. However the only doctor I've heard of recommending it in recent years was... I know of an (Irish) obstetrician in NYC who recommends it to his patients! Not to prolong this tangent, but to warn any North Americans who are inspired by it to rush out and buy a bottle of Guinness -- don't bother. It's not real Guinness any more, since about 2 years ago. If you read the fine print, you'll see it's from Toronto. And it is AWFUL -- the consistency of Pepsi and the taste of toxic waste. It staggers the imagination to conceive of how this could happen. Real Irish Guinness was a constant in this world for centuries, and suddenly some greedy investors turned it into a scam just for a quick buck (for surely it will be quick!) Sorry, I had to get that off my chest. Hopefully someone with some pull in Ireland will read this and do something about it :-) - Frank
plain text (was RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED] XML files most certainly are plain text XML *can* be interpreted as plain text, or it can be interpreted as something *other* than plain text (i.e. XML). This ambiguity exists for any other plain-text-based markup format, such as RTF, Postscript, ... Perhaps we need some new terminology here. It might be helpful to describe an XML file as a plain-text-markup file (PTM, for acronym lovers), but reserve the term plain text file for files that contain text with no markup. Note that the terms being defined are xxx file, not simply plain text. Thus, John can continue to say that XML is plain text, but in some contexts that wouldn't be as useful as saying XML files are plain-text-markup files. Peter Peter Constable Globalization Infrastructure and Font Technologies Microsoft Windows Division
Re: Text Editors and Canonical Equivalence (was Coloured diacritics)
On 09/12/2003 07:00, Arcane Jill wrote: Hmm. Now here's some C++ source code (syntax colored as Philippe suggests, to imply that the text editor understands C++ at least well :enough to color it) int n = wcslen(Lcafé); (That's int n = wcslen(Lcafé); for those without HTML email) The L prefix on a string literal makes it a wide-character string, and wcslen() is simply a wide-character version of strlen(). (There is no guarantee that wide character means Unicode character, but let's just assume that it does, for the moment). So, should n equal four or five? The answer would appear to depend on whether or not the source file was saved in NFC or NFD format. No, surely not. If the wcslen() function is fully Unicode conformant, it should give the same output whatever the canonically equivalent form of its input. That more or less implies that it should normalise its input. (One can imagine a second parameter specifying whether NFC or NFD is required.) This makes the issue one not for the text editor but for the programming language or its string handling library. There is more to consider than just how and whether a text editor normalizes. If a text editor is capable of dealing with Unicode text, perhaps it should also be able to explicitly DISPLAY the actual composition form of every glyph. The question I posed in the previous paragraph should ideally be obvious by sight - if you see four characters, there are four characters; if you see five characters, there are five characters. This implies that such a text editor should display NFD text as separate glyphs for each character. On the other hand, such a text editor must also acknowledge that é and e + U+0301 are actually equivalent. The /intention/ of canonical equivalence is that the glyphs should display the same - otherwise we'd need precomposed versions of, well, everything. So in other contexts, is should display them the same. The Unicode standard does allow for special display modes in which the exact underlying string, including control characters, is made visible. Yuk. That's a lot to think about for anyone considering writing a programmers' text editor with /serious/ Unicode support. Jill Simply allow the text editor to save as either NFC or NFD, and let the programming language sort out the rest. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
RE: Glottal stops (bis) (was RE: Missing African Latin letters (bis))
Doh! (It was late.) -Original Message- From: Curtis Clark [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 09, 2003 8:00 AM To: Peter Constable Subject: Re: Glottal stops (bis) (was RE: Missing African Latin letters (bis)) on 2003-12-08 23:40 Peter Constable wrote: If a band of Rumple-stiltskin Latins I think you mean Rip van Winkle, but your point is well-made. -- Curtis Clark http://www.csupomona.edu/~jcclark/ Mockingbird Font Works http://www.mockfont.com/
Re: [OT]
It staggers the imagination to conceive of how this could happen. Real Irish Guinness was a constant in this world for centuries, and suddenly some greedy investors turned it into a scam just for a quick buck (for surely it will be quick!) There has always been variation in the way it was brewed internationally. The stuff we have in Ireland is would have been too weak to last for long in the African heat before refrigeration became so cheap. The sweeter, stronger African variety, as brewed in Nigeria, is now to go on sale here though, as immigrants from Africa are complaining that you can't get a proper Guinness in Ireland. Sorry, I had to get that off my chest. Hopefully someone with some pull in Ireland will read this and do something about it :-) Bah, if we had any pull we could stop them making it increasingly colder and colder. They've already gone past the stage where you can't taste it (I understand heavily refrigerated beer is an American invention, and given the way American beer tastes this makes sense), soon it'll be served to you on a stick. I can't even remember if this thread was ever on topic. How did we get into this? -- Jon Hanna | Toys and books http://www.hackcraft.net/ | for hospitals: | http://santa.boards.ie/
Overload (was Re: Text Editors and Canonical Equivalence (was Coloured diacritics))
No, surely not. If the wcslen() function is fully Unicode conformant, it should give the same output whatever the canonically equivalent form of its input. That more or less implies that it should normalise its input. No, that is not a requirement of Unicode conformance. BTW, I must confess to an inability to keep up with the level of mail on this list. There are so many things in these mails that are simply wrong, and insufficient time for knowledgeable people to correct them. I would just caution people to first consult the materials on the Unicode site (Standard, TRs, FAQs, etc.), and take much of what is on this list with a quite sizable grain of salt. Mark __ http://www.macchiato.com - Original Message - From: Peter Kirk [EMAIL PROTECTED] To: Arcane Jill [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Tue, 2003 Dec 09 09:12 Subject: Re: Text Editors and Canonical Equivalence (was Coloured diacritics) On 09/12/2003 07:00, Arcane Jill wrote: Hmm. Now here's some C++ source code (syntax colored as Philippe suggests, to imply that the text editor understands C++ at least well :enough to color it) int n = wcslen(Lcaf); (That's int n = wcslen(Lcaf); for those without HTML email) The L prefix on a string literal makes it a wide-character string, and wcslen() is simply a wide-character version of strlen(). (There is no guarantee that wide character means Unicode character, but let's just assume that it does, for the moment). So, should n equal four or five? The answer would appear to depend on whether or not the source file was saved in NFC or NFD format. No, surely not. If the wcslen() function is fully Unicode conformant, it should give the same output whatever the canonically equivalent form of its input. That more or less implies that it should normalise its input. (One can imagine a second parameter specifying whether NFC or NFD is required.) This makes the issue one not for the text editor but for the programming language or its string handling library. There is more to consider than just how and whether a text editor normalizes. If a text editor is capable of dealing with Unicode text, perhaps it should also be able to explicitly DISPLAY the actual composition form of every glyph. The question I posed in the previous paragraph should ideally be obvious by sight - if you see four characters, there are four characters; if you see five characters, there are five characters. This implies that such a text editor should display NFD text as separate glyphs for each character. On the other hand, such a text editor must also acknowledge that and e + U+0301 are actually equivalent. The /intention/ of canonical equivalence is that the glyphs should display the same - otherwise we'd need precomposed versions of, well, everything. So in other contexts, is should display them the same. The Unicode standard does allow for special display modes in which the exact underlying string, including control characters, is made visible. Yuk. That's a lot to think about for anyone considering writing a programmers' text editor with /serious/ Unicode support. Jill Simply allow the text editor to save as either NFC or NFD, and let the programming language sort out the rest. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Text Editors and Canonical Equivalence (was Coloured diacritics)
Peter Kirk scripsit: No, surely not. If the wcslen() function is fully Unicode conformant, it should give the same output whatever the canonically equivalent form of its input. Not so. Remember, the conformance requirement is not that a process can't distinguish between canonically equivalent strings (otherwise a normalizer would be impossible; it wouldn't know whether to normalize or not!) but that a process can't assume that *other* processes will distinguish between canonically equivalent strings. Equally, it can't assume that the other process will fail to distinguish them, either. In an environment in which C wide characters are Unicode characters, then wcslen returns the number of distinct characters in the literal string. How many characters it contains depends on how many were placed in the source file by the author and what, if anything, has happened to the source file since. -- As you read this, I don't want you to feel John Cowan sorry for me, because, I believe everyone [EMAIL PROTECTED] will die someday.-- From a Nigerian-typehttp://www.reutershealth.com scam spam I got http://www.ccil.org/~cowan
Unsubscribe
- The information contained in this message is proprietary of Amdocs, protected from disclosure, and may be privileged. The information is intended to be conveyed only to the designated recipient(s) of the message. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, use, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you. -
Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)
On 09/12/2003 06:36, [EMAIL PROTECTED] wrote: Perhaps so does yours. It isn't clear whether the CSS for .red-text would have to over-ride the default behaviour whereby an inline element like span is rendered by stacking it to the left or right (depending on text directionality) of the previous inline element or text node, or if the accent should go over the e by default. Well, I would put it like this. Consider the following: (1) span class=black-text{U+00E9}/span (2) span class=black-texte{U+0301}/span (3) span class=black-textespan class=black-text{U+0301}/span/span (4) span class=black-textespan class=red-text{U+0301}/span/span I would expect (1), (2) and (3) to be rendered identically, and (4) to differ only in the colour of the accent, just as it would be (apart from (1) if U+0301 were replaced by a regular letter. I am assuming nothing special defined in the CSS - the behaviour should be the same with a simple colour attribute. And so I would expect the behaviour of an in-line span element to be subtly different from its normal behaviour when the text starts with a combining mark. I think this is what any naive user would expect in the circumstances, and is also what is sensible. Briefly testing on a Win2000 box I found that IE6 ignored the styling on the accent, Mozilla1.4 didn't show the accent, and Opera7.2 displayed the red accent (tests had the same results with #x0301; as with the combining character used directly). It isn't clear to me which, if any, of these are examples of conformant behaviour. Looking at existing implementations is a very bad guide to what behaviour is actually conformant, sensible, or expected by users. We have four independent variables here! -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: plain text (was RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)
Peter Constable scripsit: Perhaps we need some new terminology here. It might be helpful to describe an XML file as a plain-text-markup file (PTM, for acronym lovers), but reserve the term plain text file for files that contain text with no markup. Note that the terms being defined are xxx file, not simply plain text. Thus, John can continue to say that XML is plain text, but in some contexts that wouldn't be as useful as saying XML files are plain-text-markup files. Fair enough, though technically even plain-text files typically mark either line ends or paragraph breaks with markup (= control) characters. -- My corporate data's a mess! John Cowan It's all semi-structured, no less. http://www.ccil.org/~cowan But I'll be carefree[EMAIL PROTECTED] Using XSLT http://www.reutershealth.com In an XML DBMS.
RE: Text Editors and Canonical Equivalence (was Coloured diacriti cs)
Hmm. Now here's some C++ source code (syntax colored as Philippe suggests, to imply that the text editor understands C++ at least well :enough to color it) int n = wcslen(Lcafé); (That's int n = wcslen(Lcafé); for those without HTML email) The L prefix on a string literal makes it a wide-character string, and wcslen() is simply a wide-character version of strlen(). (There is no guarantee that wide character means Unicode character, but let's just assume that it does, for the moment). Even assuming that you can assume that wide characters are Unicode, you have not yet assumed in what kind of UTF they are. (Don't assume I deliberately making calembours :-) The only thing that the C(++) standards say about type wchar_t is that it is not smaller that type char, so a wide character could well be a byte, and a wide character string could well be UTF-8, or even ASCII. So, should n equal four or five? Why not six? If, in our C(++) compiler, type wchar_t is an alias for char, and wide character strings are encoded in UTF-8, and the é is decomposed, then n will be equal to 6. The answer would appear to depend on whether or not the source file was saved in NFC or NFD format. The answer is: int n = wcslen(Lcafé); That's why you take the burden to call the wcslen library function rather than assuming a hard-coded value such as: int n = 4; // the length of string café There is more to consider than just how and whether a text editor normalizes. Whatever the editor does, what if then the *compiler* normalizes it? The source file and the compiled object file are not necessarily in the same encoding and/or normalization. A certain compiler could accept a certain range of input encodings (maybe declared with command-line parameter) and convert them all in a certain internal representation in the compiler object file (e.g., Unicode expressed in a particular UTF and with a particular normalization). That's why library functions such as strlen or wcslen exist. You don't need to bother what these functions will return in a particular compiler or environment, as far as the following code is guaranteed to work: const wchar_t * myText = Lcafé; wchar_t * myBuffer = malloc(sizeof(wchar_t) * (wcslen(myText) + 1)); if (myBuffer != NULL) { wcscpy(myBuffer, myText); } If a text editor is capable of dealing with Unicode text, perhaps it should also be able to explicitly DISPLAY the actual composition form of every glyph. Against, this is not possible nor desirable, because a text editor is not supposed to know how the compiler (or its runtime libraries) will transform string literals. The question I posed in the previous paragraph should ideally be obvious by sight - if you see four characters, there are four characters; if you see five characters, there are five characters. Provided that you can define what a character is... After a few years reading this mailing list, I haven't seen a single acceptable definition of character. Moreover, I matured the impression that it is totally irrelevant to have such a definition: - as an end user, I am interested in a higher level kind of objects (let's call them graphemes, i.e. those things I see on the screen and I can interact with my mouse); - as a programmer, I am interested in a lower lever kind of objects (let's call them encoding units, i.e. those things that I count when I have to allocate memory for a string, or the like). The term character is in a sort of conceptual limbo which makes it pretty useless for everybody, IMHO. _ Marco
RE: Text Editors and Canonical Equivalence (was Coloured diacriti cs)
I (Marco Cimarosti) wrote: So, should n equal four or five? Why not six? ^^^ Errata: seven. If, in our C(++) compiler, type wchar_t is an alias for char, and wide character strings are encoded in UTF-8, and the é is decomposed, then n will be equal to 6. ^ Errata: 7. Sorry. _ Marco
Re: Overload (was Re: Text Editors and Canonical Equivalence (was Coloured diacritics))
On 09/12/2003 10:01, Mark Davis wrote: No, surely not. If the wcslen() function is fully Unicode conformant, it should give the same output whatever the canonically equivalent form of its input. That more or less implies that it should normalise its input. No, that is not a requirement of Unicode conformance. BTW, I must confess to an inability to keep up with the level of mail on this list. There are so many things in these mails that are simply wrong, and insufficient time for knowledgeable people to correct them. I would just caution people to first consult the materials on the Unicode site (Standard, TRs, FAQs, etc.), and take much of what is on this list with a quite sizable grain of salt. Mark, I understand your problem with the level of mail. But, in this case, I have read the appropriate section of TUS 4.0 and quote it here to prove it, from p.59: C9 A process shall not assume that the interpretations of two canonical-equivalent character sequences are distinct. ... Ideally, an implementation would always interpret two canonical-equivalent character sequences identically. ... Perhaps my error is that I have raised (or is it lowered?) ideally would to should. So let me rephrase what I said before: If the wcslen() function is fully Unicode conformant, ideally it would give the same output whatever the canonically equivalent form of its input. Surely that is what C9 is saying. Or is the issue about whether such a function is a process? I didn't say that conformance implies that a process should normalise its input (I accept that that is not true), but only that for this particular function, counting the length of a string, sensible results can be given only if the string is normalised, or at least transformed in some other way which removes distinctions between canonically equivalent forms (e.g. normalisation with some kinds of modified data). I am tacitly assuming at this point that the function is part of a general-purpose library for use by users who are not interested in the details of character coding etc. I can see that different considerations may apply for an internal function within a Unicode processing and rendering implementation. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
RE: Text Editors and Canonical Equivalence (was Coloured diacriti cs)
Peter Kirk wrote: So, should n equal four or five? The answer would appear to depend on whether or not the source file was saved in NFC or NFD format. No, surely not. If the wcslen() function is fully Unicode conformant, it should give the same output whatever the canonically equivalent form of its input. That more or less implies that it should normalise its input. Standards and fantasy are both good things, provided you don't mix them up. The wcslen has nothing whatsoever to do with the Unicode standard, but it has all to do with the *C* standard. And, according to the C standard, wcslen must simply count the number wchar_t array elements from the location pointed to by its argument up to the first wchar_t element whose value is L'\0'. Full stop. (One can imagine a second parameter specifying whether NFC or NFD is required.) One can imagine whatever (s)he wants, but should please avoid to claim that his/her imagination corresponds to some existing standards. This makes the issue one not for the text editor but for the programming language or its string handling library. This is correct. The Unicode standard does allow for special display modes in which the exact underlying string, including control characters, is made visible. Can you please cite the passage where the Unicode standard would not allow this? _ Marco
Unsubscribe
Re: Text Editors and Canonical Equivalence (was Coloured diacritics)
On 09/12/2003 10:16, [EMAIL PROTECTED] wrote: Peter Kirk scripsit: No, surely not. If the wcslen() function is fully Unicode conformant, it should give the same output whatever the canonically equivalent form of its input. Not so. Remember, the conformance requirement is not that a process can't distinguish between canonically equivalent strings ... Remembered. This is not a conformance requirement, just an ideally. See C9 and the posting I just made. ... (otherwise a normalizer would be impossible; it wouldn't know whether to normalize or not!) ... Not so. Normalisation is idempotent i.e. the result of normalising an already normalised string (with the same normalisation form) is identical to that of not normalising it. So the normaliser doesn't need to know in advance if the string is normalised. Now it may be more efficient to test for normalisation first; but the conformance clause says nothing to stop you making implementation shortcuts. ... but that a process can't assume that *other* processes will distinguish between canonically equivalent strings. Equally, it can't assume that the other process will fail to distinguish them, either. In an environment in which C wide characters are Unicode characters, then wcslen returns the number of distinct characters in the literal string. How many characters it contains depends on how many were placed in the source file by the author and what, if anything, has happened to the source file since. This implies that wcslen is not doing what C9 says that it ideally... would always do. But see the caveats in my other posting. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Text Editors and Canonical Equivalence (was Coloured diacritics)
Peter Kirk scripsit: ... (otherwise a normalizer would be impossible; it wouldn't know whether to normalize or not!) ... Not so. Normalisation is idempotent Quite right. I should have said that normalization *checking* would be impossible. -- Only do what only you can do. John Cowan [EMAIL PROTECTED] --Edsger W. Dijkstra, http://www.reutershealth.com deceased 6 August 2002 http://www.ccil.org/~cowan
RE: [OT]
[...] some greedy investors turned it into a scam just for a quick buck (for surely it will be quick!) Sorry, I had to get that off my chest. Hopefully someone with some pull in Ireland will read this and do something about it :-) Or simply flush Guinne$$ and drink Murphix. :-) Ciao. Marco
Unsubscribe
Unsubscribe
begin:vcard n:Adhav;Mahesh tel;cell:609.468.7005 tel;work:732.227.7720 x-mozilla-html:FALSE org:PRI;PD Informatics adr:;;;New Brunswick;NJ;;USA version:2.1 email;internet:[EMAIL PROTECTED] title:Consultant fn:Mahesh C Adhav end:vcard
Re: [OT]
At 06:54 AM 12/9/2003, Michael Everson wrote: Hm. We have a hot beverage symbol. Maybe we need a pint glass ... and combining shamrock and harp marks. JH Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] What was venerated as style was nothing more than an imperfection or flaw that revealed the guilty hand. - Orhan Pamuk, _My name is red_
Re: [OT]
At 11:06 -0800 2003-12-09, John Hudson wrote: At 06:54 AM 12/9/2003, Michael Everson wrote: Hm. We have a hot beverage symbol. Maybe we need a pint glass ... and combining shamrock and harp marks. I did get the shamrock in. ;-) -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Text Editors and Canonical Equivalence (was Coloured diacriti cs)
On 09/12/2003 10:22, Marco Cimarosti wrote: Peter Kirk wrote: So, should n equal four or five? The answer would appear to depend on whether or not the source file was saved in NFC or NFD format. No, surely not. If the wcslen() function is fully Unicode conformant, it should give the same output whatever the canonically equivalent form of its input. That more or less implies that it should normalise its input. Standards and fantasy are both good things, provided you don't mix them up. The wcslen has nothing whatsoever to do with the Unicode standard, but it has all to do with the *C* standard. And, according to the C standard, wcslen must simply count the number wchar_t array elements from the location pointed to by its argument up to the first wchar_t element whose value is L'\0'. Full stop. OK, as a C function handling wchar_t arrays it is not expected to conform to Unicode. But if it is presented as a function available to users for handling Unicode text, for determining how many characters (as defined by Unicode) are in a string, it should conform to Unicode, including C9. ... The Unicode standard does allow for special display modes in which the exact underlying string, including control characters, is made visible. Can you please cite the passage where the Unicode standard would not allow this? TUS 4.0 p.60 (part of C9): Even processes that normally do not distinguish between canonical-equivalent character sequences can have reasonable exception behavior. Some examples of this behavior include ... Show Hidden Text modes that reveal memory representation structure; ... Somewhere else I think there is more detail. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)
[EMAIL PROTECTED] writes: You might as well say that C code is not plain text because it too is subject to special canons of interpretation. C, C++ and Java source files are not plain text as well (they have their own C, C++ and Java source files are plain text. text/* MIME type, which is NOT text/plain notably because of the rules I've seen text/cpp and text/java, but really there are no such types. I've also seen text/x-source-code which is at least legal, if of little value to interoperability. The correct MIME type for C and C++ source files is text/plain. This is where I disagree: a plain text file makes no difference of interpretation between their meta-linguistic meaning for the programming language that uses and need it, and the same characters used to create string constants or identifier names. Unicode cannot, and must not, specify how the meta-characters used in a programming language must combine with other actual strings that are treated by the language syntax itself as _separate tokens_. This means that the concept of combining sequences MUST NOT be used across all language token boundaries. These boundaries are out of the spec of Unicode, but part of the spec for the language, and they must be respected at the first level even before trying to create other combining sequences within the _same_ token. So even if text/c, text/cpp, text/pascal or text/basic are not officially registered (but text/java and text/javascript are registered...) it is important to handle text sources that aren't plain texts as another text/* type, for example text/x-other or text/x-source or text/x-c or text/x-cpp. I'd be prepared to give good odds that that is the case with Java source files as well. As I said text/java is the appropriate MIME type for Java source files... associated with end-of-lines, notably in presence of comments). As source files (that is, at the stage in processing at which a human user can see the source and edit it) the only handling required for end-of-lines is converstion of new line function characters, the same as for any other use of plain text. The treatment of end-of-lines as significant when processed (for example following one-line // comments) is a matter of what an application chooses to do with a particular character. This is no different than an indexer deciding that a plain text file contains a particular word, or for that matter in my putting coffee filters into my basket if I see coffee filters written on my shopping list. Just imagine what would be created with your assumption with this source: const wchar_t c = L'?'; where ? is a combining character. Using the plain/text content type for this C source would imply that it combines with the previous single-quote. This would create an opportunity for canonical composition, and thus would create an equivalent source file which would be: const wchar_t c = L§'; where this § character is a composed character. Now the source file contains a syntax error and does not compile, even though the previous source compiled and was giving to the c constant the value of the codepoint coding the ? diacritic... Of course the programmer could avoid this nightmare by using numeric character references as in: const wchar_t c = L'\U000309'; or may be (but less portable, as it assumes the runtime encoding form used by wchar_t as being UCS4 or UTF-16 or UTF2, when the source file may be coded in a non-Unicode charset): const wchar_t c = (wchar_t)0x000309ul; But both XML/HTML/SGML and the various programming languages are plain text. See text/xml, text/html and text/sgml MIME types. They also aren't text/plain so they have their own interpretation of Unicode characters which is not the one found in the Unicode standard. They have their own interpretation of tne Unicode characters which is *in addition to* This is not *in addition* but *instead of* and thus this breaks the rule of Unicode conformance at that level, as the code point does not match the meaning REQUIRED by conforming applications as being a code point, coding an abstract character with a well-defined representative glyph and REQUIRED composability with surrounding characters. the one found in the Unicode standard. As to all but the simplest applications that use Unicode (as interesting as many of them are, characters are of little use on their own). Note that a simple text editor such as NotePad can safely be used to edit source files, simply because it does not attempt to perform any normalization of the loaded or saved files, even when editing it (there's not even a edit menu option to normalise any area of the text in the edit buffer). Most editors for programming languages treat individual characters as really individual and completely unrelated to each other. This means that they won't attempt any normalization, so characters will not be reordered, or
Re: Text Editors and Canonical Equivalence (was Coloured diacritics)
On 09/12/2003 10:41, [EMAIL PROTECTED] wrote: Peter Kirk scripsit: ... (otherwise a normalizer would be impossible; it wouldn't know whether to normalize or not!) ... Not so. Normalisation is idempotent Quite right. I should have said that normalization *checking* would be impossible. Agreed. C9 clearly specifies that a process cannot assume that another process will give a correct answer to the question is this string normalised?, because that is to assume that another process will make a distinction between two different, but canonical-equivalent character sequences. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
RE: Glottal stops (bis)
Peter Constable posted: Michael, you've seen what they are using. How will the community be served when type designers start creating fonts that have a cap-height glyph for 0294 supplemented by a modified capital P? The characters can be seen on the web at http://www.wkss.nt.ca/HTML/08_ProjectsReports/PDF/DogribPlaceCaribouHabitat2002.pdf Search on Small Clear Lake in the file on Jackfish and Moosenose and glottal stop for a few of many examples. These characters are obviously being used. See also http://members.tripod.com/~DeneFont/win_char.htm There appears to me to be two possibilities for Unicode: 1. Encode a new character for the lowercase glottal stop and recategorize U+2094 as uppercase. 2. Encode two new characters and leave U+2094 as is. The second suggestion has the advantage that a font designer would be more free than otherwise to render the uppercase glottal stop to match more closely other uppercase characters in a particular lettering style. Jim Allan
RE: unification (CJKV history) ; Alphabetic Aramaic+ ...
I'm working on unification and would like to more about the earliest CJKV work--was it from the RLG? This history of unification is laid out pretty clearly in Appendix A of TUS. I read a book on computerizing languages by a Sproat from Bell Labs--not as satisfying as I had hoped, although he had the good taste to mention Hebrew accents. Which book? A Computational Theory of Writing Systems or Morphology and Computation? Neither is really related to the topic of Han unification. What exactly are you looking for? Tree
RE: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)
Just imagine what would be created with your assumption with this source: const wchar_t c = L'?'; where ? is a combining character. The programmer would get bit. At best, there's no reason to assume that every compiler accepts UTF-8, besides that fact that you can't assume that the compiler or any intermediary step doesn't normalize. That's why Unicode escapes exist, and partially why Java as a general rule translates source into a form that uses Unicode escapes for non-ASCII characters. Even if you assume the compiler can accept Unicode text in whatever UTF you choose, it still seems needlessly dangerous to use a bare combining character instead of a Unicode escape or a numeric entity. Despite your distinction, there's no clear line between programming editors and non-programming editors. Any editor that gives you variable names in Hindi or Arabic is likely to have the sophistication need to combine that ? with that ', and I see no reason they won't; quite possibly, the underlying system won't give them the option to handle Hindi or Arabic and not combining that ? with that '. Emacs, for one notorious programming editor, fully plans to have that sophistication. -- ___ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm
RE: unification (CJKV history) ; Alphabetic Aramaic+ ...
Elaine Keown still in Texas Dear Tom Emerson: This history of unification is laid out pretty clearly in Appendix A of TUS. I hope this is online--And they go all the way back to the 200 previous suggestions, some from the Chinese Language Computer Society? A Computational Theory of Writing Systems or I'm looking for previous thought on properties of scripts that affect how they are encodedSproat did that only a little--that was the disappointment. I think that encoding standards are actually the technical end of what they call sociolinguistics in linguistics dept + discussing a script's computational properties.Elaine __ Do you Yahoo!? New Yahoo! Photos - easier uploading and sharing. http://photos.yahoo.com/
XML based mapping files.
Hello, I am trying to Implement Unicode Technical Report, UTR 22 and I have a few questions about this specification. Since this specification is normative, XML must be the way to go when including local encoding to unicode mapping files in your application, this requires conversion of existing mapping files to XML form. Have other application enforced this conversion to their mapping files. Is there a tool which could be useful in doing the conversion or it can only be done manually? ICU have extensive repository of XML based mapping files, but does any other reference source for XML formatted mapping files or the alias table exists, apart from ICU. ? Thank you in advance. Regards, Shubhagam Gupta [EMAIL PROTECTED]
RE: Qumran Greek
Michael Everson wrote: At 13:34 -0800 2003-12-08, Elaine Keown wrote: I include 2 Qumran symbols that are probably Greek. Obviously it's impossible to tell from two tiny gifs I'm looking for help with the large 'X'. I would guess that the first of your symbols, if Greek, is a PARAGRAPHOS or a FORKED PARAGRAPHOS. It's also used in Coptic. The X looks like a CHI of course. I had the same feeling when I replied to Elaine that this may be an annotation added by a Coptic scribe within the Hebrew text. But it was hard to guess if this was the case. Coptic religious have made extensive studies in Egypt related to ancient texts in Hebrew, and it's quite natural that they may have mixed their own annotations in Coptic in the margin of the original Hebrew texts. It's exactly similar to annotating today a Han text with notes in English. So I'm not sure it needs a specific encoding, as this may just be a shift from one script to another. Elaine could look within her copy of the whole text if there are not other occurences than just single symbols, i.e. added words, in the margin of the text. __ ella for Spam Control has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com attachment: winmail.dat
RE: Qumran Greek
I have no problem with Qumran scribes being multilingual or using Greek symbols in either Coptic or Hebrew or Aramaic text. -- Michael Everson * * Everson Typography * * http://www.evertype.com
RE: Qumran Greek
At 00:27 +0100 2003-12-10, Philippe Verdy wrote: It's exactly similar to annotating today a Han text with notes in English. So I'm not sure it needs a specific encoding, as this may just be a shift from one script to another. Of course the PARAGRAPHOS characters are to be encoded in the Supplemental Punctuation block where they can be used for many scripts. There's a FORKED PARAGRAPHOS though and a REVERSED FORKED PARAGRAPHOS though, which names may not be all that good if they can be used in a bidirectional context. Ken? -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Qumran Greek
Elaine in central Texas Hi, I would guess that the first of your symbols, if Greek, is a PARAGRAPHOS or a FORKED PARAGRAPHOS. It's also used in Coptic. Yes, both of those seem to be at Qumran. In Coptic, do you know what period of time they are in? The X looks like a CHI of course. Even though it's sort of curvy and oversized?--though of course you can't tell the size from this. I had been assuming that it was something else, since Emanuel Tov didn't name it as such and he mostly did SeptuagintElaine __ Do you Yahoo!? New Yahoo! Photos - easier uploading and sharing. http://photos.yahoo.com/
Re: Glottal stops (bis) (was RE: Missing African Latin letters (bis))
On 12/09/03 02:26, Peter Constable wrote: From: [EMAIL PROTECTED] on behalf of Kenneth Whistler Nobody is agitating for an uppercase apostrophe. Not in Canada, that I know of. (I've seen indication of languages in Russia that have a case distinction for ' and possible also .) Early versions of Volapk used (U+02BB, I think) for the sound /h/, and specified that the uppercase apostrophe-shape was a boldface one. I can provide a scan, I think, if people think it matters. ~mark
Re: Text Editors and Canonical Equivalence (was Coloured diacriti cs)
Peter Kirk peterkirk at qaya dot org wrote: The wcslen has nothing whatsoever to do with the Unicode standard, but it has all to do with the *C* standard. And, according to the C standard, wcslen must simply count the number wchar_t array elements from the location pointed to by its argument up to the first wchar_t element whose value is L'\0'. Full stop. OK, as a C function handling wchar_t arrays it is not expected to conform to Unicode. But if it is presented as a function available to users for handling Unicode text, for determining how many characters (as defined by Unicode) are in a string, it should conform to Unicode, including C9. wcslen() is very definitely presented as a function for counting _code_units_. You can't even rely on it to count Unicode characters accurately, if a wchar_t is 16 bits long, because supplementary characters will require 2 code points (high + low surrogate). Programmers rely on primitive functions like wcslen() to do what they do very rapidly, and not to change their meaning in new versions of the language standard. It would be very handy to have a suite of C functions that normalize their input string to any of NFK*[CD], or to compare strings or measure their length taking normalization into account, but those would have to be all-new functions. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: XML based mapping files.
Shubhagam Gupta wrote: I am trying to Implement Unicode Technical Report, UTR 22 and I have a few questions about this specification. Since this specification is normative, XML must be the way to go when including local encoding to unicode mapping files in your application, this requires conversion of existing mapping files to XML form. No Unicode Technical Report is normative in the sense that one must follow it in order to conform to the Unicode Standard. (If it were, it would be a Unicode Standard Annex.) UTRs contain additional information on the use of Unicode in certain environments, or guidelines for the use of Unicode with other standards. In particular, UTR #22 does *not* require existing mapping tables to be converted to XML. It provides an appropriate XML-based format for mapping tables, along with other suitable guidelines for things like fallback assignments. But there is no requirement to convert existing tables, and in fact the official tables available on the Unicode FTP site continue to be available in the plain-text Format A. Have other application enforced this conversion to their mapping files. Is there a tool which could be useful in doing the conversion or it can only be done manually? ICU have extensive repository of XML based mapping files, but does any other reference source for XML formatted mapping files or the alias table exists, apart from ICU. ? I don't know of any such tools, but there is a possibility that something could be put together using ICU. Indeed, while UTR #22 contains plenty of good material, I tend to think of it as public documentation of a format used by ICU and probably few others. BTW, anyone catch the error in section 4.2.1 of this UTR? -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/