Re: Questions on ZWNBS - for line initial holam plus alef
From: John Cowan [EMAIL PROTECTED] Peter Kirk scripsit: On 13/08/2003 11:09, Philippe Verdy wrote: ... For this reason, defective combining sequences (combining characters without a leading base character) should be forbidden (invalid for XML). If there is even the remotest possibility of this happening, we need to know quickly! As a member of the XML Core Working Group of the W3C, I can assure you that there is not even the remotest possibility of it. OK, forbidden is possibly excessive. Do you prefer the terms strongly discouraged in favor of a new encoding that could be used by applications that are concerned by security and parsing issues? If there's no such new encoding proposed, at least XML Core WG members could discuss about the way to solve the security problems. There may exist some solutions which I did not think about...
Re: Questions on ZWNBS - for line initial holam plus alef
- Original Message - From: Jon Hanna [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, August 14, 2003 1:49 PM Subject: RE: Questions on ZWNBS - for line initial holam plus alef I do agree: a XML document could require the use at some place of a given attribute or element. If this attribute name follows the element name after a line break, which gets changed into a space during parsing, forcing XML parsers to treat SPACE+combining as a unbreakable grapheme cluster acting like a letter would have the effect of creating a new element name which may violate the lement name identity. Now suppose that the attribute name contains a colon, you have created a custom namespace name, under which you can add any element you like, even if this was forbidden by the content-model of the reference schema. 1. SPACE is treated blindly as a SPACE by XML. String + space + combining + string would not be treated as a single token, no matter how that space was introduced. That's what you were complaining about in the first place (as far as I can make out). 2. While nmtokens can begin with a combining character names cannot, nor can they contain spaces. 3. This would in no way change the content-model. So even if the above two points didn't hold they would only sneak the document past something which performed validation before parsing(!), and where the content-model was already pretty loose (so it didn't complain about the unrecognised attribute). You've just discovered a way to disguise one document that isn't well-formed as a different document that isn't well-formed. l33t! So this would invalidate existing documents, or create holes allowing insertion of arbitrary XML content, if the XML application is not validating extremely strictly the element names (the pair namespace+ name) and exclude completely from processing any unrecognized element (including all its content and attributes). This argument is not on friendly terms with the concept of causality. This would be a breach in the content model which may have been validated and tested for security in another layer of the document encoding process (notably when XML documents are created from templates, such as XSL processors, or custom C source using simple template substitution). Testing validity without testing well-formedness is not possible. So for me the sequence SPACE+combining should not be acceptable as a valid grapheme cluster within element names or attribute names, As it already isn't. and thus would need to be excluded from NMTOKEN. The correct way to do it is to consider it NOT A LETTER, but a symbol (Sk), exactly like other spacing diacritics, which are already invalid in NMTOKEN. Wait a second. That was my justification for why the fact that space+combining is ALREADY prohibited from NMTOKEN shouldn't be considered a failure on the part of XML to allow for freedom of choice with the strings used for NMTOKENs. Now you actually want to introduce this (already existent) feature. There still remains the unresolved question of grapheme clusters that could span the starting or ending or / of tags, or the leading of a entitity reference. No there isn't. What goes before , , / or isn't a problem since those are all non-combining characters and a new unit for any sort of processing treating more than one codepoint as a unit. What goes after or has to be a name (not an nmtoken) and as such is already prohibited from beginning with a combiner. What goes after is already dealt with by the Charmod, and even if you ignore charmod apart from the possibility of normalisation turning the sequence U+003E, U+0338 into U+226E (a possibility that is well noted) it still isn't going to hurt. One note: in Unicode, grapheme clusters (considered unbreakable) are more than just combining sequences! Look at CGJ, WJ, ZWJ, ... So what is after or *before* a base character may impact parsing grapheme clusters! As the well-formedness of XML documents goes even before its validity (which is optional, but required in some applications that need to parse the DOM-tree or InfoSet rather than), this impacts the way Unicode can be used (read it as embedded) within XML. Depending on where this encoded text is used (NMTOKENs, text elements, attribute values,...) the embedding constraints will be different, but in my opinion anonymous text elements and attribute values should both use the same encoding capabilities as they both can (should be able to) represent any kind of valid Unicode plain text. As SPACE is handled differently in attribute values, this is a problem. that causes a problem for SPACE+NSM (considered valid but with imprecise properties for now). The constraints are less severe in anonymous text elements as there exists several technics (including CDATA sections) to represent them. In fact, XML will consider each text element or attribute value as an
RE: Questions on ZWNBS - for line initial holam plus alef
OK, it's safe, but it is a misuse of Unicode. As space plus combining character is a unit in Unicode, it should be treated as a unit by higher level protocols. If higher level protocols are allowed to do arbitrary things within Unicode units, there is no end to the possible confusion. See for example, from Unicode 4.0 chapter 3: C7 A process shall interpret a coded character representation according to the character semantics established by this standard, if that process does interpret that coded character representation. If this is not the case (I'm not entirely sure this bans what XML does with spaces) then all we would need is a change so that rather than a de facto ban on space+combining within names and nmtokens we would have an explicit ban on the same; then we'd all be happy, except possibly for some sadistic XML application designer that was planning on use that combination out of ill-will towards his or her colleagues.
Re: Handwritten EURO sign
At 23:35 +0200 2003-08-05, Pim Blokland wrote: I have absolutely no idea what you are talking about. You are lucky not having to put up with bad English like five euro and six cent, living in the Netherlands and speaking Dutch as you do. See http://www.evertype.com/standards/euro if you wish to learn more about a disaster in language planning. -- Michael Everson * * Everson Typography * * http://www.evertype.com
RE: Conflicting principles
Anyway, John J, what code are we talking about that has to work from the positions of the combining marks back to the underlying representation? Are you talking about OCR? No, the issue is more how to start from a base form and work forward to encompass the whole series of characters which need to be treated as one in certain processes, which can include cursor movement, hit testing, display, line breaking, collation, normalization. Collation isn't really based on combining sequences (even though UTS 10 specifies a certain spanning over non-blocking (combining) characters). Note in particular the following entry in the CTT (and with different syntax in the UTS 10 tables): U0E4D_0E32 S0E33;BASE;MIN;U0E33 % THAI CHARACTER SARA AM (and a similar one for Lao). This is a collation entry for a contraction of a combining mark followed(!) by (formally) a base character. (I'm not really sure what the true logical sequence would be, though.) /kent k
Re: Questions on ZWNBS - for line initial holam plus alef
From: Peter Kirk [EMAIL PROTECTED] I note that there is no line break opportunity in space, NBSP. But is there one after the space in space, RLM, NBSP? If so, RLM, NBSP, combining character has a third advantage, that it gives the right line break opportunity when this sequence is word initial, which it wouldn't do without the RLM. How can we be so complicated when a new base character with the needed properties would be much simpler and easier to support in implementations? What is wrong with the encoding of new recommanded alternatives to SPACE or NBSP, i.e. an invisible symbol, an invisible LTR letter, an invisible RTL letter? This way we can fix some issues in the current text of UAX'es but recommand that new writers use a new base character which will behave correctly without those too complex hacks that users and implementers won't understand.
Re: Handwritten EURO sign (off topic?)
On 14/08/2003 09:54, Michael Everson wrote: Lepton in Greek was accepted from the beginning. Leptó pl leptá. The same word as the original widow's mite (Mark 12:42). Probably worth even less now! -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Diacriticals and descents in upper case (was: Re: Caron / Hacek?)
On 2003.06.12, 18:38, Philippe Verdy [EMAIL PROTECTED] wrote: Capital letters simply don't use ascents or descents, and thus they occupy a *smaller* space than the lowercase letters. Some upper case letters commonly (i.e. in some typical fonts) have descents, especially, though not only, in italic style: U+0047 : LATIN CAPITAL LETTER G U+004A : LATIN CAPITAL LETTER J U+0051 : LATIN CAPITAL LETTER Q U+005A : LATIN CAPITAL LETTER Z U+01B7 : LATIN CAPITAL LETTER EZH U+0396 : GREEK CAPITAL LETTER ZETA U+0414 : CYRILLIC CAPITAL LETTER DE U+0423 : CYRILLIC CAPITAL LETTER U U+0426 : CYRILLIC CAPITAL LETTER TSE U+0429 : CYRILLIC CAPITAL LETTER SHCHA U+046E : CYRILLIC CAPITAL LETTER KSI In some cases, there is no space in the font point size to put some upper diacritics above the letter, and the diacritic will almost always be written after the base character, sometimes with a distinct glyph, if the printed lines must fit in narrow lines (to save paper in books). This is indeed the current practice in Czech and Slovak, as said in the thread, but it's completely out of fashion to do so fo, at least, Portuguese. Nineteen century books do have E for UC e-acute, but that has been replaced by É in all quality media for quite a long time. the color of a font is not what you think: Note that what I think may not be what you think I think... ;-) -- . António MARTINS-Tuválkin, | ()| [EMAIL PROTECTED] || R. Laureano de Oliveira, 64 r/c esq. | PT-1885-050 MOSCAVIDE (LRS) Não me invejo de quem tem | +351 934 821 700 carros, parelhas e montes | http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe | http://pagina.de/bandeiras/ a água em todas as fontes |
Re: Unicode 4.0 is online at last!
Peter Kirk suggested... Interesting and a little embarrassing that Unicode's own documentation is not Unicode compatible! I don't think it's very embarrassing... The Unicode consortium after all doesn't produce book editing and typesetting software, we use other peoples' software. I think it's rather amazing that we can now actually produce a PDF of the entire book. This is incredibly better than the situation ten years ago. In any case, perhaps you can you suggest a Unicode conformant authoring tool that is up to the task of editing and typesetting the standard itself? It must have at least the capability of Framemaker 6 (i.e., tables, figures, sectioning, table-of-contents, index, etc) whilst implementing the full standard, including all scripts... Even the ones that would be newly defined in the next version... ;-) Rick
Re: Handwritten EURO sign (off topic?)
- Message d'origine - De: Marco Cimarosti [EMAIL PROTECTED] Anto'nio Martins-Tuva'lkin wrote: After all the euro is a common currency and its figures should be written in a common way. Why? Very good question. Multilingual countries like Belgium or Canada already were or are writing the same amounts using different cultural conventions depending on the language of the text where they appear. Otherwise, I'm personally quite flexible if only one convention is used and imposed upon all, as long as it is the French one ;-) P. Andries - o - 0 - o - Unicode en français http://pages.infinit.net/hapax (Traduction de l'UTR 20 en cours)
Re: Compatibility decompositions
John Cowan asked: I realize that existing compatibility decompositions are a rag-bag, especially those marked with the generic compat tag rather than one of the specific tags such as font, initial, or super. I wonder what principles, if any, can be enunciated for giving a newly introduced character a compatibility decomposition at the present time? Fortunately, I have just the material to hand to answer such a question -- a file listing all the additions to Unicode 3.2 and Unicode 4.0. We can look in those tea leaves and divine the probable intentions of the UTC, based on a pretty good sampling of 2000+ recent character additions. 03F9;GREEK CAPITAL LUNATE SIGMA SYMBOL;Lu;compat 03A303F2; Reason: uppercase of U+03F2, which has a compatibility mapping 1D2C;MODIFIER LETTER CAPITAL A;Lm;super 0041; ... 1D61;MODIFIER LETTER SMALL CHI;Lm;super 03C7; Reason: analogy to existing superscript modifier letters 1D62;LATIN SUBSCRIPT SMALL LETTER I;Ll;sub 0069; ... 1D6A;GREEK SUBSCRIPT SMALL LETTER CHI;Ll;sub 03C7; Reason: analogy to existing superscript modifier letters (but these are *sub*script) 2047;DOUBLE QUESTION MARK;Po;compat 003F 003F; Reason: analogy to existing U+2048..U+2049 2057;QUADRUPLE PRIME;Po;compat 2032 2032 2032 2032; Reason: analogy to existing U+2033..U+2034 205F;MEDIUM MATHEMATICAL SPACE;Zs;compat 0020; Reason: analogy to existing fixed-width spaces 2071;SUPERSCRIPT LATIN SMALL LETTER I;Ll;super 0069; Reason: analogy to existing U+207F superscript n 213D;DOUBLE-STRUCK SMALL GAMMA;Ll;font 03B3; ... 2149;DOUBLE-STRUCK ITALIC SMALL J;Ll;font 006A; Reason: analogy to existing font variant letterlike symbols 2A0C;QUADRUPLE INTEGRAL OPERATOR;Sm;compat 222B 222B 222B 222B; Reason: analogy to existing U+222C..U+222D 2A74;DOUBLE COLON EQUAL;Sm;compat 003A 003A 003D; 2A75;TWO CONSECUTIVE EQUALS SIGNS;Sm;compat 003D 003D; 2A76;THREE CONSECUTIVE EQUALS SIGNS;Sm;compat 003D 003D 003D; Reason: symbols were explicitly representing sequences of elements, but were single entities in the math entity set 309F;HIRAGANA DIGRAPH YORI;Lo;vertical 3088 308A; 30FF;KATAKANA DIGRAPH KOTO;Lo;vertical 30B3 30C8; Reason: vertical ligated variants of Japanese syllable sequences 321D;PARENTHESIZED KOREAN CHARACTER OJEON;So;compat 0028 110B 1169 110C 1165 11AB 0029; 321E;PARENTHESIZED KOREAN CHARACTER O HU;So;compat 0028 110B 1169 1112 116E 0029; 3250;PARTNERSHIP SIGN;So;square 0050 0054 0045; Reason: analogy with all the rest of the existing squared compatibility characters originating in Korean standards 3251;CIRCLED NUMBER TWENTY ONE;No;circle 0032 0031;21 ... 32BF;CIRCLED NUMBER FIFTY;No;circle 0035 0030;50 Reason: analogy with existing circled number characters 32CC;SQUARE HG;So;square 0048 0067; ... 33FF;SQUARE GAL;So;square 0067 0061 006C; Reason: analogy with the rest of the existing squared compatibility characters originating in Korean standards FDFC;RIAL SIGN;Sc;isolated 0631 06CC 0627 0644; Reason: explicit request in the proposal to provide decomposition, approved by the committees FE47;PRESENTATION FORM FOR VERTICAL LEFT SQUARE BRACKET;Ps;vertical 005B; FE48;PRESENTATION FORM FOR VERTICAL RIGHT SQUARE BRACKET;Pe;vertical 005D; Reason: analogy with existing vertical form variants FF5F;FULLWIDTH LEFT WHITE PARENTHESIS;Ps;wide 2985;;*;;; FF60;FULLWIDTH RIGHT WHITE PARENTHESIS;Pe;wide 2986;;*;;; Reason: analogy with existing fullwidth characters 1D4C1;MATHEMATICAL SCRIPT SMALL L;Ll;font 006C; Reason: analogy with the rest of the math alphanumerics And then there are canonical equivalences added: 2ADC;FORKING;Sm;2ADD 0338;;not independent;;; Reason: analogy with the other negated math symbols (and allowable under Unicode stability policies because the base character U+2ADD was encoded at the same time) FA30;CJK COMPATIBILITY IDEOGRAPH-FA30;Lo;4FAE; ... FA6A;CJK COMPATIBILITY IDEOGRAPH-FA6A;Lo;983B; Reason: analogy with the treatment of all the other Han compatibility characters. So you can see from this that the overwhelming reason for providing a compatibility (or canonical) decomposition for a newly encoded character is analogy with the treatment of existing characters which are arguably just like the character newly encoded. The reason for that is *consistency* in the standard. It would be less useful to have some characters treated one way for decompositions and others (inexplicably, from the point of view of implementers) treated another. In particular, is it sufficient that the character strongly resembles an existing character or combination of characters, but for one or another reason needs to be distinct from it? I don't think strong resemblance to an existing character is enough. There were plenty of examples among the math symbols of symbols
RE: Pre-orders of The Unicode Standard, Version 4.0
-Original Message- From: John Cowan [mailto:[EMAIL PROTECTED] Sent: Thursday, August 14, 2003 10:20 AM To: Magda Danish (Unicode) Cc: Unicode Core List; [EMAIL PROTECTED] Subject: Re: Pre-orders of The Unicode Standard, Version 4.0 Thanks. Is the Unicode Consortium in any way benefited (or disadvantaged) if non-members order through it rather than through Amazon or BN? The Unicode Consortium has an Associate agreement with both Amazon and BN so we do benefit from members and/or non-members purchasing the book through either of them, as long as they follow the link (to Amazon or BN) from the Unicode website. Magda
Re: Questions on ZWNBS - for line initial holam plus alef
Peter, in XML you really don't want to use attributes for any general text; there are too many restrictions on the content. For example, we never put translatable text into them. Attributes should really be treated more like sequences of symbols, with a constrained syntax. This is also not in violation of the Unicode conformance clause. A space plus combining character is a unit in some sense. That is, it is a combining character sequence (and grapheme cluster). However, there is no clause that says that such units cannot be changed, or that any particular sequence of characters cannot be changed; operations such as case mapping or normalization do just that, they change characters. There are restrictions on what can be changed *if* a process purports to not modify the text (C10). But an XML parser is certainly capable of interpreting a sequence A B, and deciding that it wants to change A to C. If the parser interpreted the 0x0041 in UTF-16 as a Z or a Greek Alpha, *that* would be a violation of C7. But interpreting a space as a space, then deciding to modify it, is perfectly legit. Mark __ http://www.macchiato.com Eppur si muove - Original Message - From: Peter Kirk [EMAIL PROTECTED] To: John Cowan [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Wednesday, August 13, 2003 05:09 Subject: Re: Questions on ZWNBS - for line initial holam plus alef On 12/08/2003 20:28, John Cowan wrote: Peter Kirk scripsit: 2) In attribute values, LF, CR, and TAB characters are normalized to spaces. Not relevant here. This would be relevant if it is legal for the character after LF, CR, and TAB to be a combining mark. Is this legal? In this case what was previously a defective (but legal) combining sequence would turn into a non-defective one, but the intended whitespace would be lost. The point is that there is no such thing as an *intended* line break in an attribute value; it will *always* be translated to a space before the application sees it. (More exactly, line-break characters can be inserted into attribute values, but only with the use of a numeric character reference such as #xA;.) Sorry, I'm confused. Are you saying that the input processing will translate line breaks into spaces within attribute values, unless inserted as #xA; ? Well, I suppose this is fair enough as it is up to the user not to enter garbage. Not just a rendering glitch, I suspect. If the combining character is combined with the separating space, the space loses many of its separating functions, and perhaps keeps a confusing subset of them with all sorts of possibilities of error. The space(s) will be used to separate individual tokens at processing time. No spacing diacritic (either single-character or space+combining) is permitted in a NMTOKEN. OK if this is clearly illegal, but this might restrict use of some languages in NMTOKEN. Would NBSP + combining be allowed? At best tokens beginning with combining characters will be unusable. At worst they will crash the implementation (and count on someone trying deliberately to do that!). In effect, the combining character will constitute a defective combining sequence at the beginning of the individual token. Stepping away from the letter of the standard for a moment, there is no real reason to begin a NMTOKEN with a combining character. It is only allowed is a result of the miscegenation of SGML concepts with Unicode ones. In SGML's original design of tokens, they consisted of letters and digits (and a few punctuation marks, which functioned as letters). There were four kinds: a NUMBER could contain only digits, a NAME could not begin with a digit, a NUTOKEN had to begin with a digit, and a NMTOKEN had no restrictions. ID and IDREF had the same syntax as NAME with additional semantics. Later, the categories letter and digit were generalized, by redefining the concrete syntax, to be whatever you wanted, and were renamed name-start and name characters (technically, a name character was a letter *or* a digit). When SGML was simplified to produce XML, only NMTOKEN, the most general type of token, was kept. However, in order to keep the semantics of letter and digit in the Unicode world, letter was extended to be any letter and digit to be any digit *or* combining character. That worked well for ID and IDREF, since treating combining characters as part of digit prevented them from appearing first, as was only sensible. Unfortunately, NMTOKENs, since there were no restrictions, became able to begin with a combining character, though that made no real sense. To write in a restriction would make it impossible to specify XML's concrete syntax in SGML terms, which did not allow for three different classes of characters within tokens. So we wound up with a basically useless capability that if used will only cause trouble. There is some
Re: Questions on ZWNBS - for line initial holam plus alef
From: Peter Kirk [EMAIL PROTECTED] There is some potential for real trouble here, if one process outputs an NMTOKEN starting with a combining character preceded by a separating space, or something else which is changed into a space, and another process takes the new space plus combining character as a unit and so doesn't recognise the separation. Any hackers and virus programmers reading this will soon start flooding the Internet with tokens beginning with combining characters in the hope of crashing implementations or finding back doors. Of course this wouldn't have been a problem if Unicode had never defined space plus combining character as legal and meaningful. But this is not my problem! I do agree: a XML document could require the use at some place of a given attribute or element. If this attribute name follows the element name after a line break, which gets changed into a space during parsing, forcing XML parsers to treat SPACE+combining as a unbreakable grapheme cluster acting like a letter would have the effect of creating a new element name which may violate the lement name identity. Now suppose that the attribute name contains a colon, you have created a custom namespace name, under which you can add any element you like, even if this was forbidden by the content-model of the reference schema. So this would invalidate existing documents, or create holes allowing insertion of arbitrary XML content, if the XML application is not validating extremely strictly the element names (the pair namespace+ name) and exclude completely from processing any unrecognized element (including all its content and attributes). This would be a breach in the content model which may have been validated and tested for security in another layer of the document encoding process (notably when XML documents are created from templates, such as XSL processors, or custom C source using simple template substitution). So for me the sequence SPACE+combining should not be acceptable as a valid grapheme cluster within element names or attribute names, and thus would need to be excluded from NMTOKEN. The correct way to do it is to consider it NOT A LETTER, but a symbol (Sk), exactly like other spacing diacritics, which are already invalid in NMTOKEN. There still remains the unresolved question of grapheme clusters that could span the starting or ending or / of tags, or the leading of a entitity reference. For this reason, defective combining sequences (combining characters without a leading base character) should be forbidden (invalid for XML). So there remains a unsolved conflict here: defective combining sequences cause security or validity problems in XML documents, and a non-defective SPACE+combining sequence cause also security problems. There's no secure choice to represent spacing diacritics which are not already encoded in a precomposed form...
Re: Unicode 4.0 is online at last!
On 11/08/2003 17:37, Kenneth Whistler wrote: Well, I've been promising that good things would come to those who wait. ;-) At last, the Unicode website has been updated with the online chapters for Unicode 4.0. See: http://www.unicode.org/versions/Unicode4.0.0/ Or just go to the Unicode 4.0 link from the home page. Enjoy. --Ken P.S. Just FYI, Peter K., now it is o.k. for everyone to come back from their August Unicode vacations. Let the textual criticism begin! The documentation is great, but I have had some problems copying text from it (with Acrobat Reader 5), in particular with text in small capitals e.g. Unicode character names. For example, I get the following from p.44: The sequence of Unicode characters U+0061 a + U+0308 ! + U+0075 u unambiguously encodes u not a. I mentioned this on another list, and received the following as part of a reply from an expert on PDF format: For example, here is some text copied and pasted from the Unicode Standard, p.44, http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf: Interesting choice, since this document was NOT produced using a Unicode-aware authoring tool - they used FrameMaker 6, which doesn't do Unicode. FrameMaker was able to pass enough information into Acrobat Distiller so that SOME of the fonts used have ToUnicode tables - but they appear to be limited to symbol fonts and a few extra glyphs... Therefore, without this information in the PDF, Acrobat is (understandably) unable to properly extract Unicode-based information from the document. Interesting and a little embarrassing that Unicode's own documentation is not Unicode compatible! -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Pre-orders of The Unicode Standard, Version 4.0
Magda Danish (Unicode) scripsit: To order, please use the the book order form at http://www.unicode.org/book/bookform.html Thanks. Is the Unicode Consortium in any way benefited (or disadvantaged) if non-members order through it rather than through Amazon or BN? -- John Cowan [EMAIL PROTECTED] www.ccil.org/~cowan www.reutershealth.com If he has seen farther than others, it is because he is standing on a stack of dwarves. --Mike Champion, describing Tim Berners-Lee (adapted)
Re: Colourful scripts and Aramaic
At 13:12 -0700 2003-08-07, Peter Kirk wrote: Well, it seems to me that in the case of the Aramaic proposal we don't even have that. We have an archaic version of the script which is now used mainly for Hebrew, and which many scholars still call Aramaic (in distinction from paleo-Hebrew) although Unicode calls it Hebrew. The Aramaic glyphs are almost all recognisably the same as or slight variants on the Hebrew ones. And Hebrew script is already used, uncontroversially, for large corpora of Aramaic e.g. in the Talmud. Why a new script for the few surviving examples of ancient Aramaic in this script? People. It's the widespread offshoot used throughout the Middle East that spawned Brahmic and Uighur and other scripts. It isn't necessarily the thing you think is confined to three scraps of papyrus or whatever. We aren't working actively on this now. We don't have an active proposal. We have something roadmapped, and I for one don't want to spend time right now defending its roadmapping to you apart from what is in my earlier paper on Semitic scripts. Could you turn off the fire alarms? -- Michael Everson * * Everson Typography * * http://www.evertype.com
Unicode 4.0.1 Beta period now starting
The beta period for Unicode 4.0.1 has now started. Detailed information is available on the beta page: http://www.unicode.org/versions/beta.html Beta versions of Unicode 4.0.1 data files are now available for public comment here: http://www.unicode.org/Public/4.0-Update1/ This is the first update of Unihan.txt since Unicode 3.2, and it includes a large number of corrections and additions. There are several other minor changes to other data files. The beta period closes on August 18, 2003. Since time is short, developers are asked to please focus quickly on the data file review if you have not yet done so. Beta period comments will be reviewed by the Unicode Technical Committee at the upcoming meeting starting August 25, 2003. If you have any feedback on any of the beta files, please submit it by August 18, 2003. You can submit feedback via the online reporting page here: http://www.unicode.org/reporting.html Note: If you are a liaison representative, please forward this message as appropriate within your organization.
[hebrew] Re: Roadmap---Mandaic, Early Aramaic, Samaritan
Elaine, I really, really, really don't have time to debug your dissatisfaction with the use of the word Aramaic in the Roadmaps. This is NOT something anyone is working actively on right now. When a proposal comes forth, there will be evidence in it that can be picked at. In actuality, one could make a very good case that all extant Semitic/ extended Aramaic-Moabite-Amorite-Yaudic-Hebrew etc. type alphabetic scripts between the earliestSinaitic / Wadi El-Hol---and middle Parthian are font variants We are not going to encode Phoenican and Samaritan and Palymrene as font variants of Hebrew. If you want to write those languages in Hebrew script, do so. Any border(s) you draw will be either completely artificial or mostly artifical. That's the problem. The borders we draw are based on the analyses of script experts. I gather that you are a font person, fascinated by the aesthetic pleasure of wondrous shapes. I am a lot more than that. I am a database person, concerned with minimizing unnecessary font variation, which may interfere with future overworked Semitic retrieval engines. You will never be at as greater disadvantage than a Sanskritist is, considering that the Rg Veda can be written in a dozen or so scripts. The Mandaic and Samaritan scripts apparently enjoy at least some modern liturgical use. Yes, they do! But the Samaritan is also heavily used within Jewish studies / Biblical studies communities. The Samaritans also use their shapes in private correspondence. Then we shall encode them. of Aramaic script to encode has not been looked at carefully. Indeed we have no current proposals which are well-advanced at this time. I'm responding now because this may be the only time period where Hebraists interact with UnicodeCarpe diem.. Hebraists are discussing concerns about METEG and things. You're responding about things which don't even have formal proposals to respond to. If you want me to start working on encoding other early Semitic scripts, please give generously to the Script Encoding Initiative and ask for prioritization. Failing that, I will be working on things which have higher priority (and more complete proposals) at present, like Coptic, Saurashtra, Nuskhuri, Buginese, N'Ko, Ol Chiki, Avestan and Pahlavi, and so on. I am responding at great length to the Roadmap proposals for the Semitic dialects Mandaic, Early Aramaic, and Samaritan. We are proposing to encode scripts, not languages. Yes, that is your take on it. But scripts are frozen language, not the liquid language of speech or the gaseous language of poetry.. You encode scripts so we can manipulate languages We encode scripts so that we can represent texts. And we will do it, as we have, to the best of our ability, but not by lumping everything together just because it makes things easy for database programmers. Best regards, -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
Peter responded to Mark: On 05/08/2003 14:40, Mark Davis wrote: Where did you get the notion that space is not a base character? And base characters include those that are not control or format characters. Space is neither one. The standard specifically states in a number of places that to exhibit a combining mark in isolation you use a space (or NBSP). Mark __ http://www.macchiato.com ► “Eppur si muove” ◄ I got this from the Unicode Standard 4.0, as quoted by Jim Allan: *Mis*quoted by Jim Allan. In http://www.unicode.org/book/preview/ch03.pdf the space characters in general are given class Zs: Zs, Zl, and Zp are considered format characters, but their membership in the Z (separator) class takes precedence over their membership in the Cf class, because the General Category assigns only a single value to each character. That piece of text is *NOT* a quotation from Chapter 3 of Unicode 4.0. Go to that URL and search for it yourself. It is quoted from Chapter 4 of Unicode *3.0*, p. 88, in the discussion of General Category in Section 4.5, General Category -- Normative in Part. The corresponding paragraph has been deleted from the relevant section in Unicode 4.0, precisely because the standard now precisely defines format control characters as {Cf, Zl, Zp} but *ex*cluding Zs. See p. 25 in: http://www.unicode.org/book/preview/ch02.pdf So the various space characters (class Zs) are also classified as format characters. From http://www.unicode.org/book/ch04.pdf: _D13 Base character:_ a character that does not graphically combine with preceding character, and that is neither control nor a format character. Accordingly, by definition, spaces are not base characters. This conclusion is false. As Mark indicated, SPACE (and NBSP) are base characters, and have been treated as such in terms of diacritic application since Unicode 1.0 was published: By convention, diacritical marks used by the Unicode encoding scheme may be exhibited in (apparent) isolation by applying them to U+0020 SPACE or to U+00A0 NON-BREAKING SPACE. This might be done, for example, when talking about the diacritical mark itself as a mark, rather than using it in its normal way in text. -- Unicode 1.0, p. 19 [1991] And that *is* an accurate quote from the standard. In Unicode 4.0 that text survives as: By convention, diacritical marks used by the Unicode Standard may be exhibited in (apparent) isolation by applying them to U+0020 SPACE or to U+00A0 NON-BREAKING SPACE. This tactic might be employed, for example, when talking about the diacritical mark itself as a mark, rather than using it in its normal way in text. -- Unicode 4.0, p. 46 [2003] I'd say the intent of the UTC and the Unicode Standard in this regard has always been rather clear and has stayed unchanged for quite some time. --Ken
RE: Questions on ZWNBS - for line initial holam plus alef
the solution with SPACE is really tricky due to the special treatment of SPACE notably in HTML, SGML, XML I disagree. There are a few different things that happen with whitespace in such technologies. Some of these only apply to elements that do not allow any character data apart from whitespace to appear directly within them, and hence are not an issue here. Some happen at relatively high level of processing, e.g. rendering (not parsing) of HTML, and as such should correctly process spaces combined with combining characters. There are only two theoretical problems that I can see here, the first is that a whitespace character other than space gets converted to space by attribute value normalisation, and that this changes the meaning of the text in some way. This could only occur if the combining character were the first character in a line of text, which is quite a nonsensical construct to begin with. The other would be with names, qnames, nmtokens and such. These are not normal textual content; they are human-readable constructs that are based on normal text because that makes it easier for some developers to work at a plain-text level (if they speak the natural language that the human-readable constructs were based on). Support for the linguistic oddity of a dialectic divorced from the context in which it would normally exist would have little justification in this place except for fulfilling the general goal of completeness. Completeness is a laudable aim of course, but extreme edge-cases need only be brought in if they are both safe and cheap. Anyone designing an XML application who frequently considers isolated diacritics as the most natural choice in part of such tokens probably needs to take a couple of weeks holidays before continuing the design. Of course some of the characters that could be considered to be precomposed isolated diacritics are banned from use in nmtokens anyway.
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
Peter Kirk peter dot r dot kirk at ntlworld dot com wrote: Point taken. But when different fonts and rendering engines give different results because the standard is unclear or ambiguous, that is a matter for the discussion here. And when conforming fonts and rendering engines fail to give the required results, that may also be because of a deficiency in the standard. Or it may not. It may be a deficiency in the level of Unicode support afforded by the fonts and rendering engines. It may simply reflect a difference between your requirements and what the standard promises, and doesn't promise. It seems that many rendering engines give to the sequence space, combining mark the width normally assigned to a space. Is this actually what the standard suggests? The standard doesn't say anything about width in this case. It leaves it up to the display engine, which is as it should be. I have identified a need to display combining marks with no extra width, only the width required by the mark. Should the sequence space, combining mark do what I want, or shouldn't it? If so, this needs to be spelled out so that rendering engines know what they are supposed to do. If not, there may be a need for a new character. This is a deficiency in the standard, not in the rendering engines. When the specific alignment of isolated glyphs is important to me, I use markup. I'm a big supporter of plain text, as many members of this list know, but the exact spacing of isolated combining marks seems like a layout issue to me. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
RE: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
there is no such thing as NFD decompositions. Sorry for the confusion. Still even with a NFKD decomposition, And there is no such thing as NFKD decomposition either. It goes as follows, in steps: 1. Canonical and compatibility decomposition mappings (one-step), and canonical classes. 2. Canonical and compatibility full/recursive decompositions and canonical reordering. The compatibility (full) decompositions make use of both the canonical and compatibility decomposition mappings. 3. Canonical and compatibility equivalences. 4. The four Unicode normal forms (NFD, NFC, NFKD, and NFKC). Please don't turn it upside down, that's only confusing! Ok, the formal definition of equivalences and normal forms are a bit backwards in The Unicode standard, defining NFD (in practice, though not the name) before the equivalences. Normally, a normal form is defined as a particular representative element in an equivalence class... But there is no need to aggravate the backwardsness into cyclicity. ... It's true that not all (only most) combining non-spacing characters have a non-combining spacing counterpart. Only a *few* g.c. Mn characters have spacing counterparts! /kent k
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
According to the docs at http://www.microsoft.com/typography/otfntdev/indicot/other.htm, uniscribe renders combining marks in isolation when they are applied to SPACE + ZWJ. (Without the ZWJ, it uses a dotted circle.) Perhaps this is an acceptable solution to the people calling for a new character. Combining marks and signs that appear in text not in conjunction with a valid consonant base are considered invalid. Uniscribe displays these marks using the fallback rendering mechanism defined in the Unicode Standard (section 5.12, 'Rendering Non-Spacing Marks' of the Unicode Standard 3.1), i.e. positioned on a dotted circle. Please note that to render a sign standalone (in apparent isolation from any base) one should apply it on a space (see section 2.5 'Combining Marks' of the Unicode Standard). Uniscribe requires a ZWJ to be placed between the space and a mark for them to combine into a standalone sign. Noah
RE: Newbie Question - what are all those duplicated characters FOR?
Ah, now you're making assumptions about me which are not, in fact, valid. I'm not quite sure exactly what you mean by the text, but I own a copy of The Unicode Standard Version 3.0 and have read it pretty much in entirety. I have also read almost everything I could find on the unicode.org web site. In none of these sources have I found an answer to this question. It was for this very reason that I joined this forum, thinking Aha! Maybe someone THERE might know the answer. So, Michael, perhaps you might be so kind as to give the URL of the text to which you refer (or even the page number in the 3.0 book). If I find such a text, I will most certainly read it. Stefan has effectively dealt with SOME of my confusion, but questions remain. For example: between 1D49C (mathematical script capital A) and 1D49E(mathematical script capital C) we find 1D49D (reserved). What is it reserved for? I am aware that codepoint 212C is script capital B, but why does that justify leaving a hole in the codepoint space? Why not just omit mathematical script capital B without leaving a hole? (i.e. why not just go straight from A to C?). More questions. From E0020 to E007E we have tag space through to tag tilde. These are copies of the Basic Latin block at 0020. I still don't know what they are for. I am, however, VERY keen to learn, and so would really appreciate it if someone could tell me, or indeed point me in the direction of the text which explains it. Thanking you in advance for your help, Jill -Original Message- From: Michael Everson [mailto:[EMAIL PROTECTED] Sent: Friday, August 08, 2003 6:54 PM To: [EMAIL PROTECTED] Subject: Re: Newbie Question - what are all those duplicated characters FOR? At 17:46 +0100 2003-08-08, [EMAIL PROTECTED] wrote: I'm reasonably sure that this question reflects my own ignorance, rather than some problem with the standard, but nonetheless, I am confused. Read the text. Don't just read the code charts. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Conflicting principles
On 07/08/2003 13:57, John Cowan wrote: Kent Karlsson scripsit: 4) Encode the vowel signs as combining characters, after the base characters they logical follow. Consider them as double [width] combining characters, that happen to have no ink above/below the character they apply to, but (like double width combining characters) have ink over/under the glyph for the base character that follows. Cool. ... Agreed! ... But an immediate problem comes to mind: what if there is a line break between the two base characters? What if there is a line break between the two characters joined by a double width combining character? Are arbitrary line breaks in the middle of words actually permitted anyway? Presumably any line breaking property of the first base character of the pair is cancelled anyway. That leaves a problem only if the second base character has a line break before possibility. Well, that could just be treated as one of the sequences we were discussing yesterday, not illegal Unicode but its rendering is undefined. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: AL32UTF8 Vs UTF8
Jay Chandru scripsit: I wanted to know the differences between AL32UTF8 and UTF8. My database (oracle) will be in AL32UTF8 format. Will the applications that require multibyte characters work as they are functionin in UTF8 format. The Oracle UTF8 format is really CESU-8, whereas the AL32UTF8 format is true UTF-8. The difference shows up in characters beyond U+, which are represented with six bytes in UTF8 format, four bytes in AL32UTF8 format. UTF8 format is not very interoperable, whereas AL32UTF8 format is. I recommend that you use the latter. -- John Cowan [EMAIL PROTECTED] www.ccil.org/~cowan www.reutershealth.com If I have seen farther than others, it is because I am surrounded by dwarves. --Murray Gell-Mann
Re: Conflicting principles
At 01:18 +0200 2003-08-09, Philippe Verdy wrote: Such break in a middle of a multiple width diacritic exist in some notations, and are not considered horrible typography. Just look at musical notations where a upper horizontal parenthesis is used to group some elements [...] Music setting is not typesetting, and that kind of music representation is outside of the scope of the Unicode Standard. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
on 2003-08-06 15:24 Doug Ewell wrote: I'm not a typographer (intelligent or otherwise), but I'm having a tough time seeing how Section 2.10 *requires* fonts and rendering engines to give a space-plus-combining-diacritic combination the exact minimum width of the diacritic alone, or to leave equal space before and after such a combination. All I think it is saying is that, for example, the combination i-plus-tilde may be wider than i alone, because tilde is wider than i. Considering that one approach is to use opentype to map a letter plus diacritical to a single glyph, an obvious solution would be to include space + diacritical combos in that table. An important font issue, but a font issue nonetheless. -- Curtis Clark http://www.csupomona.edu/~jcclark/ Mockingbird Font Works http://www.mockfont.com/
24th Unicode Conference - Last week to $SAVE with early-birdrates!
REGISTER THIS WEEK AND SAVE ON EARLY-BIRD CONFERENCE AND HOTEL RATES! Are you falling behind? Version 4.0 of the Unicode Standard is here! Software and Web applications can now support more languages with greater efficiency and lower cost. Do you need to find out how? Do you need to be more competitive around the globe? Is your software upward-compatible with version 4.0? Does your staff need internationalization training? Learn about software and Web internationalization and the new Unicode Standard, including its latest features and requirements. This is the only event endorsed by the Unicode Consortium. The conference will be held September 3-5, 2003 in Atlanta, Georgia and is completely updated. Twenty-fourth Internationalization and Unicode Conference (IUC24) Unicode, Internationalization, the Web: Powering Global Business http://www.unicode.org/iuc/iuc24 September 3-5, 2003 Atlanta, Georgia, USA NEWS Visit the Conference Web site ( http://www.unicode.org/iuc/iuc24 ) to check the updated Conference program and register. To help you choose Conference sessions, we've included abstracts of talks and speakers' biographies. Attend the Showcase to find out more about products supporting the Unicode Standard, and products and services that can help you globalize/localize your software, documentation and Internet content. Be an Exhibitor! Show off your product at the premier technical conference worldwide for both software and Web internationalization. See: http://www.unicode.org/iuc/iuc24/showcase.html To find out about, and register for the TILP Breakfast Meeting and Roundtable, organized by The Institute of Localisation Professionals, and taking place at the same venue on September 4, 7:00 a.m.-9:00 a.m., See: http://www.tilponline.org/events/diary.shtml or http://www.unicode.org/iuc/iuc24 KEYNOTES: Keynote speakers for IUC24 are well-known authors in the Internationalization and Localization industries: Donald De Palma, President, Common Sense Advisory, Inc., and author of Business Without Borders: A Strategic Guide to Global Marketing, and Richard Gillam, author of Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard and a former columnist for C++ Report. TUTORIALS: The redeveloped and enhanced Unicode 4.0 Tutorial is taught by Dr. Asmus Freytag, one of the major contributors to the standard, and extensively experienced in implementing real-world Unicode applications. Structured into 3 independent modules, you can attend just the overview, or only the most advanced material. Tutorials in Web Internationalization, non-Latin scripts, and more, are offered in parallel and taught by recognized industry experts. CONFERENCE TRACKS: Gain the competitive edge! Conference sessions provide the most up-to-date technical information on standards, best practices, and recent advances in the globalization of software and the Internet. Panel discussions and the friendly atmosphere allow you to exchange ideas and ask questions of key players in the internationalization industry. WHO SHOULD ATTEND?: If you have a limited training budget, this is the one Internationalization conference you need. Send staff that are involved in either Unicode-enabling software, or internationalization of software and the Internet, including: managers, software engineers, systems analysts, font designers, graphic designers, content developers, Web designers, Web administrators, technical writers, and product marketing personnel. CONFERENCE WEB SITE, PROGRAM and REGISTRATION The Conference Program and Registration form are available at the Conference Web site: http://www.unicode.org/iuc/iuc24 CONFERENCE SPONSORS Agfa Monotype Corporation Basis Technology Corporation ClientSide News L.L.C. Oracle Corporation World Wide Web Consortium (W3C) XenCraft GLOBAL COMPUTING SHOWCASE Visit the Showcase to find out more about products supporting the Unicode Standard, and products and services that can help you globalize/localize your software, documentation and Internet content. Sign up for the Exhibitors' track as part of the Conference. For more information, please see: http://www.unicode.org/iuc/iuc24/showcase.html Exhibitors to date: Agfa Monotype Corporation ASET International Services Corporation Basis Technology ClientSide News L.L.C. LingoPort, Inc. Multilingual Computing, Inc. The Symbio Group The Institute of Localisation Professionals
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
On Sunday, August 10, 2003 9:30 AM, Mark Davis [EMAIL PROTECTED] wrote: As for oe-ligature, the French representative to WG3 (or its predecessor) said that France could live without it. Even worse; the story I heard was that the committee had planned from the start to have and in positions D7 and F7, but that late in the process the representative from France objected, so they replaced them by and . That would certainly explain why these symbols are in the middle of a batch of letters... It's true that in French these are really ligatures, and not plain letters, meaning that this is mostly a standard typographic convention, rather than orthographic. The national AFNOR may have opted for this solution thinking that these holes would have benfited for other languages commonly used in Europe, and there were probably other candidate characters that finally got encoded in a separate ISO-8859-* set. I don't know which compromize was taken, but the origin DEC VT set also had holes at those positions. It's just strange that the ISO working group opted for those two characters at D7 and F7, when there could have been a pair of characters coded for Finnish, or Catalan (like the dotted L which is still coded with a separate middle dot symbol instead of a true diacritic, and that renders quite poorly with ISO-8859-1 and even with Windows 1252). Well, French and Catalan writers have lived with those encoded sequences, and fixed the rendering using ligating rules in their renderers or fonts (or used the oe/OE ligatures in Windows1252). I just suspect that the French objection on oe/OE was related to the fear of modifying keyboards that were previously created based on the French version of ISO646, where such ligature could not be coded. Since then, the AFNOR version of ISO646-FR has been simplified to remove the tricky combining sequences built with BACKSPACE, like C+BACKSPACE+COMMA to code a C WITH CEDILLA, as they were no longer necessary with a more universally used 8-bit set (7-bit sets have survived only within Teletex/Videotex standards, built in accordance with ISO646 with SS2 sequences to encode non-spacing diacritics *before* the base character with which they combine, to match the keyboard input order based on dead keys for combining diacritics, and this 7-bit set is probably the only one remaining in large use today for French, with ISO646-FR now nearly extinct in favor of ISO646-US/ASCII) -- Philippe. Spams non tolrs: tout message non sollicit sera rapport vos fournisseurs de services Internet.
Re: Questions on ZWNBS - for line initial holam plus alef
Jon Hanna scripsit: If this is not the case (I'm not entirely sure this bans what XML does with spaces) then all we would need is a change so that rather than a de facto ban on space+combining within names and nmtokens we would have an explicit ban on the same; then we'd all be happy, except possibly for some sadistic XML application designer that was planning on use that combination out of ill-will towards his or her colleagues. Space in any case is not allowed in a token. There are far worse conformance problems than this anyway, notably the fact that canonical equivalence is not respected in XML names: a start-tag that is decomposed and an end-tag that is composed (or vice versa) will not match. -- The Imperials are decadent, 300 pound John Cowan [EMAIL PROTECTED] free-range chickens (except they have http://www.reutershealth.com teeth, arms instead of wings andhttp://www.ccil.org/~cowan dinosaurlike tails).--Elyse Grasso
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
On 05/08/2003 14:40, Mark Davis wrote: Where did you get the notion that space is not a base character? And base characters include those that are not control or format characters. Space is neither one. The standard specifically states in a number of places that to exhibit a combining mark in isolation you use a space (or NBSP). Mark __ http://www.macchiato.com Eppur si muove I got this from the Unicode Standard 4.0, as quoted by Jim Allan: In http://www.unicode.org/book/preview/ch03.pdf the space characters in general are given class Zs: Zs, Zl, and Zp are considered format characters, but their membership in the Z (separator) class takes precedence over their membership in the Cf class, because the General Category assigns only a single value to each character. So the various space characters (class Zs) are also classified as format characters. From http://www.unicode.org/book/ch04.pdf: _D13 Base character:_ a character that does not graphically combine with preceding character, and that is neither control nor a format character. Accordingly, by definition, spaces are not base characters. -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
Re: Display of Isolated Nonspacing Marks (was Re: Questions onZWNBS...)
At 01:30 +0200 2003-08-10, Philippe Verdy wrote: Whateer you think, the SPACE+diacritic is still a hack, and certainly not a canonical equivalent (including for its properties), of the existing spacing diacritics, which also do not fit all usages because they are symbols. It is the formally specified way to represent what you say you want to represent. If an implementation doesn't do that nicely enough, complain to the implementors. (This has already been suggested to you.) -- Michael Everson * * Everson Typography * * http://www.evertype.com
RE: Conflicting principles
what code are we talking about that has to work from the positions of the combining marks back to the underlying representation? Such code is not just common and widespread, it is practically ubiquitous. The principle of base characters always coming first are used: Whenever you need to calculate the size of a visual representation of a string. Whenever you need to move a caret, or locate the caret position closest to a cursor position. Whenever you perform normalisation. Whenever you insert a substring which may not begin with a base character into another string. Whenever you need to guarantee that a portion of streamed text is sufficiently complete that operations on it won't have to be redone when more characters are received. Whenever you need to examine the properties of a character which may change if combined (e.g. breaking properties can be changed when combined). This is not code that couldn't necessarily be rewritten to allow cases where combining marks preceded base characters (though it may become considerably more complicated, frightfully so in some cases, which in turn would lead some developers to neglect full support for the scripts that used this new feature). It is code that is all over the place, much of it would be hard to track down, and generally unless coders have all nicely isolated the process of locating combining sequences (and you just know some of them haven't) it's going to be a mess trying to upgrade. This doesn't say we should automatically dismiss any proposal to change the principle, but it does weigh heavily against any such process.
RE: Does Unicode 3.1 take care of all characters of 'Hong Kong Supplimentary Character Set - 2001' (HKSCS-2001) ?
Aren't the replies about Unicode 3.2 (or maybe 4.0) rather than 3.1? 1651 - Supplimentary Plane 2 - \2e80 - \u2f00 Plane 2 covers U+2 to U+2, and is not in the BMP (= Plane 0). /kent k
Re: Display of Isolated Nonspacing Marks (problems with UAX#29)
On 10/08/2003 18:44, Doug Ewell wrote: Has it occurred to anyone yet that the very *concept* of spacing diacritics is a hack? Spacing diacritics are used to conduct a sort of meta-discussion about characters, as in A base character o is combined with an acute accent to create . They are not part of the normal writing systems of most natural languages. It is as if I were describing the two typical glyphs used for lower-case g, the one with one bowl and the one with two bowls, but actually showing the separate, constituent pieces of the glyphs instead of using words to describe them. They are interesting things to talk about, but not necessarily things that need to be encoded in plain text. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/ They are indeed interesting things to talk about, and many people do talk about them, and they appear in many texts (including the Unicode Standard!). The goal of Unicode is to define characters which people use, and that must include documents about languages e.g. dictionaries, tutorials, discussions of writing systems etc - which, put together, form a significant proportion of publishing output. They are indeed meta-content but that does not disqualify them from being plain text. Spacing diacritics clearly come into the category of characters which people use, and so should be defined, and properly and unambiguously so, by the standard. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
On 08/08/2003 09:54, Jim Allan wrote: ... It certainly makes sense that in the case of space characters that have a defined width that this width is innate to the definition of the character and in such a case should take precidence over the width of the normally non-spacing combining character. I would welcome clear instructions by Unicode on this point where either result would be useful in order than applications may be expected to produce results that are consistent with each other. :-) Agreed! I would think it would be consistant with Unicode for an application to shrink the width of normal space followed by a diacritic such as a single overdot as exact formatting behavior is not defined in such cases. Well, is a space followed by a diacritic actually a space, or is it the same code point reused or overloaded By convention (to quote the standard) for a logically distinct purpose? Some of the discussions here have implied the latter. Either way, the best clarification would be to add a character whose explicit function is to form non-spacing variants of diacritics. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Unicode 4.0 is online at last!
My congratulations to Ken, Julie, and Eric! For those who might not know, this trio (especially Eric with the online bit) get our unadulterated love and appreciation...Lots of difficulties on the road to online Unicode 4.0 :-) !! Lisa - Forwarded by Lisa Moore/Santa Teresa/IBM on 08/11/2003 09:54 PM - Kenneth Whistler [EMAIL PROTECTED]To: [EMAIL PROTECTED] Sent by: cc: [EMAIL PROTECTED] [EMAIL PROTECTED]Subject: Unicode 4.0 is online at last! icode.org 08/11/2003 05:37 PM Please respond to Kenneth Whistler Well, I've been promising that good things would come to those who wait. ;-) At last, the Unicode website has been updated with the online chapters for Unicode 4.0. See: http://www.unicode.org/versions/Unicode4.0.0/ Or just go to the Unicode 4.0 link from the home page. Enjoy. --Ken P.S. Just FYI, Peter K., now it is o.k. for everyone to come back from their August Unicode vacations. Let the textual criticism begin!
Re: Roadmap---Mandaic, Early Aramaic, Samaritan
Elaine, I disagree with you. Just because Semitic languages *can* be represented in the Hebrew script does not mean that every script is just a font variant of the Hebrew script. There are genetic relationships of the development of the scripts which are involved in our analysis so far. There are also user community concerns. The Mandaic and Samaritan scripts apparently enjoy at least some modern liturgical use. The question of what kind of Aramaic script to encode has not been looked at carefully. Indeed we have no current proposals which are well-advanced at this time. But I am not disposed to removing them from the Roadmap at this time on foot of the reasons you give. I am responding at great length to the Roadmap proposals for the Semitic dialects Mandaic, Early Aramaic, and Samaritan. BTW, the larger phylum for these dialects is called Afroasiatic. We are proposing to encode scripts, not languages. Samaritan is a Hebrew dialect, still used today in Israel in worship/liturgy and probably elsewhere in the Middle East, with a series of different vowel and other marks, many of them derived from Arabic. And a set of base letter glyphs which differs strongly from Hebrew. But AfroasiaticAramaic, Syriac, Mandaic, Egyptian, Somali, Hausa, Hebrew, Samaritan, Amorite, Yaudic, Tigrinya, Arabic, Berber, Moabite, Amorite, Coptic-has not fared as well as CJKV. That is because CJK is a moneymaker, and resources are not available to those who would like to work on the scripts used by these languages. So here's the problem, which seems to me a clear language engineering situation: there are VOLUMINOUS amounts of material in Egyptian and Akkadian that could be computerized. The Hebrew Bible has 1,000 pages of Hebrew and Aramaic, the Talmud has at least 40,000 pages of Aramaic and Hebrew. There's also quite a bit of Ugaritic, a unique alphabet. Yes, we know. But for the Early Aramaic, which can be perfectly represented in modern Hebrew square script, there are maybe 3 pages of mostly tiny scraps of text, if that much. For many of the scraps the question is: what language is this, actually?-- Aramaic or something else? But you are proposing a completely unnecessary script for 3 pages of material, and make an overworked search engine go through those 3 pages in a different way than the work it does for the other thousands of pages of Aramaic in the 6 other scripts. We are talking about the Aramaic that was enormously widespread and was the basis for a number of other scripts. Perhaps Early Aramaic is not what it should be called. (Indeed the Roadmap doesn't name it so.) Mandaic is easily represented by Hebrew + one extra letter. There is more material here, but there is no problem in seeing it as a variant font. There is as far as I am concerned. Samaritan is a Hebrew font variant with interesting different sets of vowel points. There's no reason to computerize it separately, despite the exotic shapes. I think there is. Every scrap of early alphabetic Semitic material has different letter shapes. It never did become anything like a standard. Many of these scripts had type designed for them. Scholars did not always use Hebrew to represent all of it, nor should they have. It may be some time before proposals to encode these appear. You and others will have an opportunity to examine them. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Handwritten EURO sign (off topic?)
James H. Cloos Jr. wrote: Anto'nio == Anto'nio Martins-Tuva'lkin [EMAIL PROTECTED] writes: Anto'nio (Let alone the validity of things Anto'nio like k, c etc.) I'm sure things like m, k, M and even G will come into use, though I expect more will use them in front of the digits. I certainly use m$, k$ et al, and regulary see others use them. m and m$ would be millieuros and millidollars. How could anyone need anything like that? And why use c$ and c, wouldn't be just as good? Stefan
Re: Conflicting principles
John Cowan asked: I would like to ask the old farts^W^Wrespected elders of the UTC which principle they consider more important, abstractly speaking: the principle that combining marks always follow their base characters (a typographical principle), or that text is stored, with a few minor exceptions, in phonetic order (a lexicographical principle). As may often be the case in such hypothetical questions, I think there is a false dichotomy presumed here. The principle of the order of combining marks results from the need to resolve the following architectural question for the standard: Does a combining mark apply to the base character that precedes it or to the base character that follows it? In other words, does á = 0065, 0301 or does á = 0301, 0065? There can only be one right answer to that question, while having a coherent, interoperable character encoding standard. The choice that the Unicode architects made on this principle in 1989 is sacrosanct and inviolable. The principle of logical order of encoding results from the need to resolve the following architectural question for the standard: Is a right-to-left script encoded in visual order in the backing store or in phonetic (= logical) order? In other words, is tsava spelled 05E6, 05D1, 05D0 or 05D0, 05D1, 05E6. There can only be one right answer to that question, while having a coherent, interoperable character encoding standard. The choice that the Unicode architects made on this principle in 1989 is sacrosanct and inviolable. Everything else is just working out the details for making actual script encodings consistent in the context of those overarching principles. The status of a character as combining or not is up for grabs, depending on the analysis of a script's behavior and how it should be represented. And the layout for actual display of rendered texts does not, and never has, slavishly followed logical order in lockstep. Again, everyone, if you haven't already, go back and meditate some more on the fundamental mandala of Unicode: Figure 2-3, Unicode Character Code to Rendered Glyphs, which illustrates both issues of combining mark order with respect to base character and general logical order of characters as applied to a particular script encoding (Devanagari). And don't miss the following piece of text associated with that figure: The Unicode Standard documents the default relationship between character sequences and glyphic appearance for the purpose of ensuring that the same text content can be stored with the same, and therefore interchangeable, sequence of character codes. This should, IMO, be put up on a pedestal and have the spotlights shined on it. This is the *fundamental* obligation of a character encoding standard. If you cannot accomplish this, then you just have a bunch of charts full of pretty pictures, and everyone is on their own for trying to figure out how to communicate with anybody else using them. As someone or other said, I believe that hitherto -- *hitherto,* mark you -- [we have] entirely overlooked the existence of, well, scripts that might cause a conflict between these esteemed principles. The reason why the UTC should tackle the encoding of Tengwar is not so much because it would help in the publication of Elvish poetry, but because confronting the architectural issues it poses for encoding would make an excellent tutorial case for how the two principles of combining mark order and logical order impact the task of coming up with an appropriate encoding for a complex script. And it would starkly illustrate the fact that an appropriate character encoding does not necessarily directly reflect the phonological structure of a language as represented by that script. --Ken
Re: Questions on ZWNBS - for line initial holam plus alef
From: Jon Hanna [EMAIL PROTECTED] I was saying that it wouldn't be sensible to begin a line with a combining diacritic, since that combining diacritic would be combining with a newline character which it's difficult to think of any possible sensible meaning for. A newline is a control with a whitespace property and a line-breaking behavior. It must not combine with a combining diacritic, according to the UAX definition of grapheme clusters. So newline+NSM is clearly defective and must be parsed as two distinct combining sequences, the first one for the newline sequence, the second one being defective as the combining character does not have a base character to which it applies (the standard suggests using a dotted circle to render it in editors, but suggests nothing for the rendering of final documents, which could simply drop the defective sequence or display it with a replacement base character, or use a dotted circle, or a invisible glyph. So the result in this case is implementation dependant, and not interoperable. For me the term difficult is inappropriate. In fact it is invalid for interoperability (even though it is valid, not forbidden, for ISO10646/Unicode, as an string fragment for intermediate processing), and such sequence should not occur in actual documents, out of any external processing context which defines its behavior.
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
Ted Hopp asked: I believe that reasonable people might reasonably conclude from factoids 1 and 2 that SPACE is indeed a format character. Reasonable, but evidently wrong. Explanation, please? I provided the text deconstruction in my last email, but to continue, the confusion arises from the strange nature of SPACE in the history of character encoding. SPACE, for a long time now in the history of character encodings, has been classified as a *graphic* character. Certainly, in the general SC2 character encoding context of ISO 2022, SPACE always shows up in the G0 set, with other graphic characters, instead of in the various control functions encoded in C0 or C1 sets. But looked at from the legacy of device control, SPACE could just as well been categorized as a control function: MOVE PRINT HEAD ONE UNIT RIGHT, comparable to BACKSPACE. And in the context of the Unicode Standard, people often loosely talk about space characters as being format characters, since they are a) more akin to punctuation than normal letters, b) have no glyph associated with them, and c) impact line-breaking and other aspects of the formatting of characters in their vicinity. But the *formal* categorization of Unicode characters, defined by the UTC to help eliminate this kind of ambiguity in talk about the character types, is spelled out in Figure 2.5 of Unicode 4.0 now: http://www.unicode.org/book/preview/ch02.pdf and the *formal* meaning of format control character (Basic type = Format) in Unicode is now any character with the General Category of {Cf, Zl, Zp}. The space characters are all lumped in with graphic characters. So while there are still some ambiguities to be worked out in the definition of base character in the Unicode Standard, neither the status of SPACE as a graphic character nor the recommendation of the standard that non-spacing marks be applied to SPACE as a means of showing them in isolation is in question. --Ken
Re: Handwritten EURO sign (off topic?)
At 00:52 +0100 2003-08-14, Anto'nio Martins-Tuva'lkin wrote: Using the cent sign is mostly US specific and the symbol is not recognized as such in most European countries, so the cent sign is bound directly to the dollar. If the dollar sign can be used for currencies other than the USD, even for some which name is not even dollar, then I suppose there is a theoreitical possiblity that it may be used as a symbol of euro cent (though I personally prefer c*). There is no reason that the noble ¢ cent sign should not be used for the European currency. Personally I always use it, because 2¢ looks like two cents and 2c looks like two cee. In Ireland of course when we used pence we wrote 2p and said two pee. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Questions on ZWNBS - for line initial holam plus alef
Ken Whistler posted: Of course a standard which mandates space folding is also within its rights to mandate, for example, the non-use of nonspacing marks applied to SPACE characters. It can simply rule out such sequences as valid for its context, in which case the problem goes away. And for such standards or applications one can usually use U+00A0 NO-BREAK SPACE to force multiple spacings. One can also use this followed by a non-spacing combining character to call for rendering of that combining character in isolation. My feeling is that because of the special qualities of regular SPACE using NBSP (U+00A0) should be the more robust way to go. Essentially, since the Unicode specifications say that a non-spacing diacritic can be applied to any base character, including the spaces, it is up to fonts and other presentation software to support this and to try to make the results look good according to othrographic and cultural expectations, just as it is with any text coded in Unicode. Sometimes fonts don't do this. I would not at all be surprised to find for example that _g_ followed by U+0325 COMBINING RING BELOW would come out with the combining ring overlapping the tail of the _g_ unless I were using a font especially designed for linguistic use. I would not be at all surprised that some fonts and display devices wouldn't justify NBSP + COMBINING DOT BELOW at the beginning of a line. But good typographical fonts should justify such combinations and should presumably change the width of NBSP when appropriate. Such changes of width and shapes are what one finds with ligatures in fonts that support ligatures. Jim Allan
Re: Questions on ZWNBS - for line initial holam plus alef
Peter Kirk wrote: I think this may be a Peter mistake. I meant to refer to spacing diacritics. Sorry. It is certainly highly inappropriate for spacing diacritics to be considered word boundaries. Why? It is entirely dependent on the orthography and conventions involved. There is probably as much (or more) bad ASCII usage of spacing diacritics like `this', where a grave accent character is being misapplied to make a directional quotation mark, as there is actual, linguistically appropriate use of spacing diacritics. Also, everyone should consider carefully the status of UAX #29, Text Boundaries. quote 2 Conformance This is informative material. There are many different ways to divide text elements corresponding to grapheme clusters, words and sentences, and the Unicode Standard and this document do not restrict the ways in which implementations can do this. This specification is a emphasisdefault/emphasis mechanism; more sophisticated engines can and should tailor it for particular locales or environments. ... /quote The whole UAX is informative. It is a here's-how-you-can-approach- the-problem implementation guide with some suggestions for rules and classes. *If* you are working with an orthography that uses one or more spacing diacritics, and *If* those spacing diacritics need to be represented by SPACE, NSM sequences, then you are in the situation where your implementation of text boundaries should take SPACE, NSM sequences explicitly into account, so as to result in expected behavior for that orthography. Everyone has had experiences with their platform UI producing bad results for text boundaries. The Solaris platform I am writing this on right now, for example, implements a double-click word selection that treats the string `this', above, including the grave accent, the apostrophe, and the comma, as a word. Is that right or wrong? Well, it depends on what you are trying to do, I expect. But even the most sophisticated platform implementers can only do so much with processes like default word selection. It is bound to be wrong for one purpose or another and for one orthography or another. Ultimately you need to have tailored processes that can be orthography-specific if you want to get best results. --Ken
Re: Conflicting principles
On 06/08/2003 14:04, John Jenkins wrote: Speaking purely as an old fart, I'd say the former. We already break the latter principle in Thai and Lao, and having be prepared to scan either forward or backward from a base character in order to find its combining marks would add overhead to a lot of code, including existing code. On Wednesday, August 6, 2003, at 2:16 PM, John Cowan wrote: I would like to ask the old farts^W^Wrespected elders of the UTC which principle they consider more important, abstractly speaking: the principle that combining marks always follow their base characters (a typographical principle), or that text is stored, with a few minor exceptions, in phonetic order (a lexicographical principle). John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://homepage..mac.com/jhjenkins/ This answer presupposes that there is a well-defined concept of which base character a combining mark belongs to. That is not always true. The particukar combining mark which precipitated the debate may be situated above the gap between the (logically and phonetically) preceding and following characters, or may move on to the preceding or the following characters depending on the precise context and on the typographer's preference. Anyway, John J, what code are we talking about that has to work from the positions of the combining marks back to the underlying representation? Are you talking about OCR? -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Valid encodings
We need an official Unicode Lint. Jony -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Philippe Verdy Sent: Thursday, August 07, 2003 4:28 PM To: [EMAIL PROTECTED] Subject: SPAM: Re: Questions on ZWNBS - for line initial holam plus alef On Thursday, August 07, 2003 2:40 AM, Doug Ewell [EMAIL PROTECTED] wrote: Kenneth Whistler kenw at sybase dot com wrote: But I challenge you to find anything in the standard that *prohibits* such sequences from occurring. I've learned that this question of illegal or invalid character sequences is one of the main distinguishing factors between those who truly understand Unicode and those who are still on the Road to Enlightenment. Very, very few sequences of Unicode characters are truly invalid or illegal. Unpaired surrogates are a rare exception. In almost all cases, a given sequence might give unexpected results (e.g. putting a combining diacritic before the base character) or might be ineffectual (e.g. putting a variation selector before an arbitrary character), but it is still perfectly legal to encode and exchange such a sequence. For Unicode itself this is true, but what users want is interoperability of the encoded text with accurate rendering rules. In practice, this means that any undefined or unpredictable behavior will mean lack of interoperability and should not be used. The standard should then highly promote what is a /valid/ encoding for text with regard of interoperability for all text processing algorithms including parsing combining sequences, collation, and computing character properties from those /valid/ encoded sequences. We don't have to care much if some encoded text considered valid under Unicode/ISO-IEC10646 is rendered or processed differently or unpredictably, provided that this does not affect common text for actual languages. In fact the standard specifies that ALL sequences made of code points in U+ to U+10 (excluding U+xFEFF, U+x and surrogates in U+D800 to U+DFFF) are valid under ISO/IEC 10646, but it does not attempt to assign properties or behavior to ALL of these characters or encoded sequences, as this is the job of Unicode to specify this behavior. If there's something to enhance in the Unicode standard (not in the ISO/IEC 10646), it's exactly the specification of interoperable encoded sequences. This certainly means that concrete examples for actual languages must be documented. Just assigning properties to individual ISO/IEC 10646 characters is not enough, and Unicode should concentrate more efforts in the actual encoding of text and not only on individual characters. So for me, the validity of text is a ISO/IEC 10646 concept (shared now with Unicode versions for the assignment of characters in the repertoire), related only to the legally usable code points, and Unicode speaks about well-formed or ill-formed sequences, or about normalized sequences and transformations that preserve the actual text semantics. There is no ambiguity in ISO/IEC 10646 for the character assignments. But composed sequences are the real problem, for which Unicode must seek agreements: the W3C character model is only based on the simplified combining sequences, but Unicode should go further with much more precise rules for the encoding of actual text, even before any attempt to describe other transformation algorithms (only the NF* transformations have for now a stability policy, but actual text writers need also stability for the text composition rules for actual languages. We certainly don't need more assigned code points for existing scripts. But more rules for the actual representation of text using these scripts, and how distinct scripts can interact and be mixed. There's some rules already specified for Combining jamos, or combining Latin/Cyrillic/Greek alphabets, or for Hiragana/Katakana, but we are still far from an agreement for Hebrew, and even for some Han composed sequences, which still lack a specification needed for interoperability. The current wording of Unicode validity is for me very weak, and probably defective. What it designates is only a ISO10646 validity for used code points, and the validity of their UTF* transformations, based on individual code points. The kind of validity rules users want with Unicode is a conformance of the actually encoded scripts for actual languages, for interoperability and data exchange. The fact that Unicode is born by trying to maximize the roundtrip convertibility with legacy codepages or encoded character sets has introduced many difficulties: first the base+combining characters model was introduced as fundamental for alphabetized scripts with separate letters for vowels. Then there's the case of Brahmic scripts which complicates things,
Re: Questions on ZWNBS - for line initial holam plus alef
On 08/08/2003 08:54, Philippe Verdy wrote: ... Could there be another codepoint assigned that has these properties: 20CF;ZERO WIDTH SYMBOL;Sk;0;ON;compat 0020N; i.e. being considered symbolic, not a whitespace, with combining class 0 (not combining), and used as an explicit base for a isolated spacing diacritic to never show with a dotted circle? (note U+20CF is just a suggestion, as it fits at end of the symbolic block used for currency symbols, just before the extended combining characters block, and because the U+02XX block where other Sk spacing diacritics are defined is full). The compatibility decomposition to a space is to make it in sync with other compatibly decomposable spacing diacritics. The new character would allow to represent diacritics that currently don't have a spacing counterpart, and use them as if they were letter like. Let's look at a similar diacritic which currently has an existing precombined spacing version: 00B4;ACUTE ACCENT;Sk;0;ON;compat 0020 0301N;SPACING ACUTE Philippe, this sounds like an excellent suggestion, at least in general terms. There is a missing function here, which has been provided (since Unicode 1.0) by overloading the characters space and NBSP with an inappropriate second function. Of course we can't make existing practice illegal, but we can recommend that in future versions of the standard your new ZERO WIDTH SYMBOL character should be used for display of isolated diacritics where there is no separate spacing form. We can also suggest that the width of the combination should be that of the diacritic only. But I'm not sure that ZERO WIDTH SYMBOL is the best name, unless you are suggesting other uses in which it really has zero width. Well, it might have in a case like line initial holam which shifts on to a following silent alef, but that is a rather special case. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Which ancestral links
Indeed, pardon my haste, that was a matter of an addition to the Syriac script. For a comparison of the various scripts used for Sogdian, http://iranianlanguages.com/midiranian/sogdian.htm#Alphabet Raymond - Original Message - From: Michael Everson [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Friday, August 08, 2003 5:43 PM Subject: Re: Which ancestral links At 17:26 +0100 2003-08-08, Raymond Mercier wrote: John Clews writes: I've never seen a description of the Sogdian alphabet (i.e. I have never come across one): is there a good article or URL which illustrates such links? Here is a Unicode proposal for just that: http://wwwold.dkuug.dk/jtc1/sc2/wg2/docs/n2422.pdf That is not the Sogdian script. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Questions on ZWNBS - for line initial holam plus alef
On 08/08/2003 13:56, Thomas M. Widmann wrote: Peter Kirk [EMAIL PROTECTED] writes: On 08/08/2003 08:54, Philippe Verdy wrote: ... Could there be another codepoint assigned that has these properties: 20CF;ZERO WIDTH SYMBOL;Sk;0;ON;compat 0020N; [...] But I'm not sure that ZERO WIDTH SYMBOL is the best name, unless you are suggesting other uses in which it really has zero width. Well, it might have in a case like line initial holam which shifts on to a following silent alef, but that is a rather special case. What would be a better name? ACCENT CARRIER? /Thomas Perhaps CARRIER FOR COMBINING CHARACTERS - not COMBINING CHARACTER CARRIER as that gives the wrong idea that this should itself be a combining character, it should not. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Roadmap---Mandaic, Early Aramaic, Samaritan
Elain Keown responded to Michael: I really, really, really don't have time to debug your dissatisfaction with the use of the word Aramaic in the Roadmaps. This is NOT something anyone is working actively on right now. When a I'm not writing about nomenclature---not the point all. I'm objecting to your endlessly fracturing such closely related scripts into completely different blocks, thus making Afroasiatic even harder to handle than it has to be. Back to Michael's point: This is NOT something anyone is working actively on right now. There are no active proposals for Nabatean, Palmyran, Mandaic, whatever... Whether or not these end up encoded in separate blocks is a matter of future debate, *when* an active proposal or proposals are on the table stating the issues. You have no Semitists in your e-world, there is no one to fight you, no one except me and a few Hebraists care about the fate of electronic Afroasiatic. I don't see how this is the case, given that you earlier scoped Afroasiatic to include Ugaritic, Egyptian, Akkadian, and other scripts. The borders we draw are based on the analyses of script experts. You've never had a Semitic script expert, that's the problem. This is nonsense. We are beset with Semitic script experts. What you might mean is that Michael doesn't have to hand an expert on your range of early Aramaic scripts, in particular. Or are you claiming that Hebrew and Arabic are not Semitic scripts? If you continue at the rate you are going, you will continue to build codes that will torture me until I die. If your strongly stated opposition to encoding Mandaic, Samaritan, and early Aramaic (which you have subsequently weakened by admitting, for example, that there is a separate community of usage of Samaritan) means that you don't want to represent some collection of early Aramaic scripts with separate characters (but instead wish to display them as font variants of Hebrew), then nobody is going to stop you. As Michael indicated, you are perfectly free to represent them all in Hebrew, if that is the best solution for your research. And if the relations are as transparent as you indicate, then conversion of other corpuses to match your own conventions should be reasonably trivial, in any case. --Ken This isn't an abstract and charming problem, like the conlangs, these are real languages and real software will be built for them. Maybe you have little interest in our small user community, but we are at least as large as the Samaritan one, although I admit they have far more interesting customs. Elaine
Re: Assume everything on this list is ignored (was Re: Newbie Question - what are all those duplicated characters FO R?)
Mark Davis scripsit: I repeat again. Nothing on this list has any guarantee that it will be seen by anyone in the UTC. If you want to submit a FAQ question that's great -- and I strongly encourage it. But please use: http://www.unicode.org/reporting.html to make sure it is tracked. Hearing and obedience. -- Work hard, John Cowan play hard, [EMAIL PROTECTED] die young, http://www.reutershealth.com rot quickly.http://www.ccil.org/~cowan
Note about CGJ in current MS implementation
A note for those interested in how CGJ may be used in font lookups: In the current MS implementation (Office 2002, Wordpad, etc.) if CGJ is inserted immediately after a space character it breaks RTL directionality. So for the time being at least, any use of CGJ to affect rendering in Biblical Hebrew (where it is really proving very useful in a variety of ways) requires that CGJ always be preceded by something other than space. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] The sight of James Cox from the BBC's World at One, interviewing Robin Oakley, CNN's man in Europe, surrounded by a scrum of furiously scribbling print journalists will stand for some time as the apogee of media cannibalism. - Emma Brockes, at the EU summit
RE: Conflicting principles
Madison Hi, Only two people asked me what else exists in the complete Hebrew character set, but maybe others care. The significant points here are that there are other pointing systems to be combined with base letters and that there are manuscripts that have TWO pointing systems marked on EACH consonant, sometimes two Hebrew ones, sometimes a Hebrew one AND an Arabic one. And sometimes, in exotic Karaite manuscripts, there are Arabic letters with Tiberian pointing--there are some of these in England, Cambridge U, I think--Elaine __ THE COMPLETE ARAMAIC / HEBREW CHARACTER SET (PRELIMINARY--missing 11 Jewish dialects, 10 still spoken) Section A Ancient or common symbols Net Count (subtracts overlap) original 22-letter alphabet 22 Epigraphic punctuation 4? Epigraphic numbers11 Ezra's points 2 Medial letters 5 Tiberian pointing, etc52 Other Hebrew ms symbols __7_ TOTAL100? SECTION B VARIANT LETTERS FOR REGIONAL JEWISH LANGUAGES Arabic (=Judeo-Arabic) 4 Berber (=Judeo-Berber) 0 Persian ()3 Tajik (=Bukhari) 2 Tat2 Krimchak 1 Neo-Aramaic (=Kurdit) 1 Greek (written in Hebrew..)1 French (written in Hebrew..) 3 Shuadit, Comtadin (Provencal written in Hebrew) 0 italian1 Ladino 2 Yiddish3 Net subset totals 20 SECTION C BABYLONIAN POINTING ETC BAbylonian35 SECTION D PALESTINIAN POINTING ETC Palestinian 18 SECTION E SAMARITAN POINTING ETC SAMARITAN 12 Net subtotals C,D,E 65 SECTION F RARE OR UNIQUE SYMBOLS Palmyrene dotted resh 1 Bodleian Hebrew e631 Cairo Codex1 Total Aramaic / Hebrew to date 188 ? I have the file with footnotes, but I don't know where-- packed somewhere
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
On 05/08/2003 09:42, Jim Allan wrote: Peter Kirk posted: If I want to do this, should I explicitly encode a dotted circle, or should I encode nothing and expect the font to generate the dotted circle, as it often does? I think that practise of a font or application automaticaly inserting a dotted circle under an orphaned combining character is dubious compliant with Unicode specifications. ... Thanks, Jim, for all this data, but now I am totally confused. Well, at least it seems clear that if I want a dotted circle I should explicitly encode it. But if I don't... Suppose for example I want to write a sentence like In this language the diacritic ^ may appear above the letters ..., but instead of ^ I want to use a combining character, a regularly positioned centred above the letter diacritic, which does not have a defined spacing variant. I don't want a dotted circle. And I want it to be spaced as here, i.e. with one space before the diacritic and one after it. It seems to me that at one place in the standard I am told to encode space - combining mark - space, for the combining mark will not combine with the space because the space is not a base character; and in another place I am implicitly told to encode space - space - combining mark - space, because the second space acts as a carrier for the combining mark. I hope that wanting to display this correctly is not another place where I have stepped over the boundaries of what is reasonable to expect plain text to convey, but that this too can be grist for the Unicode 5.0 mill to grind very finely - both quotes from Ken Whistler earlier today. And I think that if this issue is clarified it will also become clear what should be done about string initial holam and alef etc. Perhaps a simple way ahead would be to define a new character something like COMBINING MARK HOLDER with no glyph, which is defined specifically for this purpose, is a base character and not a format character, and is expected to be just as wide as is necessary to display the combining mark. Then we could say that a spacing accent is equivalent (possibly even canonically if made a composition exclusion?) to COMBINING MARK HOLDER plus a non-spacing accent, and remove the misleading compatibility equivalences to SPACE plus a non-spacing accent. -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
IETF, W3 ....?
Elaine Keown still in Madison Dear John Cowan and Peter Kirk: Could you possibly explain to me why these other organizations---IETF and W3-- are apparently concerned about character properties, to the point where apparently they also have a hand in deciding what will happen with Hebrew? For a long time, I thought that the gatekeepers were the UTC and the people in Tel Avivso there are these others? Elaine
Aramaic scripts
There are omissions in Michael Everson's chart in http://www.dkuug.dk/jtc1/sc2/wg2/docs/n2311.pdf The chart was based on Semitic languages, although purporting to be about scripts. After all Greek and Latin also derive from the same family of scripts, as we all learn from page 1 of Greek grammars. There are less obvious omissions: 1. Kharoshthi, a RtoL script much used in North WestIndia, and regarded by everyone as a derivative from a form of the Aramaic script used in that region. It is found on coins, Ashokan edicts, various inscriptions andmanuscripts. It was used to write mainly prakrits, although some sanskrit text is known. See, for example, A.H. Dani, Indian Palaeography, Oxford 1963. 2. Pahlavi, widely used to write Middle Persian.This involved a troublesome mixture of Persian reading of Aramaic words, a subject requiring more elaboration than is needed here. Raymond Mercier
RE: Questions on ZWNBS - for line initial holam plus alef
3) In attribute values that have a declared type other than CDATA, multiple spaces are compressed to a single space, and leading and trailing spaces are removed. After this is done, there can be no spaces in attributes of type ID, IDREF, ENTITY, NMTOKEN, NOTATION, or enumerated types. In the types IDREFS and ENTITIES, spaces are used to separate individual tokens, none of which may begin with a combining character. In the remaining type, NMTOKENS, individual characters may begin with a combining character, so it is possible that such a token, if not the first in the attribute, will be rendered in a peculiar way, with the combining character placed over the separating space. But that is a mere rendering glitch and in no way affects anything. Not just a rendering glitch, I suspect. If the combining character is combined with the separating space, the space loses many of its separating functions, and perhaps keeps a confusing subset of them with all sorts of possibilities of error. At best tokens beginning with combining characters will be unusable. At worst they will crash the implementation (and count on someone trying deliberately to do that!). The only safe thing to do is to specify that space followed by a combining mark is NEVER considered to be a space and this combination is NEVER generated. No, the safe thing to do (and the thing that is done) is to treat the space as a space ignoring the fact that the NMTOKEN contains a combining character, this is even safer than your suggestion since it can't mis-identify the combining properties of a character. This effectively bans space+combining (and for that matter NBSP+combining since NBSP isn't allowed in NMTOKENs) within an NMTOKEN and means that if you attempt to begin an NMTOKEN with space+combining it will be treated as beginning with the combining character. The resulting lost of expressive power in having this banned is negligible, it means that you can't use what is quite a linguistic oddity (space+combining is mainly used in meta-discussion of combining marks as was mentioned earlier) in a context where it is human-readable (hopefully) but not fully general text. NMTOKENs should only be given raw to a user by relatively low-level tools (i.e. general purpose XML tools for developers), in other contexts they should be represented by a more user-friendly and application-appropriate indicator (perhaps text, perhaps not) so the inability to use space+combining won't apply at that level.
Re: IETF, W3 ....?
[EMAIL PROTECTED] scripsit: Could you possibly explain to me why these other organizations---IETF and W3-- are apparently concerned about character properties, to the point where apparently they also have a hand in deciding what will happen with Hebrew? For a long time, I thought that the gatekeepers were the UTC and the people in Tel Avivso there are these others? The IETF and the W3C do not care in the least what properties are assigned by the Unicode Consortium to any specific character, or what treatment is given to any specific script. They do care very much that the Unicode Consortium, having made certain guarantees of stability (viz. that certain character properties would not be changed), abides by those guarantees. It's pretty well agreed by those who care that the combining classes of Hebrew vowel signs were assigned badly. Unfortunately, nobody pointed out the problem (or not forcibly enough) during the period 1991-1999 when something could have been done about it. It's too late to do anything about it now without breaching those guarantees. The Unicode Consortium's word is its bond. -- John Cowan [EMAIL PROTECTED] www.ccil.org/~cowan www.reutershealth.com I must confess that I have very little notion of what [s. 4 of the British Trade Marks Act, 1938] is intended to convey, and particularly the sentence of 253 words, as I make them, which constitutes sub-section 1. I doubt if the entire statute book could be successfully searched for a sentence of equal length which is of more fuliginous obscurity. --MacKinnon LJ, 1940
RE: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
The NFD decompositions of spacing marks is alredy defined as a SPACE plus a non-spacing combining character. Philippe, please! Those are *compatibility* decompositions. The normal form NFD only uses *canonical* decompositions. And there is no such thing as NFD decompositions. /kent k
AL32UTF8 Vs UTF8
Greetings, We are using Oracle9i with application tier as 11i. I wanted to know the differences between AL32UTF8 and UTF8. My database (oracle) will be in AL32UTF8 format. Will the applications that require multibyte characters work as they are functionin in UTF8 format. Would be great if anybody can gimme a comparision on AL32UTF8 and UTF8 Also pls list requirement of any 3rd party softwares for code page conversions in case of AL32UTF8 Thanks in advance, -Jay Do you Yahoo!? Yahoo! SiteBuilder - Free, easy-to-use web site design software
Re: Questions on ZWNBS - for line initial holam plus alef
On 06/08/2003 03:38, Kent Karlsson wrote: Kenneth Whistler wrote: Kent Karlsson said: I see no particular *technical* problem with using WJ, though. In contrast to the suggestion of using CGJ (re. another problem) anywhere else but at the end of a combining sequence. CGJ has combining class 0, despite being invisible and not (visually) interfering with any other combining mark. Using CGJ at a non-final position in a combining sequence puts in doubt the entire idea with combining classes and normal forms. Why? See above (I DID write the motivation!). Combining classes are generally assigned according to typographic placement. Combining characters (except those that are really letters) that have the same placement, and interfere typographically are assigned the same combining class, while those that don't get different classes, ... Not true, as we have seen for Hebrew. It's supposed to be true, but isn't, and the problems can't be fixed. ... and the relative order is then considered unimportant (canonically equivalent). How is then, e.g. a, ring above, cgj, dot below supposed to be different from a, dot below, cgj, ring above (supposing all involved characters are fully supported), when a, ring above, dot below is NOT supposed to be much different from a, dot below, ring above (them being canonically equivalent)? ... There is no difference when the characters really do not interfere typographically. But when they do, there is a real and, in some languages, meaningful distinction. ... ... the only ways out seem to be to either formally deprecate CGJ, or at least confine it to very specific uses. Other occurrences would not be ill-formed or illegal, but would then be non-conforming. OK, let's confine it to those specific uses where it is really needed, e.g. to get round the problem of combining characters with different combining classes which actually do interact typographically, and perhaps there was another one being suggested. I have no problem with that - as long as the list of permitted uses is not set in stone, so that new uses can be approved when they are discovered. But there is no good reason to object to its use in those cases where it is needed, simply because in many other cases it is not needed. -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
Re: Questions on ZWNBS - for line initial holam plus alef
On Saturday, August 09, 2003 12:49 AM, Michael Everson [EMAIL PROTECTED] wrote: At 14:22 -0700 2003-08-08, Kenneth Whistler wrote: Philippe, you are tilting at windmills, here. There is no chance that the UTC is going to consider such a character, in my assessment, let alone give it the properties you suggest. Nor WG2 either. Why that? Because I suggest something that some other may think as useful to fill a large gap in Unicode for spcing diacritics, but I'm not trusted enough due to my errors or confusions here, so that this suggestion would be endorsed by more serious UTC or WG2 members? I admit that the properties of such character can be discussed, and is possibly not necessarily a Sk symbol, but a Lo letter, in which case the name INVISIBLE LETTER may be appropriate (where it could also fill the gap for Hebrew Yerushala(y)im, but this is a possibly distinct function for a missing letter in phonology). Why do you think it is stupid to have a single carrier character that would avoid adding new spacing diacritics, when the standard combining diacritics could be used without less quirks like defective sequences just to produce the desired effect? If you think that spacing diacritics are stupid, why then are they given these properties and not deprecated (no more recommanded) in the standard, in favor of the SPACE+diacritics sequences, which are really not equivalent to spacing diacritics used as symbols (sometimes described also as MODIFIER LETTER which is very misleading according to their gc=Sk property) and as base characters (to which other diacritics can be applied) ? -- Philippe. Spams non tolérés: tout message non sollicité sera rapporté à vos fournisseurs de services Internet.
Re: Conflicting principles
On Friday, August 08, 2003 9:16 PM, Peter Kirk [EMAIL PROTECTED] wrote: On 07/08/2003 13:57, John Cowan wrote: ... But an immediate problem comes to mind: what if there is a line break between the two base characters? What if there is a line break between the two characters joined by a double width combining character? Are arbitrary line breaks in the middle of words actually permitted anyway? Presumably any line breaking property of the first base character of the pair is cancelled anyway. That leaves a problem only if the second base character has a line break before possibility. Well, that could just be treated as one of the sequences we were discussing yesterday, not illegal Unicode but its rendering is undefined. Such break in a middle of a multiple width diacritic exist in some notations, and are not considered horrible typography. Just look at musical notations where a upper horizontal parenthesis is used to group some elements (sorry I don't know how you name it exactly in English or Italian), despite there's a measure break in the middle, which may span to the other musical line: you end up with two parts for the same diacritic broken across the lines. -- Philippe. Spams non tolérés: tout message non sollicité sera rapporté à vos fournisseurs de services Internet.
Re: Handwritten EURO sign
Michael Everson schreef: More horrifying is the idiotic euro is immune to grammar error which continues to be broadcast daily by our television and radio stations, all because people with power lacked the moral courage to say oops, yeah, that was the wrong interpretation of the Directive which was intended to ensure clean typography. Sigh. I have absolutely no idea what you are talking about. Pim Blokland
Re: Pigpen/Masonic/Poundex
At 18:49 +0200 2003-08-08, Chris Jacobs wrote: This seems to be a clear difference from colorful scripts, where I think there is an agreement about which glyph represents which sound. So I think the analogy between pigpen and colorful scripts does not hold. Two gifs on two websites does not constitute actual use of a script, nor a need for real users to interchange it. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Questions on ZWNBS - for line initial holam plus alef
At 05:27 PM 8/8/2003, Kenneth Whistler wrote: Because the mechanism for doing so -- application to SPACE or to NBSP -- has been specified by the standard for a decade now. True enough, but I'm also a bit concerned about this mechanism because white space characters are another pesky thing that not all applications paint. TEX, perhaps most famously, uses its own 'glue' instead of the space glyph in the font. And what happens when word spacing is expanded or contracted in text? The diacritic mark ends up being shoved to the left or right of where it should be. Of course, if the space glyph is not painted you have to rely on blind offsets for mark positioning, because unpainted glyphs can't be found for smart positioning lookups. As someone who cares about typography, I don't like blind offsets because they don't offer precise enough control: I would much rather have a mechanism that I can reliably and precisely use with glyph positioning lookups. I'm not suggesting that the use of space/nbspace for this purpose should be deprecated, only that an alternate mechanism would be useful for those who want more control of how combining marks are rendered on a blank base. A similar but not identical issue was raised by Peter Constable when we were talking about Qere vs Ketiv readings in Biblical Hebrew. There are cases in which vowels are applied to ellided consonants, which in some texts results in marks applied to a blank base in mid-word. In this case, my concern about using space or nbspace is that these imply a word break where there is not, in fact, any break in the word: the blank base is part of the word. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] The sight of James Cox from the BBC's World at One, interviewing Robin Oakley, CNN's man in Europe, surrounded by a scrum of furiously scribbling print journalists will stand for some time as the apogee of media cannibalism. - Emma Brockes, at the EU summit
RE: Conflicting principles
Ken's point of course is that however bizarre the backing store for Sindarin and English Tengwar modes may be, combining characters per se must follow their base characters no matter what. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Conflicting principles
On Thursday, August 07, 2003 11:29 PM, Michael Everson [EMAIL PROTECTED] wrote: Ken's point of course is that however bizarre the backing store for Sindarin and English Tengwar modes may be, combining characters per se must follow their base characters no matter what. Even if that breaks the logical analysis of text? How does the Sindarin mode affect the line or word breaking rule for example: suppose that the combining character is coded after the next logical base character, would it be valid to break at this base character and thus send the combining vowel to the next line, where in fact what is intended is to use a vowel carier for the combining character logically attached to the previous base character? I don't know Tengwar's Sindarin mode enough to see how word breaking can affect the interpretation of text. But preserving the logical ordering of letters seems much more important for actual text encoding than just being constrained by combining rules that were created taking into account only the first encoded scripts for Latin, Greek, Cyrillic, Hebrew, Arabic and Hiragana/Katakana scripts that use combining characters. The response to such answer would come in relation with other still unencoded scripts; you quoted some of them which have similar difficulties, and that are neither extinct, and have a huge amount of existing texts to represent, including many modern languages that are only partly litterated and that would benefit from a written litteracy form according to similar languages spoken and written in a cultural region, notably in Africa, Central Asia, and Oceania (regions that have suffered for too long of an absence of an easy to adapt and learn writing system for minority languages). Even in India, there is still no consensus for the use of the ISCII-based writing system for Brahmic scripts, and the current work on Tibetan or on Indo-Aryan languages show that the currently officially adopted system does not fit the cultural demand of minority users, because the official writing system does not fit very well their language. There will certainly not be a huge revolution in writing systems (families of scripts with similar behaviors), but existing systems will still continue to be adapted to fit local cultural demands for minorities and specialized areas, that a too strict encoding model proposed now by Unicode cannot fit well. Some examples include text that use a non linear layout, where the layout carries important semantics (examples are numerous for hieroglyphic languages, one of which having modern use and not fitting well with Unicode which often fails to represent clusters with simple combining sequences assuming a base character and diacritics). If one looks at Korean jamos, the problem has only been solved by actually *reducing* the number of layout combinations, and creating artificial letters (jamos) for some combinations that are logically perceived as multiple letters (for example the SSANGKIEOK jamo, which is really a pair of KIEOK letters), which are only partly decomposed and represented as their component letters, whose composition layout is greatly simplified but does not match correctly the historic Hangul clusters. Probably the same thing can be said about Han ideographs, constantly updated to present new clusters, and even Hiragana/Katakana clusters currently represented as single codepoints when in fact they are really composed, and constantly enriched with new clusters notably in the scientific area. To allow users to create their own clusters, Unicode has added ideographic description characters which are controls used as prefixes for a combining sequence containing base letters. This is already a break in the axiomatic view of combining sequences made with a single base letter. Other areas where combining sequences are not following this model is of course the Hangul script, the CGJ character used between two base letters, double (width) diacritics, ... Really there already exists many exceptions to the axiomatic view of combining sequences, and I don't see why there could not exist a model allowing new classes of combining characters attached to a *following* base character, such as for Tangwar Sindarin vowels (if we suppose that Sindarin vowels are encoded separately from Quenya vowels, because of their distinct combining properties, and because the Tengwar script is really a family of related scripts, which contains much more differences than between Latin, Greek and Cyrillic separate scripts). So one cannot be satisfied by the currently limited model with a single base letter and combining modifiers, which would create an artificial hierarchy between letters, that does not fit the cultural semantics of the encoded language. -- Philippe. Spams non tolérés: tout message non sollicité sera rapporté à vos fournisseurs de services Internet.
Re: Questions on ZWNBS - for line initial holam plus alef
From: John Cowan [EMAIL PROTECTED] Peter Kirk scripsit: So far so good, but when I get to an accent with no predefined spacing variant, I have a problem! No you don't. If you want to say Seagull is the diacritic used to represent linguolabial sounds in the IPA, then you just encode U+0020 U+033C at the beginning of the next line. If the seagull doesn't line up properly, you complain to the foundry or the implementor. It's true that you can complain to a foundry for an inappropriaet glyph positioning but not to an implementor of other components dealing with text boundaries. The inaccuracies we are spaeaking about are not in the glyph representation but in text handling algorithms, these last ones being clearly part of the Unicode standard, unlike font problems.
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
On Wednesday, August 06, 2003 11:48 PM, Peter Kirk [EMAIL PROTECTED] wrote: OK, what kind of markup should I use, in any well-known markup language, to ensure that an isolated diacritic is centred in the space between the words before and after it? In plain text, I think that this encoding: ...endOfWord1, SPACE, SPACE, diacritic, SPACE, startOfWord2... is what you need, as it creates the following combining sequences: ...endOfWord1, SPACE, SPACE, diacritic, SPACE, startOfWord2... If you don't want any space around the diacritic which must be displayed isolated but in the middle of a word, the following would work: ...endOfWord1, SPACE, diacritic, startOfWord2... Here the SPACE is not a break opportunity, but just the base character for the diacritic inserted. What is missing in the standard is defining the property of such SPACE+diacritic sequence: normally it inherits the properties of the base character, and properties of diacritics are ignored. But when using a SPACE or NBSP base character new properties may be needed. If there's still a break opportunity on the base SPACE of a combining sequence, it is not clear where the break occurs: before the SPACE (i.e. before the combining sequence), or after the diacritic (i.e. after the combining sequence)? I think that the second option applies here, i.e. the base SPACE would create a break opportunity at end of the whole combining sequence made with a SPACE and the following combining characters (including CGJ if needed to fix canonical ordering). Another similar case would be the use of a isolated nukta (which normally modifies a following base character): the sequence nukta, SPACE is a single combining sequence with a break opportunity. So a sequence like nukta, SPACE, acute accent would be unbreakable but would include a break opportunity at its end, unless it is followed by a NBSP. And the sequence nukta, NBSP, acute accent would also be unbreakable either in the middle or on both ends. -- Philippe. Spams non tolérés: tout message non sollicité sera rapporté à vos fournisseurs de services Internet.
RE: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
I would like to point out that with all due respect, how particular fonts or rendering engines behave is only marginally relevant to the Unicode list. I think that we should deal only with the Unicode specification. A particular implementation or many implementations may not behave as expected, and then may be either conformant or non-conformant, or may behave as expected and still be either conformant or non-conformant. Messages such as the attached help the discussion of the specification only as illustrations and as a basis for discussing conformity. Jony -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Peter Kirk Sent: Wednesday, August 06, 2003 12:11 PM To: Curtis Clark Cc: Unicode List Subject: Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...) On 05/08/2003 16:59, Curtis Clark wrote: on 2003-08-05 15:31 Peter Kirk wrote: Thank you, Mark. This helps to clarify things, but still doesn't explicitly answer my question of how to encode a sentence like In this language the diacritic ^ may appear above the letters ..., but instead of ^ I want to use a combining character and want to display exactly one space before the combining character - do I encode two spaces or one? In this language the diacritic may appear above the letters... Two spaces, at least in Thunderbird Mail. Thank you. Well, this sort of works. I looked in various fonts. In some of them the diacritic is centred in the space between the words diacritic and may, but in others it is offset to the left or the right. The problem is that the space is wider than the diacritic, which confuses things, and all the more so no doubt if it expands for justification. NBSP would probably be a better choice in that it is less likely to expand. But what I am looking for is a diacritic holder which is defined to be only as wide as the diacritic. On the principle that base characters expand to fit the width of the diacritic, ZWSP or, better, a real (rather than misnamed) zero width no break space would seem to have the right properties for that. -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
Re: Handwritten EURO sign
At 08:55 -0700 2003-08-05, Doug Ewell wrote: The original legislative attempt to dictate the exact proportions (and even color) of the euro sign, regardless of the font in use, was just silly. That is very old history, as detailed on my website (http://www.evertype.com/standards/euro/euroglyph.html). More horrifying is the idiotic euro is immune to grammar error which continues to be broadcast daily by our television and radio stations, all because people with power lacked the moral courage to say oops, yeah, that was the wrong interpretation of the Directive which was intended to ensure clean typography. Sigh. -- Michael Everson * * Everson Typography * * http://www.evertype.com
RE: Assume everything on this list is ignored
Isn't the very notion of submit[ting] a FAQ question a contradiction in terms? Surely, one merely ASKS a question. If enough people ask the same question, we may then classify it as frequently asked. It's like this. Newbies want to find things out. So they read books, and look around on the web. Eventually, they'll encounter some point of confusion they can't resolve by their own research (or don't have time to thoroughly research), so they will then find some forum to join in the hope of finding somebody there who will know the answer. This forum -- indeed, ANY forum -- will have questions asked on it. Some of them may be asked frequently. These are, by definition, Frequently Asked Questions _of the forum_. Forum FAQs are generally put together by long-term members of forums who are sick of having to answer the same question over and over again to all these damn newbies, or by other long-term members who simply wish to cut down the traffic on the list. Now this is, in fact, rather curious. Because the web page http://www.unicode.org/consortium/distlist.html implies that _this_ list (described as the Unicode Public E-mail List) is _the_ place for the public to go to pose questions to the community of Unicode users. In THE SAME PARAGRAPH that web page says as a courtesy to others on the list, please check the ... Frequently Asked Questions [at http://www.unicode.org/faq/];. (Which I did). Now, if it is true, as Mark Davis suggests, that the Frequently Asked Questions list at http://www.unicode.org/faq/; is unrelated to this list, then: (1) This should be made clear on the consortium's web page (http://www.unicode.org/consortium/distlist.html), which currently implies that the stated FAQ is the FAQ _of this list_, and (2) This list should have a FAQ of its own, independent of the consortium's FAQ, and maintained by long-term members of this list (i.e. by those who are in a position to know which questions are, in fact, frequently asked). ...and for what it's worth, the consortium's submission form at http://www.unicode.org/reporting.html seems (a) difficult to find without knowing the URL (I couldn't find it anyway, at least not by starting at www.unicode.org and clicking on links from there), and (b) intimidating -- it is not worded to encourage the I don't understand feature XYZ type of question from the public. I am therefore forced to wonder who actually _asks_ these frequently asked questions of theirs. Just my thoughts. Please don't take of this too seriously. Jill -Original Message- From: John Cowan [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 12, 2003 1:35 AM To: Mark Davis Cc: [EMAIL PROTECTED] Subject: Re: Assume everything on this list is ignored (was Re: Newbie Question - what are all those duplicated characters FO R?) Mark Davis scripsit: If you want to submit a FAQ question
The relation between Unicode and ISO/IEC 10646
As far as I know, there are many topics not covered by ISO, for example (Bbi-directional behavior. (B (BJony (B (B -Original Message- (B From: [EMAIL PROTECTED] (B [mailto:[EMAIL PROTECTED] On Behalf Of souravm (B Sent: Tuesday, August 12, 2003 8:40 AM (B To: unicode (B Subject: SPAM: The relation between Unicode and ISO/IEC 10646 (B (B (B (B Hi All, (B (B As I know, historically ISO/IEC 10646 (UCS) is from ISO and (B Unicode was defined by a consortium of major American (B computer manufacturers. From version 1.1 on, Unicode is (B scrupulously kept compatible with ISO/IEC 10646 and its (B extensions. The latest fact I found that Unicode 4.0 (B character repertoire$B!!(Bcorresponds to ISO/IEC 10646:2003. (B (B Also I understand that from Unicode 2.0 onwards Unicode (B covers all the code points of UCS-4. Now, my doubt is, in the (B current situation, (B - What is the need for continuing both of these two different (B coded character sets in parallel? Why can't they be merged? (B - Is there any additional issues/points taken care of by (B ISO/IEC 10646:2003 which are not there in Unicode 4.0 and vice versa ? (B (B (B Any funda on this will be really appreciated. (B (B Regards, (B Sourav (B (B (B
RE: Conflicting principles
Collation isn't really based on combining sequences (even though UTS 10 specifies a certain spanning over non-blocking (combining) This is a very ignorant question: where in your public documentation are these issues discussed? ... I still don't understand even what happens with basic collation in Hebrew, what effect the shin / sin dots have. Ignored at level 1, considered at level 2. From the 14651 data file: U05C1 IGNORE;SHINP;MIN;U05C1 % HEBREW POINT SHIN DOT U05C2 IGNORE;SINPT;MIN;U05C2 % HEBREW POINT SIN DOT And, of course, I don't understand any of the more complicated issues either, such as what will happen when your database sorts un-pointed Hebrew epigraphy (just the consonants) and pointed medieval Hebrew (all the jots and tittles added). Re. collation, see UTS 10, and associated data files, and if you're really interested, see ISO/IEC 14651 (sort of a parallel to UTS 10, but different), and its data file. /kent k
Re: Questions on ZWNBS - for line initial holam plus alef
On 13/08/2003 15:54, Jony Rosenne wrote: Suggested but not accepted. I am inherently suspicious when pressure is being exerted to decide complex and difficult questions in a hurry. Jony Jony, I am not trying to hurry anything. I am putting a lot of time and effort into trying to reach proper decisions on these complex and difficult questions. What I am not prepared to do is to accept a quick answer that the lowest common denominator of printers don't bother to do X, therefore we need not bother to support X in Unicode although X is a definite requirement of a significant subset of Hebrew users. If you have problems with this particular suggestion, let's discuss them on the Hebrew list. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Display of Isolated Nonspacing Marks (problems with UAX#29)
Philippe Verdy verdy_p at wanadoo dot fr wrote: Note that these two ZW and SP classes of characters are *normative*. Another proof that SPACE+diacritics is really a hack causing lots of problems in the Unicode main standard and its standard annexes. Has it occurred to anyone yet that the very *concept* of spacing diacritics is a hack? Spacing diacritics are used to conduct a sort of meta-discussion about characters, as in A base character o is combined with an acute accent to create . They are not part of the normal writing systems of most natural languages. It is as if I were describing the two typical glyphs used for lower-case g, the one with one bowl and the one with two bowls, but actually showing the separate, constituent pieces of the glyphs instead of using words to describe them. They are interesting things to talk about, but not necessarily things that need to be encoded in plain text. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
RE: ADO, SQL-Server and VB6
I might be able to help. Two questions: 1. How firmly have you tracked down the point at which this conversion happens? 2. What is the datatype in the database? (text BLOB?, ntext BLOB? varchar?)
RE: Questions on ZWNBS - for line initial holam plus alef
Michael wrote: The Name Police reject this utterly. ZERO WIDTH cannot have an expanding dynamic width. Then what about ZERO WIDTH SPACE, which, according to TUS3, p. 238, can grow to have a visible width when justified? And it has the NamesList comment: * nominally zero width, but may expand in justification (But U+0082, BREAK PERMITTED HERE, which otherwise is very similar to ZWSP according to 6429, does apparently not allow such stretching...) /kent k
Re: Questions on ZWNBS - for line initial holam plus alef
On 11/08/2003 16:06, Mark Davis wrote: Some of this seems to be in reference to an earlier contention that Text Boundaries (inc. Lines) break between the space and the non-spacing mark. I think this was attributed to Phillipe. [This may not be true: I don't actually read his email, because the information content per line falls below my email threshold; not to say that there may not be information there, but I cannot afford to take the time to find out -- sadly, one of my character flaws.] All of the text boundaries preserve grapheme cluster boundaries, which never separate a base character (including space and NBSP) from a following NSM. In addition, each of the boundary types above grapheme clusters make some statement about the behavior of the grapheme cluster. For example, with line boundaries a SPACE + NSM has a special behavior. With the others, the behavior is the same as the base character. As Ken points out, in any event these are default boundaries, and can be tailored. That being said, if the normal behavior of the default can be improvied, and someone has a concrete proposal for doing so, then it can be considered. Mark __ http://www.macchiato.com Eppur si muove I was aware that there should not be a line break or word break between the space and the NSM, although I suspect that many implementers will not be aware of this, or at least will not test for it properly and so treat any space as a word break and a line break opportunity. As I just wrote, this requirement to test all spaces for following NSMs is a significant inefficiency built into the standard. But there is still a problem if there is considered by default to be a word break and a line break opportunity AFTER the NSM. I would suggest, as a candidate for a concrete proposal, that the default behaviour be adjusted so that there is no word break or line break opportunity here either. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Unicode Technical Note added
A new Unicode Technical Note on Deterministic Sorting is now available: http://www.unicode.org/notes/tn9/ Unicode Technical Notes provide for the publication of information that may be of interest to implementers or readers of the Unicode Standard, or to users of programs which implement the Standard. The complete list of available notes is accessible here: http://www.unicode.org/notes/ Regards, Rick McGowan Unicode, Inc.
Roadmap-Mandaic, Early Aram., Samarit Alternative Mel Gibson
Elaine Keown still in Madison WISC Hello, Responding again to the deep interest in Aramaic expressed on the list, I am writing with a suggested preliminary Alternative or possibly Countercultural version of the Roadmap and a New, Improved Acronym for EUSAS (Egyptian, Akkadian, Ugaritic, Semitic Alphabetic and Syllabic)... And, slightly OT, I imagine you all are also waiting breathlessly for the new Mel Gibson movie which is, of course, going to be in ARAMAIC with NO subtitles, not even Unicode- conformant ones. If Aramaic is trendy in LA, when will it hit Mountain View? Here is the beginning of an Alternative Roadmap. _Suggested Afroasiatic Roadmap Blocks_ Egyptian Hieroglyphics---the Aramaic glyphs for Aramaic in hieroglyphics (from Wadi El-Hol) are included Egyptian hieratic---the Aramaic ones (see wadi, above) are included Egyptian demotic--Aramaic demots are included The Cuneiform Block --- the one Aramaic cuneiform is included (and also the Arabic in cuneiform) _CEUSAS_ Instead of describing the not-yet-encoded Middle Eastern/ N and East African scripts as EUASAS, I suggest CEUSAS ---Cuneiform, Egyptian, Ugaritic, Semitic Alphabetic and Syllabic. Under cuneiform go Sumerian, Akkadian (old Babylonian and Assyrian), Hittite, Elamite and whatever. Cuneiform had a long shelf life--3,400 B.C. to about 125 A.D. Elaine
Roadmap-Mandaic, Early Aram., Samarit Alternative Mel Gibson
I think we will keep the Roadmap as it is for the time being. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Questions on ZWNBS - for line initial holam plus alef
From: Kenneth Whistler [EMAIL PROTECTED] It is perfectly reasonable, as I see it, to consider the SPACE in a SPACE, NSM sequence to be: a. significant b. part of the characters in a document that are not markup (at least in the cases we are talking about, since the problem is not about defining Nmtokens for markup in Biblical Hebrew, but rather the representation of the Biblical Hebrew document content itself) So I *still* don't see the problem you are on about, and even if there was one, the xml:space attribute could be used to require preservation of a particular space. May be you are forgetting that in XML and HTML, attributes (including spacial attributes like xml:space can have default values, and in fact they have such values set in DTD or schemas to by normative XML applications like XHTML. Authors are not supposed to modify normative schemas or DTDs, and so use elements with their default attributes. This is the case of XHTML as an application of XML, and HTML as an application of SGML (neither HTML or SGML parsers will interpret the xml:space attribute, and XML parsers will handle it only if they are validating documents with their DTD or schema)
Re: [A12n-Collab] Creating fonts for Akan language
At 12:27 AM 8/7/2003, [EMAIL PROTECTED] wrote: My desire is to create (make) a set of fonts for the Akan language for Windows 2000 to begin with. I have been able to create a crude version for my own use but I know that the people of Ghana would be very happy to be able to install a standardized version for their own use. I would also want to eventually map it to a keyboard, probably with extra keys for the two Akan characters. My problem is: 1. How do I set out to create such a font? 2. How do I use the existing character 0190/025B in such a font? 3. How do I create and get the 15th character accepted in the Unicode set? 1. See www.fontlab.com 2. Make a Unicode encoded font (TrueType or CFF OpenType). For use in Windows 2000 or XP or other Unicode text processing environments, you do not need to worry about 8-bit codepages: so long as the glyphs for these letters are mapped to the correct Unicode characters in the font cmap table, they will work. If you want to make your own keyboard layout driver for Akan, you can use Microsoft's new Keyboard Layout Creator: http://www.microsoft.com/globaldev/tools/msklc.mspx 3. The 'open o' character is already included in the Unicode Standard. The uppercase letter is U+0186 and the lowercase is U+0254. A couple of additional comments: Akan is a tonal language, yes? This likely means that although the Bureau of Ghana languages specifies an alphabet of 22 letters there are circumstances in which it is necessary to indicate tones to differentiate otherwise identical words. For educational and lexicographical texts it may also be desirable to indicate nasalisation. This means that simply providing glyphs for the 44 upper- and lowercase letters might not be sufficient: you may also need dynamic mark positioning. Microsoft are apparently releasing a number of updates to their core font set with upcoming versions of Office and Windows that will include extensive African language support. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] The sight of James Cox from the BBC's World at One, interviewing Robin Oakley, CNN's man in Europe, surrounded by a scrum of furiously scribbling print journalists will stand for some time as the apogee of media cannibalism. - Emma Brockes, at the EU summit
Which ancestral links
In message [EMAIL PROTECTED] Michael Everson writes: Re: Colourful scripts and Aramaic This is nearly off topic, but I'd be glad of any clarifications, or references that anybody has. In message [EMAIL PROTECTED] Michael Everson wrote in response to Peter Kirk, with a clarification I agree with mainly: People. It [Aramaic] is the widespread offshoot used throughout the Middle East that spawned Brahmic and Uighur and other scripts. It isn't necessarily the thing you think is confined to three scraps of papyrus or whatever. I'd always been under the impression that the Brahmic script family and their offshoots, and the Phoenician script family and their offshoots, developed independently of each other, and although links between the two families had been suggested by some scholars, many other scholars disagreed with this suggestion. Are there some articles which show these links reasonably well, and if so, which family predated the other? Also Uighur script (as in Old Uighur, as in Sogdian) has, as a cursive script, a superficial resemblence to Arabic script (an offshoot from the Phoenician family) and I imagine that links are more easy to show. I've never seen a description of the Sogdian alphabet (i.e. I have never come across one): is there a good article or URL which illustrates such links? Best wishes John -- John Clews, Keytempo Limited (Information Management), 8 Avenue Rd, Harrogate, HG2 7PG Tel:+44 1423 888 432 mobile: +44 7766 711 395 Email: [EMAIL PROTECTED] Web:http://www.keytempo.com
Unicode 4.0 is online at last!
Well, I've been promising that good things would come to those who wait. ;-) At last, the Unicode website has been updated with the online chapters for Unicode 4.0. See: http://www.unicode.org/versions/Unicode4.0.0/ Or just go to the Unicode 4.0 link from the home page. Enjoy. --Ken P.S. Just FYI, Peter K., now it is o.k. for everyone to come back from their August Unicode vacations. Let the textual criticism begin!
Re: Questions on ZWNBS - for line initial holam plus alef
- Original Message - From: Peter Kirk [EMAIL PROTECTED] To: Jon Hanna [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Wednesday, August 13, 2003 3:05 PM Subject: Re: Questions on ZWNBS - for line initial holam plus alef On 13/08/2003 04:44, Jon Hanna wrote: No, the safe thing to do (and the thing that is done) is to treat the space as a space ignoring the fact that the NMTOKEN contains a combining character, this is even safer than your suggestion since it can't mis-identify the combining properties of a character. OK, it's safe, but it is a misuse of Unicode. As space plus combining character is a unit in Unicode, it should be treated as a unit by higher level protocols. If higher level protocols are allowed to do arbitrary things within Unicode units, there is no end to the possible confusion. See for example, from Unicode 4.0 chapter 3: C7 A process shall interpret a coded character representation according to the character semantics established by this standard, if that process does interpret that coded character representation. OK, but XML inherits its behavior from SGML and you won't change it. The only way to bypass this would be to use entitiy references to encode the base space needed by the Unicode convention, so this is related to what Unicode defines as a higher level protocol, needed here to bypass the limitations of basic text. However it still creates a problem within CDATA sections, which are not supposed to contain entity references. One needs then to use the XML CDATA escaping mechanism with another escaping system specific to CDATA sections (which are formally anonymous text elements and equivalent to them).
Re: Questions on ZWNBS - for line initial holam plus alef
On 11/08/2003 06:59, Jon Hanna wrote: There are only two theoretical problems that I can see here, the first is that a whitespace character other than space gets converted to space by attribute value normalisation, and that this changes the meaning of the text in some way. This could only occur if the combining character were the first character in a line of text, which is quite a nonsensical construct to begin with. Not at all! Imagine a tutorial on a language, which might well list the accents used, in a format like this: ` (grave accent) is used with a, e and o, and indicates more open pronunciation ^ (circumflex accent) is used with any vowel, and indicates lengthening So far so good, but when I get to an accent with no predefined spacing variant, I have a problem! -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Questions on ZWNBS - for line initial holam plus alef
On 13/08/2003 04:44, Jon Hanna wrote: No, the safe thing to do (and the thing that is done) is to treat the space as a space ignoring the fact that the NMTOKEN contains a combining character, this is even safer than your suggestion since it can't mis-identify the combining properties of a character. OK, it's safe, but it is a misuse of Unicode. As space plus combining character is a unit in Unicode, it should be treated as a unit by higher level protocols. If higher level protocols are allowed to do arbitrary things within Unicode units, there is no end to the possible confusion. See for example, from Unicode 4.0 chapter 3: C7 A process shall interpret a coded character representation according to the character semantics established by this standard, if that process does interpret that coded character representation. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
Philip Verdy posted: Could ZWS+combining diacritic may be the best solution for isolated diacritics in text? From http://www.unicode.org/book/ch04.pdf: * Such characters may be large enough to effect the placement of their base character relative to preceding and succeeding base characters. For example, a circumflex applied to an i may effect spacing (î), as might the character U+20DD COMBINING ENCLOSED CIRCLE. Unless Unicode 4.0 as changed this the words may and might here would indicate that ZWSP is not *necessarily* the best solution. There is no specification about what an application *must* do to be conforming in this circumstance, merely indication that an application that does expand spacing for the sake of appearance is not non-confirming. It is *probably* implied that this is the right way to go. But I would guess that it would also be conforming for an application to not expand spacing at all on ZWSP so that coding of _o_ + ZWSP + COMBINING CIRCUMFLEX + _o_ would place the circumflex centered over _oo_ with its center point between the two letters. Either result would be useful for different purposes. It certainly makes sense that in the case of space characters that have a defined width that this width is innate to the definition of the character and in such a case should take precidence over the width of the normally non-spacing combining character. I would welcome clear instructions by Unicode on this point where either result would be useful in order than applications may be expected to produce results that are consistent with each other. :-) I would think it would be consistant with Unicode for an application to shrink the width of normal space followed by a diacritic such as a single overdot as exact formatting behavior is not defined in such cases. Jim Allan
Re: Questions on ZWNBS - for line initial holam plus alef
- Original Message - From: Doug Ewell [EMAIL PROTECTED] To: Unicode Mailing List [EMAIL PROTECTED] Cc: Peter Kirk [EMAIL PROTECTED]; Kenneth Whistler [EMAIL PROTECTED] Sent: Monday, August 11, 2003 5:39 PM Subject: Re: Questions on ZWNBS - for line initial holam plus alef Peter Kirk peter dot r dot kirk at ntlworld dot com wrote: Thank you, Ken. Well, you make it sound as if the problems are minimal, and that version I can just about accept. But if Philippe is correct about what he says about UAX#29 and UAX#14, there are some more serious problems. It is certainly highly inappropriate for non-spacing diacritics to be considered word boundaries. Non-spacing diacritics had better not be word boundaries, otherwise a string like Quebec (spelled with U+0301, as here) would be considered two words. I don't have time right now to look up the relevant properties and UAX's, but I sincerely hope this is just another Philippe mistake and not a general misinterpretation that anyone might make. Not a mistake from me, sorry. From you yes: Peter Kirk probably wanted to speak about *spacing* diacritics (when coded with SPACE+NSM). There is no such *spacing* character in Qubec. Don't accuse me of something I did not say. And be more tolerant please with what is an obvious typo in the message from Peter Kirk. Instead of just flaming, could you better read the message and accept errors and correct them instead of sending such unconstructive replied. Thanks.
Re: Questions on ZWNBS - for line initial holam plus alef
On Monday, August 11, 2003 12:27 AM, Kenneth Whistler [EMAIL PROTECTED] wrote: A point I keep trying to make, but which often gets overlooked by people trying to code Unicode mechanisms for dealing with edge cases, is that the design goal of the Unicode Standard is, and always has been, to represent *plain text content*. It cannot, and should not, IMO, deal with requirements for representing arbitrarily fine distinctions of typographical detail in all manuscripts and other documents in all writing systems of the world. Spacing diacritics are not on the edge of the standard, when they are already given a full block and handled there as symbols (not as letters as suggested in some parts of UAX's), with their own identity independant of their actual glyphic representation. I am not discussing about the typesetting of these grapheme clusters but really about the textual semantics of such combining sequences with an invisible base character, affecting all their properties and not fully described in the various standard annexes. Due to the huge legacy use of SPACE+diacritics in legacy text, and the already normative parts of some standard annexes, it will be hard to correct the behavior or change the text of these annexes. And it's where a new better base character than SPACE could help solve cleanly the ambiguities. -- Philippe. Spams non tolérés: tout message non sollicité sera rapporté à vos fournisseurs de services Internet.
Re: Questions on ZWNBS - for line initial holam plus alef
Peter Kirk scripsit: On 13/08/2003 11:09, Philippe Verdy wrote: ... For this reason, defective combining sequences (combining characters without a leading base character) should be forbidden (invalid for XML). If there is even the remotest possibility of this happening, we need to know quickly! As a member of the XML Core Working Group of the W3C, I can assure you that there is not even the remotest possibility of it. -- John Cowan [EMAIL PROTECTED]http://www.ccil.org/~cowan Is it not written, That which is written, is written?
RE: AL32UTF8 Vs UTF8
Jay, Oracle's UTF-8 is not really a valid encoding. It encodes surrogates as if they were characters. The kept the old Unicode 2.x code that only supports BMP to provide sort key compatibility for clients who never upgraded to Unicode 3.0 support and are using 16 bit character encoding improperly. UTF8 sorts in the same way as the old 16 bit Unicode before surrogates. Do not use UTF8 because it is really not Unicode conformant with any Unicode standard. Instead use AL32UTF8. Carl -Original Message-From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]On Behalf Of Jay ChandruSent: Sunday, August 10, 2003 8:58 AMTo: [EMAIL PROTECTED]Subject: AL32UTF8 Vs UTF8 Greetings, We are using Oracle9i with application tier as 11i. I wanted to know the differences between AL32UTF8 and UTF8. My database (oracle) will be in AL32UTF8 format. Will the applications that require multibyte characters work as they are functionin in UTF8 format. Would be great if anybody can gimme a comparision on AL32UTF8 and UTF8 Also pls list requirement of any 3rd party softwares for code page conversions in case of AL32UTF8 Thanks in advance, -Jay Do you Yahoo!?Yahoo! SiteBuilder - Free, easy-to-use web site design software
Re: Questions on ZWNBS - for line initial holam plus alef
On 12/08/2003 20:28, John Cowan wrote: Peter Kirk scripsit: 2) In attribute values, LF, CR, and TAB characters are normalized to spaces. Not relevant here. This would be relevant if it is legal for the character after LF, CR, and TAB to be a combining mark. Is this legal? In this case what was previously a defective (but legal) combining sequence would turn into a non-defective one, but the intended whitespace would be lost. The point is that there is no such thing as an *intended* line break in an attribute value; it will *always* be translated to a space before the application sees it. (More exactly, line-break characters can be inserted into attribute values, but only with the use of a numeric character reference such as #xA;.) Sorry, I'm confused. Are you saying that the input processing will translate line breaks into spaces within attribute values, unless inserted as #xA; ? Well, I suppose this is fair enough as it is up to the user not to enter garbage. Not just a rendering glitch, I suspect. If the combining character is combined with the separating space, the space loses many of its separating functions, and perhaps keeps a confusing subset of them with all sorts of possibilities of error. The space(s) will be used to separate individual tokens at processing time. No spacing diacritic (either single-character or space+combining) is permitted in a NMTOKEN. OK if this is clearly illegal, but this might restrict use of some languages in NMTOKEN. Would NBSP + combining be allowed? At best tokens beginning with combining characters will be unusable. At worst they will crash the implementation (and count on someone trying deliberately to do that!). In effect, the combining character will constitute a defective combining sequence at the beginning of the individual token. Stepping away from the letter of the standard for a moment, there is no real reason to begin a NMTOKEN with a combining character. It is only allowed is a result of the miscegenation of SGML concepts with Unicode ones. In SGML's original design of tokens, they consisted of letters and digits (and a few punctuation marks, which functioned as letters). There were four kinds: a NUMBER could contain only digits, a NAME could not begin with a digit, a NUTOKEN had to begin with a digit, and a NMTOKEN had no restrictions. ID and IDREF had the same syntax as NAME with additional semantics. Later, the categories letter and digit were generalized, by redefining the concrete syntax, to be whatever you wanted, and were renamed name-start and name characters (technically, a name character was a letter *or* a digit). When SGML was simplified to produce XML, only NMTOKEN, the most general type of token, was kept. However, in order to keep the semantics of letter and digit in the Unicode world, letter was extended to be any letter and digit to be any digit *or* combining character. That worked well for ID and IDREF, since treating combining characters as part of digit prevented them from appearing first, as was only sensible. Unfortunately, NMTOKENs, since there were no restrictions, became able to begin with a combining character, though that made no real sense. To write in a restriction would make it impossible to specify XML's concrete syntax in SGML terms, which did not allow for three different classes of characters within tokens. So we wound up with a basically useless capability that if used will only cause trouble. There is some potential for real trouble here, if one process outputs an NMTOKEN starting with a combining character preceded by a separating space, or something else which is changed into a space, and another process takes the new space plus combining character as a unit and so doesn't recognise the separation. Any hackers and virus programmers reading this will soon start flooding the Internet with tokens beginning with combining characters in the hope of crashing implementations or finding back doors. Of course this wouldn't have been a problem if Unicode had never defined space plus combining character as legal and meaningful. But this is not my problem! -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Pre-orders of The Unicode Standard, Version 4.0
Dear Unicode and Unicore List Subscribers, The release of the Unicode Standard, Version 4.0 is right around the corner. There is still time to place your individual or group orders and to get the book sent to you directly from the publisher, fresh off the press. Anyone placing bulk orders is highly encouraged to do so by August 20 as this will substantially speed up the delivery time. Full members of the Consortium receive 20% discount, Associate and Specialist members receive 10% off the list price of $74.99. To order, please use the the book order form at http://www.unicode.org/book/bookform.html Regards, Magda Danish Administrative Director The Unicode Consortium 650-693-3921