Re: Arabic ligatures
Shawn Landden shawnlandden at tuta dot io wrote: Arabic ligitures have been deprecated[1], despite a need for both ligitures and non-ligature versions of the same glyphs. The only Arabic character that is deprecated in the standard is U+0673 ARABIC LETTER ALEF WITH WAVY HAMZA BELOW. The Wikipedia article cited as [1] does not claim otherwise. Amiri uses contextual alternatives for الله. These ligatures are used in religious documents[2] via pictures, which seems to be what the current Unicode standard recommends. What is your source for this? Unlike the presentation forms, there is case for these phrases and formulas to be available both in ligature and non-ligature form. All Arabic letters and combinations can be rendered in ligated or non-ligated forms as needed using some combination of ZWJ and ZWNJ. See TUS 8.0, Section 9.2. These ligatures should be non-deprecated and subject to canonical decomposition, rather than compatibility decomposition. Section 9.2 (page 386 ff.) explains the Arabic Presentation Forms-A block (U+FB50—U+FDFF) in greater detail. -- Doug Ewell | http://ewellic.org | Thornton, CO
Re: interaction of Arabic ligatures with vowel marks
Please see this page: (for IE, use v 2010 and up) http://lovatasinhala.com/ The font is almost all ligatures. If you copy and inspect the text, you'll notice that it is simple romanized Singhala. I am currently in Sri Lanka demonstrating this. The people at president's office and one of the powerful ministers have seen it. They are elated that after all, Singhala, the most complex of 'Abigudas' is much like a Western European language and amazingly computer and user friendly. This is contrary to how it was portrayed to them by local academics and technocrats causing the poor country unnecessary debt. The ideas of Abiguda and Complex fade away if a font is made fully understanding Unicode's description of ligatures and how they are implemented by OpenType (now OpenFont). I believe that Arabic and Hebrew can follow this model so that typing the script is simplified for users without compromising orthography. On Wed, Jun 12, 2013 at 8:39 AM, Stephan Stiller stephan.stil...@gmail.comwrote: Hi, How is the placement of vowel marks around ligatures handled in Arabic text? Does anyone have good pointers on this topic? My guess is that this does not come up often (just like the topic of pointing for handwritten Hebrew), as vowel marks are mostly not added in ordinary text. Nonetheless, any text making heavy use of ligatures will from time to time need to add vowel marks for a foreign name or as a reading aid, and (as many of us know) the Quran is traditionally printed with vowel marks. I'm also wondering how font designers normally handle this. I think there are analogous questions for various ligature-heavy abugidas, so there must be an existing body of knowledge. There should be better answers than squeeze the vowels around the consonant clusters in whatever way seems most intuitive. Do traditional printing presses use extra metal types for such glyph clusters, or do they manually add and adjust the positioning of vowels? Stephan ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: interaction of Arabic ligatures with vowel marks
Andreas Have you tried Mihail Bayaryn's Siddhanta font - (or his earlier Chandas and Uttara fonts)? http://svayambhava.org/index.php/en/fonts This font supports many more vertical ligatures for Sanskrit than most other Devanagri fonts. - Chris On 13/06/2013, Andreas Prilop apri...@freenet.de wrote: On Wed, 12 Jun 2013, Richard Wordingham wrote: While the same principle applies to Indic scripts (and indeed, to the Roman alphabet), there is only one Indic mark I can think of for which the issue of component association arises, and that is the nukta. Sanskrit requires candrabindu U+0901 inside (or on top of) two La U+0932. See http://www.unicode.org/mail-arch/unicode-ml/y2011-m06/0138.html Instead of http://www.unicode.org/mail-arch/unicode-ml/y2011-m06/att-0135/image001.png I would like to see the two La on top of each other.
Re: interaction of Arabic ligatures with vowel marks
On Tue, Jun 11, 2013 at 08:09:31PM -0700, Stephan Stiller wrote: Hi, How is the placement of vowel marks around ligatures handled in Arabic text? OpenType has special support for placing non combining marks over ligatures (a subset of the general support for controlling the placement of non-combining marks); it is entirely handled at text rendering level, no difference in input whether the bases will be ligated or not. No idea about other font technologies. Regards, Khaled
Re: interaction of Arabic ligatures with vowel marks
On Tue, 11 Jun 2013 20:09:31 -0700 Stephan Stiller stephan.stil...@gmail.com wrote: Hi, How is the placement of vowel marks around ligatures handled in Arabic text? For OpenType the clue lies in the three types of GPOS (http://www.microsoft.com/typography/otspec/gpos.htm) lookup for marks - mark to base, mark to mark, and mark to ligature. As base characters get ligated, the shaper keeps track of which marks were associated with which component of the ligature, and separate vowel positions are recorded in the font for each component. There is more complicated logic to prevent various undesirable behaviour, such as marks belong to different components interacting via mark to mark position lookups or ligature lookups. The idea is to relieve the font designer of the need to think about such issues. I haven't found any public Microsoft documentation on these lookups, and for open source I can only suggest studying the source code and its comments - HarfBuzz files hb-ot-layout-gdef-table.hh, hb-ot-layout-gpos-table.hh and hb-ot-layout-gsubgpos-private.hh are particularly relevant. Obviously this will not work if the character sequence is defined in terms of presentation forms that are already ligatures. I'm also wondering how font designers normally handle this. I think there are analogous questions for various ligature-heavy abugidas, so there must be an existing body of knowledge. While the same principle applies to Indic scripts (and indeed, to the Roman alphabet), there is only one Indic mark I can think of for which the issue of component association arises, and that is the nukta. That could be handled by the ligation process instead, so I would not rely on there being a large body of Indic-specific knowledge on the issues. OpenType has special handling for consonant clusters with visible internal halant. Richard.
Re: interaction of Arabic ligatures with vowel marks
Thank you, خالد and Richard. there is only one Indic mark I can think of for which the issue of component association arises, and that is the nukta That is good to know, given the complexity of the Indic scripts. Other thoughts: * One could simply break up Arabic ligatures in need of harakat. If someone knows whether or to what extent this is done in otherwise ligated text, I will be curious to know. * Just now it is occurring to me that {the fact that the shadda is often used in ordinary writing} should make it easier to find data on all this, unless gemination blocks ligation in certain ways. * If there are conventions on the relative placement of harakat in general (I mean: not necessarily print), I will be curious to know. Some letter/consonant clusters have quite vertical an appearance, and any type foundry will need to be familiar with common practice (to the extent there is any), no matter what medium or technology is used in the end to create a typeface. Stephan
Re: interaction of Arabic ligatures with vowel marks
On Tue, 11 Jun 2013, Stephan Stiller wrote: How is the placement of vowel marks around ligatures handled in Arabic text? I'm also wondering how font designers normally handle this. Older fonts in older operating systems (like Windows XP) often failed. See http://www.unicode.org/mail-arch/unicode-ml/y2012-m03/0101.html http://www.unicode.org/mail-arch/unicode-ml/y2008-m05/thread.html#139
Re: interaction of Arabic ligatures with vowel marks
On Wed, 12 Jun 2013, Richard Wordingham wrote: While the same principle applies to Indic scripts (and indeed, to the Roman alphabet), there is only one Indic mark I can think of for which the issue of component association arises, and that is the nukta. Sanskrit requires candrabindu U+0901 inside (or on top of) two La U+0932. See http://www.unicode.org/mail-arch/unicode-ml/y2011-m06/0138.html Instead of http://www.unicode.org/mail-arch/unicode-ml/y2011-m06/att-0135/image001.png I would like to see the two La on top of each other.
interaction of Arabic ligatures with vowel marks
Hi, How is the placement of vowel marks around ligatures handled in Arabic text? Does anyone have good pointers on this topic? My guess is that this does not come up often (just like the topic of pointing for handwritten Hebrew), as vowel marks are mostly not added in ordinary text. Nonetheless, any text making heavy use of ligatures will from time to time need to add vowel marks for a foreign name or as a reading aid, and (as many of us know) the Quran is traditionally printed with vowel marks. I'm also wondering how font designers normally handle this. I think there are analogous questions for various ligature-heavy abugidas, so there must be an existing body of knowledge. There should be better answers than squeeze the vowels around the consonant clusters in whatever way seems most intuitive. Do traditional printing presses use extra metal types for such glyph clusters, or do they manually add and adjust the positioning of vowels? Stephan
Ligatures
Can you please give me a list of all the ligatures available? Thanks! - Michael Norton (a.k.a. Flarn) E-mail address: [EMAIL PROTECTED]
Ligatures
Can you please give me a list of all the ligatures available? Thanks! - Michael Norton (a.k.a. Flarn) E-mail address: [EMAIL PROTECTED]
RE: Ligatures
I suppose one could construct such a list, but using them to encode text is a Very Bad Idea. It is better, for example, to encode the fi ligature as the letter f followed by the letter i and let rendering software, fonts, and so forth provide the ligature. Encoding ligatures directly will make your life harder. For example, most spell checkers will fail the word final when it is spelled U+FB01 U+006E U+0061 U+006C (that is, fi-ligature followed by nal). If you are constructing a font, there are lots of good links on the Unicode website which include information on how to handle ligation without having a code point for every combination of characters you ligate. I haven't time to write a good quality response right now, but no doubt someone will jump in with 37 pages of text about the small amount I've already written (please excuse my sarcasm, which isn't directed at you). PS Flarn isn't the reference I think it is, is it? Best Regards, Addison Addison P. Phillips Director, Globalization Architecture http://www.webMethods.com Chair, W3C Internationalization Working Group http://www.w3.org/International Internationalization is an architecture. It is not a feature. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Flarn Sent: 20041127 15:46 To: [EMAIL PROTECTED] Subject: Ligatures Can you please give me a list of all the ligatures available? Thanks! - Michael Norton (a.k.a. Flarn) E-mail address: [EMAIL PROTECTED]
Re: Ligatures
Hopefully not adding 37 pages... Michael Norton (a.k.a. Flarn) flarn2003 at megapipe dot net wrote: Can you please give me a list of all the ligatures available? Thanks! If by available you mean separately encoded in precomposed form, you could start by checking the online, definitive Unicode data file: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt Upon searching this file, you would find 507 characters with the word LIGATURE in their name. However, I'm guessing that what you are after is Latin-script ligatures, so it probably won't help much that 477 of the 507 ligatures are Arabic presentation forms. Of the remaining 30, six are Armenian, six are Cyrillic, five are Hebrew, and two are actually not ligatures at all, but paired combining marks intended to show that the two letters under them form a single sound. That leaves 11 Latin ligatures encoded in Unicode. The two IJ characters, U+0132 () and U+0133 (), aren't really ligatures, so they don't count. If we count the OE characters, U+0152 () and U+0153 (), as ligatures, then we also have to count the AE characters as well, U+00C6 () and U+00E6 (). That leaves U+FB00 through U+FB06 ( ). The problem, as Addison pointed out, is that if you use these forms in text, most searching and sorting operations will fail to recognize them. It is better to use the regular letters and let higher-end software ligate them as appropriate. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: Ligatures
At 07:44 PM 11/27/2004, Doug Ewell wrote: The problem, as Addison pointed out, is that if you use these forms in text, most searching and sorting operations will fail to recognize them. That's not the only problem. In some languages other ligatures, such as fj might be as commonly needed as fi - the set is (intentionally) not complete and you should not build your text or technology around them. It is better to use the regular letters and let higher-end software ligate them as appropriate. Note that for many languages you need to use ZWNJ to prohibit ligatures where disallowed by the orthography. Without that information even fairly high end software cannot correctly ligate these languages. There are some (in)famous word pairs that are spelled identically, except for differences in where the ligatures can go. No software can figure this out - that information must come from the author. Getting sorting and searching operations to consistently ignore the ZWNJ is something that has a higher chance of success, compared to making such software handle long lists of ligatures. A./
the length in semantic meaning for ligatures
What is the string length in semantic meaning for a ligature? For example, when we impose a length(str) function to them? Are all the ligatures using the same rule? Or different according to different scipts of Arabic, Latin, Devanagari, Syriac, etc? What else if the ligature itself has its own code point, for example, Latin Ligatures: U+FB00 to U+FB06? thanks, _ STOP MORE SPAM with the new MSN 8 and get 2 months FREE* http://join.msn.com/?page=features/junkmail
Re: Ligatures with diacritics (was: Ancient Northwest Semitic Script)
At 01:13 PM 12/30/2003, Peter Kirk wrote: But if it were, this ligature would be very interesting and problematic because it is a ligature between a base character and a diacritic. This is not a problem if it is always used, in a particular font, but it is problematic if the ligature is optional. This is because ZWNJ and ZWJ cannot be used between base characters and diacritics because they break the combining sequence. We came across this problem before with Hebrew script, but in a rather different (and less ambiguous) context, that of the need for a ligature between meteg and hataf vowels. We should probably be careful to distinguish between ligation explicitly requested in text using ZWJ -- which is very much a minority case -- and ligation that occurs as either default rendering or as the result of a higher level font feature request. There are lots of ligatures of bases and marks in lots of fonts: ligation is one possible method of rendering any sequence of base plus mark(s), and in some cases if preferable to dynamic mark positioning. OpenType etc fonts are currently able to make these distinctions consistently, with the mechanisms John described above; but these mechanisms fail if there is a need for the ligature to be optional, as ZWNJ and ZWJ cannot be used. Again, there is the question of whether an optional ligation needs to be requested or inhibited in plain text, using these control characters, or can be handled at a higher level using markup. In OT rendering, only lookups in the Required Ligatures rlig feature cannot be turned off, so one would put optional ligatures in the Standard Ligatures liga feature if you wanted them on by default, or in the Discretionary Ligatures dlig feature if you wanted them off by default. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] What was venerated as style was nothing more than an imperfection or flaw that revealed the guilty hand. - Orhan Pamuk, _My name is red_
Re: Ligatures with diacritics
On 30/12/2003 15:44, Chris Jacobs wrote: I wonder if there are other, better defined, cases of ligatures between base characters and diacritics in other scripts, i.e. cases where there is an optional alternative to base character plus diacritic which does not look like the base character plus the diacritic. Devangari? Syllabe + virama + ZWJ -- consonant. Note that the ZWJ is _after_ the virama. Interesting. Is this actually valid at the end of a string? Would syllable, virama, ZWJ as an isolated string be rendered differently from syllable, virama? But it strikes me that this arrangement, however sensible within its own writing system, is a distortion of the regular rules for ZWJ. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Ligatures with diacritics
See http://www.unicode.org/versions/Unicode4.0.0/ chapter 9 Interesting. Is this actually valid at the end of a string? Yes. Figure 9-6 is an example. Would syllable, virama, ZWJ as an isolated string be rendered differently from syllable, virama? I don't know. syllable, virama ZWJ is rendered differently from syllable, virama, ZWNJ But I don't know which of both is the default. If it is not at the end of a string then the default is to try to include yet some more in the ligature, ZWJ or ZWNJ prevents this. But it strikes me that this arrangement, however sensible within its own writing system, is a distortion of the regular rules for ZWJ.
Ligatures with diacritics (was: Ancient Northwest Semitic Script)
On 30/12/2003 11:44, John Hudson wrote: At 11:15 AM 12/30/2003, Peter Kirk wrote: Even if it were verified, it isn't a good case for encoding a separate character *equivalent* to a combination of two existing characters: that's a glyph variant ligature. Actually, I don't think so. The separate character was not formed by merging the dot into the letter, rather the distinction was made in a different way. In modern digital font development, ligation refers to the mechanism of display, not the visual appearance, which is largely irrelevant. A ligature is any glyph that represents two or more characters, typically arrived at by a ligation lookup. If I wanted a special sin glyph *equivalent* to the character sequence shin, sindot, I would ligate the two characters to that single glyph, either directly shin sindot - sin or via a two-stage stylistic variant lookup associated with a different typographic feature shin sindot - shin_sindot and then shin_sindot - sin I understand this, and, as I answered separately, I don't think this is the appopriate mechanism in this case as the suggested ligature is not fully equivalent to the sequence. But if it were, this ligature would be very interesting and problematic because it is a ligature between a base character and a diacritic. This is not a problem if it is always used, in a particular font, but it is problematic if the ligature is optional. This is because ZWNJ and ZWJ cannot be used between base characters and diacritics because they break the combining sequence. We came across this problem before with Hebrew script, but in a rather different (and less ambiguous) context, that of the need for a ligature between meteg and hataf vowels. I wonder if there are other, better defined, cases of ligatures between base characters and diacritics in other scripts, i.e. cases where there is an optional alternative to base character plus diacritic which does not look like the base character plus the diacritic. Candidates like ø as an alternative for ö are ruled out because they are already separately encoded. I have certainly seen glyphs rather like U+0255 used for c cedilla. In the light of recent discussions, I can easily imagine a script or style like Sutterlin having a special ligated form for u umlaut, but that this ligature must not be used, rather two dots should be written above the letter as in normal Latin script, in the name Saül in which the dots represent a diaeresis rather than an umlaut. OpenType etc fonts are currently able to make these distinctions consistently, with the mechanisms John described above; but these mechanisms fail if there is a need for the ligature to be optional, as ZWNJ and ZWJ cannot be used. Are there any real examples where this might be necessary? As this is a more general issue, I am coying it back to the main Unicode list. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Ligatures with diacritics (was: Ancient Northwest Semitic Script)
I wonder if there are other, better defined, cases of ligatures between base characters and diacritics in other scripts, i.e. cases where there is an optional alternative to base character plus diacritic which does not look like the base character plus the diacritic. Devangari? Syllabe + virama + ZWJ -- consonant. Note that the ZWJ is _after_ the virama.
RE: Faulty ligatures in Adobe PhotoShop
Doug Ewell wrote: ... My copy of Photoshop 7 has an interesting image in its (HTML format) help file, page 1_16_4_13.html on Using ligatures and old style numerals. It shows three examples of Type with Ligatures option unselected and selected: ct, fi and fh. The bad part of it is that the ligated characters shown (in the sencond and third examples) seem to include a long s instead of an f... ty_06.gif attached for reference. There is no fh ligature in Unicode, No, but is is perfectly permissible to ligate f and h anyway, just like you can (or should) ligate f and j, and g and j (if the glyphs would overlap). so Photoshop may have been trying to substitute the closest available ligature to the one you wanted (which is wrong, of course). Substituting an unligated i (U+017F + U+0069) for fi (U+0066 + U+0069) makes no sense at all. If the current font doesn't contain an ligature (U+FB01), Photoshop should just leave the combination alone. U+FB01 is a compatibility character that is best avoided to use at all. Formation of of an f and i ligature should not depend on if the character U+FB01 is supported or not (though it is likely to be supported if f and i are ligated). /kent k
Re: Faulty ligatures in Adobe PhotoShop
Doug Ewell wrote: Anto'nio Martins-Tuva'lkin antonio at tuvalkin dot web dot pt wrote: The bad part of it is that the ligated characters shown (in the sencond and third examples) seem to include a long "s" instead of an "f"... ty_06.gif attached for reference. Thanks for the report, Ill forward to the Photoshop guys. By the way, the font is apparently Adobe Caslon Pro. Substituting an unligated i (U+017F + U+0069) for fi (U+0066 + U+0069) makes no sense at all. If the current font doesn't contain an ligature (U+FB01), Photoshop should just leave the combination alone. More likely, the image was created in Illustrator or some such, and the glyph selected manually by the author. I did not check explicitly, but I am ready to bet a whole lot that the font does the correct thing. Eric.
Faulty ligatures in Adobe PhotoShop
My copy of Photoshop 7 has an interesting image in its (HTML format) help file, page 1_16_4_13.html on Using ligatures and old style numerals. It shows three examples of «Type with Ligatures option unselected and selected»: ct, fi and fh. The bad part of it is that the ligated characters shown (in the sencond and third examples) seem to include a long s instead of an f... ty_06.gif attached for reference. I note that Adobe Photoshop has OTOH quite deep and (apparently) well designed support for some relatively complex font manipulations, as f.i. East Asian width and composing oddities. -- . António MARTINS-Tuválkin, | ()| [EMAIL PROTECTED] || R. Laureano de Oliveira, 64 r/c esq. | PT-1885-050 MOSCAVIDE (LRS) Não me invejo de quem tem | +351 934 821 700 carros, parelhas e montes | http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe | http://pagina.de/bandeiras/ a água em todas as fontes |attachment: ty_06.gif
Re: Faulty ligatures in Adobe PhotoShop
Anto'nio Martins-Tuva'lkin antonio at tuvalkin dot web dot pt wrote: My copy of Photoshop 7 has an interesting image in its (HTML format) help file, page 1_16_4_13.html on Using ligatures and old style numerals. It shows three examples of Type with Ligatures option unselected and selected: ct, fi and fh. The bad part of it is that the ligated characters shown (in the sencond and third examples) seem to include a long s instead of an f... ty_06.gif attached for reference. There is no fh ligature in Unicode, so Photoshop may have been trying to substitute the closest available ligature to the one you wanted (which is wrong, of course). Substituting an unligated i (U+017F + U+0069) for fi (U+0066 + U+0069) makes no sense at all. If the current font doesn't contain an ligature (U+FB01), Photoshop should just leave the combination alone. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: Faulty ligatures in Adobe PhotoShop
At 02:59 AM 8/26/2003, Anto'nio Martins-Tuva'lkin wrote: My copy of Photoshop 7 has an interesting image in its (HTML format) help file, page 1_16_4_13.html on Using ligatures and old style numerals. It shows three examples of «Type with Ligatures option unselected and selected»: ct, fi and fh. The bad part of it is that the ligated characters shown (in the sencond and third examples) seem to include a long s instead of an f... ty_06.gif attached for reference. Whoever made the image probably made a mistake; either that or the font used has faulty lookups. Photoshop 7 uses OpenType glyph substitution, so what you are seeing is not character mapping but glyph-space processing. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] You need a good operator to make type. If it were a DIY affair the caster would only run for about five minutes before the DIYer burned his butt off. - Jim Rimmer
Re: Accented ij ligatures (and yery)
On 2003.07.07, 00:25, Peter Kirk [EMAIL PROTECTED] wrote: Maybe originally U+044B (cyrillic y, yery) was two separate letters, It sure it (though I should provide some references to back this up? Hm, later...) but it is certainly considered and used as one letter in Cyrillic languages today. Encoding it as two letters would be about as sensible as insisting that w should be encoded as two u's or that i should be encoded as dotless i plus combining dot. Well, that was precisely my point when asking how much dutch ij (as in rijk, not as in bijectie) is an analogous case. Note that yery is also sometimes written with an acute accent centred over the two elements, to indicate stress. Indeed, in (at least, Russian) dictionaries and schooll books. It can also recieve an umlaut in Maryan (precomposed as U+04F9), again center over the enseble of both elements. -- . António MARTINS-Tuválkin| ()| [EMAIL PROTECTED] || R. Laureano de Oliveira, 64 r/c esq. | PT-1885-050 MOSCAVIDE (LRS) Não me invejo de quem tem | +351 934 821 700 carros, parelhas e montes | http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe | http://pagina.de/bandeiras/ a água em todas as fontes |
Re: ISO 639 duplicate codes (was: Re: Ligatures in Turkish andAzeri, was: Accented ij ligatures)
Phillipe wrote: I hae tried several times to do it. It does not work: you may effectively remove some tables your don't need, but trying to extract just the normalizer is a real nightmare. I tried it in the past, and abondonned: too tricky to maintain, and I retried it recently (one month ago, from its CVS source) and this was even worse than the first time. webMethods includes the ICU normalizer in a couple of our products. The code for one of these products requires JDK 1.2.2, so, since I had to compile ICU anyway, I took the time to figure out the dependencies and build only what I needed. The list of classes required for the normalizer is actually quite small. Of the 1.3MB ICU4j.jar, only 400K are required for the normalizer to operate correctly. Source changes required. I will gladly send a complete list of classes to anyone who would like it. It took me a day to do the work (it took longer to test it than to build it). Adding the normalizer to the JDK itself would also not be a difficult thing for Sun to do: that's because a version of the normalizer is already in the JDK, but private. I will admit that it used to be quite difficult, back in the ICU 1.x days, to separate out the normalizer, but I've done that too (for reasons I shan't enumerate). I had to modify some source code to make it work, but that was mostly because I needed JDK 1.1.x. That JAR file is even smaller, at 161K. Building updated data tables is actually easier with the old source code... In any event, you really ought to try the newer versions of ICU4J out. They are a lot easier to work with. And a light version isn't that hard to create, if that's what you want. Best Regards, Addison -- Addison P. Phillips Director, Globalization Architecture webMethods, Inc. +1 408.962.5487 mailto:[EMAIL PROTECTED] --- Internationalization is an architecture. It is not a feature. Chair, W3C I18N WG Web Services Task Force http://www.w3.org/International/ws
Re: ISO 639 duplicate codes (was: Re: Ligatures in Turkish andAzeri, was: Accented ij ligatures)
Phillipe wrote: I hae tried several times to do it. It does not work: you may effectively remove some tables your don't need, but trying to extract just the normalizer is a real nightmare. I tried it in the past, and abondonned: too tricky to maintain, and I retried it recently (one month ago, from its CVS source) and this was even worse than the first time. webMethods includes the ICU normalizer in a couple of our products. The code for one of these products requires JDK 1.2.2, so, since I had to compile ICU anyway, I took the time to figure out the dependencies and build only what I needed. The list of classes required for the normalizer is actually quite small. Of the 1.3MB ICU4j.jar, only 400K are required for the normalizer to operate correctly. Source changes are not required. I will gladly send a complete list of classes to anyone who would like it. It took me a day to do the work (it took longer to test it than to build it). Adding the normalizer to the JDK itself would also not be a difficult thing for Sun to do: that's because a version of the normalizer is already in the JDK, but private. I will admit that it used to be quite difficult, back in the ICU 1.x days, to separate out the normalizer, but I've done that too (for reasons I shan't enumerate). I had to modify some source code to make it work, but that was mostly because I needed JDK 1.1.x. That JAR file is even smaller, at 161K. Building updated data tables is actually easier with the old source code... In any event, you really ought to try the newer versions of ICU4J out. They are a lot easier to work with. And a light version isn't that hard to create, if that's what you want. Best Regards, Addison -- Addison P. Phillips Director, Globalization Architecture webMethods, Inc. +1 408.962.5487 mailto:[EMAIL PROTECTED] --- Internationalization is an architecture. It is not a feature. Chair, W3C I18N WG Web Services Task Force http://www.w3.org/International/ws
Re: Ligatures in Turkish and Azeri
On 2003.07.12, 20:59, Anto'nio Martins-Tuva'lkin [EMAIL PROTECTED] wrote: Just browsed some old book with that in mind I here meant rather books, plural. And I'll keep an eye for this in the future. -- . António MARTINS-Tuválkin, | ()| [EMAIL PROTECTED] || R. Laureano de Oliveira, 64 r/c esq. | PT-1885-050 MOSCAVIDE (LRS) Não me invejo de quem tem | +351 934 821 700 carros, parelhas e montes | http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe | http://pagina.de/bandeiras/ a água em todas as fontes |
Re: Ligatures in Portuguese, French (was: ... Turkish and Azeri)
On Sunday, July 13, 2003 10:21 PM, John Cowan [EMAIL PROTECTED] wrote: Michael Everson scripsit: A good choice if you don't slash your DIGIT SEVENs and can make your DIGIT ONEs sufficiently distinct. Eh? I *do* slash my DIGITs SEVEN and I use a single vertical stroke from my DIGITs ONE. The TIRONIAN SIGN ET as used in Ireland has no horizontal stroke. I should have said do slash your DIGIT SEVENs. So the glyph in the Unicode 3.0 book is not typical of Irish practice? It seems to have a horizontal stroke all right. In French too: children at school learn to use an horizontal stroke when drawing a digit seven, and the oblique stroke is often curved to become vertical at its central base (not placed at the left corner, and uses a small loop to connect to the top horizontal stroke. I have always used a medial horizontal stroke on my sevens, often starting it the top left corner with a tiny loop too to create a vertical serif... -- Philippe. Spams non tolérés: tout message non sollicité sera rapporté à vos fournisseurs de services Internet.
Re: ISO 639 duplicate codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)
First, you should check again, since a significant amount of work was done in modularization in 2.6. Second, the phrase IBM forgot to modularize ICU is misleading, at the least. Unlike some people, who appear to have unbounded time and energy for, say, writing emails, we have to carefully pick and choose where we spend our time. Whether very fine-grained modularization is important depends a great deal on the client's requirements, and must be traded off against the many other things we could be doing with our time. Third, ICU4J is a source product. Saying that it is impossible to integrate the ICU's Normalize... is also misleading, since one can clearly modify source to remove dependencies on code one doesn't want to include, if it is not core to the functionality. (Of course, it may vary in amount of effort that is required.). And transliterators are not, in any event, required for Normalization. Mark __ http://www.macchiato.com Eppur si muove - Original Message - From: Philippe Verdy [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Monday, July 14, 2003 11:13 Subject: Re: ISO 639 duplicate codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures) On Monday, July 14, 2003 5:34 AM, Mark Davis [EMAIL PROTECTED] wrote: ... Of course Java already includes some parts of ICU, but other things are in ICU4J are difficult now to integrate in Java, simply because IBM forgot to modularize ICU so that it can be integrated slowly. Accepting ICU4J as part of the core is a big decision choice, because ICU4J is quite large, and there are certainly developers for Java that would not accept to have 1 aditional MB of data and classes loaded in each JVM (particularly because the integration of ICU would affect a lot of core classes for the Java2 platform now also used for small devices). ... For example, it is impossible to integrate the ICU's Normalizer class in Java without also importing the UChar class and all its related services for UString, such as transliterators, and ... You are very misinformed about ICU4J. I hae tried several times to do it. It does not work: you may effectively remove some tables your don't need, but trying to extract just the normalizer is a real nightmare. I tried it in the past, and abondonned: too tricky to maintain, and I retried it recently (one month ago, from its CVS source) and this was even worse than the first time. I know that there's now a recent announcement (less than 1 month ago) for its modularization, but it's true that I did not check the new modularized sources. So my application of ICU4J is still only when I can accept the whole package, as maintaining a stripped-down customization is too tricky. But may be this has changed, I just updated my ICU sources from CVS. I'll recheck it to see if a ICU Light version can be created (which would only keep the core features, without the support for tailoring rules compiled at run-time). -- Philippe. Spams non tolrs: tout message non sollicit sera rapport vos fournisseurs de services Internet.
Re: Ligatures in Portuguese, French (was: ... Turkish and Azeri)
Jim Allan scripsit: What this doesn't indicate is that sometimes in medieval text the ampersand ligature is used to spell _et_ as part of a longer word. Not just mediaeval text; c. for etc. (= et cetera) was common right through the 19th century if not later. -- John Cowan [EMAIL PROTECTED] www.ccil.org/~cowan www.reutershealth.com In the sciences, we are now uniquely privileged to sit side by side with the giants on whose shoulders we stand. --Gerald Holton
Re: Ligatures in Portuguese, French (was: ... Turkish and Azeri)
Jim Allan scripsit: See http://www.adobe.com/type/topics/theampersand.html for a short history of the ampersand and some of its variations in modern computer fonts. Unfortunately the explanation of the name ampersand given there is exactly backwards: it is not per se and, but and per se . Anglophones used to recite the alphabet by saying ... x, y, z, and per se [by itself] , pronounced of course and per se and and later ampersand. Check common fonts like Trebuchet MS, Berkeley Book, Goudy Sans, Korinna and Univers for recognizable _Et_ ampersands. I hand-write by making a tall lower-case epsilon glyph and then drawing a solidus over it. -- I am expressing my opinion. When myJohn Cowan honorable and gallant friend is called, [EMAIL PROTECTED] he will express his opinion. This is http://www.ccil.org/~cowan the process which we call Debate. --Winston Churchill
Re: Ligatures in Portuguese, French (was: ... Turkish and Azeri)
At 01:21 -0400 2003-07-13, John Cowan wrote: I hand-write by making a tall lower-case epsilon glyph and then drawing a solidus over it. I just use the TIRONIAN SIGN ET. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Ligatures in Portuguese, French (was: ... Turkish and Azeri)
John == John Cowan [EMAIL PROTECTED] writes: John Not just mediaeval text; c. for etc. (= et cetera) was John common right through the 19th century if not later. And picked up steam again online in the 1980s; groups.google.com should have lots of examples of c. -JimC
Re: Ligatures in Portuguese, French (was: ... Turkish and Azeri)
Michael Everson scripsit: I hand-write by making a tall lower-case epsilon glyph and then drawing a solidus over it. I just use the TIRONIAN SIGN ET. A good choice if you don't slash your DIGIT SEVENs and can make your DIGIT ONEs sufficiently distinct. -- Dream projects long deferredJohn Cowan [EMAIL PROTECTED] usually bite the wax tadpole.http://www.ccil.org/~cowan --James Lileks http://www.reutershealth.com
Re: Ligatures in Portuguese, French (was: ... Turkish and Azeri)
John Cowan posted: Not just mediaeval text; c. for etc. (= et cetera) was common right through the 19th century if not later. The combination _c_ is still used. Search for c in http://www.scotland.gov.uk/consultations/environment/tacnh-00.asp for example. But in mentioning medieval use I was thinking of use of the ampersand as a replacement for _et_ in words where _et_ is not the Latin word _et_. An article I read some years back discussed a medieval listing and explanation of the Icelandic alphabet which included the __ as a letter. The author of the article explained this by noting that __ was used occasionally in manuscripts to spell _et_ in Icelandic words. Jim Allan
Re: Ligatures in Portuguese, French (was: ... Turkish and Azeri)
At 14:09 -0400 2003-07-13, John Cowan wrote: Michael Everson scripsit: I hand-write by making a tall lower-case epsilon glyph and then drawing a solidus over it. I just use the TIRONIAN SIGN ET. A good choice if you don't slash your DIGIT SEVENs and can make your DIGIT ONEs sufficiently distinct. Eh? I *do* slash my DIGITs SEVEN and I use a single vertical stroke from my DIGITs ONE. The TIRONIAN SIGN ET as used in Ireland has no horizontal stroke. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Ligatures in Portuguese, French (was: ... Turkish and Azeri)
Michael Everson scripsit: A good choice if you don't slash your DIGIT SEVENs and can make your DIGIT ONEs sufficiently distinct. Eh? I *do* slash my DIGITs SEVEN and I use a single vertical stroke from my DIGITs ONE. The TIRONIAN SIGN ET as used in Ireland has no horizontal stroke. I should have said do slash your DIGIT SEVENs. So the glyph in the Unicode 3.0 book is not typical of Irish practice? It seems to have a horizontal stroke all right. -- Where the wombat has walked,John Cowan [EMAIL PROTECTED] it will inevitably walk again. http://www.ccil.org/~cowan
Re: Ligatures in Portuguese, French (was: ... Turkish and Azeri)
At 16:21 -0400 2003-07-13, John Cowan wrote: I should have said do slash your DIGIT SEVENs. So the glyph in the Unicode 3.0 book is not typical of Irish practice? It seems to have a horizontal stroke all right. It is utterly typical of Irish practice. I meant that it doesn't have an additional horizontal stroke as a slashed 7 does. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Ligatures in Portuguese, French (was: ... Turkish and Azeri)
Philippe Verdy verdy_p at wanadoo dot fr wrote: All this discussion shows that there is an extremely large number of glyph variation for the ampersand which is both (at the abstract level) a symbol character, and a ligature of two lowercase abstract characters. But ligatures for the uppercase ET and titlecase Et do exist as well. For Unicode, only the abstract symbol is encoded, but not the ligatures, despite they share a common set of glyphs. That is one of the essential features of Unicode. Abstract characters are encoded; glyph variants (in general) are not. Could the variant selectors may be used ? I see that Unicode does not allow a free use of variant selectors, which are defined only for cases where it would be important to preserve the precise semantic of the encoded text, but not as a way to preserve the glyphic information (so character variants are strictly limited). That's correct. The difference between the Arial-style glyph that looks a bit like a tilted treble clef (U+1D11E) and John's epsilon-with-solidus and Philippe's e-with-small-attached-t is one of style only. The distinction does not need to be encoded in plain text, any more than the distinction between a lowercase g with one bowl versus two. Apparently the math experts really, really needed to make a distinction in plain text between (e.g.) a less-than-or-equal sign with a horizontal bottom stroke and one with a slanted bottom stroke. We can take it on faith that that distinction is important in plain text, but we don't need to add more distinctions that probably aren't. I don't see a solution for this problem within Unicode itself (and neither in ISO/IEC 10646), unless a separate standard is started to encode glyphs mapped to characters (in the UCS-4 space, out of its 17 first planes?). For now the safest way is to use specific fonts encoding these glyphs in PUA positions, and bind these fonts to the abstract text using stylesheets, meta information, or markup languages. But with such technic, the abstract text would be modified. A way to avoid it is to surround the text with markup that specifies an explicicit substitution, like this in XML: typo as=#xF001;et/typo, You probably don't want to start down the slippery slope of encoding Latin glyph variants as PUA characters. Check the archives of this mailing list; you will find that proposals to use the PUA to turn Unicode into a glyph registry are generally not well received. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: ISO 639 duplicate codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)
... Of course Java already includes some parts of ICU, but other things are in ICU4J are difficult now to integrate in Java, simply because IBM forgot to modularize ICU so that it can be integrated slowly. Accepting ICU4J as part of the core is a big decision choice, because ICU4J is quite large, and there are certainly developers for Java that would not accept to have 1 aditional MB of data and classes loaded in each JVM (particularly because the integration of ICU would affect a lot of core classes for the Java2 platform now also used for small devices). ... For example, it is impossible to integrate the ICU's Normalizer class in Java without also importing the UChar class and all its related services for UString, such as transliterators, and ... You are very misinformed about ICU4J. Mark __ http://www.macchiato.com Eppur si muove - Original Message - From: Philippe Verdy [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Saturday, July 12, 2003 14:45 Subject: Re: ISO 639 duplicate codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures) On Saturday, July 12, 2003 4:17 PM, Jony Rosenne [EMAIL PROTECTED] wrote: What has iw to with Hebrew? I wasn't involved with the change, but I'm glad it was done. Java and other systems probably still use it because they never bothered to check the latest version of 639. I know for certain that this was the case with one of the major computer vendors. In the case of Java, I don't think so. Sun has certainly maintained the language code simply to avoid breaking existing localizations to Hebrew of Java-written software, waiting probably for a better way to locate locales than the fixed locales path resolution algorithm which is part of its core Classes since the beginning. As long as Java core classes will not use a locale resolver that allows tuning the resolution algorithm used to load resource bundles, while also maintaining the compatibility with the existing softwares that assume that Hebrew resources are loaded with the iw language code, Sun will not change this code. In IBM ICU4J, there is such an extended resolver, but Sun takes a long time to approve such proposals, and have it first accepted, documented, balloted and voted in its JCP program. Of course Java already includes some parts of ICU, but other things are in ICU4J are difficult now to integrate in Java, simply because IBM forgot to modularize ICU so that it can be integrated slowly. Accepting ICU4J as part of the core is a big decision choice, because ICU4J is quite large, and there are certainly developers for Java that would not accept to have 1 aditional MB of data and classes loaded in each JVM (particularly because the integration of ICU would affect a lot of core classes for the Java2 platform now also used for small devices). For example, it is impossible to integrate the ICU's Normalizer class in Java without also importing the UChar class and all its related services for UString, such as transliterators, and advanced features such as the UCA tailoring rules run-time compiler. Some ICU open-sourcers, as well as its users seem to think now that the modularization of ICU is an important but complex project. -- Philippe. Spams non tolrs: tout message non sollicit sera rapport vos fournisseurs de services Internet.
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
Where does the fact of saying that a Grapheme Disjoiner... The character you should be referring to is not a new character GDJ, but rather is the existing ZWNJ, the functions of which include prevention of a ligature. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485
Re: ISO 639 duplicate codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)
On Saturday, July 12, 2003 6:51 AM, Doug Ewell [EMAIL PROTECTED] wrote: Philippe Verdy verdy_p at wanadoo dot fr wrote: Good luck with ISO language codes which does not even define them, and contain many duplicate codes even in the Alpha-2 space (he/iw, in/id), or unprecize codes matching sometimes very imprecize families of languages overlapping other language codes... The codes iw for Hebrew and in for Indonesian were deprecated FOURTEEN YEARS AGO. It is not accurate or fair to refer to them as duplicates of he and id. The Registration Authority deprecates such codes, rather than deleting them, for backward compatibility with any data that might contain the old codes. I was sure also that iw was not used today, until I found that it is still used in Java on Windows, for legacy reasons... Creating a resource bundle in Hebrew with the code he was simply... ignored. So I had to rename it to iw. Shamely, on Linux or various Unixes the correct code to use for locales varies, and it comes from the user-environment settings, actually setup by a system profile, most of the time... Users that want to get the benefit of existing locales for Hebrew will constantly need to change between he and 'iw. The normal installation solution is still today to create a file link between he and iw resources, so that they both can be used. I was really disappointed when I saw that these legacy language codes were not simplifiable the way we think, by ignoring iw and in, and still today, Java does not offer a way to create links at runtime to resolve locales with equivalent ids, without duplicating resources or creating special rules with: if ( code=he|| code=iw ) (don't forget that Java has also run-time resources with no files)...
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
On 11/07/2003 11:18, Philippe Verdy wrote: # T: special case for uppercase I and dotted uppercase I #- For non-Turkic languages, this mapping is normally not used. #- For Turkic languages (tr, az), this mapping can be used instead of the normal mapping for these characters. snip Is that what is called a character subset for a scripted language family? Well I don't like the term Turkic to name it. I prefer the more common Altaic Latin alphabet, seen as a standard subset of the Latin script, with additional properties. May be Unicode should not try to use language codes for families of languages, but it could define representative subsets of characters which may contain characters from several scripts, but would be minimized according to the tradition of a family of languages. Such families seem evident from the current ISO-8859-* and Mac/Windows/DOS charsets. -- Philippe. Thank you, Philippe. Well, I am glad to read not normally used rather than must not be used as this allows mapping T to be used for other languages when appropriate. I also don't like the word Turkic here. This is a linguistic term for a language family, see http://www.ethnologue.com/show_family.asp?subid=710. Turkish and Azeri are Turkic languages, but there are many Turkic languages which don't use this case mapping, either because they use other alphabets (Cyrillic, Arabic, occasionally Hebrew, perhaps even Greek) or because they use a Latin alphabet with the regular case mapping as in Uzbek and Turkmen. There are also some non-Turkic minority languages which need the T case mapping. Altaic Latin alphabet is a reasonable alternative, although again Altaic is a language family name, covering Turkic, Mongolian and Tungus, see http://www.ethnologue.com/show_family.asp?subid=709, and as far as I know mapping T is not needed for any Mongolian or Tungusic languages. Does anyone know of a good resource on the web, or elsewhere, listing the alphabets used for different languages around the world? I know a project was attempted a few years ago at least for Europe. It would be useful to have this kind of data available somewhere even with no official status. -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
At 03:25 -0700 2003-07-12, Peter Kirk wrote: Does anyone know of a good resource on the web, or elsewhere, listing the alphabets used for different languages around the world? I know a project was attempted a few years ago at least for Europe. It would be useful to have this kind of data available somewhere even with no official status. http://www.evertype.com/alphabets -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: ISO 639 duplicate codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)
Samedi 12 juillet 6h51, Doug Ewell [EMAIL PROTECTED] crivit : The codes iw for Hebrew and in for Indonesian were deprecated FOURTEEN YEARS AGO. It is not accurate or fair to refer to them as duplicates of he and id. The Registration Authority deprecates such codes, rather than deleting them, for backward compatibility with any data that might contain the old codes. Just out of curiosity, why was iw deprecated ? Seems perfectly fine to me. And why was he chosen (Herero, Hemba, Hellenic Greek) ? P.A.
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
On 12/07/2003 04:18, Michael Everson wrote: At 03:25 -0700 2003-07-12, Peter Kirk wrote: Does anyone know of a good resource on the web, or elsewhere, listing the alphabets used for different languages around the world? I know a project was attempted a few years ago at least for Europe. It would be useful to have this kind of data available somewhere even with no official status. http://www.evertype.com/alphabets Thank you, Michael. I knew you had this information, of course, as I helped to provide it, but I didn't know where it was now. This is of course restricted to Europe as you have defined it, and is not exhaustive for Turkey. Also it doesn't include recent Latin alphabets for minority languages of Azerbaijan, as used in schools to a rather limited extent, perhaps because I never sent you the data. The link to http://www.evertype.com/alphabets/azerbaijan.pdf is broken; and in http://www.evertype.com/alphabets/turkish.pdf the dotted capital I is missing, as viewed in Acrobat Reader 5.1 on Windows 2000. -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
Re: ISO 639 duplicate codes (was: Re: Ligatures in Turkish andAzeri, was: Accented ij ligatures)
At 08:11 -0400 2003-07-12, Patrick Andries wrote: Just out of curiosity, why was « iw » deprecated ? Seems perfectly fine to me. And why was « he » chosen (Herero, Hemba, Hellenic Greek) ? Iwrit (iw), being a German transliteration of the name of the Hebrew language, and Jiddisch (ji) were both thought (by someone) to be less suitable than the English-based he and yi which replaced them. -- Michael Everson * * Everson Typography * * http://www.evertype.com
RE: ISO 639 duplicate codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)
What has iw to with Hebrew? I wasn't involved with the change, but I'm glad it was done. Java and other systems probably still use it because they never bothered to check the latest version of 639. I know for certain that this was the case with one of the major computer vendors. Jony -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Patrick Andries Sent: Saturday, July 12, 2003 2:12 PM To: Philippe Verdy; Doug Ewell Cc: [EMAIL PROTECTED] Subject: Re: ISO 639 duplicate codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures) Samedi 12 juillet à 6h51, Doug Ewell [EMAIL PROTECTED] écrivit : The codes iw for Hebrew and in for Indonesian were deprecated FOURTEEN YEARS AGO. It is not accurate or fair to refer to them as duplicates of he and id. The Registration Authority deprecates such codes, rather than deleting them, for backward compatibility with any data that might contain the old codes. Just out of curiosity, why was « iw » deprecated ? Seems perfectly fine to me. And why was « he » chosen (Herero, Hemba, Hellenic Greek) ? P.A.
Re: ISO 639 duplicate codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)
Michael Everson [EMAIL PROTECTED] écrivit : At 08:11 -0400 2003-07-12, Patrick Andries wrote: Just out of curiosity, why was « iw » deprecated ? Seems perfectly fine to me. And why was « he » chosen (Herero, Hemba, Hellenic Greek) ? Iwrit (iw), being a German transliteration of the name of the Hebrew language, and Jiddisch (ji) were both thought (by someone) to be less suitable than the English-based he and yi which replaced them. This is also what I concluded, but «iv» for ivrit could have pleased those who thought the transliteration must be English-based (what a strange idea!). P. A.
Re: ISO 639 duplicate codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)
On Saturday, July 12, 2003 4:17 PM, Jony Rosenne [EMAIL PROTECTED] wrote: What has iw to with Hebrew? I wasn't involved with the change, but I'm glad it was done. Java and other systems probably still use it because they never bothered to check the latest version of 639. I know for certain that this was the case with one of the major computer vendors. In the case of Java, I don't think so. Sun has certainly maintained the language code simply to avoid breaking existing localizations to Hebrew of Java-written software, waiting probably for a better way to locate locales than the fixed locales path resolution algorithm which is part of its core Classes since the beginning. As long as Java core classes will not use a locale resolver that allows tuning the resolution algorithm used to load resource bundles, while also maintaining the compatibility with the existing softwares that assume that Hebrew resources are loaded with the iw language code, Sun will not change this code. In IBM ICU4J, there is such an extended resolver, but Sun takes a long time to approve such proposals, and have it first accepted, documented, balloted and voted in its JCP program. Of course Java already includes some parts of ICU, but other things are in ICU4J are difficult now to integrate in Java, simply because IBM forgot to modularize ICU so that it can be integrated slowly. Accepting ICU4J as part of the core is a big decision choice, because ICU4J is quite large, and there are certainly developers for Java that would not accept to have 1 aditional MB of data and classes loaded in each JVM (particularly because the integration of ICU would affect a lot of core classes for the Java2 platform now also used for small devices). For example, it is impossible to integrate the ICU's Normalizer class in Java without also importing the UChar class and all its related services for UString, such as transliterators, and advanced features such as the UCA tailoring rules run-time compiler. Some ICU open-sourcers, as well as its users seem to think now that the modularization of ICU is an important but complex project. -- Philippe. Spams non tolérés: tout message non sollicité sera rapporté à vos fournisseurs de services Internet.
Re: Ligatures in Portuguese, French (was: ... Turkish and Azeri)
On Saturday, July 12, 2003 9:59 PM, Anto'nio Martins-Tuva'lkin [EMAIL PROTECTED] wrote: On 2003.07.10, 20:34, John Cowan [EMAIL PROTECTED] wrote: IIRC, Portuguese traditional typography also avoids the fi-ligature, even though the language has no dotless-i. Just browsed some old book with that in mind and I cannot really corroborate. I've even seen some other more exotic ligatures, such as st and ct. Maybe there was such a reccomendation in some portugguese type-setting manual, but its result doesn't show... In French typography, we also find the special ligatures for the French (and Roman Latin) word et (means and), using old alternate forms for the lowercase letter e, looking mostly like a Greek epsilon (or the Latin Small Open E, still used in Tamazigh as a letter distinct from the standard Latin Small E). The resulting ligature glyph is very near from the ASCII ampersand character, and I just wonder if the ampersand is not a variation of this French or Latin ligature, which belongs to the same typographic traditions as the s, t, c, t and long-s, t ligatures (and probably the long-s, s ligature too in German's sharp-s). In French text, using the character to replace a et word would seem ugly (or lazy), even today where it looks like a technical symbol imported from English or used in trademarks (such as the new France Telecom Orange logo, where it clearly uses the common association of this character with Internet), and called esperluète, éperluète, or commonly et commercial. On the opposite, the use of the et ligature (which is really representing the French word et with its two letters) is quite common even in recent books and publications, and it looks pretty good typographically, notably for its titlecase version at at the beginning of sentences. There are many examples in various languages, where what was a typographic ligature ot two letters, became used as a separate letter or character in another language... Now that computers can generate these ligatures more easily, I think there is a renewal of their use and creation, probably meaning in the future more ligatures converted to plain letters in written languages. -- Philippe. Spams non tolérés: tout message non sollicité sera rapporté à vos fournisseurs de services Internet.
Re: Ligatures in Portuguese, French (was: ... Turkish and Azeri)
Philippe Verdy posted: In French typography, we also find the special ligatures for the French (and Roman Latin) word et (means and), using old alternate forms for the lowercase letter e, looking mostly like a Greek epsilon (or the Latin Small Open E, still used in Tamazigh as a letter distinct from the standard Latin Small E). See http://www.adobe.com/type/topics/theampersand.html for a short history of the ampersand and some of its variations in modern computer fonts. What this doesn't indicate is that sometimes in medieval text the ampersand ligature is used to spell _et_ as part of a longer word. So perhaps it should be considered a letter with alphabetic properties? The forms you describe seems like some of those shown in my link and all but the two earliest would be recognized by English readers as acceptable modern ampersand forms. Check common fonts like Trebuchet MS, Berkeley Book, Goudy Sans, Korinna and Univers for recognizable _Et_ ampersands. In common proofreading practice in English, at least in my experience, the ampersand is often pronounced as et. On the opposite, the use of the et ligature (which is really representing the French word et with its two letters) is quite common even in recent books and publications, and it looks pretty good typographically, notably for its titlecase version at at the beginning of sentences. Possibly a capital ampersand is needed? Jim Allan
Re: Ligatures in Portuguese, French (was: ... Turkish and Azeri)
- Original Message - From: Jim Allan [EMAIL PROTECTED] See http://www.adobe.com/type/topics/theampersand.html for a short history of the ampersand and some of its variations in modern computer fonts. Whole article (17 pages) about ampersand ligature in French (and other languages) : http://www.gutenberg.eu.org/pub/GUTenberg/publicationsPDF/22-blanchard.pdf
RE: Ligatures in Turkish and Azeri, was: Accented ij ligatures
Note also: the Soft_Dotted property was created and considered specially for Turkish and Azeri. Adding to the long, and unfortunately getting longer, list of misleading statements from Philippe! No, the reason for the Soft_Dotted property was/is to mark which characters (regardless of language) that don't display intrinsic dot(s) above subglyph(s) when (another) combining character above is applied to it (and to then keep the dot(s) a combining dot above or a combining diaeresis, as appropriate, must be used explicitly). In this language context the ASCII i is always rendered with a dot, kept also for uppercases. I hope you don't mean to use a dotted glyph for U+0069! B.t.w. It is perfectly legal to use a ligature (in the TECHNICAL sense, perhaps not the typographic sense) for f, i also for Turkish and related languages, especially if the f and i would otherwise overlap. The point is that f, i and f, dotless i must be clearly distinguishable for these languages, and that may mean that one has to use a TECHNICAL ligature for f, i having a glyph where the dot on the i is clearly visible (the horizontal bar of the f and the top serif of the i may still merge). That may be done by whatever means that is better-looking for that particular font, e.g. moving the loop of the f to the left, right, or up. (Using ZWNJ should not do that, if correctly implemented, but can instead, mistakenly, result in overlapping f and dot-of-i glyphs, since not even a technical ligature, IIUC (correct me if I'm wrong), would be allowed...) /kent k
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
On Friday, July 11, 2003 1:12 PM, Kent Karlsson [EMAIL PROTECTED] wrote: Note also: the Soft_Dotted property was created and considered specially for Turkish and Azeri. Adding to the long, and unfortunately getting longer, list of misleading statements from Philippe! No, the reason for the Soft_Dotted property was/is to mark which characters (regardless of language) that don't display intrinsic dot(s) above subglyph(s) when (another) combining character above is applied to it (and to then keep the dot(s) a combining dot above or a combining diaeresis, as appropriate, must be used explicitly). I don't know how I can say, with my limited English, things without being always accused of creating misleading things. Correct things if you think my words create possible confusion in their interpretation, but please don't over-exhibit them. I don't know how non-English native writers can participate here if all differences of interpretations caused by possible use of inappropriate English terms are answered with flame. This is really frustrating... The important words in my sentence is considered specially, where specially does not imply only. It's just that Turkish and Azeri are already given special treatment in Unicode, which already includes language exceptions in its technical algorithms (notably for character foldings). And according to this treatment, the U+0069 character is already intended to have a semantic value of a dotted i and not a dotless i in languages where this creates a semantic difference, so the question of the Soft_Dotted property is more glyphic than purely semantic, and it has a semantic behavior (at the abstract text level where Unicode is supposed to standardize things) mostly in case folding operations where the actual encoding of the converted abstract text is important. The rest of the description of the Soft_Dotted property is mostly a recommandation for authors of fonts and text renderers, so that they should *preserve this semantic difference* in the rendered text between abstract letters dotted and dotless i's... And this does not affect the encoding of the abstract text or any algorithmic transformation of the encoded abstract text. By saying preserve this semantic difference*, I do not imply that the U+0069 must/should have a dot above: it remains a font design problem, out of scope of Unicode. There are certainly many ways to preserve the semantic difference in the rendered text when this is really appropriate (for example in Turkish and Azeri, or with a distinct and emphasized rendering of the Turkish dot, including in possible ligatures with other letters). FLAME-OFF And please, do not flame me if this message contains new terms that also create confusion. I can reread the best I can, and there are certainly other better ways to say the same thing in English without these unintentional confusive interpretations, and I am sorry by advance that such confusion still persist. Accept the fact that I'm not a Unicode member and Unicode is only one of my interests, and I have a lot of other terminologies with which I have to work with. If you can't accept that approximative English language may be used by participants here, and refuse to understand the real intent of users when they write here, then have this group be moderated, but don't say it is open to discussions from anybody using Unicode. For normative aspects, with all exact terms, Unicode has its web site, its publications, its data files, its working draft documents, its technical committees, its permanent members, its chaimans, and even bugcomment report forms to interact with users at the normative level. And I am sure that permanent Unicode members do not even need this newsgroup to exchange their work on normative documents that are directly sent to the working committee bureaus, or via private email, phone calls, snail letters, or their own web sites. Please don't expect the same linguistic level quality here. Also don't complain if my messages are long, but the constant critics about what I am supposed to imply, gives me no other choice than explaining always what I mean, and this is particularly lengthy, and really boring in a newsgroup. /FLAME-OFF Thanks for your patience. -- Philippe.
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
On 11/07/2003 05:56, Philippe Verdy wrote: Note also: the Soft_Dotted property was created and considered specially for Turkish and Azeri. Whatever it was that was specially created or adjusted for Turkish and Azeri, was it specifically restricted to these two languages? These are I think the only relatively major languages which use the special dotted and dotless i case mappings. But they are also used, at least in a small way, for minority languages of Turkey and Azerbaijan. (Use of these minority languages in Turkey is illegal, but that's another matter.) They were used in the 1930's for many Central Asian languages, and were at least proposed in the 1990's for newly introduced Latin alphabets. So I hope that what is fixed by Unicode is the name not of two languages but of an extensible family of scripts. -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
On Friday, July 11, 2003 3:50 PM, Peter Kirk [EMAIL PROTECTED] wrote: So I hope that what is fixed by Unicode is the name not of two languages but of an extensible family of scripts. I think you speak about family of languages? Good luck with ISO language codes which does not even define them, and contain many duplicate codes even in the Alpha-2 space (he/iw, in/id), or unprecize codes matching sometimes very imprecize families of languages overlapping other language codes... Until it is demonstrated that a language needs such fix in Unicode support tables, it's best to just say that these fixes are needed for some recognized language codes and that applications are allowed to add their own fixes or language tailorings, and that the existing language tailorings in Unicode databases are just non-normative samples. -- Philippe.
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
On 11/07/2003 08:51, Philippe Verdy wrote: On Friday, July 11, 2003 3:50 PM, Peter Kirk [EMAIL PROTECTED] wrote: So I hope that what is fixed by Unicode is the name not of two languages but of an extensible family of scripts. I think you speak about family of languages? Not really. A set of languages, but they are not all related in any way, and many of them have more than one script or alphabet so this is not really a property of the languages. Perhaps set of alphabets would be a better way to put it. Good luck with ISO language codes which does not even define them, and contain many duplicate codes even in the Alpha-2 space (he/iw, in/id), or unprecize codes matching sometimes very imprecize families of languages overlapping other language codes... Until it is demonstrated that a language needs such fix in Unicode support tables, ... If necessary I can collect some data to demonstrate this, at least for some languages. ... it's best to just say that these fixes are needed for some recognized language codes and that applications are allowed to add their own fixes or language tailorings, and that the existing language tailorings in Unicode databases are just non-normative samples. -- Philippe. Agreed. But does Unicode actually treat them as non-normative samples? -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
On Friday, July 11, 2003 6:43 PM, Peter Kirk [EMAIL PROTECTED] wrote: Agreed. But does Unicode actually treat them as non-normative samples? Note clear here: the reference documents say that these tables are normative for applications that want to implement a conforming case folding. But UTR#30 (characters folding) contains still many areas marked as to be done, so it is not clear that all folding issues have been solved. It seems reasonnable however that non language specific elements in the CaseFolding table are normative, as they are computed from UCD... I see this comment: [quote] # The entries in this file are in the following machine-readable format: # # code; status; mapping; # name # # The status field is: # C: common case folding, common mappings shared by both simple and full mappings. # F: full case folding, mappings that cause strings to grow in length. Multiple characters are separated by spaces. # S: simple case folding, mappings to single characters where different from F. # T: special case for uppercase I and dotted uppercase I #- For non-Turkic languages, this mapping is normally not used. #- For Turkic languages (tr, az), this mapping can be used instead of the normal mapping for these characters. # Note that the Turkic mappings do not maintain canonical equivalence without additional processing. # See the discussions of case mapping in the Unicode Standard for more information. # # Usage: # A. To do a simple case folding, use the mappings with status C + S. # B. To do a full case folding, use the mappings with status C + F. # #The mappings with status T can be used or omitted depending on the desired case-folding #behavior. (The default option is to exclude them.) # [/quote] Simple Case Mapping (C+S) is not marked to be done in UTR#30, but other special mappings with status T are off by default (so they depend of a specific tailoring, a non-normative behavior if I interpret it correctly, as applications are free to use or not use them, under unspecified conditions, i.e. here the desired behavior). This concerns many more characters than just Turkish/Azeri uses, and there is some overlap with the informative and unfinished UTR#30 reference: (1) Simple mappings (are they normative?): 1F88; S; 1F80; # GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI 1F89; S; 1F81; # GREEK CAPITAL LETTER ALPHA WITH DASIA AND PROSGEGRAMMENI 1F8A; S; 1F82; # GREEK CAPITAL LETTER ALPHA WITH PSILI AND VARIA AND PROSGEGRAMMENI 1F8B; S; 1F83; # GREEK CAPITAL LETTER ALPHA WITH DASIA AND VARIA AND PROSGEGRAMMENI 1F8C; S; 1F84; # GREEK CAPITAL LETTER ALPHA WITH PSILI AND OXIA AND PROSGEGRAMMENI 1F8D; S; 1F85; # GREEK CAPITAL LETTER ALPHA WITH DASIA AND OXIA AND PROSGEGRAMMENI 1F8E; S; 1F86; # GREEK CAPITAL LETTER ALPHA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI 1F8F; S; 1F87; # GREEK CAPITAL LETTER ALPHA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI 1F98; S; 1F90; # GREEK CAPITAL LETTER ETA WITH PSILI AND PROSGEGRAMMENI 1F99; S; 1F91; # GREEK CAPITAL LETTER ETA WITH DASIA AND PROSGEGRAMMENI 1F9A; S; 1F92; # GREEK CAPITAL LETTER ETA WITH PSILI AND VARIA AND PROSGEGRAMMENI 1F9B; S; 1F93; # GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI 1F9C; S; 1F94; # GREEK CAPITAL LETTER ETA WITH PSILI AND OXIA AND PROSGEGRAMMENI 1F9D; S; 1F95; # GREEK CAPITAL LETTER ETA WITH DASIA AND OXIA AND PROSGEGRAMMENI 1F9E; S; 1F96; # GREEK CAPITAL LETTER ETA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI 1F9F; S; 1F97; # GREEK CAPITAL LETTER ETA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI 1FA8; S; 1FA0; # GREEK CAPITAL LETTER OMEGA WITH PSILI AND PROSGEGRAMMENI 1FA9; S; 1FA1; # GREEK CAPITAL LETTER OMEGA WITH DASIA AND PROSGEGRAMMENI 1FAA; S; 1FA2; # GREEK CAPITAL LETTER OMEGA WITH PSILI AND VARIA AND PROSGEGRAMMENI 1FAB; S; 1FA3; # GREEK CAPITAL LETTER OMEGA WITH DASIA AND VARIA AND PROSGEGRAMMENI 1FAC; S; 1FA4; # GREEK CAPITAL LETTER OMEGA WITH PSILI AND OXIA AND PROSGEGRAMMENI 1FAD; S; 1FA5; # GREEK CAPITAL LETTER OMEGA WITH DASIA AND OXIA AND PROSGEGRAMMENI 1FAE; S; 1FA6; # GREEK CAPITAL LETTER OMEGA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI 1FAF; S; 1FA7; # GREEK CAPITAL LETTER OMEGA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI 1FBC; S; 1FB3; # GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI 1FCC; S; 1FC3; # GREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENI 1FFC; S; 1FF3; # GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI (2) Full mappings (clearly optional): 00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S 0130; F; 0069 0307; # LATIN CAPITAL LETTER I WITH DOT ABOVE 0149; F; 02BC 006E; # LATIN SMALL LETTER N PRECEDED BY APOSTROPHE 01F0; F; 006A 030C; # LATIN SMALL LETTER J WITH CARON 0390; F; 03B9 0308 0301; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS 03B0; F; 03C5 0308 0301; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS 0587; F; 0565 0582; # ARMENIAN SMALL LIGATURE ECH YIWN 1E96; F; 0068 0331; # LATIN SMALL LETTER H WITH LINE BELOW 1E97;
ISO 639 duplicate codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)
Philippe Verdy verdy_p at wanadoo dot fr wrote: Good luck with ISO language codes which does not even define them, and contain many duplicate codes even in the Alpha-2 space (he/iw, in/id), or unprecize codes matching sometimes very imprecize families of languages overlapping other language codes... The codes iw for Hebrew and in for Indonesian were deprecated FOURTEEN YEARS AGO. It is not accurate or fair to refer to them as duplicates of he and id. The Registration Authority deprecates such codes, rather than deleting them, for backward compatibility with any data that might contain the old codes. The part about codes for language families overlapping other codes for specific languages is, regrettably, true. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
On Thursday, July 10, 2003 12:08 PM, Peter Kirk [EMAIL PROTECTED] wrote: On 1st July Philippe Verdy wrote: If fonts still want to display dots on these characters, that's a rendering problem: there already exists a lot of fonts used for languages other than Turkish and Azeri, which do not display any dot on a lowercase ASCII i or j (dotted), and display a dot on their uppercase ASCII versions (normally not dotted with classic fonts)... The absence or presence of these dots is then seen as decorative even if these fonts are not suitable for Turkish and Azeri, but this is clearly not an encoding problem in the Unicode encoded text, and not a problem either for case conversions. Turkish and Azeri do not use the ij ligature. The sequences i - j and dotless i - j do occur (rarely, as j is a rare letter in both languages) but are treated as separate letters. I know, and the quoted paragraph did not speak about the ij ligature but effectively about the separate dotted/dotless i/I letters, for which decorated fonts where the lowercase ASCII (dotted) i codepoint uses a dotless glyph, or the uppercase ASCII (dotless) I codepoint uses a dotted glyph (some fonts are ligating the dot with decorative curves). These fonts are effectively not suitable for Turkish and Azeri. In Turkish and Azeri the sequences f - i and f - dotless i both occur, and are fairly frequent. So it is inappropriate in these languages to use fi ligatures in which the dot on the i is lost or invisible, at least where the second character is a dotted i. Has any thought been given to this issue? Is it possible to block such ligation on a language-dependent basis? Isn't there a Grapheme Disjoiner format control character to force the absence of a ligature like fi, i.e. f, GDJ, i? Also it is certainly possible that in dictionaries etc in these languages stress might be marked by an accent on the vowel - as certainly in the older Cyrillic Azeri just as in Bulgarian as just posted. In this case the dot should not be removed from the dotted i when the stress mark is added, so that the distinction from dotless i is not lost. Has that issue been addressed? (In my Latin script Azeri dictionary stress is marked by a spacing grave accent before the vowel, but this may have been done precisely to work around this problem.) This is part of the proposal for review: an explicit combining dot-above diacritic can be inserted between the normal (soft-dotted) base letter and the above diacritic (with class 230): latin-small-i, dot-above, accute-accent cyrillic-small-je, dot-above, grave-accent -- Philippe.
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
On 10/07/2003 08:21, Philippe Verdy wrote: In Turkish and Azeri the sequences f - i and f - dotless i both occur, and are fairly frequent. So it is inappropriate in these languages to use fi ligatures in which the dot on the i is lost or invisible, at least where the second character is a dotted i. Has any thought been given to this issue? Is it possible to block such ligation on a language-dependent basis? Isn't there a Grapheme Disjoiner format control character to force the absence of a ligature like fi, i.e. f, GDJ, i? Maybe, but it is hardly realistic to expect all existing Turkish and Azeri text to be recoded to insert a character in the middle of each f - i sequence. -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
On Thursday, July 10, 2003 5:41 PM, Peter Kirk [EMAIL PROTECTED] wrote: Isn't there a Grapheme Disjoiner format control character to force the absence of a ligature like fi, i.e. f, GDJ, i? Maybe, but it is hardly realistic to expect all existing Turkish and Azeri text to be recoded to insert a character in the middle of each f - i sequence. Note also: the Soft_Dotted property was created and considered specially for Turkish and Azeri. In this language context the ASCII i is always rendered with a dot, kept also for uppercases. The other solution would be to use f, i, dot-above: the forced dot-above diacritic avoids the ligature, and the sequence is rendered by two glyphs for f and i, dot-above, i.e. the glyph for f, and the force-dotted glyph for i. Its uppercase conversion cause no problem: F, I, dot-above = F + I, dot-above = F + I-dot-above As well as additional stress diacritics: f, i, dot-above, accute-accent = f + i, dot-above, accute-accent F, I, dot-above, accute-accent = F + I-dot-above, accute-accent = F + I-dot-above, accute-accent -- Philippe.
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
On 10/07/2003 09:34, Stefan Persson wrote: Peter Kirk wrote: Maybe, but it is hardly realistic to expect all existing Turkish and Azeri text to be recoded to insert a character in the middle of each f - i sequence. Aren't most Turkish and Azeri text coded as ISO-8859-9 and similar code pages? I that case, it would be enough to add the proper disjoiners to the proper Unicode conversion tables. Stefan There is no existing code page covering Azeri Latin, so everything is in Unicode or in one of a huge variety of custom solutions. See http://www.azer.com/aiweb/categories/magazine/81_folder/81_articles/81_standardfonts.html, and the article The Land of Azeri Fonts: It's a Jungle Out There in the same magazine issue, unfortunately not online, which summarises 20 or so custom encodings all in current use. Anyway, I understood from the recent discussion of Hebrew that it is Unicode policy not to do anything which could theoretically invalidate existing text even if it could be proved that no such text existed. -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
Peter Kirk wrote: Maybe, but it is hardly realistic to expect all existing Turkish and Azeri text to be recoded to insert a character in the middle of each f - i sequence. Aren't most Turkish and Azeri text coded as ISO-8859-9 and similar code pages? I that case, it would be enough to add the proper disjoiners to the proper Unicode conversion tables. Stefan
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
On Thursday, July 10, 2003 6:42 PM, Peter Kirk [EMAIL PROTECTED] wrote: Anyway, I understood from the recent discussion of Hebrew that it is Unicode policy not to do anything which could theoretically invalidate existing text even if it could be proved that no such text existed. Where does the fact of saying that a Grapheme Disjoiner can be used in Turkish to avoid that the f collapses the dot above a next lowercase i? This does not change anything: existing texts can still produce ligatures in a renderer, unless explicitly said to not do so with a Grapheme Disjoiner, or the renderer is specially tuned to support the Turkish/Azeri languages. Existing texts do not need to be reencoded, if they are already correctly labelled with their language. The absence of such language specifier will never forbid a renderer to choose a fi ligature if available, unless these renderers are made conforming by correctly interpreting the Grapheme Disjoiner to mean break the grapheme cluster here, and display the previous character(s), then the Grapheme Disjoiner can be rendered itself as a non-spacing empty glyph, then the rest of the string... I'm still convinced that a ligature is still possible for a turkish f, dotted-i sequence, using f, i, dot-above. The ligature would apply to the middle bar of the f joined with the top serif of the i, but the top-right loop of the f would simply be a small horital bar, disjoined from the dot still present on the i. The same ligature could be used for the encoded sequence f, dotless-i, so an actual font would render the glyphs for f, i, dot-above as a base ligature glyph for f, dotless-i (with a top horizontal bar for the f part), and add separately the dot-above glyph kerned into the existing f-dotless-i ligature. To force disable this last ligature, we would use the encoded sequence f, GDJ, dot-less-i According to unicode the sequence i, dot-above has always been valid, despite it apparently has the same dotted glyph for all languages. It differs only in the fact that the explicit dot-above removes the Soft_Dotted property of the previous i to make it dotless, followed by a forced diacritic. So the encoded sequence i, dot-above is now made equivalent (for rendering purpose) to dotless-i, dot-above (despite they are not canonically equivalent per UAX#15: NFC/D) and not equivalent to an isolated i (not followed above diacritics)... -- Philippe.
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
Peter Kirk asked: In Turkish and Azeri the sequences f - i and f - dotless i both occur, and are fairly frequent. So it is inappropriate in these languages to use fi ligatures in which the dot on the i is lost or invisible, at least where the second character is a dotted i. Has any thought been given to this issue? Is it possible to block such ligation on a language-dependent basis? and Philippe Verdy responded with another question: Isn't there a Grapheme Disjoiner format control character to force the absence of a ligature like fi, i.e. f, GDJ, i? The answer to Philippe's rejoinder question is no, there is not a Grapheme Disjoiner format control character. What Philippe has in mind, however, is covered in the standard by the interaction of the joiner and non-joiner characters with ligature control: U+200C ZERO WIDTH NON-JOINER is intended to break both cursive connections and ligatures in rendering. ZWNJ requests that glyphs in the lowest available category (for the given font) be used. -- Unicode 4.0, Section 15.2, Layout Controls The categories referred to, from lowest to highest, are: 1. unconnected 2. cursively connected 3. ligated At Peter pointed out, however, it is neither expected or reasonable to have to go back through and drop in ZWNJ's at every relevant location in existing Turkish or Azeri text, simply to prevent fi ligation. Such use of ZWNJ is intended to be exceptional, to deal with special cases. The general solutions depend either on use of fonts (or more generally, renderers) which block such ligation across the board. It is my understanding that modern font technologies allow the choice of ligation to essentially be a style selection for the font. How well various applications take advantage of that and make the choice available easily to end users may be an open issue still, but the fundamental pieces to do this correctly are available. --Ken
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
On Thursday, July 10, 2003 8:37 PM, Kenneth Whistler [EMAIL PROTECTED] wrote: Peter Kirk asked: In Turkish and Azeri the sequences f - i and f - dotless i both occur, and are fairly frequent. So it is inappropriate in these languages to use fi ligatures in which the dot on the i is lost or invisible, at least where the second character is a dotted i. Has any thought been given to this issue? Is it possible to block such ligation on a language-dependent basis? and Philippe Verdy responded with another question: Isn't there a Grapheme Disjoiner format control character to force the absence of a ligature like fi, i.e. f, GDJ, i? The answer to Philippe's rejoinder question is no, there is not a Grapheme Disjoiner format control character. I did not refer to a specific unicode character, I knew that there is one already dedicated, but I did not want to comment about this choice. There's no contractiction. The Grapheme Disjoiner, for you is ZWNJ. OK. And I did not want to promote any change in any legally and lecacy encoded text, only to suggest ways to solve the apparent rendering problem in Turkish, when the f, i encoded character pair may be badly rendered. For the actual rendering, selecting a fi ligature is not appropriate for Turkish, and in fact the canonically decomposed character has no linguistic ambiguity in Turkish. So what ever the fi encoded codepoint designates, it is not the fi ligature glyoh but really two characters, whose ligation may still be performed according to language context. A font that would automatically select a fi ligature to represent a sequence of f, i codepoints, from the fact that the fi codepoint is canonically equivalent is probably defective and not conforming. Such selection of ligature must be put under the control of the renderer with additional markup, which can in fact select among three ligatures in Turkish: the fi ligature glyph where the f is ligated with the dot above i (normal ligature for languages other than Turkish/Azeri, the f-dotted-i and f-fotted-i ligatures for Turkish/Azeri. Markup is necessary to select the appropriate glyph, or this can be selected by using the Grapheme Disjoiner (ZWNJ) or the Grapheme Joiner (ZWJ) in addition to the use of a i or dotless-i codepoint eventually followed by the i-above diacritic. All this enrichment of text is assumed to be under the control of the markup added to the original text which does not need to specify whever ligatures should or should not be used.
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
Philippe Verdy scripsit: Where does the fact of saying that a Grapheme Disjoiner can be used in Turkish to avoid that the f collapses the dot above a next lowercase i? It is settled that ZWNJ is the correct character to break ligatures. ZWJ means make a ligature if you can; if not, shape characters to joining forms if you can; if not that either, do nothing. ZWNJ means break ligatures, if any, and shape characters to non-joining forms, if possible. I'm still convinced that a ligature is still possible for a turkish f, dotted-i sequence, using f, i, dot-above. The ligature would apply to the middle bar of the f joined with the top serif of the i, but the top-right loop of the f would simply be a small horital bar, disjoined from the dot still present on the i. Yes, theoretically. Whether that is good Turkish typography is a different question, which AFAIK prefers simply an f-glyph followed by an i-glyph with no ligaturing. IIRC, Portuguese traditional typography also avoids the fi-ligature, even though the language has no dotless-i. The same ligature could be used for the encoded sequence f, dotless-i, I doubt that any font has a ligature for this combination at all. So the encoded sequence i, dot-above is now made equivalent (for rendering purpose) to dotless-i, dot-above (despite they are not canonically equivalent per UAX#15: NFC/D) and not equivalent to an isolated i (not followed above diacritics)... There is no guarantee that the native i dot looks the same as the dot above in a given font (it may have different vertical kerning or even a different shape), nor is there any guarantee that the i with its dot removed looks the same as the dotless-i. -- John Cowan www.ccil.org/~cowan www.reutershealth.com [EMAIL PROTECTED] 'My young friend, if you do not now, immediately and instantly, pull as hard as ever you can, it is my opinion that your acquaintance in the large-pattern leather ulster' (and by this he meant the Crocodile) 'will jerk you into yonder limpid stream before you can say Jack Robinson.' --the Bi-Coloured-Python-Rock-Snake
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
On 10/07/2003 11:37, Kenneth Whistler wrote: At Peter pointed out, however, it is neither expected or reasonable to have to go back through and drop in ZWNJ's at every relevant location in existing Turkish or Azeri text, simply to prevent fi ligation. Such use of ZWNJ is intended to be exceptional, to deal with special cases. The general solutions depend either on use of fonts (or more generally, renderers) which block such ligation across the board. It is my understanding that modern font technologies allow the choice of ligation to essentially be a style selection for the font. How well various applications take advantage of that and make the choice available easily to end users may be an open issue still, but the fundamental pieces to do this correctly are available. Thank you, Ken. I think you get my point. I am not so interested in character level mechaisms for disabling the ligature as in higher level features. But I guess I am really thinking in terms of markup, so outside the domain of Unicode, which might disable ligation. -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
See also http://www.microsoft.com/typography/developers/opentype/detail.htm which explains how ligatures can be turned off on a language-dependent basis. Laurentiu Peter Kirk asked: In Turkish and Azeri the sequences f - i and f - dotless i both occur, and are fairly frequent. So it is inappropriate in these languages to use fi ligatures in which the dot on the i is lost or invisible, at least where the second character is a dotted i. Has any thought been given to this issue? Is it possible to block such ligation on a language-dependent basis?
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
and Philippe Verdy responded with another question: Isn't there a Grapheme Disjoiner format control character to force the absence of a ligature like fi, i.e. f, GDJ, i? The answer to Philippe's rejoinder question is no, there is not a Grapheme Disjoiner format control character. I did not refer to a specific unicode character, I knew that there is one already dedicated, but I did not want to comment about this choice. There's no contractiction. The Grapheme Disjoiner, for you is ZWNJ. OK. ad hominem Every so often, Philippe, it would be refreshing if, when someone points out in error in your claims about the Unicode Standard, that you would simply acknowledge the error and discontinue making the claim, instead of coming back trying to claim that the error was just another way of being right. /ad hominem There is a separate character, U+034F COMBINING GRAPHEME JOINER, which is the grapheme joiner, abbreviation CGJ in the standard. That character has nothing to do with ligation control. There has also been debate, on several occasions, within the UTC, regarding the advisability of encoding a grapheme non-joiner, as a pair with the grapheme joiner. But again, such a grapheme non-joiner -- which has *not* been encoded, by the way -- would have nothing to do with ligation control. So it is a disservice to the list, perpetuating confusion, to invent the term Grapheme Disjoiner and use it in a series of notes regarding ligation control, when the standard already designates the ZWJ and the ZWNJ as the relevant controls related to ligation control. So it is not that for me the Grapheme Disjoiner is the ZWNJ; rather, it is for the Unicode Standard that the ZWNJ is the designated, standardized format control for ligation control of the sort you are talking about. Please learn the terminology and make correct use of it. A font that would automatically select a fi ligature to represent a sequence of f, i codepoints, from the fact that the fi codepoint is canonically equivalent U+FB01 LATIN SMALL LIGATURE FI is not a *canonical* equivalent to f, i; it is *compatibility* equivalent. That is an important distinction. is probably defective and not conforming. Wrong. There is nothing nonconformant about fonts automatically ligating f, i (or any other sequence). Such automatic ligation may not always be appropriate or the desired result for an end user, but that has nothing to do with the conformance requirements of the standard. Such selection of ligature must be put under the Wrong. must -- may control of the renderer with additional markup, which can in fact select among three ligatures in Turkish: the fi ligature glyph where the f is ligated with the dot above i (normal ligature for languages other than Turkish/Azeri, the f-dotted-i and f-fotted-i ligatures for Turkish/Azeri. It is unclear that any such ligatures are required or desireable for Turkish/Azeri, in any case. Markup is necessary to select the appropriate glyph, or this ^^^ Wrong. A higher-level protocol is needed, and that may involve markup. But the Turkish requirements can equally well be met by simply setting no ligature style settings for the relevant fonts. can be selected by using the Grapheme Disjoiner (ZWNJ) Wrong term. See above. or the Grapheme Joiner (ZWJ) in addition to the use of ^ Wrong term. See above. a i or dotless-i codepoint eventually followed by the i-above diacritic. And in any case, it is inadvisable to be suggesting use of ZWJ and ZWNJ in this way to solve the problem of assuring that Turkish texts don't ligate inappropriately on rendering. All this enrichment of text is assumed to be under the control of the markup added to the original text which does not need to specify whever ligatures should or should not be used. This last clause I agree with. But the implication that markup has to be added to Turkish text in order to get it to render correctly regarding ligature usage is incorrect. Adding markup to the text is adding to the original text as surely as adding ZWNJ format controls would be. In any case it is unnecessary, since alternatives exist which simply specify suppression (or use) of ligatures stylistically in the fonts. --Ken
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
Peter == Peter Kirk [EMAIL PROTECTED] writes: Peter Maybe, but it is hardly realistic to expect all existing Peter Turkish and Azeri text to be recoded to insert a character in Peter the middle of each f - i sequence. But a lot of it already does do that. In TeX Turkish uses f{}i to block the (fonts) ligation. roff does something similar. Im sure all of the other text-source publishing systems do as well. Even the WYSI(NR)WYG must be doming something to accomplish that. -JimC NR Not Really
Re: Accented ij ligatures (and yery)
On 2003.07.01, 15:09, Pim Blokland [EMAIL PROTECTED] wrote: Maybe it was a bad idea to include ? as a character in Unicode at all, but now it's there, there's no reason to ignore it when refining the rules, to deprecate it practically. Food for thought: How would you compare U+0133 (ij digraph) with U+044B (cyrillic y, yery)? Consider that the latter also consists graphically of two separate letters: U+044A (hard sign) and U+0456 (old i) -- though the first looks rather like U+044C (soft sign). This is an obvious difference, but everything else seems quite comparable. Except nobody in this list is making a big fuss about having included U+044B in the standard was such a bad idea... ;-) -- . António MARTINS-Tuválkin, | ()| [EMAIL PROTECTED] || R. Laureano de Oliveira, 64 r/c esq. | PT-1885-050 MOSCAVIDE (LRS) Não me invejo de quem tem | +351 934 821 700 carros, parelhas e montes | http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe | http://pagina.de/bandeiras/ a água em todas as fontes |
RE: Accented ij ligatures (was: Unicode Public Review Issues update)
Believe it or not, the IJ and ij digraphs *were* included for compatibility with an 8-bit legacy character set (ISO 6937). 6937 is a multibyte encoding (one or two bytes per character). There are no combining characters at all in 6937, even though there is a common misunderstanding that there are, since the lead bytes are (almost) systematically assigned. Whether that automatically means they should have been assigned canonical instead of compatibility decompositions, I don't know. I think in this case it is correct that the decomposition is a compatibility one. It could have been: none; like for the oe and ae ligatures. This is in contrast to the MICRO SIGN which ideally should have had a canonical decomposition; but Latin-1 characters got special treatment (and ASCII characters have even more special treatment in this regard, where some spacing accents are not decomposed at all). /kent k
RE: Accented ij ligatures (was: Unicode Public Review Issues update)
In either cases, the Soft_Dotted property is probably overkill on the existing ij or IJ ligatures (should should have been better There is no point in having a soft-dotted property for the capital letter... named letters and not ligatures) for Dutch. Or is this update needed to document officially the expected rendering behavior for sequences ij,accute and ij,macron? Yes. ij ligature, combining acute should give a dotless ij digraph with an acute accent centred over it; ij ligature, combining double acute should give a dotless ij digraph with an acute on top of each dotless subletter glyph; I'm by now not sure which is the correct one, but the first one can only be produced this way. (And the others are unrelated to the dotless-i and dotless-j, so keep these two out of the pot.) The main interest of the Soft_Dotted property is not to describe the rendering for the character, Yes, it is. I should know, the soft-dotted property was my suggestion in the first place... And please read the note accompanying the public review issue. Not all of the characters in my initial list was actually given the property, however. This is what the current suggestion tries to correct. I know, there are Thai and Khmer letters where a glyph appendage below is removed when there are other things below, like a vowel or a subjoined consonant; and there is as yet no property for that... (But those appendages don't have any similar combining character below either.) but to document how case conversions (lowercase, uppercase, titlecase, folded) can be performed safely on The soft-dotted property is not primarily defined for case mapping, even though it is used there too. Case mapping is documented in the UCD; for non-same-always-1-1 cases, they are documented in SpecialCasing.txt. There is no special rule for the ij/IJ combination (even for Dutch) there; and it may be unlikely that there will be one. It's easier to just use the ij ligature characters (which do have the expected case mapping already)... /kent k
Re: Accented ij ligatures (was: Unicode Public Review Issues update)
Kent Karlsson kentk at cs dot chalmers dot se wrote: Believe it or not, the IJ and ij digraphs *were* included for compatibility with an 8-bit legacy character set (ISO 6937). 6937 is a multibyte encoding (one or two bytes per character). There are no combining characters at all in 6937, even though there is a common misunderstanding that there are, since the lead bytes are (almost) systematically assigned. It's still an 8-bit character set. Characters are defined in terms of 8-bit code units; some use one, others use two. This is just like the double-byte character sets used for CJK. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: Accented ij ligatures (was: Unicode Public Review Issues update)
On Tuesday, July 01, 2003 1:55 PM, Kent Karlsson [EMAIL PROTECTED] wrote: My feeling about the proposed Public Review document should exclude the ij ligature, waiting for the decision about the new dotless-ij ligature approved in the first rounds by UTC and waiting for approval by ISO JTC... There is no proposal to add any dotless ij ligature character. Please read the pipeline documents more carefully before going off imagining a character not being proposed, and is unlikely to be seriously proposed. Sorry, I should have written dotless-j in the last paragraph, for the proposed character at U+0237 (LATIN SMALL LETTER DOTLESS J) For me the ij ligature is mostly used for Dutch, and the few applications where ij,accute and ij,macron are used should be rendering them according to that language, where it is handled as a single letter. In all other cases, the ij ligature should be avoided, simply because there are other better choices with i/dotless-i/I/dotted-I and j/J/proposed-dotless-j, in combination with double diacritics inserted between them to produce the desired effect. In either cases, the Soft_Dotted property is probably overkill on the existing ij or IJ ligatures (should should have been better named letters and not ligatures) for Dutch. Or is this update needed to document officially the expected rendering behavior for sequences ij,accute and ij,macron? The main interest of the Soft_Dotted property is not to describe the rendering for the character, but to document how case conversions (lowercase, uppercase, titlecase, folded) can be performed safely on the Unicode encoded string. I'd like to know exactly why it is needed for Dutch, as such a ligature is not used in Turkish and Azeri written with the Altaic Latin alphabet... If fonts still want to display dots on these characters, that's a rendering problem: there already exists a lot of fonts used for languages other than Turkish and Azeri, which do not display any dot on a lowercase ASCII i or j (dotted), and display a dot on their uppercase ASCII versions (normally not dotted with classic fonts)... The absence or presence of these dots is then seen as decorative even if these fonts are not suitable for Turkish and Azeri, but this is clearly not an encoding problem in the Unicode encoded text, and not a problem either for case conversions. The only reason that would justify adding a Soft_Dotted property on ij would be that it is needed to allow the correct handling of language-dependant case conversions. -- Philippe.
Re: Accented ij ligatures (was: Unicode Public Review Issues update)
Michael Everson schreef: I think the answer is, regarding the soft dot property, please leave the ij ligature alone. And I think not. When putting accents on the (which does happen!), the dots must go. Simple as that. Maybe it was a bad idea to include as a character in Unicode at all, but now it's there, there's no reason to ignore it when refining the rules, to deprecate it practically. Pim Blokland
Re: Accented ij ligatures
Pim Blokland wrote: When putting accents on the (which does happen!), the dots must go. Simple as that. Where should the accent be placed in that case? Should the accent be centered over ij? Should there be one accent over i and then the same over j? Or should the accent only be an accent over one of the letters? Stefan
Re: Accented ij ligatures (was: Unicode Public Review Issues update)
On Tuesday, July 01, 2003 4:09 PM, Pim Blokland [EMAIL PROTECTED] wrote: Maybe it was a bad idea to include as a character in Unicode at all, but now it's there, there's no reason to ignore it when refining the rules, to deprecate it practically. No, that was needed for correct Dutch support. Look at the case conversion of ij into IJ, even with titlecase... The character itself is not breakable in Dutch where it is definitely not a ligature, but a single character, with its own case conversion rule, exactly like the ae and AE letters (considered as ligatures or as unreakable letters depending on the language that use them). That's why ij and IJ are not canonically decomposable as i, j and I, J (this is just a compatibility decomposition). If it had only been a shortcut character mapped for compatibility reasons from some 8-bit encodings, it would have been normalized with a canonical decomposition. (the exception to this rule is the inclusion of Arabic ligatures which were clearly and always decomposable, but that could not be canonically decomposed because it would have required more than a character pair for the NFD equivalence, so they are only given a NFKD decomposition and their usage is strongly deprecated, and just included for an unnecessary roundtrip conversion from legacy Arabic encodings). -- Philippe.
Re: Accented ij ligatures (was: Unicode Public Review Issues update)
Philippe Verdy verdy_p at wanadoo dot fr wrote: Maybe it was a bad idea to include as a character in Unicode at all, but now it's there, there's no reason to ignore it when refining the rules, to deprecate it practically. No, that was needed for correct Dutch support. Look at the case conversion of ij into IJ, even with titlecase... You don't need a separate character for that. You can use special casing rules. That's why Unicode doesn't have special I and i characters for Turkish. Believe it or not, the IJ and ij digraphs *were* included for compatibility with an 8-bit legacy character set (ISO 6937). Whether that automatically means they should have been assigned canonical instead of compatibility decompositions, I don't know. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Accented ij ligatures (was: Unicode Public Review Issues update)
Philippe Verdy schreef: Interesting issue for the Latin Small ij Ligature (U+0133): Normally the Soft_Dotted issupposed to make disappear one dot when there's and additional diacritic above, but many applications may keep these two dots above, fitting the diacritic in the middle. This proposal would mean that this become illegal, and it promote the use of an additional intermediate dot-above diacritic if the dot must be kept. I don't know of any instances where a ij digraph would keep the dots AND get additional accent marks, nor of any where the ij would appear with a dotless i and dotless j and a single dot above, centered between them. Can you give examples? Pim Blokland
Re: Accented ij ligatures (was: Unicode Public Review Issues update)
On Monday, June 30, 2003 1:58 PM, Pim Blokland [EMAIL PROTECTED] wrote: Philippe Verdy schreef: Interesting issue for the Latin Small ij Ligature (U+0133): Normally the Soft_Dotted issupposed to make disappear one dot when there's and additional diacritic above, but many applications may keep these two dots above, fitting the diacritic in the middle. This proposal would mean that this become illegal, and it promote the use of an additional intermediate dot-above diacritic if the dot must be kept. I don't know of any instances where a ij digraph would keep the dots AND get additional accent marks, nor of any where the ij would appear with a dotless i and dotless j and a single dot above, centered between them. Can you give examples? No of course: the only sequence I know is a dotless ij digraph with a centered accute accent. I just wonder if this public review makes things clear that the presence of an accute accent is supposed to remove both dots. For now I have seen some fonts keeping the two dots, when centering an additional accute accent. The text of this update should specify that for this pair, the intended option is to remove both soft dots, if there are other diacritics. But if one wants to restore the preious visual behavior, even if it's incorrect for languages using this digraph as a letter, what would be the behavior of using the following sequence: ij, combining dot above, combining accute (i.e. should this display 1 or 2 dots?) Should the previous incorrect rendering be approximated with: ij, combining diaeresis, combining accute or ij, combining dot above, combining dot above, combining accute ??? -- Philippe.
Re: Accented ij ligatures (was: Unicode Public Review Issuesupdate)
Philippe == Philippe Verdy [EMAIL PROTECTED] writes: Philippe But if one wants to restore the preious visual behavior, Philippe even if it's incorrect for languages using this digraph as a Philippe letter, what would be the behavior of using the following Philippe sequence: ij, combining dot above, combining accute Philippe (i.e. should this display 1 or 2 dots?) Seems clear to me that if ij has soft dots (and I agree it should) then to get a pair of dots via a combining accent one should use a two dot combining accent: U+0308 COMBINING DIAERESIS. So if you want two dots and an acute use ij, U+0308, U+0301: Of course a given fonts diaeresis will often not line up with the stems of its ij, and a custom one should be used instead. Or features and/or ligs as appropriate to the font technology could just use the ij glyph w/ an extra acute. Either way it is a glyph issue rather than a character issue. But it really seems to be just an academic issue, yes? -JimC
Re: Accented ij ligatures (was: Unicode Public Review Issues update)
On Monday, June 30, 2003 9:13 PM, James H. Cloos Jr. [EMAIL PROTECTED] wrote: So if you want two dots and an acute use ij, U+0308, U+0301: Of course a given fonts diaeresis will often not line up with the stems of its ij, and a custom one should be used instead. Or features and/or ligs as appropriate to the font technology could just use the ij glyph w/ an extra acute. Either way it is a glyph issue rather than a character issue. Doesn't it create a new equivalence for the sequences ij, diaeresis and ij if neither of them are followed by another combining above diacritic ? If we dont want such equivalences, the Unicode standard should say then that it's illegal to use two consecutive identical combining diacritics. Or simply forbid using ij,diaeresis alone (not followed by another diacritic with CC=230). Yes this is really tricky, and academic, I admit. But what forbids encoding two superposed arrows above any letter? Or encoding a ij,macron (with the dots removed from ij) followed by diaeresis, which could have a mathematical meaning? -- Philippe.
Re: Accented ij ligatures (was: Unicode Public Review Issuesupdate)
I think the answer is, regarding the soft dot property, please leave the ij ligature alone. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: List of ligatures for languages of the Indian subcontinent.
Thank you for your comments. I am not going to attempt to produce the list of ligatures myself. I am writing the paper to draw attention to the problem which exists in relation to the DVB-MHP (Digital Video Broadcasting - Multimedia Home Platform) system of interactive broadcasting and its application to the languages of the Indian subcontinent and hopefully provide a software format for resolving it.. It appears that the software requirement is essentially as follows, if one wishes to use a font-based method of display with an ordinary font. Receive a stream of input characters encoded in regular Unicode UTF-16 format suitable for processing as Java char items. Output a local stream of Java char suitable to be used in a Java drawString method with an ordinary font. As far as I can tell at present, the eutocode typography file format could be used to produce char codes for conjunct forms and for dealing with matras by scanning whole words, in that the changes needed seem always to be within a word and that there is no carry over to a following word. http://www.users.globalnet.co.uk/~ngo/ast03300.htm The discussion has led me to believe that it would be helpful for me to add an additional possibility to a eutocode typography file, using two presently unallocated codes. I have not yet finally decided which particular two yet, nor finalized their definitions, as I am open to any suggestions for improvement, yet here is the idea. For the moment I refer to them as U+EBEX and U+EBEY. A line in a eutocode typography file could have a line as follows. sequence1 U+EBEX sequence2 U+EBEY sequence3 The spaces in the above line are for setting out the line clearly here, in use the spaces would not be there. Such a line would have the meaning as follows. Carry out the replacement sequence2 U+EBEF sequence3 if and only if sequence1 matches the sequence stored in the language choice string. The sequence1 sequence is expressed using none or more characters from the range U+0020 to U+007E and is the decoded result of the latest use of a sequence of plane 14 language tags. The idea is that the plane 14 tags would be used to signal particular languages, represented as in international standards, though the eutocode typography file will only define a sequence as such, not compliance with any list of languages. Would this be sufficient to provide a way to guide a Java program to produce an output stream of Java char to use to access an ordinary font in order to render languages of the Indian subcontinent, provided that a eutocode typography file and a font were supplied? I recognize that the preparing of the eutocode typography file and the ordinary font containing the glyphs is a large task and I am not going to try to do it myself. However, if I can publish a software format which has the capability to solve the problem and can draw attention to the need to prepare the list and to prepare fonts which implement the list in part or in full together with eutocode typography files which can be used so that the fonts can be applied in applications, and can also produce a wish for the list to be a published open resource with a view to helping interoperability then I feel that that is about as far as I can go in this topic at the moment. However, I do feel that acting now may well be beneficial as a well known infrastructural method will be available for consideration when people want to produce such displays on interactive television displays. This is but one of a number of ideas for techniques to use in content authorship for the DVB-MHP platform. http://www.users.globalnet.co.uk/~ngo/ast03200.htm In relation to the font of colour codes downloadable from the following page. http://www.users.globalnet.co.uk/~ngo/font7001.htm I have now produced a test version which includes those colour codes and also four for point size and 28 others for various aspects of access level multimedia authoring. This includes codes for variations of object replacement character defined within the Private Use Area. One is OBJECT REPLACEMENT CHARACTER SYNONYM because trying to place a U+FFFC into some wordprocessors can cause problems if the wordprocessor also accepts graphics and uses U+FFFC for that. The others are OBJECT REPLACEMENT CHARACTER with left, centre and right alignment. The rest are mostly to do with producing a basic programmed learning capability within a plain text file, including such items as GREEN MARKER and so on so that when a push button is pushed all input characters are skipped until a marker of the corresponding colour is reached. There are also a SKIP UNTIL CONTINUE and a CONTINUE MARKER so that programmed learning layouts following simple flow charts may be expressed in a sequential manner within a file. Thank you for your interest in reading through all of this posting. I have recently produced an ornaments font, which I am hoping to write up for the web, and wonder if you
RE: List of ligatures for languages of the Indian subcontinent.
Kenneth Whistler wrote: Dream on. The information needed exists in books and other reference source in libraries, book shops, and other collections across India -- and, for that matter, around the world. It is merely a matter of collecting the relevant information and distilling it into succinct, yet complete, statements of the relevant information needed for proper typographic practice for each script, for each style of each script, for each local typographic tradition for each style, and so on. A couple of hints for William and other people interested in this issue: - Akira Nakanishi, Writing Systems of the World -- Alphabets, Syllabaries, Pictograms, Tuttle 1980(1999), ISBN 0804816549. This is charming little book explores all the scripts used in the world today, giving for each one of them a table of all the signs (apart Chinese, of course) and an explanation of how the script works. For each script, it also reproduces a page from a daily newspaper written in that scripts. The information is not always 100% accurate, however the book remains an invaluable introduction to the scripts of the world, and a perfect complement to the reading of the Unicode Standard. - The grammars in the National Integration Series by Balaji Publications, Madras, India. Each grammar in this series is a small A5-format book bearing a title like: Learn language name in 30 Days through English. The grammars are not very valid by the linguistic point of view (it's unlikely that the reader will actually learn an Indian language in one month!), but they all have a very interesting introduction to the script used by each language, which also normally includes a table of all the combinations of consonant+vowel, and a table of the essential consonant clusters, and of half or subjoined consonants. If you compare the grammars of languages sharing the same script (such as Sanskrit, Hindi, and Marathi, all written with the Devanagari script), you can verify how the list of required ligatures varies from a language to another. Notice that also these books are far from being 100% accurate. All the above books have low price and are easily found in bookshops in the UK and elsewhere. Another good source for making a lists of required glyphs are the existing non-Unicode fonts for Indic languages. The nicest free collection I have seen so far is the Akruti GNU TrueType fonts, which contains a set of glyphs appropriate for most modern usages: http://www.akruti.com/freedom/ _ Marco
List of ligatures for languages of the Indian subcontinent. (from Re: per-character stories in a database)
And nobody out there is volunteering to do it. I would do it gladly, but I do not have any skills at Indian languages. My opinion is that the list is important for the future of digital interactive broadcasting so I am trying to get the list done so that it is ready for use in displaying distance education texts in interactive broadcasting situations across the Indian subcontinent using my telesoftware invention. I was told that I could commission it. I described what I thought was a good design brief for the list and asked how much it would cost. I am still waiting to find out. A lot of the information needed to prepare the numbered list is apparently in files, it is just that it is not available to people. If the Unicode Consortium really does not wish to include this important project within its scope, then it will need to be achieved in some other manner. I would have thought that whether the Unicode Consortium will take this project on or not should go to a formal board meeting of the Unicode Consortium so that there can be no doubt whatsoever of the provenance of any decision. William Overington 17 March 2003
Re: List of ligatures for languages of the Indian subcontinent. (from Re: per-character stories in a database)
A few observations, so that William will understand the scope and some of the issues of what he is proposing. 1. For some Indic scripts, including Devanagari, there is no fixed set of 'ligatures' that would be normative for every typeface, or for every language using the script. So even for a single script you would be looking at multiple lists, with the same combination of characters likely represented in different ways for different languages. 2. The idea of a 'ligature', as it exists in the Latin script, is not really found in Indic scripts. This terminology derives from the application of particular typecasting and typesetting technologies to Indic scripts. So while some aspects of some Indic scripts may, with relative accuracy, be spoken of as ligatures in some font formats (e.g. the 'akhand' feature of OpenType that forms obligatory 'ligatures'), it is not necessary that Indic scripts require mapping of multiple characters to single glyphs. This is simply one model for rendering one aspect of Indic scripts. [As a parallel, consider Tom Milo's ligature-free approach to Arabic, another script widely and erroneously assumed to involve ligatures.] 3. As Rick has already alluded to re. Tibetan, it is far from necessary for all the *graphemes* of a script to be represented by individual, ligature glyphs. A grapheme may be composed of single glyphs and/or ligatures combined with dynamically positioned mark glyphs. Building or even cataloguing every possible grapheme -- every combination of base glyph, ligature and mark(s) in a script -- is an incredibly inefficient approach to Indic rendering. 4. Cataloguing and publishing known consonant conjunct forms for Indic scripts is a good idea and a worthwhile goal, which would indeed be a valuable resource for font developers. Michael Everson has indicated that he has what he considers a comprehensive list for Devanagari, and I probably have something close to comprehensive in my own files and books. However, William should not delude himself that such a catalogue would represent all that is necessary to rendering Indic scripts in the technologies that interest him. Once you have the conjuncts catalogued, and have identified subsets of conjuncts that are appropriate to the languages that you intend to support, you still need to implement shaping and positioning for matras relative to every base glyph and every conjunct. William writes: '...I do not have any skills at Indian languages.' While some may find his enthusiasm admirable, it would be a good idea for him to develop such skills before he starts writing papers on implementing such languages for digital interactive broadcasting or any other technology. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] Anyone who has both children and house pets has surely noticed that the children exposed to language will develop language, in turn, whereas the house pets will not. - Stephen Pinker
Re: List of ligatures for languages of the Indian subcontinent.
William Overington asked: And nobody out there is volunteering to do it. I was told that I could commission it. That statement by Michael Everson was not a *permission*, but merely a statement of fact. Anyone can commission any expert they like, under contract to produce whatever output or specification the purchaser would like. That includes you. I described what I thought was a good design brief for the list and asked how much it would cost. I am still waiting to find out. Well, the short answer is that it would cost a *lot*. But don't expect the Unicode discussion list to price out contracts for you. :-) A lot of the information needed to prepare the numbered list is apparently in files, it is just that it is not available to people. Dream on. The information needed exists in books and other reference source in libraries, book shops, and other collections across India -- and, for that matter, around the world. It is merely a matter of collecting the relevant information and distilling it into succinct, yet complete, statements of the relevant information needed for proper typographic practice for each script, for each style of each script, for each local typographic tradition for each style, and so on. And once you start down that road -- as John Hudson pointed out -- you would quickly find that the problem is not one of enumerating the list of required ligatures, but is rather more complicated than that -- and that the term ligature is not even the pertinent typographic construct of most interest to Indian rendering. If the Unicode Consortium really does not wish to include this important project within its scope, It does not. then it will need to be achieved in some other manner. Just so. --Ken
Re: Ligatures fj etc (from Re: Ligatures (qj) )
Yesterday, 13 March 2003, I wrote as follows. quote So I reasoned that the system might scan through a font when it is loaded and decide upon the lowest point for the whole font and then proceed on that basis. end quote An email correspondent has kindly written to me privately and I now know that it is not necessary for an application such as a wordprocessing package to make a complete survey of all the glyphs in a font as the font is being loaded, because the information on what are the high and low points for the font is readily available in predefined locations within the font. I expect that many readers of this list already know that, yet I feel that I should post this note in case some readers do not because I would not want to have set them off on a wrong way of looking at how a system works. William Overington 14 March 2003
Ligatures fj etc (from Re: Ligatures (qj) )
Thank you both for your responses. Yes, U+2502 or U+2503 would achieve the desired effect for which I devised U+E700 STAFF without resorting to the Private Use Area. The only reason for my not using one of those was that I was unaware of those codes as such. An interesting point is that they appear to be usable with fonts which have descenders yet still fill the entire height of the font. I suppose that when I had, some time ago, when looking through what Unicode offers, in a general context, not looking for the STAFF effect at that time, seen the box drawing characters I thought of those characters in the context of the character set of the old PET computer from the 1970s and of the way that some software on older non-graphics terminals on mainframe computers makes an attempt at message windows using such characters to construct boxes. Indeed, an interesting footnote to U+2502 states = Videotex Mosaic DG14. I cannot quite remember what Videotex was. I remember Videotext (with a t at the end) and seem to remember that Videotex (no t at the end) was a different system, possibly from the USA or maybe France. There was also a system which started called NAPLPS, which was an acronym for something like North American something and the word Presentation was in it, though I forget the exact acronym derivation. I was unaware of the VDMX table and so had a look at http://www.yahoo.com and found a couple of useful documents. However, VDMX appears to refer specifically to OpenType rather than ordinary TrueType. My reason for including the STAFF character, the intended effect of which I can now produce using U+2502 or U+2503, was that, being fairly new to producing fonts and just, thus far, using the Softy editor to produce ordinary TrueType fonts, I had noticed, when trying it out in 2002, that if I produce a font with a b c d e f then the font displays with lines packed togather, yet that if I then add g the line spacing for all lines increases, even if there is no g in that line. So I reasoned that the system might scan through a font when it is loaded and decide upon the lowest point for the whole font and then proceed on that basis. Now, in defining Quest text I wanted to have the possibility of accents on capital letters and descenders such as y and g and always look clear, so I decided effectively to lock some leading into the font and set the maximum height right from the start. Features of Quest text are that it is designed so that characters are produced directly from drawings in the Softy editor, not from template graphics, and that Quest text is designed, as far as possible, by the application of a set of rules, such as that verticals are all 256 font units wide, with both edges at a font unit value which is a multiple of 256 and that horizontals are all 168 font units in vertical height with one edge at a font unit value which is a multiple of 256, corners which are curved are curved with a single Bézier curve which has an action length, as I call it, of 128 font units in both horizontal and vertical directions. Some characters, such as x and k are exceptions to the general rules, yet Quest text is largely made up of horizontals and verticals, including for letters such as A O e and s. The idea is that hopefully Quest text will be very clear at both 12 point and 18 point and that, as point size increases, it will display its artistic look. At 300 point, Quest text looks smooth and rounded with an elegant combining of wider verticals with narrower horizontals, almost as if drawn with a pen with a nib 256 font units wide and 168 font units high. The rules do produce the effect though that capitals look lighter than lowercase letters as they are overall wider and yet use the same width verticals. I am wondering whether to consider that a fault or a feature! :-) An important part of the development process of Quest text is to display some text at 12 point in WordPad, make a Print Screen graphic and paste it into Paint and then study the graphic at 8x magnification. Hopefully Quest text combines great clarity with an artistic look. William Overington 13 March 2003
RE: Ligatures
probably didn't come out right. I never meant to say moving the characters apart was the best solution. Moving only the offending accent mark rather than the entire (composite) character might help in some cases, but this technique also should be used with care. Like in the case of "Te", if you have a very wide T and a very small e, any accent on the e would endup to the far right of it if you force avoiding collision with the T. So in this case I think you can't help putting the e and the T further apart if the e has an accent than if it doesn't. Then you have kerned the T and (unaccented) e too close to begin with, which is bad (taste)... This also depends on the font.There is no universal solution! I may agree with that. But changing the kerning (relative to what is done for the base letters) isWAY down in the list of actions that should be taken. /kent k
Re: Ligatures fj etc (from Re: Ligatures (qj) )
At 02:21 AM 3/13/2003, William Overington wrote: My reason for including the STAFF character, the intended effect of which I can now produce using U+2502 or U+2503, was that, being fairly new to producing fonts and just, thus far, using the Softy editor to produce ordinary TrueType fonts, I had noticed, when trying it out in 2002, that if I produce a font with a b c d e f then the font displays with lines packed togather, yet that if I then add g the line spacing for all lines increases, even if there is no g in that line. So I reasoned that the system might scan through a font when it is loaded and decide upon the lowest point for the whole font and then proceed on that basis. Linespacing in typical Windows apps is controlled by OS/2 table vertical metrics WinAscent and WinDescent. My guess, from your description, is that Softy automatically prevents clipping by assigning OS/2 table values based on the max height of the font bounding box (the height from the lowest descent to the heighest ascent). Is there no way to manually set OS/2 values in Softy? If not, you should get yourself a proper font tool. FontLab is best, but Font Creator from High Logic is a pretty good and much cheaper option. I think this is getting off topic for this list. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] It is necessary that by all means and cunning, the cursed owners of books should be persuaded to make them available to us, either by argument or by force. - Michael Apostolis, 1467
Ligatures fj etc (from Re: Ligatures (qj) )
John Hudson wrote as follows. quote If you don't intend to use the PUA codepoint in text, there really is no point in having it at all. end quote Well, one useful scenario is as follows. Suppose please that one wishes to process incoming regular Unicode text, using a eutocode typography file to influence the process, details of the format on the http://www.users.globalnet.co.uk/~ngo/ast03300.htm web page, and then use the output Unicode format text stream as codes to look up glyphs in an ordinary TrueType font, so as to produce a display which includes using some ligature glyphs. Having a code such as U+E70B for fj and codes for other characters as part of a consistent set which is published has the advantage that if various software authors use the eutocode typography file format, and various people spend time encoding specific eutocode typography files, (such as for 18th Century English printing with long s ligatures, German Fraktur printing and the ligatures of languages of the Indian subcontinent), and various people produce ordinary TrueType fonts with ligature glyphs encoded using consistent lists of published Private Use Area code points for ligatures, then the existence of the list of Private Use Area code points may well help in interoperability, so that, for example, having looked at the result using a font produced by one artist one may have a look at the result using a font produced by another artist without needing to change the contents of the particular eutocode typography file being used for the processing and having then to reprocess the original text using that second eutocode typography file. Another use is that preparing some text using WordPad and other programs, not for interchange but just for, say, producing a local print of a poster, having a consistent, widely used set of Private Use Area code points for ligatures would mean that a poster designer could try out a number of fonts from various artists without needing to reset the text each time using whatever code points each font designer used for each particular ligature glyph. I would mention that my thinking on using Private Use Area codes for ligatures has gradually moved towards the use of the eutocode typography file rather than interchanging files using Private Use Area code points for ligatures, yet I do feel that, for local use such Private Use Area allocations for ligatures as the golden ligatures collection provides are potentially useful as they do provide for interoperability of fonts which contain ligatures which fonts are produced by a variety of artists. Use of the golden ligatures collection is entirely optional, yet it can be used to try to achieve some level of interoperability of fonts. Indeed, font designers who produce fonts using advanced font technologies, where the conversion tables are internal to the font rather than external as with the eutocode typography file, where the glyphs for ligatures are not accessed directly may, if they choose, make use of the code point allocations of the golden ligatures collection so as to allow the glyphs also to be accessed from other platforms with a hope of some level of interoperability. Certainly, using the code points of the golden ligatures collection is not using regular Unicode code point allocations, yet as a self-help facility amongst end users so that use of fonts containing ligatures is easier, the golden ligatures collection is perhaps of some practical use. I accept that the use of Private Use Area encodings does not guarantee compatibility, yet one can take care to try to make the use of Private Use Area codes for ligatures and other characters as graceful as possible. For example, although there is absolutely no requirement at all for me to do so, and no one has asked me to do so, I decided to make sure that no golden ligatures code point allocations made in the future will clash with the code points used for Phaistos Disc Script in the ConScript Registry. I am happy to point out, in addition, that I do quite like the idea of a link with traditional letterpress printing where each ligature character was cast as one piece of metal for the whole ligature and one could actually pick them up and place them in a composing stick, so the golden ligatures collection is about art and nostalgia as well as about technology and practicality of achieving a stylish display using computing equipment. I have added a new code recently, which is U+E700 STAFF which is a vertical line from the very top of the glyph and going as far below the 0 line as one chooses for a particular font. With Quest text I encoded this character early with a line going vertically from -768 font units to 2048 font units. This forces the overall display height of the font before I added either of lowercase y and g, which in fact go down to -512 font units in Quest text, so the U+E700 character within the font helps in the display process even though the character is not usually
Re: Ligatures fj etc (from Re: Ligatures (qj) )
. William Overington wrote, I have added a new code recently, which is U+E700 STAFF which is a vertical line from the very top of the glyph and going as far below the 0 line as one chooses for a particular font. With Quest text I encoded this character early with a line going vertically from -768 font units to 2048 font units. Since the full height box drawing glyphs are supposed to join vertically, wouldn't adding something like U+2502 or U+2503 to a font achieve the desired effect without resorting to the PUA? Best regards, James Kass .