Re: Ligatures in Turkish and Azeri
On 2003.07.10, 20:34, John Cowan <[EMAIL PROTECTED]> wrote: > IIRC, Portuguese traditional typography also avoids the fi-ligature, > even though the language has no dotless-i. Just browsed some old book with that in mind and I cannot really corroborate. I've even seen some other more exotic ligatures, such as "st" and "ct". Maybe there was such a reccomendation in some portugguese type-setting manual, but its result doesn't show... -- . António MARTINS-Tuválkin, | ()| <[EMAIL PROTECTED]> || R. Laureano de Oliveira, 64 r/c esq. | PT-1885-050 MOSCAVIDE (LRS) Não me invejo de quem tem | +351 934 821 700 carros, parelhas e montes | http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe | http://pagina.de/bandeiras/ a água em todas as fontes |
Re: Ligatures in Turkish and Azeri
On 2003.07.12, 20:59, Anto'nio Martins-Tuva'lkin <[EMAIL PROTECTED]> wrote: > Just browsed some old book with that in mind I here meant rather "books", plural. And I'll keep an eye for this in the future. -- . António MARTINS-Tuválkin, | ()| <[EMAIL PROTECTED]> || R. Laureano de Oliveira, 64 r/c esq. | PT-1885-050 MOSCAVIDE (LRS) Não me invejo de quem tem | +351 934 821 700 carros, parelhas e montes | http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe | http://pagina.de/bandeiras/ a água em todas as fontes |
Ligatures in Turkish and Azeri, was: Accented ij ligatures
On 1st July Philippe Verdy wrote: If fonts still want to display dots on these characters, that's a rendering problem: there already exists a lot of fonts used for languages other than Turkish and Azeri, which do not display any dot on a lowercase ASCII i or j (dotted), and display a dot on their uppercase ASCII versions (normally not dotted with classic fonts)... The absence or presence of these dots is then seen as "decorative" even if these fonts are not suitable for Turkish and Azeri, but this is clearly not an encoding problem in the Unicode encoded text, and not a problem either for case conversions. Turkish and Azeri do not use the ij ligature. The sequences i - j and dotless i - j do occur (rarely, as j is a rare letter in both languages) but are treated as separate letters. In Turkish and Azeri the sequences f - i and f - dotless i both occur, and are fairly frequent. So it is inappropriate in these languages to use fi ligatures in which the dot on the i is lost or invisible, at least where the second character is a dotted i. Has any thought been given to this issue? Is it possible to block such ligation on a language-dependent basis? Also it is certainly possible that in dictionaries etc in these languages stress might be marked by an accent on the vowel - as certainly in the older Cyrillic Azeri just as in Bulgarian as just posted. In this case the dot should not be removed from the dotted i when the stress mark is added, so that the distinction from dotless i is not lost. Has that issue been addressed? (In my Latin script Azeri dictionary stress is marked by a spacing grave accent before the vowel, but this may have been done precisely to work around this problem.) -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
On Thursday, July 10, 2003 12:08 PM, Peter Kirk <[EMAIL PROTECTED]> wrote: > On 1st July Philippe Verdy wrote: > > > If fonts still want to display dots on these characters, that's a > > rendering problem: there already exists a lot of fonts used for > > languages other than Turkish and Azeri, which do not display any > > dot on a lowercase ASCII i or j (dotted), and display a dot on their > > uppercase ASCII versions (normally not dotted with classic fonts)... > > > > The absence or presence of these dots is then seen as "decorative" > > even if these fonts are not suitable for Turkish and Azeri, but > > this is clearly not an encoding problem in the Unicode encoded text, > > and not a problem either for case conversions. > > > > Turkish and Azeri do not use the ij ligature. The sequences i - j and > dotless i - j do occur (rarely, as j is a rare letter in both > languages) but are treated as separate letters. I know, and the quoted paragraph did not speak about the ij ligature but effectively about the separate dotted/dotless i/I letters, for which "decorated" fonts where the lowercase ASCII (dotted) i codepoint uses a dotless glyph, or the uppercase ASCII (dotless) I codepoint uses a dotted glyph (some fonts are ligating the dot with decorative curves). These fonts are effectively not suitable for Turkish and Azeri. > In Turkish and Azeri the sequences f - i and f - dotless i both occur, > and are fairly frequent. So it is inappropriate in these languages to > use fi ligatures in which the dot on the i is lost or invisible, at > least where the second character is a dotted i. Has any thought been > given to this issue? Is it possible to block such ligation on a > language-dependent basis? Isn't there a "Grapheme Disjoiner" format control character to force the absence of a ligature like , i.e. ? > Also it is certainly possible that in dictionaries etc in these > languages stress might be marked by an accent on the vowel - as > certainly in the older Cyrillic Azeri just as in Bulgarian as just > posted. In this case the dot should not be removed from the dotted i > when the stress mark is added, so that the distinction from dotless i > is not lost. Has that issue been addressed? (In my Latin script Azeri > dictionary stress is marked by a spacing grave accent before the > vowel, but this may have been done precisely to work around this > problem.) This is part of the proposal for review: an explicit combining dot-above diacritic can be inserted between the normal (soft-dotted) base letter and the above diacritic (with class 230): -- Philippe.
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
On 10/07/2003 08:21, Philippe Verdy wrote: In Turkish and Azeri the sequences f - i and f - dotless i both occur, and are fairly frequent. So it is inappropriate in these languages to use fi ligatures in which the dot on the i is lost or invisible, at least where the second character is a dotted i. Has any thought been given to this issue? Is it possible to block such ligation on a language-dependent basis? Isn't there a "Grapheme Disjoiner" format control character to force the absence of a ligature like , i.e. ? Maybe, but it is hardly realistic to expect all existing Turkish and Azeri text to be recoded to insert a character in the middle of each f - i sequence. -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
On Thursday, July 10, 2003 5:41 PM, Peter Kirk <[EMAIL PROTECTED]> wrote: > > Isn't there a "Grapheme Disjoiner" format control character to > > force the absence of a ligature like , i.e. ? > > > Maybe, but it is hardly realistic to expect all existing Turkish and > Azeri text to be recoded to insert a character in the middle of each > f - i sequence. Note also: the Soft_Dotted property was created and considered specially for Turkish and Azeri. In this language context the ASCII i is always rendered with a dot, kept also for uppercases. The other solution would be to use : the forced dot-above diacritic avoids the ligature, and the sequence is rendered by two glyphs for and , i.e. the glyph for , and the force-dotted glyph for . Its uppercase conversion cause no problem: = + = + As well as additional stress diacritics: = + = + = + -- Philippe.
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
On 10/07/2003 09:34, Stefan Persson wrote: Peter Kirk wrote: > Maybe, but it is hardly realistic to expect all existing Turkish and Azeri text to be recoded to insert a character in the middle of each f - i sequence. Aren't most Turkish and Azeri text coded as ISO-8859-9 and similar code pages? I that case, it would be enough to add the proper disjoiners to the proper Unicode conversion tables. Stefan There is no existing code page covering Azeri Latin, so everything is in Unicode or in one of a huge variety of custom solutions. See http://www.azer.com/aiweb/categories/magazine/81_folder/81_articles/81_standardfonts.html, and the article "The Land of Azeri Fonts: It's a Jungle Out There" in the same magazine issue, unfortunately not online, which summarises 20 or so custom encodings all in current use. Anyway, I understood from the recent discussion of Hebrew that it is Unicode policy not to do anything which could theoretically invalidate existing text even if it could be proved that no such text existed. -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
Peter Kirk wrote: > Maybe, but it is hardly realistic to expect all existing Turkish and Azeri text to be recoded to insert a character in the middle of each f - i sequence. Aren't most Turkish and Azeri text coded as ISO-8859-9 and similar code pages? I that case, it would be enough to add the proper disjoiners to the proper Unicode conversion tables. Stefan
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
On Thursday, July 10, 2003 6:42 PM, Peter Kirk <[EMAIL PROTECTED]> wrote: > Anyway, I understood from the recent discussion of Hebrew that it is > Unicode policy not to do anything which could theoretically invalidate > existing text even if it could be proved that no such text existed. Where does the fact of saying that a Grapheme Disjoiner can be used in Turkish to avoid that the f collapses the dot above a next lowercase i? This does not change anything: existing texts can still produce ligatures in a renderer, unless explicitly said to not do so with a Grapheme Disjoiner, or the renderer is specially tuned to support the Turkish/Azeri languages. Existing texts do not need to be reencoded, if they are already correctly labelled with their language. The absence of such language specifier will never forbid a renderer to choose a fi ligature if available, unless these renderers are made conforming by correctly interpreting the Grapheme Disjoiner to mean "break the grapheme cluster here, and display the previous character(s)", then the Grapheme Disjoiner can be rendered itself as a non-spacing empty glyph, then the rest of the string... I'm still convinced that a ligature is still possible for a turkish sequence, using . The ligature would apply to the middle bar of the joined with the top serif of the , but the top-right loop of the f would simply be a small horital bar, disjoined from the dot still present on the i. The same ligature could be used for the encoded sequence , so an actual font would render the glyphs for as a base ligature glyph for (with a top horizontal bar for the part), and add separately the glyph kerned into the existing ligature. To force disable this last ligature, we would use the encoded sequence According to unicode the sequence has always been valid, despite it apparently has the same dotted glyph for all languages. It differs only in the fact that the explicit removes the Soft_Dotted property of the previous to make it dotless, followed by a forced diacritic. So the encoded sequence is now made "equivalent" (for rendering purpose) to (despite they are not canonically equivalent per UAX#15: NFC/D) and not "equivalent" to an isolated (not followed above diacritics)... -- Philippe.
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
Peter Kirk asked: > > In Turkish and Azeri the sequences f - i and f - dotless i both occur, > > and are fairly frequent. So it is inappropriate in these languages to > > use fi ligatures in which the dot on the i is lost or invisible, at > > least where the second character is a dotted i. Has any thought been > > given to this issue? Is it possible to block such ligation on a > > language-dependent basis? > and Philippe Verdy responded with another question: > Isn't there a "Grapheme Disjoiner" format control character to force the > absence of a ligature like , i.e. ? The answer to Philippe's rejoinder question is no, there is not a "Grapheme Disjoiner" format control character. What Philippe has in mind, however, is covered in the standard by the interaction of the joiner and non-joiner characters with ligature control: "U+200C ZERO WIDTH NON-JOINER is intended to break both cursive connections and ligatures in rendering. "ZWNJ requests that glyphs in the lowest available category (for the given font) be used." -- Unicode 4.0, Section 15.2, Layout Controls The categories referred to, from lowest to highest, are: 1. unconnected 2. cursively connected 3. ligated At Peter pointed out, however, it is neither expected or reasonable to have to go back through and drop in ZWNJ's at every relevant location in existing Turkish or Azeri text, simply to prevent fi ligation. Such use of ZWNJ is intended to be exceptional, to deal with special cases. The general solutions depend either on use of fonts (or more generally, renderers) which block such ligation across the board. It is my understanding that modern font technologies allow the choice of ligation to essentially be a style selection for the font. How well various applications take advantage of that and make the choice available easily to end users may be an open issue still, but the fundamental pieces to do this correctly are available. --Ken
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
On Thursday, July 10, 2003 8:37 PM, Kenneth Whistler <[EMAIL PROTECTED]> wrote: > Peter Kirk asked: > > > > In Turkish and Azeri the sequences f - i and f - dotless i both > > > occur, and are fairly frequent. So it is inappropriate in these > > > languages to use fi ligatures in which the dot on the i is lost > > > or invisible, at least where the second character is a dotted i. > > > Has any thought been given to this issue? Is it possible to block > > > such ligation on a language-dependent basis? > > > > and Philippe Verdy responded with another question: > > > Isn't there a "Grapheme Disjoiner" format control character to > > force the absence of a ligature like , i.e. ? > > The answer to Philippe's rejoinder question is no, there is not > a "Grapheme Disjoiner" format control character. I did not refer to a specific unicode character, I knew that there is one already dedicated, but I did not want to comment about this choice. There's no contractiction. The Grapheme Disjoiner, for you is ZWNJ. OK. And I did not want to promote any change in any legally and lecacy encoded text, only to suggest ways to solve the apparent rendering problem in Turkish, when the encoded character pair may be badly rendered. For the actual rendering, selecting a ligature is not appropriate for Turkish, and in fact the canonically decomposed character has no linguistic ambiguity in Turkish. So what ever the encoded codepoint designates, it is not the ligature glyoh but really two characters, whose ligation may still be performed according to language context. A font that would automatically select a ligature to represent a sequence of codepoints, from the fact that the codepoint is canonically equivalent is probably defective and not conforming. Such selection of ligature must be put under the control of the renderer with additional markup, which can in fact select among three ligatures in Turkish: the ligature glyph where the f is ligated with the dot above i (normal ligature for languages other than Turkish/Azeri, the and ligatures for Turkish/Azeri. Markup is necessary to select the appropriate glyph, or this can be selected by using the "Grapheme Disjoiner" (ZWNJ) or the "Grapheme Joiner" (ZWJ) in addition to the use of a or codepoint eventually followed by the diacritic. All this enrichment of text is assumed to be under the control of the markup added to the original text which does not need to specify whever ligatures should or should not be used.
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
Philippe Verdy scripsit: > Where does the fact of saying that a Grapheme Disjoiner can be used > in Turkish to avoid that the f collapses the dot above a next lowercase i? It is settled that ZWNJ is the correct character to break ligatures. ZWJ means "make a ligature if you can; if not, shape characters to joining forms if you can; if not that either, do nothing." ZWNJ means "break ligatures, if any, and shape characters to non-joining forms, if possible." > I'm still convinced that a ligature is still possible for a turkish dotted-i> sequence, using . The ligature would apply > to the middle bar of the joined with the top serif of the , > but the top-right loop of the f would simply be a small horital bar, > disjoined from the dot still present on the i. Yes, theoretically. Whether that is good Turkish typography is a different question, which AFAIK prefers simply an f-glyph followed by an i-glyph with no ligaturing. IIRC, Portuguese traditional typography also avoids the fi-ligature, even though the language has no dotless-i. > The same ligature could be used for the encoded sequence , I doubt that any font has a ligature for this combination at all. > So the encoded sequence is now made "equivalent" > (for rendering purpose) to (despite they are > not canonically equivalent per UAX#15: NFC/D) and not "equivalent" > to an isolated (not followed above diacritics)... There is no guarantee that the native i dot looks the same as the dot above in a given font (it may have different vertical kerning or even a different shape), nor is there any guarantee that the i with its dot removed looks the same as the dotless-i. -- John Cowan www.ccil.org/~cowan www.reutershealth.com [EMAIL PROTECTED] "'My young friend, if you do not now, immediately and instantly, pull as hard as ever you can, it is my opinion that your acquaintance in the large-pattern leather ulster' (and by this he meant the Crocodile) 'will jerk you into yonder limpid stream before you can say Jack Robinson.'" --the Bi-Coloured-Python-Rock-Snake
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
On 10/07/2003 11:37, Kenneth Whistler wrote: At Peter pointed out, however, it is neither expected or reasonable to have to go back through and drop in ZWNJ's at every relevant location in existing Turkish or Azeri text, simply to prevent fi ligation. Such use of ZWNJ is intended to be exceptional, to deal with special cases. The general solutions depend either on use of fonts (or more generally, renderers) which block such ligation across the board. It is my understanding that modern font technologies allow the choice of ligation to essentially be a style selection for the font. How well various applications take advantage of that and make the choice available easily to end users may be an open issue still, but the fundamental pieces to do this correctly are available. Thank you, Ken. I think you get my point. I am not so interested in character level mechaisms for disabling the ligature as in higher level features. But I guess I am really thinking in terms of markup, so outside the domain of Unicode, which might disable ligation. -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
See also http://www.microsoft.com/typography/developers/opentype/detail.htm which explains how ligatures can be turned off on a language-dependent basis. Laurentiu Peter Kirk asked: > In Turkish and Azeri the sequences f - i and f - dotless i both occur, > and are fairly frequent. So it is inappropriate in these languages to > use fi ligatures in which the dot on the i is lost or invisible, at > least where the second character is a dotted i. Has any thought been > given to this issue? Is it possible to block such ligation on a > language-dependent basis?
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
> > and Philippe Verdy responded with another question: > > > > > Isn't there a "Grapheme Disjoiner" format control character to > > > force the absence of a ligature like , i.e. ? > > > > The answer to Philippe's rejoinder question is no, there is not > > a "Grapheme Disjoiner" format control character. > > I did not refer to a specific unicode character, I knew that there > is one already dedicated, but I did not want to comment about > this choice. > > There's no contractiction. The Grapheme Disjoiner, for you is > ZWNJ. OK. Every so often, Philippe, it would be refreshing if, when someone points out in error in your claims about the Unicode Standard, that you would simply acknowledge the error and discontinue making the claim, instead of coming back trying to claim that the error was just another way of being right. There is a separate character, U+034F COMBINING GRAPHEME JOINER, which is the "grapheme joiner", abbreviation "CGJ" in the standard. That character has nothing to do with ligation control. There has also been debate, on several occasions, within the UTC, regarding the advisability of encoding a "grapheme non-joiner", as a pair with the "grapheme joiner". But again, such a grapheme non-joiner -- which has *not* been encoded, by the way -- would have nothing to do with ligation control. So it is a disservice to the list, perpetuating confusion, to invent the term "Grapheme Disjoiner" and use it in a series of notes regarding ligation control, when the standard already designates the ZWJ and the ZWNJ as the relevant controls related to ligation control. So it is not that for me "the Grapheme Disjoiner is the ZWNJ"; rather, it is for the Unicode Standard that the ZWNJ is the designated, standardized format control for ligation control of the sort you are talking about. Please learn the terminology and make correct use of it. > A font that would automatically select a ligature to represent > a sequence of codepoints, from the fact that the > codepoint is canonically equivalent U+FB01 LATIN SMALL LIGATURE FI is not a *canonical* equivalent to ; it is *compatibility* equivalent. That is an important distinction. > is probably defective and not > conforming. Wrong. There is nothing nonconformant about fonts automatically ligating (or any other sequence). Such automatic ligation may not always be appropriate or the desired result for an end user, but that has nothing to do with the conformance requirements of the standard. > Such selection of ligature must be put under the Wrong. "must" --> "may" > control of the renderer with additional markup, which can in fact > select among three ligatures in Turkish: the ligature glyph > where the f is ligated with the dot above i (normal ligature for > languages other than Turkish/Azeri, the and > ligatures for Turkish/Azeri. It is unclear that any such ligatures are required or desireable for Turkish/Azeri, in any case. > Markup is necessary to select the appropriate glyph, or this ^^^ Wrong. A higher-level protocol is needed, and that may involve markup. But the Turkish requirements can equally well be met by simply setting "no ligature" style settings for the relevant fonts. > can be selected by using the "Grapheme Disjoiner" (ZWNJ) Wrong term. See above. > or the "Grapheme Joiner" (ZWJ) in addition to the use of ^ Wrong term. See above. > a or codepoint eventually followed by the > diacritic. And in any case, it is inadvisable to be suggesting use of ZWJ and ZWNJ in this way to solve the problem of assuring that Turkish texts don't ligate inappropriately on rendering. > All this enrichment of text is assumed > to be under the control of the markup added to the original > text which does not need to specify whever ligatures should > or should not be used. This last clause I agree with. But the implication that markup has to be added to Turkish text in order to get it to render correctly regarding ligature usage is incorrect. Adding markup to the text is "adding to the original text" as surely as adding ZWNJ format controls would be. In any case it is unnecessary, since alternatives exist which simply specify suppression (or use) of ligatures stylistically in the fonts. --Ken
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
> "Peter" == Peter Kirk <[EMAIL PROTECTED]> writes: Peter> Maybe, but it is hardly realistic to expect all existing Peter> Turkish and Azeri text to be recoded to insert a character in Peter> the middle of each f - i sequence. But a lot of it already does do that. In TeX Turkish uses f{}i to block the (font’s) ligation. ’roff does something similar. I’m sure all of the other text-source publishing systems do as well. Even the WYSI(NR)WYG¹ must be doming something to accomplish that. -JimC ¹ NR ≡ Not Really
RE: Ligatures in Turkish and Azeri, was: Accented ij ligatures
> Note also: the Soft_Dotted property was created and considered > specially for Turkish and Azeri. Adding to the long, and unfortunately getting longer, list of misleading statements from Philippe! No, the reason for the Soft_Dotted property was/is to mark which characters (regardless of language) that don't display intrinsic dot(s) above subglyph(s) when (another) combining character above is applied to it (and to then keep the dot(s) a combining dot above or a combining diaeresis, as appropriate, must be used explicitly). > In this language context the ASCII i is always rendered with a dot, > kept also for uppercases. I hope you don't mean to use a dotted glyph for U+0069! B.t.w. It is perfectly legal to use a ligature (in the TECHNICAL sense, perhaps not the typographic sense) for also for Turkish and related languages, especially if the f and i would otherwise overlap. The point is that and must be clearly distinguishable for these languages, and that may mean that one has to use a TECHNICAL ligature for having a glyph where the dot on the i is clearly visible (the horizontal bar of the f and the top serif of the i may still merge). That may be done by whatever means that is better-looking for that particular font, e.g. moving the loop of the f to the left, right, or up. (Using ZWNJ should not do that, if correctly implemented, but can instead, mistakenly, result in overlapping f and dot-of-i glyphs, since not even a technical ligature, IIUC (correct me if I'm wrong), would be allowed...) /kent k
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
On Friday, July 11, 2003 1:12 PM, Kent Karlsson <[EMAIL PROTECTED]> wrote: > > Note also: the Soft_Dotted property was created and considered > > specially for Turkish and Azeri. > > Adding to the long, and unfortunately getting longer, list of > misleading statements from Philippe! No, the reason for the > Soft_Dotted property was/is to mark which characters (regardless of > language) that don't display intrinsic dot(s) above subglyph(s) > when (another) combining character above > is applied to it (and to then keep the dot(s) a combining dot above > or a combining diaeresis, as appropriate, must be used explicitly). I don't know how I can say, with my limited English, things without being always accused of creating misleading things. Correct things if you think my words create possible confusion in their interpretation, but please don't over-exhibit them. I don't know how non-English native writers can participate here if all differences of interpretations caused by possible use of inappropriate English terms are answered with flame. This is really frustrating... The important words in my sentence is "considered specially", where "specially" does not imply "only". It's just that Turkish and Azeri are already given special treatment in Unicode, which already includes language exceptions in its technical algorithms (notably for character foldings). And according to this treatment, the U+0069 character is already intended to have a semantic value of a dotted and not a dotless in languages where this creates a semantic difference, so the question of the "Soft_Dotted" property is more glyphic than purely semantic, and it has a semantic behavior (at the abstract text level where Unicode is supposed to standardize things) mostly in case folding operations where the actual encoding of the converted abstract text is important. The rest of the description of the Soft_Dotted property is mostly a recommandation for authors of fonts and text renderers, so that they should *preserve this semantic difference* in the rendered text between abstract letters dotted and dotless 's... And this does not affect the encoding of the abstract text or any algorithmic transformation of the encoded abstract text. By saying "preserve this semantic difference*, I do not imply that the U+0069 must/should have a dot above: it remains a font design problem, out of scope of Unicode. There are certainly many ways to preserve the semantic difference in the rendered text when this is really appropriate (for example in Turkish and Azeri, or with a distinct and emphasized rendering of the Turkish dot, including in possible ligatures with other letters). And please, do not flame me if this message contains new terms that also create confusion. I can reread the best I can, and there are certainly other better ways to say the same thing in English without these unintentional confusive interpretations, and I am sorry by advance that such confusion still persist. Accept the fact that I'm not a Unicode member and Unicode is only one of my interests, and I have a lot of other terminologies with which I have to work with. If you can't accept that approximative English language may be used by participants here, and refuse to understand the real intent of users when they write here, then have this group be moderated, but don't say it is open to discussions from anybody using Unicode. For normative aspects, with all exact terms, Unicode has its web site, its publications, its data files, its working draft documents, its technical committees, its permanent members, its chaimans, and even bug&comment report forms to interact with users at the normative level. And I am sure that permanent Unicode members do not even need this newsgroup to exchange their work on normative documents that are directly sent to the working committee bureaus, or via private email, phone calls, snail letters, or their own web sites. Please don't expect the same linguistic level quality here. Also don't complain if my messages are long, but the constant critics about what I am "supposed" to "imply", gives me no other choice than explaining always what I mean, and this is particularly lengthy, and really boring in a newsgroup. Thanks for your patience. -- Philippe.
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
On 11/07/2003 05:56, Philippe Verdy wrote: Note also: the Soft_Dotted property was created and considered specially for Turkish and Azeri. Whatever it was that was specially created or adjusted for Turkish and Azeri, was it specifically restricted to these two languages? These are I think the only relatively major languages which use the special dotted and dotless i case mappings. But they are also used, at least in a small way, for minority languages of Turkey and Azerbaijan. (Use of these minority languages in Turkey is illegal, but that's another matter.) They were used in the 1930's for many Central Asian languages, and were at least proposed in the 1990's for newly introduced Latin alphabets. So I hope that what is fixed by Unicode is the name not of two languages but of an extensible family of scripts. -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
On Friday, July 11, 2003 3:50 PM, Peter Kirk <[EMAIL PROTECTED]> wrote: > So I hope that what is fixed by Unicode is the name not > of two languages but of an extensible family of scripts. I think you speak about family of languages? Good luck with ISO language codes which does not even define them, and contain many duplicate codes even in the Alpha-2 space (he/iw, in/id), or unprecize codes matching sometimes very imprecize families of languages overlapping other language codes... Until it is demonstrated that a language needs such fix in Unicode support tables, it's best to just say that these fixes are needed for some recognized language codes and that applications are allowed to add their own "fixes" or language tailorings, and that the existing language tailorings in Unicode databases are just non-normative samples. -- Philippe.
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
On 11/07/2003 08:51, Philippe Verdy wrote: On Friday, July 11, 2003 3:50 PM, Peter Kirk <[EMAIL PROTECTED]> wrote: So I hope that what is fixed by Unicode is the name not of two languages but of an extensible family of scripts. I think you speak about family of languages? Not really. A set of languages, but they are not all related in any way, and many of them have more than one script or alphabet so this is not really a property of the languages. Perhaps "set of alphabets" would be a better way to put it. Good luck with ISO language codes which does not even define them, and contain many duplicate codes even in the Alpha-2 space (he/iw, in/id), or unprecize codes matching sometimes very imprecize families of languages overlapping other language codes... Until it is demonstrated that a language needs such fix in Unicode support tables, ... If necessary I can collect some data to demonstrate this, at least for some languages. ... it's best to just say that these fixes are needed for some recognized language codes and that applications are allowed to add their own "fixes" or language tailorings, and that the existing language tailorings in Unicode databases are just non-normative samples. -- Philippe. Agreed. But does Unicode actually treat them as non-normative samples? -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
On Friday, July 11, 2003 6:43 PM, Peter Kirk <[EMAIL PROTECTED]> wrote: > Agreed. But does Unicode actually treat them as non-normative samples? Note clear here: the reference documents say that these tables are normative for applications that want to implement a conforming case folding. But UTR#30 (characters folding) contains still many areas marked as "to be done", so it is not clear that all folding issues have been solved. It seems reasonnable however that non language specific elements in the CaseFolding table are normative, as they are computed from UCD... I see this comment: [quote] # The entries in this file are in the following machine-readable format: # # ; ; ; # # # The status field is: # C: common case folding, common mappings shared by both simple and full mappings. # F: full case folding, mappings that cause strings to grow in length. Multiple characters are separated by spaces. # S: simple case folding, mappings to single characters where different from F. # T: special case for uppercase I and dotted uppercase I #- For non-Turkic languages, this mapping is normally not used. #- For Turkic languages (tr, az), this mapping can be used instead of the normal mapping for these characters. # Note that the Turkic mappings do not maintain canonical equivalence without additional processing. # See the discussions of case mapping in the Unicode Standard for more information. # # Usage: # A. To do a simple case folding, use the mappings with status C + S. # B. To do a full case folding, use the mappings with status C + F. # #The mappings with status T can be used or omitted depending on the desired case-folding #behavior. (The default option is to exclude them.) # [/quote] Simple Case Mapping (C+S) is not marked "to be done" in UTR#30, but other special mappings with status T are off by default (so they depend of a specific tailoring, a non-normative behavior if I interpret it correctly, as applications are free to use or not use them, under unspecified conditions, i.e. here the "desired behavior"). This concerns many more characters than just Turkish/Azeri uses, and there is some overlap with the informative and unfinished UTR#30 reference: (1) Simple mappings (are they normative?): 1F88; S; 1F80; # GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI 1F89; S; 1F81; # GREEK CAPITAL LETTER ALPHA WITH DASIA AND PROSGEGRAMMENI 1F8A; S; 1F82; # GREEK CAPITAL LETTER ALPHA WITH PSILI AND VARIA AND PROSGEGRAMMENI 1F8B; S; 1F83; # GREEK CAPITAL LETTER ALPHA WITH DASIA AND VARIA AND PROSGEGRAMMENI 1F8C; S; 1F84; # GREEK CAPITAL LETTER ALPHA WITH PSILI AND OXIA AND PROSGEGRAMMENI 1F8D; S; 1F85; # GREEK CAPITAL LETTER ALPHA WITH DASIA AND OXIA AND PROSGEGRAMMENI 1F8E; S; 1F86; # GREEK CAPITAL LETTER ALPHA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI 1F8F; S; 1F87; # GREEK CAPITAL LETTER ALPHA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI 1F98; S; 1F90; # GREEK CAPITAL LETTER ETA WITH PSILI AND PROSGEGRAMMENI 1F99; S; 1F91; # GREEK CAPITAL LETTER ETA WITH DASIA AND PROSGEGRAMMENI 1F9A; S; 1F92; # GREEK CAPITAL LETTER ETA WITH PSILI AND VARIA AND PROSGEGRAMMENI 1F9B; S; 1F93; # GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI 1F9C; S; 1F94; # GREEK CAPITAL LETTER ETA WITH PSILI AND OXIA AND PROSGEGRAMMENI 1F9D; S; 1F95; # GREEK CAPITAL LETTER ETA WITH DASIA AND OXIA AND PROSGEGRAMMENI 1F9E; S; 1F96; # GREEK CAPITAL LETTER ETA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI 1F9F; S; 1F97; # GREEK CAPITAL LETTER ETA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI 1FA8; S; 1FA0; # GREEK CAPITAL LETTER OMEGA WITH PSILI AND PROSGEGRAMMENI 1FA9; S; 1FA1; # GREEK CAPITAL LETTER OMEGA WITH DASIA AND PROSGEGRAMMENI 1FAA; S; 1FA2; # GREEK CAPITAL LETTER OMEGA WITH PSILI AND VARIA AND PROSGEGRAMMENI 1FAB; S; 1FA3; # GREEK CAPITAL LETTER OMEGA WITH DASIA AND VARIA AND PROSGEGRAMMENI 1FAC; S; 1FA4; # GREEK CAPITAL LETTER OMEGA WITH PSILI AND OXIA AND PROSGEGRAMMENI 1FAD; S; 1FA5; # GREEK CAPITAL LETTER OMEGA WITH DASIA AND OXIA AND PROSGEGRAMMENI 1FAE; S; 1FA6; # GREEK CAPITAL LETTER OMEGA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI 1FAF; S; 1FA7; # GREEK CAPITAL LETTER OMEGA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI 1FBC; S; 1FB3; # GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI 1FCC; S; 1FC3; # GREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENI 1FFC; S; 1FF3; # GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI (2) Full mappings (clearly optional): 00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S 0130; F; 0069 0307; # LATIN CAPITAL LETTER I WITH DOT ABOVE 0149; F; 02BC 006E; # LATIN SMALL LETTER N PRECEDED BY APOSTROPHE 01F0; F; 006A 030C; # LATIN SMALL LETTER J WITH CARON 0390; F; 03B9 0308 0301; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS 03B0; F; 03C5 0308 0301; # GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS 0587; F; 0565 0582; # ARMENIAN SMALL LIGATURE ECH YIWN 1E96; F; 0068 0331; # LATIN SMALL LETTER H WITH LINE BELOW 1E97; F; 0074 0308;
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
> Where does the fact of saying that a Grapheme Disjoiner... The character you should be referring to is not a new character GDJ, but rather is the existing ZWNJ, the functions of which include prevention of a ligature. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
On 11/07/2003 11:18, Philippe Verdy wrote: # T: special case for uppercase I and dotted uppercase I #- For non-Turkic languages, this mapping is normally not used. #- For Turkic languages (tr, az), this mapping can be used instead of the normal mapping for these characters. Is that what is called a "character subset" for a scripted language family? Well I don't like the term "Turkic" to name it. I prefer the more common "Altaic Latin alphabet", seen as a standard subset of the Latin script, with additional properties. May be Unicode should not try to use language codes for families of languages, but it could define "representative subsets of characters" which may contain characters from several scripts, but would be minimized according to the tradition of a family of languages. Such families seem evident from the current ISO-8859-* and Mac/Windows/DOS charsets. -- Philippe. Thank you, Philippe. Well, I am glad to read "not normally used" rather than "must not be used" as this allows mapping T to be used for other languages when appropriate. I also don't like the word Turkic here. This is a linguistic term for a language family, see http://www.ethnologue.com/show_family.asp?subid=710. Turkish and Azeri are Turkic languages, but there are many Turkic languages which don't use this case mapping, either because they use other alphabets (Cyrillic, Arabic, occasionally Hebrew, perhaps even Greek) or because they use a Latin alphabet with the regular case mapping as in Uzbek and Turkmen. There are also some non-Turkic minority languages which need the T case mapping. "Altaic Latin alphabet" is a reasonable alternative, although again Altaic is a language family name, covering Turkic, Mongolian and Tungus, see http://www.ethnologue.com/show_family.asp?subid=709, and as far as I know mapping T is not needed for any Mongolian or Tungusic languages. Does anyone know of a good resource on the web, or elsewhere, listing the alphabets used for different languages around the world? I know a project was attempted a few years ago at least for Europe. It would be useful to have this kind of data available somewhere even with no official status. -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
At 03:25 -0700 2003-07-12, Peter Kirk wrote: Does anyone know of a good resource on the web, or elsewhere, listing the alphabets used for different languages around the world? I know a project was attempted a few years ago at least for Europe. It would be useful to have this kind of data available somewhere even with no official status. http://www.evertype.com/alphabets -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures
On 12/07/2003 04:18, Michael Everson wrote: At 03:25 -0700 2003-07-12, Peter Kirk wrote: Does anyone know of a good resource on the web, or elsewhere, listing the alphabets used for different languages around the world? I know a project was attempted a few years ago at least for Europe. It would be useful to have this kind of data available somewhere even with no official status. http://www.evertype.com/alphabets Thank you, Michael. I knew you had this information, of course, as I helped to provide it, but I didn't know where it was now. This is of course restricted to Europe as you have defined it, and is not exhaustive for Turkey. Also it doesn't include recent Latin alphabets for minority languages of Azerbaijan, as used in schools to a rather limited extent, perhaps because I never sent you the data. The link to http://www.evertype.com/alphabets/azerbaijan.pdf is broken; and in http://www.evertype.com/alphabets/turkish.pdf the dotted capital I is missing, as viewed in Acrobat Reader 5.1 on Windows 2000. -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
ISO 639 "duplicate" codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)
Philippe Verdy wrote: > Good luck with ISO language codes which does not even > define them, and contain many duplicate codes even in > the Alpha-2 space (he/iw, in/id), or unprecize codes > matching sometimes very imprecize families of languages > overlapping other language codes... The codes "iw" for Hebrew and "in" for Indonesian were deprecated FOURTEEN YEARS AGO. It is not accurate or fair to refer to them as "duplicates" of "he" and "id". The Registration Authority deprecates such codes, rather than deleting them, for backward compatibility with any data that might contain the old codes. The part about codes for language families overlapping other codes for specific languages is, regrettably, true. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: ISO 639 "duplicate" codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)
On Saturday, July 12, 2003 6:51 AM, Doug Ewell <[EMAIL PROTECTED]> wrote: > Philippe Verdy wrote: > > > Good luck with ISO language codes which does not even > > define them, and contain many duplicate codes even in > > the Alpha-2 space (he/iw, in/id), or unprecize codes > > matching sometimes very imprecize families of languages > > overlapping other language codes... > > The codes "iw" for Hebrew and "in" for Indonesian were deprecated > FOURTEEN YEARS AGO. It is not accurate or fair to refer to them as > "duplicates" of "he" and "id". The Registration Authority deprecates > such codes, rather than deleting them, for backward compatibility with > any data that might contain the old codes. I was sure also that "iw" was not used today, until I found that it is still used in Java on Windows, for legacy reasons... Creating a resource bundle in Hebrew with the code "he" was simply... ignored. So I had to rename it to "iw". Shamely, on Linux or various Unixes the correct code to use for locales varies, and it comes from the user-environment settings, actually setup by a system profile, most of the time... Users that want to get the benefit of existing locales for Hebrew will constantly need to change between "he" and 'iw". The "normal" installation solution is still today to create a file link between "he" and "iw" resources, so that they both can be used. I was really disappointed when I saw that these legacy language codes were not simplifiable the way we think, by ignoring "iw" and "in", and still today, Java does not offer a way to create "links" at runtime to resolve locales with equivalent ids, without duplicating resources or creating special rules with: if ( code="he"|| code="iw" ) (don't forget that Java has also run-time resources with no files)...
Re: ISO 639 "duplicate" codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)
Samedi 12 juillet à 6h51, Doug Ewell <[EMAIL PROTECTED]> écrivit : > The codes "iw" for Hebrew and "in" for Indonesian were deprecated > FOURTEEN YEARS AGO. It is not accurate or fair to refer to them as > "duplicates" of "he" and "id". The Registration Authority deprecates > such codes, rather than deleting them, for backward compatibility with > any data that might contain the old codes. Just out of curiosity, why was « iw » deprecated ? Seems perfectly fine to me. And why was « he » chosen (Herero, Hemba, Hellenic Greek) ? P.A.
RE: ISO 639 "duplicate" codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)
What has "iw" to with Hebrew? I wasn't involved with the change, but I'm glad it was done. Java and other systems probably still use it because they never bothered to check the latest version of 639. I know for certain that this was the case with one of the major computer vendors. Jony > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Patrick Andries > Sent: Saturday, July 12, 2003 2:12 PM > To: Philippe Verdy; Doug Ewell > Cc: [EMAIL PROTECTED] > Subject: Re: ISO 639 "duplicate" codes (was: Re: Ligatures in > Turkish and Azeri, was: Accented ij ligatures) > > > > > Samedi 12 juillet à 6h51, Doug Ewell <[EMAIL PROTECTED]> écrivit : > > > The codes "iw" for Hebrew and "in" for Indonesian were deprecated > > FOURTEEN YEARS AGO. It is not accurate or fair to refer to them as > > "duplicates" of "he" and "id". The Registration Authority > deprecates > > such codes, rather than deleting them, for backward > compatibility with > > any data that might contain the old codes. > > Just out of curiosity, why was « iw » deprecated ? Seems > perfectly fine to me. And why was « he » chosen (Herero, > Hemba, Hellenic Greek) ? > > P.A. > > > > >
Re: ISO 639 "duplicate" codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)
Michael Everson" <[EMAIL PROTECTED]> écrivit : > At 08:11 -0400 2003-07-12, Patrick Andries wrote: > > >Just out of curiosity, why was « iw » deprecated ? Seems perfectly fine to > >me. And why was « he » chosen (Herero, Hemba, Hellenic Greek) ? > > Iwrit (iw), being a German transliteration of the name of the Hebrew > language, and Jiddisch (ji) were both thought (by someone) to be less > suitable than the English-based "he" and "yi" which replaced them. This is also what I concluded, but «iv» for ivrit could have pleased those who thought the transliteration must be English-based (what a strange idea!). P. A.
Re: ISO 639 "duplicate" codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)
We did that deliberately. Faced with a situation where a registration authority changes IDs on a whim -- with no regard to the issues of stability in software and data -- the best policy is to always use the old one, and map any new locales to the old one. That way when you exchange IDs between old and new systems, it all continues to work. (We did in fact know of the latest version of the standard at the time.) (In ICU, we did add a more general-purpose aliasing mechanism, both for resource bundles and parts thereof.) Mark __ http://www.macchiato.com ► “Eppur si muove” ◄ - Original Message - From: "Philippe Verdy" <[EMAIL PROTECTED]> To: "Doug Ewell" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Saturday, July 12, 2003 00:27 Subject: Re: ISO 639 "duplicate" codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures) > On Saturday, July 12, 2003 6:51 AM, Doug Ewell <[EMAIL PROTECTED]> wrote: > > > Philippe Verdy wrote: > > > > > Good luck with ISO language codes which does not even > > > define them, and contain many duplicate codes even in > > > the Alpha-2 space (he/iw, in/id), or unprecize codes > > > matching sometimes very imprecize families of languages > > > overlapping other language codes... > > > > The codes "iw" for Hebrew and "in" for Indonesian were deprecated > > FOURTEEN YEARS AGO. It is not accurate or fair to refer to them as > > "duplicates" of "he" and "id". The Registration Authority deprecates > > such codes, rather than deleting them, for backward compatibility with > > any data that might contain the old codes. > > I was sure also that "iw" was not used today, until I found that it is > still used in Java on Windows, for legacy reasons... Creating a resource > bundle in Hebrew with the code "he" was simply... ignored. So I had to > rename it to "iw". > > Shamely, on Linux or various Unixes the correct code to use for locales > varies, and it comes from the user-environment settings, actually setup > by a system profile, most of the time... Users that want to get the > benefit of existing locales for Hebrew will constantly need to change > between "he" and 'iw". The "normal" installation solution is still today > to create a file link between "he" and "iw" resources, so that they both > can be used. > > I was really disappointed when I saw that these legacy language codes > were not simplifiable the way we think, by ignoring "iw" and "in", and still > today, Java does not offer a way to create "links" at runtime to resolve > locales with equivalent ids, without duplicating resources or creating > special rules with: if ( code="he"|| code="iw" ) > (don't forget that Java has also run-time resources with no files)... > > >
Re: ISO 639 "duplicate" codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)
On Saturday, July 12, 2003 4:17 PM, Jony Rosenne <[EMAIL PROTECTED]> wrote: > What has "iw" to with Hebrew? > > I wasn't involved with the change, but I'm glad it was done. Java and > other systems probably still use it because they never bothered to > check the latest version of 639. I know for certain that this was the > case with one of the major computer vendors. In the case of Java, I don't think so. Sun has certainly maintained the language code simply to avoid breaking existing localizations to Hebrew of Java-written software, waiting probably for a better way to locate locales than the fixed "locales path resolution algorithm" which is part of its core Classes since the beginning. As long as Java core classes will not use a locale resolver that allows tuning the resolution algorithm used to load resource bundles, while also maintaining the compatibility with the existing softwares that assume that Hebrew resources are loaded with the "iw" language code, Sun will not change this code. In IBM ICU4J, there is such an extended resolver, but Sun takes a long time to approve such proposals, and have it first accepted, documented, balloted and voted in its JCP program. Of course Java already includes some parts of ICU, but other things are in ICU4J are difficult now to integrate in Java, simply because IBM forgot to modularize ICU so that it can be integrated slowly. Accepting ICU4J as part of the core is a big decision choice, because ICU4J is quite large, and there are certainly developers for Java that would not accept to have 1 aditional MB of data and classes loaded in each JVM (particularly because the integration of ICU would affect a lot of core classes for the Java2 platform now also used for small devices). For example, it is impossible to integrate the ICU's Normalizer class in Java without also importing the UChar class and all its related services for UString, such as transliterators, and advanced features such as the UCA tailoring rules run-time compiler. Some ICU open-sourcers, as well as its users seem to think now that the modularization of ICU is an important but complex project. -- Philippe. Spams non tolérés: tout message non sollicité sera rapporté à vos fournisseurs de services Internet.
Re: ISO 639 "duplicate" codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)
... > Of course > Java already includes some parts of ICU, but other things are in > ICU4J are difficult now to integrate in Java, simply because IBM > forgot to modularize ICU so that it can be integrated slowly. > Accepting ICU4J as part of the core is a big decision choice, > because ICU4J is quite large, and there are certainly developers > for Java that would not accept to have 1 aditional MB of data and > classes loaded in each JVM (particularly because the integration > of ICU would affect a lot of core classes for the Java2 platform > now also used for small devices). ... > For example, it is impossible to integrate the ICU's Normalizer > class in Java without also importing the UChar class and all its > related services for UString, such as transliterators, and ... You are very misinformed about ICU4J. Mark __ http://www.macchiato.com ► “Eppur si muove” ◄ - Original Message - From: "Philippe Verdy" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Saturday, July 12, 2003 14:45 Subject: Re: ISO 639 "duplicate" codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures) > On Saturday, July 12, 2003 4:17 PM, Jony Rosenne <[EMAIL PROTECTED]> wrote: > > > What has "iw" to with Hebrew? > > > > I wasn't involved with the change, but I'm glad it was done. Java and > > other systems probably still use it because they never bothered to > > check the latest version of 639. I know for certain that this was the > > case with one of the major computer vendors. > > In the case of Java, I don't think so. Sun has certainly maintained the > language code simply to avoid breaking existing localizations to > Hebrew of Java-written software, waiting probably for a better way to > locate locales than the fixed "locales path resolution algorithm" which > is part of its core Classes since the beginning. > > As long as Java core classes will not use a locale resolver that allows > tuning the resolution algorithm used to load resource bundles, while > also maintaining the compatibility with the existing softwares that > assume that Hebrew resources are loaded with the "iw" language code, > Sun will not change this code. > > In IBM ICU4J, there is such an extended resolver, but Sun takes a > long time to approve such proposals, and have it first accepted, > documented, balloted and voted in its JCP program. Of course > Java already includes some parts of ICU, but other things are in > ICU4J are difficult now to integrate in Java, simply because IBM > forgot to modularize ICU so that it can be integrated slowly. > Accepting ICU4J as part of the core is a big decision choice, > because ICU4J is quite large, and there are certainly developers > for Java that would not accept to have 1 aditional MB of data and > classes loaded in each JVM (particularly because the integration > of ICU would affect a lot of core classes for the Java2 platform > now also used for small devices). > > For example, it is impossible to integrate the ICU's Normalizer > class in Java without also importing the UChar class and all its > related services for UString, such as transliterators, and > advanced features such as the UCA tailoring rules run-time > compiler. Some ICU open-sourcers, as well as its users seem > to think now that the modularization of ICU is an important but > complex project. > > -- > Philippe. > Spams non tolérés: tout message non sollicité sera > rapporté à vos fournisseurs de services Internet. > > >
Re: ISO 639 "duplicate" codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)
On Monday, July 14, 2003 5:34 AM, Mark Davis <[EMAIL PROTECTED]> wrote: > ... > > Of course > > Java already includes some parts of ICU, but other things are in > > ICU4J are difficult now to integrate in Java, simply because IBM > > forgot to modularize ICU so that it can be integrated slowly. > > Accepting ICU4J as part of the core is a big decision choice, > > because ICU4J is quite large, and there are certainly developers > > for Java that would not accept to have 1 aditional MB of data and > > classes loaded in each JVM (particularly because the integration > > of ICU would affect a lot of core classes for the Java2 platform > > now also used for small devices). > ... > > For example, it is impossible to integrate the ICU's Normalizer > > class in Java without also importing the UChar class and all its > > related services for UString, such as transliterators, and > ... > > You are very misinformed about ICU4J. I hae tried several times to do it. It does not work: you may effectively remove some tables your don't need, but trying to extract just the normalizer is a real nightmare. I tried it in the past, and abondonned: too tricky to maintain, and I retried it recently (one month ago, from its CVS source) and this was even worse than the first time. I know that there's now a recent announcement (less than 1 month ago) for its modularization, but it's true that I did not check the new "modularized" sources. So my application of ICU4J is still only when I can accept the whole package, as maintaining a stripped-down customization is too tricky. But may be this has changed, I just updated my ICU sources from CVS. I'll recheck it to see if a "ICU Light" version can be created (which would only keep the core features, without the support for tailoring rules compiled at run-time). -- Philippe. Spams non tolérés: tout message non sollicité sera rapporté à vos fournisseurs de services Internet.
Re: ISO 639 "duplicate" codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures)
First, you should check again, since a significant amount of work was done in modularization in 2.6. Second, the phrase "IBM forgot to modularize ICU" is misleading, at the least. Unlike some people, who appear to have unbounded time and energy for, say, writing emails, we have to carefully pick and choose where we spend our time. Whether very fine-grained modularization is important depends a great deal on the client's requirements, and must be traded off against the many other things we could be doing with our time. Third, ICU4J is a source product. Saying that it is "impossible to integrate the ICU's Normalize..." is also misleading, since one can clearly modify source to remove dependencies on code one doesn't want to include, if it is not core to the functionality. (Of course, it may vary in amount of effort that is required.). And transliterators are not, in any event, required for Normalization. Mark __ http://www.macchiato.com ► “Eppur si muove” ◄ - Original Message - From: "Philippe Verdy" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Monday, July 14, 2003 11:13 Subject: Re: ISO 639 "duplicate" codes (was: Re: Ligatures in Turkish and Azeri, was: Accented ij ligatures) > On Monday, July 14, 2003 5:34 AM, Mark Davis <[EMAIL PROTECTED]> wrote: > > > ... > > > Of course > > > Java already includes some parts of ICU, but other things are in > > > ICU4J are difficult now to integrate in Java, simply because IBM > > > forgot to modularize ICU so that it can be integrated slowly. > > > Accepting ICU4J as part of the core is a big decision choice, > > > because ICU4J is quite large, and there are certainly developers > > > for Java that would not accept to have 1 aditional MB of data and > > > classes loaded in each JVM (particularly because the integration > > > of ICU would affect a lot of core classes for the Java2 platform > > > now also used for small devices). > > ... > > > For example, it is impossible to integrate the ICU's Normalizer > > > class in Java without also importing the UChar class and all its > > > related services for UString, such as transliterators, and > > ... > > > > You are very misinformed about ICU4J. > > I hae tried several times to do it. It does not work: you may > effectively remove some tables your don't need, but trying > to extract just the normalizer is a real nightmare. I tried it > in the past, and abondonned: too tricky to maintain, and I > retried it recently (one month ago, from its CVS source) and > this was even worse than the first time. > > I know that there's now a recent announcement (less than 1 > month ago) for its modularization, but it's true that I did not > check the new "modularized" sources. So my application > of ICU4J is still only when I can accept the whole package, > as maintaining a stripped-down customization is too tricky. > > But may be this has changed, I just updated my ICU sources > from CVS. I'll recheck it to see if a "ICU Light" version can be > created (which would only keep the core features, without the > support for tailoring rules compiled at run-time). > > -- > Philippe. > Spams non tolérés: tout message non sollicité sera > rapporté à vos fournisseurs de services Internet. > > >