Philippe Verdy continued: > From: "Mark Davis" <[EMAIL PROTECTED]> > > From: "Anto'nio Martins-Tuva'lkin" <[EMAIL PROTECTED]> > > > On 2003.05.25, 00:00, Philippe Verdy <[EMAIL PROTECTED]> wrote: > > > > even if the Dutch language considers it as a single letter, in a > > > > way similar to the Spanish "ch" > > > > > > I see one major difference: When you apply extra wide inter-char > > > distance, you (should) get, f.i.: > > > K o r t r ij k and not K o r t r i j k > > > but E l c h e and not E l ch e > > > This is common practice in both spanish and dutch typography, ISTK. > > > I was told in this forum that the surest way to keep this working in > > > Unicode texts is to use "i<WJ>j" for Dutch and plain "ij" for other > > > languages. > > > > Well, I don't know who told you, but WORD JOINER only affects > > linebreak behavior, not intercharacter spacing. > > I think he meant <ZWJ> (the zero-width joiner) used as as markup to > create a ligated variant of a pair of characters in some languages > that offer two very distinct forms (I think about Brahmic scripts > such as Devanagari)...
No, not ZWJ, either. U+2060 WORD JOINER (WJ) impacts line breaking behavior -- not the applicable concept here. U+200D ZERO WIDTH JOINER (ZWJ) impacts cursive connection and/or ligation -- not the applicable concept here. U+034F COMBINING GRAPHEME JOINER (CGJ) is the relevant character. >From Unicode 4.0: "U+034F COMBINING GRAPHEME JOINER is used to indicate that adjacent characters are to be treated as a unit for the purposes of language-sensitive collation and searching." That function was deliberately limited by the UTC to the status of such digraphs for searching and sorting, as that was the only well-defined requirement for the character. However, as this thread has hinted, there could, in principle, be multilingual contexts where there would be other legitimate reasons for treating a digraphic ij (as for Dutch) distinct from a non-digraphic ij sequence (as for Spanish). That is the same kind of argument which led to encoding of U+034F for collation. One can imagine an implementation of automatic letterspacing, such that a sequence marked explicitly as a digraph would not expand, but that one not so marked would expand. But such distinctions would only need to be made in the rather dubious conditions of: A) Multilingual text that is also B) marked explicitly for language and that also C) requires different rules for letterspacing language-by-language. Under such circumstances, you could indicate the differences for <ij> either by making use of the U+0133 ij digraph character for one and <i,j> for the other, or you could indicate the differences by <i,CGJ,j> versus <i,j>. The first approach would likely work more easily with existing software, but results in a problematical representation of Dutch data. The second is a more generic Unicode approach, but would likely be ignored by most software. In any case, the much more likely situation would be software that did letterspacing for fine typography based just on Dutch rules. It would not *need* any markup of <i,j> sequences, since it would be looking for and special-casing the sequences, anyway. --Ken