Re: transliteration of mjagkij znak (Cyrillic soft sign)
Prime for soft sign transliteration used to avoid ambiguty: apostroph is used for apostroph itself, common sign in Ukrainian or Belarusian.
Re: transliteration of mjagkij znak (Cyrillic soft sign)
In Ukrainian, for example, both “ь” and “`” are used. “ь” is used for softer pronounce of the preceding consonant ( тіньовий ), whilst “`” is used for splitting them, like if they were the first letter in a word, even when the next vowel sounds soft otherwise ( пом`якшення -- the last “я” sounds softer the former one ). Regards, Konstantin 2016-02-11 18:05 GMT+04:00 QSJN 4 UKR: > I can show an example of use both, prime (as soft sign) and apostroph > (hemisoft) in Cyrilic-based phonetic transcription (Orthoepic > Dictionary of Ukrainian, http://padaread.com/?book=84816=6 > http://padaread.com/?book=84816=7) >
Re: transliteration of mjagkij znak (Cyrillic soft sign)
I can show an example of use both, prime (as soft sign) and apostroph (hemisoft) in Cyrilic-based phonetic transcription (Orthoepic Dictionary of Ukrainian, http://padaread.com/?book=84816=6 http://padaread.com/?book=84816=7)
Re: transliteration of mjagkij znak (Cyrillic soft sign)
On 2/11/2016 6:05 AM, QSJN 4 UKR wrote: I can show an example of use both, prime (as soft sign) and apostroph (hemisoft) in Cyrilic-based phonetic transcription (Orthoepic Dictionary of Ukrainian, http://padaread.com/?book=84816=6 http://padaread.com/?book=84816=7) Can you give the number of the entry on that page? I've found the prime, but I do not see an apostrophe. What I see is a combining apostrophe (similar to the way CARON is rendered as a raised comma, when following "d"). A./
RE: transliteration of mjagkij znak (Cyrillic soft sign)
And so it is, also in the library world both before and after Unicode: for miagkii znak the prime is prescribed. The prime is also prescribed for some uses for standard transliteration in Tibetan and Hebrew/Arabic/Persian/Pushto: See:e.g. the relevant tables on https://www.loc.gov/catdir/cpso/roman.html: Tibetan: When two full forms of letters are stacked, as in Sanskritized Tibetan, there is no need to indicate the stacking. However, in the two cases noted here a modified letter prime should be inserted between the two consonants for the purpose of disambiguation. ཏྶ་ tʹsa ཙ་ tsa ནྱ་ nʹya ཉ་ nya Hebrew: A single prime ( ʹ ) is placed between two letters representing two distinct consonantal sounds when the combination might otherwise be read as a digraph. hisʹhid Persian: When the affix and the word with which it is connected grammatically are written separately in Persian, the two are separated in romanization by a single prime ( ʹ ). khānahʹhā Martin Heijdra -Original Message- From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Michael Everson Sent: Tuesday, February 09, 2016 8:43 AM To: Unicode Discussion Subject: Re: transliteration of mjagkij znak (Cyrillic soft sign) On 9 Feb 2016, at 05:31, Asmus Freytag (t) <asmus-...@ix.netcom.com<mailto:asmus-...@ix.netcom.com>> wrote: > Without scouring the book I don't know whether there's another place in it > where something's unquestioningly the prime. In that case we could figure out > whether its appearance is simply the way that font does it. Alternatively, if > making double prime look different from two single primes, perhaps that's > common enough across fonts, and would help to lay any doubts to rest - but > so far, what I see is a spacing acute. Well, Asmus, it isn’t one. We linguists have been taught it’s the prime. https://en.wikipedia.org/wiki/Prime_(symbol)#Use_in_linguistics Michael Everson * http://www.evertype.com/
Re: transliteration of mjagkij znak (Cyrillic soft sign)
On 9 Feb 2016, at 05:31, Asmus Freytag (t)wrote: > Without scouring the book I don't know whether there's another place in it > where something's unquestioningly the prime. In that case we could figure out > whether its appearance is simply the way that font does it. Alternatively, if > making double prime look different from two single primes, perhaps that's > common enough across fonts, and would help to lay any doubts to rest - but > so far, what I see is a spacing acute. Well, Asmus, it isn’t one. We linguists have been taught it’s the prime. https://en.wikipedia.org/wiki/Prime_(symbol)#Use_in_linguistics Michael Everson * http://www.evertype.com/
Re: transliteration of mjagkij znak (Cyrillic soft sign)
On 2/8/2016 5:47 PM, Michael Everson wrote: It’s what I was taught as the scientific romanization for Russian and Slavic in general. Michael Everson * http://www.evertype.com/ Source? A./
Re: transliteration of mjagkij znak (Cyrillic soft sign)
On 2/8/2016 6:39 PM, Charlie Ruland wrote: Am 09.02.2016 schrieb Asmus Freytag (t): On 2/8/2016 5:47 PM, Michael Everson wrote: It’s what I was taught as the scientific romanization for Russian and Slavic in general. Michael Everson * http://www.evertype.com/ Source? A./ Look at tables 27.1 (p. 348) and 27.2 (p. 351) of Paul Cubberley’s The Slavic Alphabets (=Peter T. Daniels and William Bright (eds.): The Word’s Writing Systems, pp. 346–355). Obviously the soft sign <ь> is transliterated as a prime <ʹ>, and the hard sign <ъ> as a double prime <ʺ>. Also note that <ћ> [gʲ] is Romanized as <ǵ> which can hardly be considered an apostrophe above . I looked. The <ǵ> looks like a g-acute. However, the "ink" for that acute matches the ink for the prime for <ь>, which is otherwise at the wrong angle compared to the double prime. (Does not look like one half of the double prime - the slight difference in weight would be more typical of single/double symbols). Without scouring the book I don't know whether there's another place in it where something's unquestioningly the prime. In that case we could figure out whether its appearance is simply the way that font does it. Alternatively, if making double prime look different from two single primes, perhaps that's common enough across fonts, and would help to lay any doubts to rest - but so far, what I see is a spacing acute. A./
Re: transliteration of mjagkij znak (Cyrillic soft sign)
It’s what I was taught as the scientific romanization for Russian and Slavic in general. Michael Everson * http://www.evertype.com/
Re: transliteration of mjagkij znak (Cyrillic soft sign)
Am 09.02.2016 schrieb Asmus Freytag (t): On 2/8/2016 5:47 PM, Michael Everson wrote: It’s what I was taught as the scientific romanization for Russian and Slavic in general. Michael Everson *http://www.evertype.com/ Source? A./ Look at tables 27.1 (p. 348) and 27.2 (p. 351) of Paul Cubberley’s /The Slavic Alphabets/ (=Peter T. Daniels and William Bright (eds.): /The Word’s Writing Systems/, pp. 346–355). Obviously the soft sign <ь> is transliterated as a prime <ʹ>, and the hard sign <ъ> as a double prime <ʺ>. Also note that <ћ> [gʲ] is Romanized as <ǵ> which can hardly be considered an apostrophe above . Charlie
transliteration of mjagkij znak (Cyrillic soft sign)
Hello, I am wondering how U+02B9 MOFIFIER LETTER PRIME made its way into the Unicode repertoire, and how it acquired its comment “transliteration of mjagkij znak (Cyrillic soft sign: palatalization)“. ISO/R 9:1954 through ISO/R 9:1986 map the mjagkij znak “ь” to the apostrophe, and so does DIN 1460:1982. The latter clearly depicts the apostrophe that later became U+02BC, while I am not sure whether also ISO/R 9 does so or rather depicts a glyph like U+0027. (All of these standards predate Unicode, so they just depict glyphs.) ISO/R 9:1995 maps the mjagkij znak “ь” to the prime, particularly to the modifier letter U+02B9, in accordance with the comment in the Unicode charts. Unicode archeologists, can you shed some light on the history of both U+02B9 and the mjagkij znak? And linguists, can you tell me how the mjagkij znak is transliterated normally, as an apostrophe or as a prime? Thanks for any comments, Otto
Precomposed Cyrillic letters
From http://www.unicode.org/L2/L2015/15169-montenegro-cyrillic.pdf, Addition of two letters from Montenegrin language, CYRILLIC script: 9. Can any of the proposed characters be encoded using a composed character sequence of either existing characters or other proposed characters? No Saying it doesn't make it so: Annex 1: Character shapes (related to section B, item 4b) Cyrillic small letter SJ с́ 0441 0301 Cyrillic capital letter SJ С́ 0421 0301 Cyrillic small letter ZJ з́ 0437 0301 Cyrillic capital letter ZJ З́ 0417 0301 Quite a few fonts don't display these well (and quite a few do), but of course that's a font problem, not an encoding problem. Cf. http://www.unicode.org/faq/char_combmark.html#11 -- Doug Ewell | http://ewellic.org | Thornton, CO
Re: Precomposed Cyrillic letters
On Thu, Jul 9, 2015 at 8:53 AM, Doug Ewell d...@ewellic.org wrote: From http://www.unicode.org/L2/L2015/15169-montenegro-cyrillic.pdf, Addition of two letters from Montenegrin language, CYRILLIC script: 9. Can any of the proposed characters be encoded using a composed character sequence of either existing characters or other proposed characters? No Saying it doesn't make it so: Right, although I doubt that the proposers monitor this mailing list... In case an interested party is listening: If sr-ME needs different locale data than sr, then one could contribute such data to CLDR http://cldr.unicode.org/. See the current state: http://unicode.org/cldr/trac/browser/trunk/common/main/sr_Cyrl_ME.xml markus
Re: Precomposed Cyrillic letters
On Thu, 9 Jul 2015 09:37:21 -0700 Markus Scherer markus@gmail.com wrote: On Thu, Jul 9, 2015 at 8:53 AM, Doug Ewell d...@ewellic.org wrote: From http://www.unicode.org/L2/L2015/15169-montenegro-cyrillic.pdf, Addition of two letters from Montenegrin language, CYRILLIC script: 9. Can any of the proposed characters be encoded using a composed character sequence of either existing characters or other proposed characters? No Saying it doesn't make it so: Is there a requirement to answer those questions truthfully? Right, although I doubt that the proposers monitor this mailing list... In case an interested party is listening: If sr-ME needs different locale data than sr, then one could contribute such data to CLDR http://cldr.unicode.org/. See the current state: http://unicode.org/cldr/trac/browser/trunk/common/main/sr_Cyrl_ME.xml Presumably http://cldr.unicode.org/index/survey-tool/accounts is the most relevant page for someone with credibility. However, as Montenegro has an army and a navy, you have the wrong locale. It's still waiting for a language code. See the language family panels at https://en.wikipedia.org/wiki/Eastern_Herzegovinian_dialect and https://en.wikipedia.org/wiki/Montenegrin_language for the extreme Balkanisation. But in short, yes we need the extra Cyrillic letters с́ and з́ and Latin letters ś and ź for the exemplar characters in sr_Cyrl_ME and sr_Latn_ME (or should that be sr_ME?). I can't work out the status of Montenegrin Latin {sj} and {zj}. Richard.
Re: Precomposed Cyrillic letters
Richard Wordingham richard dot wordingham at ntlworld dot com wrote: Presumably http://cldr.unicode.org/index/survey-tool/accounts is the most relevant page for someone with credibility. However, as Montenegro has an army and a navy, you have the wrong locale. It's still waiting for a language code. See the language family panels at https://en.wikipedia.org/wiki/Eastern_Herzegovinian_dialect and https://en.wikipedia.org/wiki/Montenegrin_language for the extreme Balkanisation. Montenegro could have all the military power in the world, but that doesn't make Montenegrin a distinct language. It's a dialect of Serbian. -- Doug Ewell | http://ewellic.org | Thornton, CO
Re: Old Cyrillic Yest
2012/11/12 QSJN 4 UKR qsjn4ukr at gmail dot com wrote: Old Cyrillic letter YEST (Є) has two variants: broad (also called Yakornoye Yest) and narrow. They are saved in modern Ukrainian script (only), where U+0404/0454 UKRAINIAN IE is used for the inherited BROAD YEST and the modern, rectangle form of U+0415/0453 IE for the NARROW YEST. Unicode Standard has a remark to use U+0404 for the Old Cyrillic YEST, but it is unclear, how to distinguish the BROAD YEST and the NARROW YEST. Unfortunately some fonts use U+0404/0454 for any YEST and U+0415/0435 for the modern rectangle IE, some old-style fonts use only the old YEST but with codepoint U+0415/0435 and do not use U+0404/0454 at all, some use U+0404/0454 for the BROAD YEST and U+0415/0435 for the NARROW YEST... 2012/11/23 Doug Ewell d...@ewellic.org How many truly different letters, old and new, are we talking about? On November 12 you wrote, UKRAINIAN IE and BROAD YEST is the same letter in fact. It would not make sense to assign a new BROAD YEST letter if it is really the same as UKRAINIAN IE, and if existing texts already use UKRAINIAN IE to represent it. Full picture Meaning - Glyph - Codepoint Old ChurchSlavonic: Narrow Yest (regular form) - very narrow halfmoon - 0404/0454 (ambiguous) and 0415/0435 (probably wrong glyph will be rendered) (there are no certain codepoints) Broad Yest (special form, initial, plural disambiguator) - broad halfmoon, identical to Ukrainian Ie or maybe somehow grater (broking baseline) - 0404/0454 indeed Modern imitation of Church Slavonic, or really old texts, or texts where hard to distinct Broad and Narrow Yest: Ambiguous Yest - identical to Ukrainian Ie or maybe like Narrow Yest (in old-style font) - 0404/0454 sure Modern languages: Ie - rectangle capital / closed rounded small (identical to Latin) - 0415/0435 Ukrainian Ie - identical to ambiguous Yest - 0404/0454 So there are two steps. First. Required. Separate codepoint for Narrow Yests. It is just impossible to work with ChurchSlavonic texts without these. Because: wrong glyph is rendered almost always (you must understand, we cant hope on language detection, cause the text contains certain the mix, old text with modern translation) - or - there is no way to show Broad Yest at all. Second. Optional. Separate codepoint for Broad Yests. That's only necessary if one part of text contains the ambiguous Yests (coded as now, 0404/0454, without changes!) but other part contains the Broad Yests and the author can/wants to show this feature. Am i the only man in the world who think that Unicode is poorly adapted for ChurchSlavonic? ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Old Cyrillic Yest
2013/1/29 QSJN 4 UKR qsjn4...@gmail.com I found something terrible. Sorry, I did not make a photo. That is a modern book with [http://litopys.org.ua/smotrgram/sm11.htm]-this text of Meletius Smotrytsky Grammar, but a reprint, not a faximile like I refer to. Here are the rules about using BROAD YEST and NARROW YEST. Modern publisher used GREEK EPSILON and UKRAINIAN IE to show NARROW and BROAD YEST. Hah! Try to guess what is what. The funniest is that there are examples: тѣм творцєм - тым творцєм (it has to be тѣм творцεм (singular) - тым творцєм (plural) ). :) :) :) Or vice versa: plural - singular. I didn't get it!
Re: Old Cyrillic Yest
I found something terrible. Sorry, I did not make a photo. That is a modern book with [http://litopys.org.ua/smotrgram/sm11.htm]-this text of Meletius Smotrytsky Grammar, but a reprint, not a faximile like I refer to. Here are the rules about using BROAD YEST and NARROW YEST. Modern publisher used GREEK EPSILON and UKRAINIAN IE to show NARROW and BROAD YEST. Hah! Try to guess what is what. The funniest is that there are examples: тѣм творцєм - тым творцєм (it has to be тѣм творцεм (singular) - тым творцєм (plural) ).
Re: Old Cyrillic Yest
On 29 Nov 2012, at 08:57, QSJN 4 UKR qsjn4...@gmail.com wrote: Yes, maybe, probably. Truly different glyph is the NARROW YEST. Truly special character name has the BROAD YES, YAKORNOYE YEST, while the NARROW as well as the modern UKRAINIAN є is just IE or YEST. Well, I don't know, would you please read the Wikipedia or something: http://ru.wikipedia.org/wiki/Якорное_Е (N. B. There is only one source reference in Wiki article. Dark night!). There are ways of making a case for disunification. Qsjn 4 Ukr has not made them. Michael Everson * http://www.evertype.com/
Old Cyrillic Yest
Old Cyrillic letter YEST (Є) has two variants: broad (also called Yakornoye Yest) and narrow. They are saved in modern Ukrainian script (only), where U+0404/0454 UKRAINIAN IE is used for the inherited BROAD YEST and the modern, rectangle form of U+0415/0453 IE for the NARROW YEST. Unicode Standard has a remark to use U+0404 for the Old Cyrillic YEST, but it is unclear, how to distinguish the BROAD YEST and the NARROW YEST. Unfortunately some fonts use U+0404/0454 for any YEST and U+0415/0435 for the modern rectangle IE, some old-style fonts use only the old YEST but with codepoint U+0415/0435 and do not use U+0404/0454 at all, some use U+0404/0454 for the BROAD YEST and U+0415/0435 for the NARROW YEST... Please regulate it! Unicode Standard has some codepoins for other broad Cyrillic letters: U+A64C/A64D BROAD OMEGA, U+047A/047B ROUND OMEGA (misnomer, it is broad o). Adding new codepoints for the BROAD YEST does not solve the problem: as i said, UKRAINIAN IE and BROAD YEST is the same letter in fact. Adding new codepoints for the NARROW YEST is bad idea too, existing texts use U+0404/0454 for NARROW YEST more often than for BROAD YEST (just since broad form is rare:). So we need as many as 4 new codepoints in U+A6xx block for CYRILLIC CAPITAL and SMALL LETTER BROAD and NARROW YEST. That way we shall be able to use both discernible letters of the Old Cyrillic, and we shall not mix them with the modern Ukrainian letters nor each other.
Re: Old Cyrillic Yest
Telling font designers how to do their job (even if it's within Unicode's purview which I doubt) by adding new codepoints is a novel idea to say the least. Leo On Mon, Nov 12, 2012 at 3:32 AM, QSJN 4 UKR qsjn4...@gmail.com wrote: Old Cyrillic letter YEST (Є) has two variants: broad (also called Yakornoye Yest) and narrow. They are saved in modern Ukrainian script (only), where U+0404/0454 UKRAINIAN IE is used for the inherited BROAD YEST and the modern, rectangle form of U+0415/0453 IE for the NARROW YEST. Unicode Standard has a remark to use U+0404 for the Old Cyrillic YEST, but it is unclear, how to distinguish the BROAD YEST and the NARROW YEST. Unfortunately some fonts use U+0404/0454 for any YEST and U+0415/0435 for the modern rectangle IE, some old-style fonts use only the old YEST but with codepoint U+0415/0435 and do not use U+0404/0454 at all, some use U+0404/0454 for the BROAD YEST and U+0415/0435 for the NARROW YEST... Please regulate it! Unicode Standard has some codepoins for other broad Cyrillic letters: U+A64C/A64D BROAD OMEGA, U+047A/047B ROUND OMEGA (misnomer, it is broad o). Adding new codepoints for the BROAD YEST does not solve the problem: as i said, UKRAINIAN IE and BROAD YEST is the same letter in fact. Adding new codepoints for the NARROW YEST is bad idea too, existing texts use U+0404/0454 for NARROW YEST more often than for BROAD YEST (just since broad form is rare:). So we need as many as 4 new codepoints in U+A6xx block for CYRILLIC CAPITAL and SMALL LETTER BROAD and NARROW YEST. That way we shall be able to use both discernible letters of the Old Cyrillic, and we shall not mix them with the modern Ukrainian letters nor each other.
Re: Old Cyrillic Yest
QSJN 4 UKR qsjn4ukr at gmail dot com wrote: Old Cyrillic letter YEST (Є) has two variants: broad (also called Yakornoye Yest) and narrow. They are saved in modern Ukrainian script (only), where U+0404/0454 UKRAINIAN IE is used for the inherited BROAD YEST and the modern, rectangle form of U+0415/0453 IE for the NARROW YEST. Unicode Standard has a remark to use U+0404 for the Old Cyrillic YEST, but it is unclear, how to distinguish the BROAD YEST and the NARROW YEST. Unfortunately some fonts use U+0404/0454 for any YEST and U+0415/0435 for the modern rectangle IE, some old-style fonts use only the old YEST but with codepoint U+0415/0435 and do not use U+0404/0454 at all, some use U+0404/0454 for the BROAD YEST and U+0415/0435 for the NARROW YEST... Please regulate it! The Unicode Consortium does not regulate this aspect of fonts, nor should it, except to say that glyphs have to represent the true abstract character, and not display, say, a B-like glyph at the code point for the letter A. If you are saying that Chapter 7.4 of TUS needs a description of these two abstract characters, that seems fair, but that is as far as the regulating goes. Unicode Standard has some codepoins for other broad Cyrillic letters: U+A64C/A64D BROAD OMEGA, U+047A/047B ROUND OMEGA (misnomer, it is broad o). Adding new codepoints for the BROAD YEST does not solve the problem: as i said, UKRAINIAN IE and BROAD YEST is the same letter in fact. Adding new codepoints for the NARROW YEST is bad idea too, existing texts use U+0404/0454 for NARROW YEST more often than for BROAD YEST (just since broad form is rare:). So we need as many as 4 new codepoints in U+A6xx block for CYRILLIC CAPITAL and SMALL LETTER BROAD and NARROW YEST. That way we shall be able to use both discernible letters of the Old Cyrillic, and we shall not mix them with the modern Ukrainian letters nor each other. This would create duplicate encodings for existing text, a Bad Thing. If this is genuinely a problem, the improved explanation in Chapter 7.4 (above) would be a better solution. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: [indic] Indic Transliteration Standards in Cyrillic Greek
On Sat, Nov 10, 2012 at 12:49 PM, Vinodh Rajan vinodh.vin...@gmail.com wrote: Hi, These are several standards for transliterating Indic script to Roman characters such as IAST, ISO 15919 etc. I would like to know if any similar standards exist for expressing the Indic set in Greek Cyrillic with special diacritics. If they do exist, any pointers to their Unicode representations. Thanks V -- http://www.virtualvinodh.com Vinodh, These resources will help: http://transliteration.eki.ee/pdf/Russian.pdf http://en.wikipedia.org/wiki/Scientific_transliteration_of_Cyrillic http://learningrussian.net/pronunciation/transliteration.php N. Ganesan
Re: [indic] Re: Indic Transliteration Standards in Cyrillic Greek
On Sat, Nov 10, 2012 at 3:02 PM, John Hudson j...@tiro.ca wrote: I'm sorry, I misread the original question. I'm not aware of particular Cyrillic or Greek transcription systems for Indic scripts or languages. My suspicion is that Russian systems exist, given the historic interests of Russian linguistic studies. I'm doubtful if Greek systems exist, but would be happy to be proven wrong. JH My guess is Vinodh wants to add the capacity from Indic to Cyrillic scripts: One way will be to see the Latin letters and then convert to Cyrillic, http://transliteration.eki.ee/pdf/Russian.pdf But for some letters, say in Tamil, there won't be equivalents in Cyrillic. N. Ganesan
Indic Transliteration Standards in Cyrillic Greek
Hi, These are several standards for transliterating Indic script to Roman characters such as IAST, ISO 15919 etc. I would like to know if any similar standards exist for expressing the Indic set in Greek Cyrillic with special diacritics. If they do exist, any pointers to their Unicode representations. Thanks V -- http://www.virtualvinodh.com
Re: Indic Transliteration Standards in Cyrillic Greek
At least there should exist conventions in all languages to transliterate in their own script an IPA representation (used as a central phonetic transcription, where the source languafge would be noted using its subset of IPA for representing its initial phonology rather than one particular phonetic realization). Then these phonologic IPA representations should find a good approximation in the target (script/language) pair, in order to produce consistant phonologic transcriptions that are readable orrectly in the target language. Pure translierations are most often unreadable, or read very incorrectly (even of the target language has a good support for representing the most frequent realizations of a phonologic phoneme of the source language). This scheme could also help transcriptions from one language to another that share the same script (e.g. English cheese transcripted in French as tchise, ignoring the representation of long vowels that are not heard in target French, or tchiise, but not tchīse as the macron is not read distinctly in French). You may argue that you don't need this because we already have IPA, but IPA is unreadable by most people, and there's still the ned to use more conventional symbols (and IPA is completely unreadable for readers of other scripts than Latin, Greek or Cyrillic). The application would be to transliterate people names or toponyms in postal addresses or contact lists or on administrative forms to be used in foreign countries where people can't decipher other scripts (such as Arabic or sinograms), or in airports for travelling, or to avoid that people really invent their own choice of name in another script, in suc a way that the chosen name is not registered and verifiable anywhere (unless these people have officially registered in their own coutnry their alternate usage names, but very few countries permit such registration of such usage names by individual people). For those countries that allow registration of people names in other scripts than the national script, most often they will only allow the usage of the Latin script (and frequently in a very restricted subset of it), but not in Arabic, or Greek, or Cyrillic, or Japanese kanas. To help this process, those countries are using their own national standard of transliterators to the Latin script (i.e. romanizations), simply because it is the most widely known and used internationally (and in all computer applications) and have no other support for registering additional usage names in other scripts, or for registering additional usage names that will be dependant of the target language (so these single romanizations supported will also be read incorrectly in many target languages, or could be offensive in those target languages and travellers may want ro use another usage name in those target countries). 2012/11/10 Vinodh Rajan vinodh.vin...@gmail.com: Hi, These are several standards for transliterating Indic script to Roman characters such as IAST, ISO 15919 etc. I would like to know if any similar standards exist for expressing the Indic set in Greek Cyrillic with special diacritics. If they do exist, any pointers to their Unicode representations. Thanks V -- http://www.virtualvinodh.com
Re: CYRILLIC SMALL/CAPITAL LETTER SELKUP OE (ISO 10756:1996)
We've got the example of the ISO 9 standard itself. Le 5 mars 2012 22:46, Michael Everson ever...@evertype.com a écrit : On 5 Mar 2012, at 20:13, Benjamin M Scarborough wrote: There is a clear precedent here that the unifications of N2463 are not necessarily the final fate of any of these characters. If the О Е letter for Selkup should be disunified from U+0152/U+0153, then a proposal needs to be submitted calling for the addition of the two letters to the UCS. Have you got examples, Ben? Michael Everson * http://www.evertype.com/
Re: CYRILLIC SMALL/CAPITAL LETTER SELKUP OE (ISO 10756:1996)
On Tue, Feb 28, 2012 at 4:00 AM, Philippe Verdy verd...@wanadoo.fr wrote: I am looking for the codes or assignements status of the Cyrillic letter OE/oe (ligatured) as used in Selkup (exactly similar to the Latin pair). This character pair has been part of the registration nr. 223 (in 1998) by ISO of the (8-bit) extended Cyrillic character set for non-Slavic languages for bibliographic information interchange : http://www.itscj.ipsj.or.jp/sc2/open/02n3136.pdf According to this document, this character set had also been standardized as ISO 10756:1996. Note that it contains many other characters for which it did not document any mapping to the UCS in the then emerging ISO 10646 standard. It has even been part of proposals at the UTC and ISO the same year for including in the UCS, along with other characters (at that time, Michael Everson wrote a proposal, placing them in U+04EC, U+04ED, but since the, the slots have been used for other characters (that block is now full). It is also referenced in the ISO 9 Cyrillic/Latin transliteration standard. Still, there's no Cyrillic character I can find in the encoded UCS in other Cyrillic extended blocks that are not full (for example, the CYRILLIC SUPPLEMENT block at U+0500-052F). Where are those characters ? And what about the remaining characters found in the Registration nr. 223 and ISO 10756:1996 ? And their status in the ISO 9 standard itself ? Thanks. -- Philippe. According to ftp://std.dkuug.dk/jtc1/sc2/WG2/docs/n2463.doc the Cyrillic Selkup OE is mapped to Latin OE: CYRILLIC SMALL LETTER SELKUP O E to U+0153 LATIN SMALL LIGATURE OE CYRILLIC CAPITAL LETTER SELKUP O E to U+0152 LATIN CAPITAL LIGATURE OE Several other of those missing Cyrillic characters are simply mapped to Latin ones or sort of decomposed. - Denis Moyogo Jacquerye
Re: CYRILLIC SMALL/CAPITAL LETTER SELKUP OE (ISO 10756:1996)
On Mon, Mar 5, 2012 at 19:35, Denis Jacquerye wrote: According to ftp://std.dkuug.dk/jtc1/sc2/WG2/docs/n2463.doc the Cyrillic Selkup OE is mapped to Latin OE: CYRILLIC SMALL LETTER SELKUP O E to U+0153 LATIN SMALL LIGATURE OE CYRILLIC CAPITAL LETTER SELKUP O E to U+0152 LATIN CAPITAL LIGATURE OE Several other of those missing Cyrillic characters are simply mapped to Latin ones or sort of decomposed. N2463 also maps twelve characters from ISO 10574 that have been disunified since 2002, namely: 04/06 CYRILLIC SMALL LETTER KURDISH QA is now U+051B CYRILLIC SMALL LETTER QA 04/09 CYRILLIC SMALL LETTER EL WITH MIDDLE HOOK is now U+0521 CYRILLIC SMALL LETTER EL WITH MIDDLE HOOK 04/10 CYRILLIC SMALL LETTER MORDVIN EL KA is now U+0515 CYRILLIC SMALL LETTER LHA 04/14 CYRILLIC SMALL LETTER EN WITH MIDDLE HOOK is now U+0523 CYRILLIC SMALL LETTER EN WITH MIDDLE HOOK 05/06 CYRILLIC CAPITAL LETTER KURDISH QA is now U+051A CYRILLIC CAPITAL LETTER QA 05/09 CYRILLIC CAPITAL LETTER EL WITH MIDDLE HOOK is now U+0520 CYRILLIC CAPITAL LETTER EL WITH MIDDLE HOOK 05/10 CYRILLIC CAPITAL LETTER MORDVIN EL KA is now U+0514 CYRILLIC CAPITAL LETTER LHA 05/14 CYRILLIC CAPITAL LETTER EN WITH MIDDLE HOOK is now U+0522 CYRILLIC CAPITAL LETTER EN WITH MIDDLE HOOK 06/03 CYRILLIC SMALL LETTER ER KA is now U+0517 CYRILLIC SMALL LETTER RHA 06/08 CYRILLIC SMALL LETTER KURDISH WE is now U+051D CYRILLIC SMALL LETTER WE 07/03 CYRILLIC CAPITAL LETTER ER KA is now U+0516 CYRILLIC CAPITAL LETTER RHA 07/08 CYRILLIC CAPITAL LETTER KURDISH WE is now U+051C CYRILLIC CAPITAL LETTER WE There is a clear precedent here that the unifications of N2463 are not necessarily the final fate of any of these characters. If the О Е letter for Selkup should be disunified from U+0152/U+0153, then a proposal needs to be submitted calling for the addition of the two letters to the UCS. It is worth noting that N2463 also decomposes four characters using U+0335, a practice which hasn't been used for decompositions since Unicode 1.1. I also don't understand the mapping of 04/05 CYRILLIC SMALL LETTER CHECHEN KA and 05/05 CYRILLIC CAPITAL LETTER CHECHEN KA into U+043A CYRILLIC SMALL LETTER KA, U+030A COMBINING RING ABOVE and U+041A CYRILLIC CAPITAL LETTER KA. U+030A COMBINING RING ABOVE, respectively. Is the character shown in ISO 10574 just a glyph variant of this combining sequence? —Ben Scarborough
Re: CYRILLIC SMALL/CAPITAL LETTER SELKUP OE (ISO 10756:1996)
Le 5 mars 2012 19:35, Denis Jacquerye moy...@gmail.com a écrit : On Tue, Feb 28, 2012 at 4:00 AM, Philippe Verdy verd...@wanadoo.fr wrote: I am looking for the codes or assignements status of the Cyrillic letter OE/oe (ligatured) as used in Selkup (exactly similar to the Latin pair). This character pair has been part of the registration nr. 223 (in 1998) by ISO of the (8-bit) extended Cyrillic character set for non-Slavic languages for bibliographic information interchange : http://www.itscj.ipsj.or.jp/sc2/open/02n3136.pdf According to this document, this character set had also been standardized as ISO 10756:1996. Note that it contains many other characters for which it did not document any mapping to the UCS in the then emerging ISO 10646 standard. It has even been part of proposals at the UTC and ISO the same year for including in the UCS, along with other characters (at that time, Michael Everson wrote a proposal, placing them in U+04EC, U+04ED, but since the, the slots have been used for other characters (that block is now full). It is also referenced in the ISO 9 Cyrillic/Latin transliteration standard. Still, there's no Cyrillic character I can find in the encoded UCS in other Cyrillic extended blocks that are not full (for example, the CYRILLIC SUPPLEMENT block at U+0500-052F). Where are those characters ? And what about the remaining characters found in the Registration nr. 223 and ISO 10756:1996 ? And their status in the ISO 9 standard itself ? Thanks. -- Philippe. According to ftp://std.dkuug.dk/jtc1/sc2/WG2/docs/n2463.doc the Cyrillic Selkup OE is mapped to Latin OE: CYRILLIC SMALL LETTER SELKUP O E to U+0153 LATIN SMALL LIGATURE OE CYRILLIC CAPITAL LETTER SELKUP O E to U+0152 LATIN CAPITAL LIGATURE OE Several other of those missing Cyrillic characters are simply mapped to Latin ones or sort of decomposed. Apparently this document is obsolete. Some of the proposed mappings to Latin have been encoded as plain Cyrillic letters such as: CYRILLIC SMALL LETTER KURDISH QA (not the initially proposed mapping to LATIN SMALL LETTER Q) This document was still a draft, and not a decision. The document specifically says The issue with these letters is whether they should be deunified from Latin, and encoded in the Cyrillic block.
Re: CYRILLIC SMALL/CAPITAL LETTER SELKUP OE (ISO 10756:1996)
On 5 Mar 2012, at 20:13, Benjamin M Scarborough wrote: There is a clear precedent here that the unifications of N2463 are not necessarily the final fate of any of these characters. If the О Е letter for Selkup should be disunified from U+0152/U+0153, then a proposal needs to be submitted calling for the addition of the two letters to the UCS. Have you got examples, Ben? Michael Everson * http://www.evertype.com/
CYRILLIC SMALL/CAPITAL LETTER SELKUP OE (ISO 10756:1996)
I am looking for the codes or assignements status of the Cyrillic letter OE/oe (ligatured) as used in Selkup (exactly similar to the Latin pair). This character pair has been part of the registration nr. 223 (in 1998) by ISO of the (8-bit) extended Cyrillic character set for non-Slavic languages for bibliographic information interchange : http://www.itscj.ipsj.or.jp/sc2/open/02n3136.pdf According to this document, this character set had also been standardized as ISO 10756:1996. Note that it contains many other characters for which it did not document any mapping to the UCS in the then emerging ISO 10646 standard. It has even been part of proposals at the UTC and ISO the same year for including in the UCS, along with other characters (at that time, Michael Everson wrote a proposal, placing them in U+04EC, U+04ED, but since the, the slots have been used for other characters (that block is now full). It is also referenced in the ISO 9 Cyrillic/Latin transliteration standard. Still, there's no Cyrillic character I can find in the encoded UCS in other Cyrillic extended blocks that are not full (for example, the CYRILLIC SUPPLEMENT block at U+0500-052F). Where are those characters ? And what about the remaining characters found in the Registration nr. 223 and ISO 10756:1996 ? And their status in the ISO 9 standard itself ? Thanks. -- Philippe.
Re: Are Latin and Cyrillic essentially the same script?
On 22 Nov 2010, at 18:55, Asmus Freytag wrote: That seems to be true for IPA as well - because already, if you use the font binding for IPA, your a's and g's will not come out right, which means you don't even have to worry about betas and chis. Not so. There is already a convention (going back to the late 19th or early 20th century) about handling this. In an ordinary Times-like font, a slopes and loses its hat when italicized. In an ordinary Times-like font, ɑ is replaced by an italic Greek α (alpha). Michael Everson * http://www.evertype.com/
Re: Are Latin and Cyrillic essentially the same script?
On 19 Nov 2010, at 07:15, Peter Constable wrote: And while IPA is primarily based on Latin script, not all of its characters are Latin characters: bilabial and interdental fricative phonemes are represented using Greek letters beta and theta. IPA beta and chi behave very differently from their Greek antecedents and should not remain unified. The case for theta is messier because theta is so very messy. Michael Everson * http://www.evertype.com/
Re: Are Latin and Cyrillic essentially the same script?
On 19 Nov 2010, at 17:09, Peter Constable wrote: And historic texts aren’t as likely or unlikely to require specialized fonts? Twenty years of historic text in Tatar isn't irrelevant. It's also a notational system that requires specific training in its use, And working with historic texts doesn’t require specific training? Not in terms of Jaŋalif. The training you need there is just learn to read the language in another alphabet. IPA is more complex than that, especially if you go for close transcription. While several orthographies have been based on IPA, my understanding is that some of them saw the encoding of additional characters to make them work as orthographies. Again, I don’t see how that impacts this particular case. This particular case is analogous to the borrowing of Q and W into Cyrillic from Latin. By the way I understand that there are many people who would like to revert to the Latin orthography for these Turkic languages. At present Russian law forbids this, but it is not the case that one may expect that this orthography will always remain historic. It boils down to this: just as there aren’t technical or usability reasons that make it problematic to represent IPA text using two Greek characters in an otherwise-Latin system, Yes there are. Sorting multilingual text including Greek and IPA transcriptions, for one. The glyph shape for IPA beta is practically unknown in Greek. Latin capital Chi is not the same as Greek capital chi. so also there are no technical or usability reasons I’m aware of why it is problematic to represent this historic Janalif orthography using two Cyrillic characters. They are the same technical and usability reasons which led to the disunification of Cyrillic Ԛ and Ԝ from Latin Q and W. Michael Everson * http://www.evertype.com/
Re: Are Latin and Cyrillic essentially the same script?
On 11/22/2010 4:15 AM, Michael Everson wrote: It boils down to this: just as there aren’t technical or usability reasons that make it problematic to represent IPA text using two Greek characters in an otherwise-Latin system, Yes there are. Sorting multilingual text including Greek and IPA transcriptions, for one. The glyph shape for IPA beta is practically unknown in Greek. Latin capital Chi is not the same as Greek capital chi. so also there are no technical or usability reasons I’m aware of why it is problematic to represent this historic Janalif orthography using two Cyrillic characters. They are the same technical and usability reasons which led to the disunification of Cyrillic Ԛ and Ԝ from Latin Q and W. The sorting problem I think I understand. Because scripts are kept together in sorting, when you have a mixed script list, you normally overrides just the sorting for the script to which the (sort-)language belongs. A mixed French-Russian list would use French ordering for the Latin characters, but the Russian words would all appear together (and be sorted according to some generic sort order for Cyrillic characters - except that for a bilingual list, sorting the Cyrillic according to Russian rules might also make sense.). Same for a French-Greek list. The Greek characters will be together and sorted either by a generic Greek (script) sort, or a specific Greek (language) sort.When you sort a mixed list of IPA and Greek, the beta and chi will now sort with the Latin characters, in whatever sort order applies for IPA. That means the order of all Greek words in the list will get messed up. It will neither be a generic Greek (script) sort, nor a specific Greek (language) sort, because you can't tailor the same characters two different ways in the same sort. That's the problem I understand is behind the issue with the Kurdish Q and W, and with the character pair proposed for disunification for Janalif. Perhaps, it seems, there are some technical problems that would make the support for such mixed-script orthographies not as seamless as for regular orthographies after all. In that case, a decision would boil down to whether these technical issues are significant enough (given the usage). In other words, it becomes a cost-benefit analysis. Duplication of characters (except where their glyphs have acquired a different appearance in the other context) always has a cost in added confusability. Users can select the wrong character accidentally, spoofers can do so intentionally to try to cause harm. But Unicode was never just a list of distinct glyphs, so duplication between Latin and Greek, or Latin and Cyrillic is already widespread, especially among the capitals. Unlike what Michael claims for IPA, the Janalif characters don't seem to have a very different appearance, so there would not be any technical or usability issue there. Minor glyph variations can be handled by standard technologies, like OpenType, as long as the overall appearance remains legible should language binding of a text have gotten lost. That seems to be true for IPA as well - because already, if you use the font binding for IPA, your a's and g's will not come out right, which means you don't even have to worry about betas and chis. IPA being a notation, I would not be surprised to learn that mixed lists with both IPA and other terms are a rare thing. But for Janalif it would seem that mixed Janalif/Cyrillic lists would be rather common, relative to the size of the corpus, even if its a dead (or currently out of use) orthography. I'd like to see this addressed a bit more in detail by those who support the decision to keep the borrowed characters unified. A./
Re: Are Latin and Cyrillic essentially the same script?
On 11/18/2010 11:15 PM, Peter Constable wrote: If you'd like a precedent, here's one: Yes, I think discussion of precedents is important - it leads to the formulation of encoding principles that can then (hopefully) result in more consistency in future encoding efforts. Let me add the caveat that I fully understand that character encoding doesn't work by applying cook-book style recipes, and that principles are better phrased as criteria for weighing a decision rather than as formulaic rules. With these caveats, then: IPA is a widely-used system of transcription based primarily on the Latin script. In comparison to the Janalif orthography in question, there is far more existing data. Also, whereas that Janalif orthography is no longer in active use--hence there are not new texts to be represented (there are at best only new citations of existing texts), IPA is as a writing system in active use with new texts being created daily; thus, the body of digitized data for IPA is growing much more that is data in the Janalif orthography. And while IPA is primarily based on Latin script, not all of its characters are Latin characters: bilabial and interdental fricative phonemes are represented using Greek letters beta and theta. IPA has other characteristics in both its usage and its encoding that you need to consider to make the comparison valid. First, IPA requires specialized fonts because it relies on glyphic distinctions that fonts not designed for IPA use will not guarantee. (Latin a with and without hook, g with hook vs. two stories are just two examples). It's also a notational system that requires specific training in its use, and it is caseless - in distinction to ordinary Latin script. While several orthographies have been based on IPA, my understanding is that some of them saw the encoding of additional characters to make them work as orthographies. Finally, IPA, like other phonetic notations, uses distinctions between letter forms on the character level that would almost always be relegated to styling in ordinary text. Because of these special aspects of IPA, I would class it in its own category of writing systems which makes it less useful as a precedent against which to evaluate general Latin-based orthographies. Given a precedent of a widely-used Latin writing system for which it is considered adequate to have characters of central importance represented using letters from a different script, Greek, it would seem reasonable if someone made the case that it's adequate to represent an historic Latin orthography using Cyrillic soft sign. I think the question can and should be asked, what is adequate for a historic orthography. (I don't know anything about the particulars of Janalif, beyond what I read here, so for now, I accept your categorization of it as if it were fact). The precedent for historic orthographies is a bit uneven in Unicode. Some scripts have extensive collection of characters (even duplicates or near duplicates) to cover historic usage. Other historic orthographies cannot be fully represented without markup. And some are now better supported than at the beginning because the encoding has plugged certain gaps. A helpful precedent in this case would be that of another minority or historic orthography, or historic minority orthography for which the use of Greek or Cyrillic characters with Latin was deemed acceptable. I don't think Janalif is totally unique (although the others may not be dead). I'm thinking of the Latin OU that was encoded based on a Greek ligature, and the perennial question of the Kurdish Q an W (Latin borrowings into Cyrillic - I believe these are now 051A and 051C). Again, these may be for living orthographies. /Against this backdrop, it would help if WG2 (and UTC) could point to agreed upon criteria that spell out what circumstances should favor, and what circumstances should disfavor, formal encoding of borrowed characters, in the LGC script family or in the general case./ That's the main point I'm trying to make here. I think it is not enough to somehow arrive at a decision for one orthography, but it is necessary for the encoding committees to grab hold of the reasoning behind that decision and work out how to apply consistent reasoning like that in future cases. This may still feel a little bit unsatisfactory for those whose proposal is thus becoming the test-case to settle a body of encoding principles, but to that I say, there's been ample precedent for doing it that way in Unicode and 10646. So let me ask these questions: A. What are the encoding principles that follow from the disposition of the Janalif proposal? B. What precedents are these based on resp. what precedents are consciously established by this decision? A./
RE: Are Latin and Cyrillic essentially the same script?
From: Asmus Freytag [mailto:asm...@ix.netcom.com] IPA has other characteristics in both its usage and its encoding that you need to consider to make the comparison valid. First, IPA requires specialized fonts because it relies on glyphic distinctions that fonts not designed for IPA use will not guarantee. And historic texts aren’t as likely or unlikely to require specialized fonts? It's also a notational system that requires specific training in its use, And working with historic texts doesn’t require specific training? and it is caseless - in distinction to ordinary Latin script. I could understand how that might be relevant if we were discussing a character borrowed from another script but with different casing behaviour in the original script. (E.g., the character is caseless in the original script, or it is case but only the lowercase was borrowed and a novel uppercase character was created in the receptor script. This was a valid consideration in the encoding of Lisu, for instance.) I don’t really see how that impacts the discussion in this particular case. While several orthographies have been based on IPA, my understanding is that some of them saw the encoding of additional characters to make them work as orthographies. Again, I don’t see how that impacts this particular case. Finally, IPA, like other phonetic notations, uses distinctions between letter forms on the character level that would almost always be relegated to styling in ordinary text. And again, I don’t see how this impacts the particular case under discussion. Because of these special aspects of IPA, I would class it in its own category of writing systems which makes it less useful as a precedent against which to evaluate general Latin-based orthographies. Perhaps in general it cannot serve as a precedent for all things. But as noted, I think several of the things you noted have no particular bearing in this case. For the specific issue of borrowing a character from another script in a historic orthography, I think it’s a perfectly valid precedent. It boils down to this: just as there aren’t technical or usability reasons that make it problematic to represent IPA text using two Greek characters in an otherwise-Latin system, so also there are no technical or usability reasons I’m aware of why it is problematic to represent this historic Janalif orthography using two Cyrillic characters. Btw, I suspect that calling these Latin characters is completely revisionist: if we could ask anyone that taught or used this orthography in 1930 about these characters, I suspect they would say that they are Cyrillic characters. I think the question can and should be asked, what is adequate for a historic orthography. Clearly you’re trying to have a discussion about general principles, not about the specific characters. At the moment, I’m prepared to discuss general principles to the extent that they impinge on the particular case at hand. Other’s may wish to engage on a broader discussion of general principles (though, hopefully under a different subject). Against this backdrop, it would help if WG2 (and UTC) could point to agreed upon criteria that spell out what circumstances should favor, and what circumstances should disfavor, formal encoding of borrowed characters, in the LGC script family or in the general case. That's the main point I'm trying to make here. I think it is not enough to somehow arrive at a decision for one orthography, but it is necessary for the encoding committees to grab hold of the reasoning behind that decision and work out how to apply consistent reasoning like that in future cases. These are not unreasonable requests. I don’t see any inconsistency in practice as it relates to this particular case, however. So let me ask these questions: A. What are the encoding principles that follow from the disposition of the Janalif proposal? I think one principle is that we do not always have to maintain a principle of orthographic script purity. In particular, in the case of historic orthographies no longer in active use that borrowed characters from another script in the LGC family, if there are no technical or usability reasons that make it problematic to represent those text elements using existing characters from the source script, then it is not necessary to encode equivalents in the receptor script so that we can say that the historic orthography is a pure-Latin / pure-Greek / pure-Cyrillic orthography (which, in terms of social history rather than character encoding, would likely be a revisionist perspective). B. What precedents are these based on resp. what precedents are consciously established by this decision? I'm not sure I fully understand the question so won't venture a comment. Peter
RE: Are Latin and Cyrillic essentially the same script?
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of André Szabolcs Szelp AFAIR the reservations of WG2 concerning the encoding of Jangalif Latin Ь/ь as a new character were not in view of Cyrillic Ь/ь, but rather in view of its potential identity with the tone sign mentioned by you as well. It is a Latin letter adapted from the Cyrillic soft sign, There's another possible point of view: that it's a Cyrillic character that, for a short period, people tried using as a Latin character but that never stuck, and that it's completely adequate to represent Janalif text in that orthography using the Cyrillic soft sign. Peter
Re: Are Latin and Cyrillic essentially the same script?
On 11/18/2010 8:04 AM, Peter Constable wrote: From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of André Szabolcs Szelp AFAIR the reservations of WG2 concerning the encoding of Jangalif Latin Ь/ь as a new character were not in view of Cyrillic Ь/ь, but rather in view of its potential identity with the tone sign mentioned by you as well. It is a Latin letter adapted from the Cyrillic soft sign, There's another possible point of view: that it's a Cyrillic character that, for a short period, people tried using as a Latin character but that never stuck, and that it's completely adequate to represent Janalif text in that orthography using the Cyrillic soft sign. When one language borrows a word from another, there are several stages of foreignness, ranging from treating the foreign word as a short quotation in the original language to treating it as essentially fully native. Now words are very complex in behavior and usage compared to characters. You can check for pronunciation, spelling and adaptation to the host grammar to check which stage of adaptation a word has reached. When a script borrows a letter from another, you are essentially limited in what evidence you can use to document objectively whether the borrowing has crossed over the script boundary and the character has become native. With typographically closely related scripts, getting tell-tale typographical evidence is very difficult. After all, these scripts started out from the same root. So, you need some other criteria. You could individually compare orthographies and decide which ones are important enough (or established enough) to warrant support. Or you could try to distinguish between orthographies for general use withing the given language, vs. other systems of writing (transcriptions, say). But whatever you do, you should be consistent and take account of existing precedent. There are a number of characters encoded as nominally Latin in Unicode that are borrowings from other scripts, usually Greek. A discussion of the current issue should include explicit explanation of why these precedents apply or do not apply, and, in the latter case, why some precedents may be regarded as examples of past mistakes. By explicitly analyzing existing precedents, it should be possible to avoid the impression that the current discussion is focused on the relative merits of a particular orthography based on personal and possibly arbitrary opinions by the work group experts. If it can be shown that all other cases where such borrowings were accepted into Unicode are based on orthographies that are more permanent, more widespread or both, or where other technical or typographical reasons prevailed that are absent here, then it would make any decision on the current request seem a lot less arbitrary. I don't know where the right answer lies in the case of Janalif, or which point of view, in Peter's phrasing, would make the most sense, but having this discussion without clear understanding of the precedents will lead to inconsistent encoding. A./
pupil's comment: Are Latin and Cyrillic essentially the same script?
Dear all, Still see myself as pupil reading introduction chart of unicode, but I am happy to join the discussion on the Russian: it is quite different from Latin. Apart from 33 characters in Russian alphabet = more characters and apart from quite a few characters that as English speaker you clearly do not know, Latin and Russian indeed contain some similar characters. But watch out. There are if I am correct 3 a's in the world, in this email a (Latin) looks like a (Russian) but they are different. So the Russian a is quite suited for a hierogplyph attack (I will try ontslag.com, which is Dutch for dismissal.com, to see how search engines react. With Russian a. Punycode is different of the word as total). Similar example: Ukraine i - looks like ours, but you can't register it on .rf (Russian Federation). Experiment 1 year ago with *Reïntegratie.com* http://www.google.nl/aclk?sa=lai=Cq32OAcrlTIelNsGTOoCQ8Z4GwoKpugHavNrYFpf09AgIABADKANQppe9lfj_AWCRvJqFhBigAaryw_4DyAEBqQJLcsn7dNi2PqoEHE_QPDrLX54nLEfeere4hVxwC4D9yTrI81AEiP26BRMI9ayF7dSrpQIVyo0OCh1WKGKjygUAei=AcrlTLWoLsqbOtbQiJsKsig=AGiWqtxaX45Uf8wTKRjRJAdJsIX8fkSunAadurl=http://www.arboned.nl/diensten/arbeidsdeskundig-advies/dienst/arbeidsdeskundig-reintegratieonderzoek/ being correct Dutch for reintegration, but being impossible as domainname because SIDN.nl (supposed to be nic.nl) is very conservative and does not even allow signs gave as result: in the beginning Google appreciated and appreciated itafter a few months the hosted and filled site 'sank'.(I borrowed the **ï* http://www.google.nl/aclk?sa=lai=Cq32OAcrlTIelNsGTOoCQ8Z4GwoKpugHavNrYFpf09AgIABADKANQppe9lfj_AWCRvJqFhBigAaryw_4DyAEBqQJLcsn7dNi2PqoEHE_QPDrLX54nLEfeere4hVxwC4D9yTrI81AEiP26BRMI9ayF7dSrpQIVyo0OCh1WKGKjygUAei=AcrlTLWoLsqbOtbQiJsKsig=AGiWqtxaX45Uf8wTKRjRJAdJsIX8fkSunAadurl=http://www.arboned.nl/diensten/arbeidsdeskundig-advies/dienst/arbeidsdeskundig-reintegratieonderzoek/ *from Catalan, amidst Latin characters). News about ss / sz to whom is interested: most Germans were alert (ss-holders had priority to /ß)//, /so no/Fußbal/l for me, but only experimental domain names IDNexpress.de and IDNexpre/ß.de. /It was a mini-landrush on Nov. 16 2010, 10:00 German time onwards (Denic.de) / /Very busy with .rf auction now, in December I will put 2 different sites on these ss and sz names so people can wonder at their screens to see what is happening. Above reaction was more out of domain names and practical experience than chartUTFxyz - but definitely: different script. Br, Philippe On 18-11-2010 20:04, Asmus Freytag wrote: On 11/18/2010 8:04 AM, Peter Constable wrote: From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of André Szabolcs Szelp AFAIR the reservations of WG2 concerning the encoding of Jangalif Latin Ь/ь as a new character were not in view of Cyrillic Ь/ь, but rather in view of its potential identity with the tone sign mentioned by you as well. It is a Latin letter adapted from the Cyrillic soft sign, There's another possible point of view: that it's a Cyrillic character that, for a short period, people tried using as a Latin character but that never stuck, and that it's completely adequate to represent Janalif text in that orthography using the Cyrillic soft sign. When one language borrows a word from another, there are several stages of foreignness, ranging from treating the foreign word as a short quotation in the original language to treating it as essentially fully native. Now words are very complex in behavior and usage compared to characters. You can check for pronunciation, spelling and adaptation to the host grammar to check which stage of adaptation a word has reached. When a script borrows a letter from another, you are essentially limited in what evidence you can use to document objectively whether the borrowing has crossed over the script boundary and the character has become native. With typographically closely related scripts, getting tell-tale typographical evidence is very difficult. After all, these scripts started out from the same root. So, you need some other criteria. You could individually compare orthographies and decide which ones are important enough (or established enough) to warrant support. Or you could try to distinguish between orthographies for general use withing the given language, vs. other systems of writing (transcriptions, say). But whatever you do, you should be consistent and take account of existing precedent. There are a number of characters encoded as nominally Latin in Unicode that are borrowings from other scripts, usually Greek. A discussion of the current issue should include explicit explanation of why these precedents apply or do not apply, and, in the latter case, why some precedents may be regarded as examples of past mistakes. By explicitly analyzing existing precedents, it should be possible to avoid
RE: Are Latin and Cyrillic essentially the same script?
If you'd like a precedent, here's one: IPA is a widely-used system of transcription based primarily on the Latin script. In comparison to the Janalif orthography in question, there is far more existing data. Also, whereas that Janalif orthography is no longer in active use--hence there are not new texts to be represented (there are at best only new citations of existing texts), IPA is as a writing system in active use with new texts being created daily; thus, the body of digitized data for IPA is growing much more that is data in the Janalif orthography. And while IPA is primarily based on Latin script, not all of its characters are Latin characters: bilabial and interdental fricative phonemes are represented using Greek letters beta and theta. Given a precedent of a widely-used Latin writing system for which it is considered adequate to have characters of central importance represented using letters from a different script, Greek, it would seem reasonable if someone made the case that it's adequate to represent an historic Latin orthography using Cyrillic soft sign. Peter -Original Message- From: Asmus Freytag [mailto:asm...@ix.netcom.com] Sent: Thursday, November 18, 2010 11:05 AM To: Peter Constable Cc: André Szabolcs Szelp; Karl Pentzlin; unicode@unicode.org; Ilya Yevlampiev Subject: Re: Are Latin and Cyrillic essentially the same script? On 11/18/2010 8:04 AM, Peter Constable wrote: From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of André Szabolcs Szelp AFAIR the reservations of WG2 concerning the encoding of Jangalif Latin Ь/ь as a new character were not in view of Cyrillic Ь/ь, but rather in view of its potential identity with the tone sign mentioned by you as well. It is a Latin letter adapted from the Cyrillic soft sign, There's another possible point of view: that it's a Cyrillic character that, for a short period, people tried using as a Latin character but that never stuck, and that it's completely adequate to represent Janalif text in that orthography using the Cyrillic soft sign. When one language borrows a word from another, there are several stages of foreignness, ranging from treating the foreign word as a short quotation in the original language to treating it as essentially fully native. Now words are very complex in behavior and usage compared to characters. You can check for pronunciation, spelling and adaptation to the host grammar to check which stage of adaptation a word has reached. When a script borrows a letter from another, you are essentially limited in what evidence you can use to document objectively whether the borrowing has crossed over the script boundary and the character has become native. With typographically closely related scripts, getting tell-tale typographical evidence is very difficult. After all, these scripts started out from the same root. So, you need some other criteria. You could individually compare orthographies and decide which ones are important enough (or established enough) to warrant support. Or you could try to distinguish between orthographies for general use withing the given language, vs. other systems of writing (transcriptions, say). But whatever you do, you should be consistent and take account of existing precedent. There are a number of characters encoded as nominally Latin in Unicode that are borrowings from other scripts, usually Greek. A discussion of the current issue should include explicit explanation of why these precedents apply or do not apply, and, in the latter case, why some precedents may be regarded as examples of past mistakes. By explicitly analyzing existing precedents, it should be possible to avoid the impression that the current discussion is focused on the relative merits of a particular orthography based on personal and possibly arbitrary opinions by the work group experts. If it can be shown that all other cases where such borrowings were accepted into Unicode are based on orthographies that are more permanent, more widespread or both, or where other technical or typographical reasons prevailed that are absent here, then it would make any decision on the current request seem a lot less arbitrary. I don't know where the right answer lies in the case of Janalif, or which point of view, in Peter's phrasing, would make the most sense, but having this discussion without clear understanding of the precedents will lead to inconsistent encoding. A./
Re: Are Latin and Cyrillic essentially the same script?
AFAIR the reservations of WG2 concerning the encoding of Jangalif Latin Ь/ь as a new character were not in view of Cyrillic Ь/ь, but rather in view of its potential identity with the tone sign mentioned by you as well. It is a Latin letter adapted from the Cyrillic soft sign, like the Jangalif character. Function, as you point out, is not a distinctive feature. The different serif style which you pointed out cannot be seen as discriminating features of character identity, especially not in a time of bad typography (and actually lack of latin typographic tradition in China of the time). /Sz On Wed, Nov 10, 2010 at 5:08 PM, Karl Pentzlin karl-pentz...@acssoft.de wrote: As shown in N3916: http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3916.pdf = L2/10-356, there exists a Latin letter which resembles the Cyrillic soft sign Ь/ь (U+042C/U+044C). This letter is part of the Jaꞑalif variant of the alphabet, which was used for several languages in the former Soviet Union (e.g. Tatar), and was developed in parallel to the alphabet nowadays in use for Turk and Azerbaijan, see: http://en.wikipedia.org/wiki/Janalif . In fact, it was proposed on this base, being the only Jaꞑalif letter missing so far, since the ꞑ (occurring in the alphabet name itself) was introduced with Unicode 6.0. The letter is no soft sign; it is the exact Tatar equivalent of the Turkish dotless i, thus it has a similar use as the Cyrillic yeru Ы/ы (U+042B/U+044B). In this function, it is a part of the adaptation of the Latin alphabet for a lot of non-Russian languages in the Soviet Union in the 1920s, see e.g.: Юшманов, Н. В.: Определитель Языков. Москва/Ленинград 1941, http://fotki.yandex.ru/users/ievlampiev/view/155697?page=3 . (A proposal regarding this subject is expected for 2011.) Thus, it shares with the Cyrillic soft sign its form and partly the geographical area of its use, but in no case its meaning. Similar can be said e.g. for P/p (U+0050/U+0070, Latin letter P) and Р/р (U+0420/U+0440, Cyrillic letter ER). According to the pre-preliminary minutes of UTC #125 (L2/10-415), the UTC has not accepted the Latin Ь/ь. It is an established practice for the European alphabetic scripts to encode a new letter only if it has a different shape (in at least one of the capital and small forms) regarding to all already encoded letter of the same script. The Y/y is well known to denote completely different pronunciations, used as consonant as well as vocal, even within the same language. Thus, if somebody unearths a Latin letter E/e in some obscure minority language which has no E-like vocal, to denote a M-like sound and in fact to be collated after the M in the local alphabet, this will probably not lead to a new encoding. But, Latin and Cyrillic are different scripts (the question in the Re of this mail is rhetorical, of course). Admittedly, there also is a precedence for using Cyrillic letters in Latin text: the use of U+0417/U+0437 and U+0427/U+0447 for tone letters in Zhuang. However, the orthography using them was short-lived, being superseded by another Latin orthography which uses genuine Latin letters as tone marks (J/j and X/x, in this case). On the other hand, Jaꞑalif and the other Latin alphabets which use Ь/ь did not lose the Ь/ь by an improvement of the orthography, but were completely deprecated by an ukase of Stalin. Thus, they continue to be the Latin alphabets of the respective languages. Whether formally requesting a revival or not, they are regarded as valid by the members of the cultural group (even if only to access their cultural inheritance). Especially, it cannot be excluded that persons want to create Latin domain names or e-mail addresses without being accused for script mixing. Taking this into account, not mentioning the technical problems regarding collation etc. and the typographical issues when it comes to subtle differences between Latin and Cyrillic in high quality typography, it is really hard to understand why the UTC refuses to encode the Latin Ь/ь. A quick glance at the Юшманов table mentioned above proves that there is absolutely no request to duplicate the whole Cyrillic alphabet in Latin, as someone may have feared. - Karl Pentzlin -- Szelp, André Szabolcs +43 (650) 79 22 400
Are Latin and Cyrillic essentially the same script?
As shown in N3916: http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3916.pdf = L2/10-356, there exists a Latin letter which resembles the Cyrillic soft sign Ь/ь (U+042C/U+044C). This letter is part of the Jaꞑalif variant of the alphabet, which was used for several languages in the former Soviet Union (e.g. Tatar), and was developed in parallel to the alphabet nowadays in use for Turk and Azerbaijan, see: http://en.wikipedia.org/wiki/Janalif . In fact, it was proposed on this base, being the only Jaꞑalif letter missing so far, since the ꞑ (occurring in the alphabet name itself) was introduced with Unicode 6.0. The letter is no soft sign; it is the exact Tatar equivalent of the Turkish dotless i, thus it has a similar use as the Cyrillic yeru Ы/ы (U+042B/U+044B). In this function, it is a part of the adaptation of the Latin alphabet for a lot of non-Russian languages in the Soviet Union in the 1920s, see e.g.: Юшманов, Н. В.: Определитель Языков. Москва/Ленинград 1941, http://fotki.yandex.ru/users/ievlampiev/view/155697?page=3 . (A proposal regarding this subject is expected for 2011.) Thus, it shares with the Cyrillic soft sign its form and partly the geographical area of its use, but in no case its meaning. Similar can be said e.g. for P/p (U+0050/U+0070, Latin letter P) and Р/р (U+0420/U+0440, Cyrillic letter ER). According to the pre-preliminary minutes of UTC #125 (L2/10-415), the UTC has not accepted the Latin Ь/ь. It is an established practice for the European alphabetic scripts to encode a new letter only if it has a different shape (in at least one of the capital and small forms) regarding to all already encoded letter of the same script. The Y/y is well known to denote completely different pronunciations, used as consonant as well as vocal, even within the same language. Thus, if somebody unearths a Latin letter E/e in some obscure minority language which has no E-like vocal, to denote a M-like sound and in fact to be collated after the M in the local alphabet, this will probably not lead to a new encoding. But, Latin and Cyrillic are different scripts (the question in the Re of this mail is rhetorical, of course). Admittedly, there also is a precedence for using Cyrillic letters in Latin text: the use of U+0417/U+0437 and U+0427/U+0447 for tone letters in Zhuang. However, the orthography using them was short-lived, being superseded by another Latin orthography which uses genuine Latin letters as tone marks (J/j and X/x, in this case). On the other hand, Jaꞑalif and the other Latin alphabets which use Ь/ь did not lose the Ь/ь by an improvement of the orthography, but were completely deprecated by an ukase of Stalin. Thus, they continue to be the Latin alphabets of the respective languages. Whether formally requesting a revival or not, they are regarded as valid by the members of the cultural group (even if only to access their cultural inheritance). Especially, it cannot be excluded that persons want to create Latin domain names or e-mail addresses without being accused for script mixing. Taking this into account, not mentioning the technical problems regarding collation etc. and the typographical issues when it comes to subtle differences between Latin and Cyrillic in high quality typography, it is really hard to understand why the UTC refuses to encode the Latin Ь/ь. A quick glance at the Юшманов table mentioned above proves that there is absolutely no request to duplicate the whole Cyrillic alphabet in Latin, as someone may have feared. - Karl Pentzlin
Re: Are Latin and Cyrillic essentially the same script?
2010-11-10 10:08, I wrote: KP As shown in N3916 ... Please read vowel instead of vocal throughout the mail. Sorry.
Re: Accessing alternate glyphs from plain text (from Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)
Thank you for replying. On Saturday, 7 August 2010, Doug Ewell d...@ewellic.org wrote: I think the alternate ending glyph is supposed to be specified in more detail than that. The example Asmus gave was U+222A UNION with serifs. Even though the exact proportions of the serifs may differ from one font to the next, this is still a relatively precise and constrained definition, unlike Latin small letter e with some 'alternate ending' which is completely up to the discretion of the font designer. Because of stylistic differences among calligraphers—this is a calligraphy question, not a poetry question—it is hard to imagine how this aspect of the proposal would not result in an unbounded number of glyphic variations. 'e' is not the only letter to which calligraphers like to attach special endings, and a swash cross-stroke is not the only special ending that calligraphers like to attach to 'e'. It seems to me that there are at least two ways to have an alternate ending e. One is to extend the cross-stroke to the right beyond the e and end the extension with a flourish of some sort, another is to extend the lower line out to the right and end that extension in some way. I can imagine that a proposal would lead to wanting to be able to express a choice of the two, or more, possible variants of a letter, should the font have alternate glyphs of both types. Then there is the question of what is to happen if the requested one is not available in the font: does the other alternate glyph become displayed or does the basic character glyph become displayed? I'd like to see an FAQ page on What is Plain Text? written primarily by UTC officers. That might go a long way toward resolving the differences between William's interpretation of what plain text is, which people like me think is too broad, and mine, which some people have said is too narrow. That is a good idea. Thank you also for the careful precision with which you describe the situation of who thinks what. Yet is producing such a document an impossible task? Some years ago there was a suggestion in this mailing list to produce an Frequently Asked Questions (FAQ) page about what should not be encoded. Is the document that is now suggested effectively the same thing? I thought of an analogy of trying to produce a FAQ document of What is art?. Such a document produced in 1550 might well have been very different from one produced in 1910, and those different from one produced in 1995 and those all different from one produced in 2010. Maybe the analogy is not perfect, but it seems to convey the meaning to me that if a What is Plain Text? document is produced, with a view to being able to decide what could and could not in the future be encoded in Unicode as plain text, then it could quickly become either out of date or a restriction of progress in technology. The recent encoding of the emoticons shows a dramatic change in what can be encoded as plain text from the situation some years ago. Some of my ideas have been refuted as not being suitable for encoding in plain text. Yet the refutation all seems to be based on unchangeable rules from about twenty years ago. Yet change is part of progress. I remember once being referred, in this mailing list, to an ISO document about encoding. The document made reference to a definition of character within the same document. The document was ISO/IEC TR 15285. I have found that the document is available here (the link used at the previous time no longer works). http://openstandards.dk/jtc1/sc2/wg2/docs/TR%2015285%20-%20C027163e.pdf The introduction includes the following. quote This Technical Report is written for a reader who is familiar with the work of SC 2 and SC 18. Readers without this background should first read Annex B, “Characters”, and Annex C, “Glyphs”. end quote Annex B has the following. quote In ISO/IEC 10646-1:1993, SC 2 defines a character as: A member of a set of elements used for the organisation, control, and representation of data. end quote On the accessing of alternate glyphs from plain text, I feel that as there are 256 variation selectors that could be used with each of the Latin letters, then, provided that no harm is done to those who choose not to use them, that some should be encoded so that alternate glyphs can be accessed from fonts. Some readers might find the following of interest. http://forum.high-logic.com/viewtopic.php?f=36t=2229 It is a thread entitled An unusual glyph of an Esperanto character in the Arno font. I had been looking through the following document. http://store1.adobe.com/type/browser/pdfs/ARNP/ArnoPro-Italic.pdf I had found an alternate ending glyph for the h circumflex character and had then tried to produce some text where it could be used. I felt that it was a situation of typography inspiring creative writing. Readers who enjoyed that thread might also
Re: Accessing alternate glyphs from plain text (from Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)
On Aug 7, 2010, at 10:40 AM, Doug Ewell wrote: I'd like to see an FAQ page on What is Plain Text? written primarily by UTC officers. That might go a long way toward resolving the differences between William's interpretation of what plain text is, which people like me think is too broad, and mine, which some people have said is too narrow. Well, we do have http://www.unicode.org/faq/ligature_digraph.html#10 and related FAQs? The basic idea is that plain text is the minimum amount of information to process the given language in a normal way. FOR EXAMPLE, ALTHOUGH ENGLISH CAN BE WRITTEN IN ALL-CAPS, IT USUALLY ISN'T, AND DOING IT LOOKS WRONG. We therefore have both upper- and lower-case letters for English. On the other hand, although English *is* usually written with some facility to provide emphasis, different media have different ways of providing that facility (asterisks, underlining, italicizing), and English written without any of these looks perfectly fine. Arabic, on the other hand, absolutely must have some way of allowing for different letter shapes in different contexts, or it looks just wrong, so Arabic plain text must have facility to allow for that, either by explicitly having different characters for the different shapes the letters take, or by providing a default layout algorithm that defines them. Beyond rendering, there are also considerations as to the minimal amount of information necessary for other text-based processes, such as sorting, searching, and text-to-speech. Yes, there are issues which end up being judgment calls, and it's easy to come up with cases where you can't really capture the full semantic intent of the author without what Unicode calls rich text. My favorite example is The Mouse's Tale in _Alice in Wonderland_. Plain text isn't intended to capture all the nuances of the original's semantics, but to provide at the least a very close approximation. Variation selectors are intended to cover cases where more information is needed for rendering than is required for other processes such as searching (Mongolian), or cases where different user communities disagree on whether two forms must be unified or must be deunified. = Hoani H. Tinikini John H. Jenkins jenk...@apple.com
Re: Accessing alternate glyphs from plain text (from Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)
John H. Jenkins wrote: The basic idea is that plain text is the minimum amount of information to process the given language in a normal way. That's a bit vague. We don't normally process languages; we read texts. Whether font or color variation is essential for understanding really depends on the author's purposes and choices, not on language, FOR EXAMPLE, ALTHOUGH ENGLISH CAN BE WRITTEN IN ALL-CAPS, IT USUALLY ISN'T, AND DOING IT LOOKS WRONG. I wouldn't say it looks wrong. Surely it is often typographically poor or just stupid, but it might be a consequence of technical limitations (there are still loads of systems that make no case distinction in texts, so in any relevant aspect, they are effectively uppercase-only), and all-caps English is quite understandable, though boring to read, provided that some precautions are made by writers. We therefore have both upper- and lower-case letters for English. It's just a distinction that you _can_ (and usually do) make in plain text English. It's not an inherent distinction: all-caps English is still English, though poorly written by modern standards. Arabic, on the other hand, absolutely must have some way of allowing for different letter shapes in different contexts, or it looks just wrong, so Arabic plain text must have facility to allow for that, either by explicitly having different characters for the different shapes the letters take, or by providing a default layout algorithm that defines them. But layout algorithms are not part of character encoding or part of the definition of plain text. It's not OK to render plain text Arabic, encoded at logical level (i.e., letters encoded abstractly and not as contextual forms), in a simplistic manner that uses a one letter - one glyph model. But that's not part of the definition of plain text at all. Yes, there are issues which end up being judgment calls, and it's easy to come up with cases where you can't really capture the full semantic intent of the author without what Unicode calls rich text. We don't need to invent contrived examples for that. Every time an author uses italics or bolding to make an essential point in emphasizing something he does something that cannot be captured in a plain version of the text. To make an even simpler point, if you insert an essential content image into a document you step outside the realm of plain text. I don't see any better definition for plain text than a negative one: it is text without formatting, except to the extent that forced line breaks and the choice of alternative forms for a character (to the extent that such differences are encoded in the character code) can be considered as formatting. Plain text, though apparently a very simple concept, is a very abstract one. I don't think you can explain the concept to your neighbor while standing on one foot, if at all. Human writing did not originate as plain text, and at the surface level, it is never plain text: it always has some specific physical appearance, and abstract plain text can only be found below the surface, as the underlying data format where only character identities (character numbers in a specific code) are encoded, with no reference to a particular rendering. -- Yucca, http://www.cs.tut.fi/~jkorpela/
Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters
karl-pentz...@acssoft.de wrote: I have compiled a draft proposal: Proposal to add Variation Sequences for Latin and Cyrillic letters There are 256 selectors but the proposal only suggests numbering up to 16 effectively deprecating the others. Surely we want all 256? The Mongolian selectors alter the appearance of the glyph displayed after the character has been evaluated for position in the word and a series of complex rules applied. The user will normally only have to use the selectors in exceptional cases. The selectors are only valid in certain positional cases and have been somewhat arbitarily assigned. It is not the case that selector 1 selects the same alternative form in all positions. A typical user will see most of the variations in use from the built in rules being applied. There is not a user entity which would be considered variant 1 which is used by a separate community. I regard to proposal to give a name like VARIant-M1 as confusing as they have no basis in reality I am also have some concerns from a security point of view as the proposal makes variation selectors valid for Latin characters for the first time. The selectors which produce a default behaviour or make one character look like another already encoded seem unneeded and introduce yet more clones of common characters. I also have concerns about the proposal to give the non ideographic variants names like VARIANT-1. Surely it is possible to give them descriptive names which would make it easier to understand what is meant? It is not as if we will have thousands of these. Some parts of the proposal have merit, but I would urge the UTC to hold a public consultation on the matter to allow more time for feedback to be gathered. Tim Partridge
Re: Accessing alternate glyphs from plain text (from Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)
Thank you for replying. On Friday 6 August 2010, Asmus Freytag asm...@ix.netcom.com wrote: What you mean are artistic or stylistic variants. These have certain problems, see here for an explanation: http://www.unicode.org/forum/viewtopic.php?p=221#p221 A./ I have read and reread the forum post to which you refer. I cannot understand from that text, or otherwise at the time of writing this reply, why it would not be possible to have an alternate ending glyph for a letter e accessible from plain text using an advanced font technology font (for example, an OpenType font) using the two character sequence U+0065 U+FE0F. The specific design of an alternate ending e glyph would vary from font to font, yet that it is an alternate ending e would be clear: the encoding U+0065 U+FE0F would allow the intention that an alternate ending glyph for a letter e is requested to be carried within a plain text document. I accept that I might be missing something here. If so I would be happy to learn: at the moment, however, it still seems to me to be a good idea for an encoding. William Overington 7 August 2010
Re: Accessing alternate glyphs from plain text (from Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)
Thank you for replying. On Friday 6 August 2010, John H. Jenkins jenk...@apple.com wrote: This is another case of a solution in search of a problem. No, the problem is that one cannot at present, as far as I know, access alternate glyphs of an advanced format font from a plain text file. It isn't Unicode's business to advance typography, and in any event, typesetting plain text isn't the path to good typography. Those are interesting claims. I hope that if Unicode can advance typography by providing a facility such as I am suggesting that it would be pleased to do so. Other technologies, such as OpenType, AAT, and Graphite, *do* have the job of making good typography easy and accessible. Fonts are an important part of the whole process. And, mirabile dictu, they can already do what you are suggesting here for plain text. I am unaware of how an application program using an OpenType font can be made to display alternate glyphs requested from a plain text file. Can it be done? Unicode's responsibility is to deal with existing needs. Well, for me it is a need to be able to request the display of an alternate glyph of an advanced format font from a plain text file. If it is common for poets to use various letter shapes at the end of words to convey some semantic meaning, and if they do this in their emails or tweets, or if they're complaining that this is something that they want to do but can't, then Unicode and plain text provide a proper way to help them. Alas, a paradox. If the facility becomes available, they might well use it. Yet, unlike a ROASTED SWEET POTATO glyph becoming available on some mobile telephones then later becoming encoded in Unicode because it was available on some mobile telephones, it is not, as far as I am presently aware, possible for that to happen in relation to requesting an alternate ending glyph for a letter e from a plain text file whilst still producing an ordinary e if that request cannot be fulfilled by the particular font being used. Fonts themselves are used to convey semantic meaning. I am unsure of quite how it all works, yet it seems to work partly by association with cultural knowledge of where fonts or handwriting or signwriting of that type have been used previously and partly with design aspects of the font, such as angularity or smoothness or ornateness and perhaps other factors as well. William Overington 7 August 2010
Re: Accessing alternate glyphs from plain text (from Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)
William_J_G Overington wjgo underscore 10009 at btinternet dot com wrote: I cannot understand from that text, or otherwise at the time of writing this reply, why it would not be possible to have an alternate ending glyph for a letter e accessible from plain text using an advanced font technology font (for example, an OpenType font) using the two character sequence U+0065 U+FE0F. The specific design of an alternate ending e glyph would vary from font to font, yet that it is an alternate ending e would be clear: the encoding U+0065 U+FE0F would allow the intention that an alternate ending glyph for a letter e is requested to be carried within a plain text document. I think the alternate ending glyph is supposed to be specified in more detail than that. The example Asmus gave was U+222A UNION with serifs. Even though the exact proportions of the serifs may differ from one font to the next, this is still a relatively precise and constrained definition, unlike Latin small letter e with some 'alternate ending' which is completely up to the discretion of the font designer. Because of stylistic differences among calligraphers—this is a calligraphy question, not a poetry question—it is hard to imagine how this aspect of the proposal would not result in an unbounded number of glyphic variations. 'e' is not the only letter to which calligraphers like to attach special endings, and a swash cross-stroke is not the only special ending that calligraphers like to attach to 'e'. I'd like to see an FAQ page on What is Plain Text? written primarily by UTC officers. That might go a long way toward resolving the differences between William's interpretation of what plain text is, which people like me think is too broad, and mine, which some people have said is too narrow. -- Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s
Re: Draft Proposal to add Variation Sequences for Latin and=D=A Cyrillic letters
Michael Everson On 6 Aug 2010, at 22:20, Karl Pentzlin wrote: Am Dienstag, 3. August 2010 um 09:45 schrieb Michael Everson: ME ... In particular the implications ME for Serbian orthography would be most unwelcome. As I have outlined in the revised introduction of my proposal, there are *no* implications for Serbian orthography. Admittedly, this was a little bit implicit in my first draft. Yeah, well, I am not convinced of the merits of your proposal. Sorry. I am not convinced too. Because all what this proposal is supposed to solve is to allow an automted change of orthography so that SOME long s in old doucments using Fraktur style will become round s in some other antermediate style (like Antiqua) and then all of them will become round s later. It's a matter of orthographic adaptation, i.e. modernization of old texts. But any modernization of old orthographies imples more than just changing some glyphs. For example the modernisation of medieval French texts require knowing when it was written (to correctly infer its semantic), then knowing for which period of time the modernized version was created, and then knowing what other orthographic changes where necessary, such as substituting s (long or round) into circumflexes, or changing tildes into circumflex or newer (distinguished) modern accents, or dropping some other letters. Unicode is not made to adapt to orthographic changes. My opinion is that it just has to encode the orthography, AS IT IS, ignoring all possible other adaptation due to modernizations (and evolutions of the written language). In other words, the existing long s and common round s is just enoiugh to preserve the original orthography and its semantics, as they were in the original text (even if it was ambiguous or incoherent). The variation selectors are not intended to convey the additional semantics needed for adaptations to newer orthographies, but ONLY the additional semantics that exist in a written language at the time when it was effectively written. Text modernizers will really need something else, notably lexical and gramatical analysis (within humane supervision), and they are completely out of scope of Unicode and ISO 10646. These will work by effectively correcting the text, i.e. changing its original orthography and semantics. This process will be mostly like many transliterations schemes or like all translations processes: the resulting text is obsiously different and intended for different readers. The only case where we really need variation selectors is when we can demonstrate that there are opposable pairs where a glyphic variant (within a unified abstract character) in the SAME text by the SAME author conveys a distinct semantic. For everything else, variations selectors should not be used at all, and a encoded round s will still mean the same, even if it's renderered with a Fraktur font or a Bodoni- or Antiqua- like font. Philippe.
Re: Draft Proposal to add Variation Sequences for Latin and=D=A Cyrillic letters
verdy_p verdy underscore p at wanadoo dot fr wrote: I am not convinced too. Because all what this proposal is supposed to solve is to allow an automted change of orthography so that SOME long s in old doucments using Fraktur style will become round s in some other antermediate style (like Antiqua) and then all of them will become round s later. You missed some e-mails. The long s/round s sequences are gone from the latest proposal. -- Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s
Accessing alternate glyphs from plain text (from Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)
On Thursday, 5 August 2010, Kenneth Whistler k...@sybase.com wrote: I am thinking of where a poet might specify an ending version of a glyph at the end of the last word on some lines, yet not on others, for poetic effect. I think that it would be good if one could specify that in plain text. Why can't a poet find a poetic means of doing that, instead of depending on a standards organization to provide a standard means of doing so in plain text? Seems kind of anti-poetic to me. ;-) --Ken Well, I was just suggesting an example. I am not an expert on poetry. It would not be a matter of a poet depending on a standards organization, it would be a matter of a standards organization noting that adding alternate glyphs to fonts is a modern trend and doing what it can to facilitate access to those alternate glyphs from plain text in a standardized way. For example, suppose that an alternate ending glyph for a letter e is desired at the end of a line of a poem. I am thinking that U+0065 U+FE0F could be used to do that. It seems to me that as U+0065 U+FE0F is presently unused and that there are also other variation selectors not used with U+0065, that it would do no harm and would be useful for U+0065 U+FE0F to be officially standardized as requesting an alternate ending glyph for a letter e, yet using the ordinary glyph of U+0065 of the font if an alternate ending glyph of the letter e is not available within the font. The standards organizations have a great opportunity to advance typography by defining some of the Latin letter plus variation selector pairs so that alternate glyphs within a font may be accessed directly from plain text. William Overington 6 August 2010
Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters
On 2010/08/05 2:56, Asmus Freytag wrote: On 8/2/2010 5:04 PM, Karl Pentzlin wrote: I have compiled a draft proposal: Proposal to add Variation Sequences for Latin and Cyrillic letters The draft can be downloaded at: http://www.pentzlin.com/Variation-Sequences-Latin-Cyrillic2.pdf (4.3 MB). The final proposal is intended to be submitted for the next UTC starting next Monday (August 9). Any comments are welcome. - Karl Pentzlin This is an interesting proposal to deal with the glyph selection problem caused by the unification process inherent in character encoding. When Unicode was first contemplated, the web did not exist and the expectation was that it would nearly always be possible to specify the font to be used for a given text and that selecting a font would give the correct glyph. The Web may finally get to solve this problem, although it may still take some time to be fully deployed. Please see http://www.w3.org/Fonts/ for more details and pointers. Regards,Martin. -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:due...@it.aoyama.ac.jp
Re: Accessing alternate glyphs from plain text (from Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)
On 8/6/2010 2:03 AM, William_J_G Overington wrote: On Thursday, 5 August 2010, Kenneth Whistler k...@sybase.com wrote: I am thinking of where a poet might specify an ending version of a glyph at the end of the last word on some lines, yet not on others, for poetic effect. I think that it would be good if one could specify that in plain text. Why can't a poet find a poetic means of doing that, instead of depending on a standards organization to provide a standard means of doing so in plain text? Seems kind of anti-poetic to me. ;-) --Ken Well, I was just suggesting an example. I am not an expert on poetry. What you mean are artistic or stylistic variants. These have certain problems, see here for an explanation: http://www.unicode.org/forum/viewtopic.php?p=221#p221 A./
Re: Accessing alternate glyphs from plain text (from Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)
On Aug 6, 2010, at 3:03 AM, William_J_G Overington wrote: The standards organizations have a great opportunity to advance typography by defining some of the Latin letter plus variation selector pairs so that alternate glyphs within a font may be accessed directly from plain text. This is another case of a solution in search of a problem. It isn't Unicode's business to advance typography, and in any event, typesetting plain text isn't the path to good typography. Other technologies, such as OpenType, AAT, and Graphite, *do* have the job of making good typography easy and accessible. And, mirabile dictu, they can already do what you are suggesting here for plain text. Unicode's responsibility is to deal with existing needs. If it is common for poets to use various letter shapes at the end of words to convey some semantic meaning, and if they do this in their emails or tweets, or if they're complaining that this is something that they want to do but can't, then Unicode and plain text provide a proper way to help them. = Hoani H. Tinikini John H. Jenkins jenk...@apple.com
Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters
Am Dienstag, 3. August 2010 um 02:04 schrieb ich: KP I have compiled a draft proposal: KP Proposal to add Variation Sequences for Latin and Cyrillic letters In the meantime, I have submitted a final version to the UTC (L2/10-280), as the UTC starts upcoming Monday (2010-08-09). For those who do not have access to L2, it is also found at: http://www.pentzlin.com/Variation-Sequences-Latin-Cyrillic.pdf (4.4 MB). Thank you to all who participated on the discussions on this list. According to your hints, I have: · dropped the proposed variants for Latin small letter s (addressing Fraktur/Blackletter), as the special aspects of these are to be handled in a separate proposal (if such will be done). · dropped the unspecific variants for Latin small letter a and g, · rewritten substantial parts of the introduction, to be more concise at the points which had raised questions on this list and elsewhere. - Karl Pentzlin
Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters
Am Freitag, 6. August 2010 um 11:08 schrieb Martin J. Dürst: MJD The Web may finally get to solve this problem, although it may still MJD take some time to be fully deployed. Please see http://www.w3.org/Fonts/ MJD for more details and pointers. Variation sequences are a means to support this goal, as they provide font developers with a standardized and easy understandable means, which unburdens the font designers as well as the site designers who decide which font they offer for their intended users of their content. - Karl Pentzlin
Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters
Am Dienstag, 3. August 2010 um 09:45 schrieb Michael Everson: ME ... In particular the implications ME for Serbian orthography would be most unwelcome. As I have outlined in the revised introduction of my proposal, there are *no* implications for Serbian orthography. Admittedly, this was a little bit implicit in my first draft. - Karl Pentzlin
Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters
Am Donnerstag, 5. August 2010 um 12:31 schrieb William_J_G Overington: WO Yet what if one wants to use the precomposed g circumflex character? To search in the text of the Unicode standard for canonical equivalence is helpful in this case for end users as well as for font designers and for programmers of rendering systems. - Karl Pentzlin
Re: long s (was: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)
Am Mittwoch, 4. August 2010 um 22:44 schrieb ich: KP However, in my next version, I will replace the s variants by long s variants: KP 017F FE00 ...LONG S VARIANT-1 ... STANDARD FORM KP · will be displayed long in any script variants KP 017F FE01 ...LONG S VARIANT-1 FLEXIBLE FORM (naming provisionally) KP · will be displayed long in Fraktur, Gaelic, and similar script variants KP · will usually be displayed round when used with Roman type KP This has the advantage, especially when implicit application of variation sequences KP is possible, it can be applied to existing data without change. In the final version of my proposal, I have completely dropped this, as this subject obviously needs a separate discussion in a separate proposal. - Karl Pentzlin
Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters
Yeah, well, I am not convinced of the merits of your proposal. Sorry. On 6 Aug 2010, at 22:20, Karl Pentzlin wrote: Am Dienstag, 3. August 2010 um 09:45 schrieb Michael Everson: ME ... In particular the implications ME for Serbian orthography would be most unwelcome. As I have outlined in the revised introduction of my proposal, there are *no* implications for Serbian orthography. Admittedly, this was a little bit implicit in my first draft. - Karl Pentzlin Michael Everson * http://www.evertype.com/
Re: long s (was: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)
For the standard form you probably don't need to add a variation selector. The codepoint for long s itself expresses exactly the semantic to represent this character as long s in ANY type style. While I'm not convinced of your variation proposal at all (on the contrary), if you write it, write it properly. :-) /Sz 2010/8/4 Karl Pentzlin karl-pentz...@acssoft.de Am Dienstag, 3. August 2010 um 19:11 schrieb Janusz S. Bień: JJSB I see no reason why, if I understand correctly, the long s variant is JSB to be limited to Fraktur-like styles. The *variant* is applicable to situations where the character is to be displayed long when Fraktur-like styles are in effect, while it is to be displayed round when modern styles are in effect. The plain *character* long s is intended to be displayed long in all circumstances. However, in my next version, I will replace the s variants by long s variants: 017F FE00 ...LONG S VARIANT-1 STANDARD FORM · will be displayed long in any script variants 017F FE01 ...LONG S VARIANT-1 FLEXIBLE FORM (naming provisionally) · will be displayed long in Fraktur, Gaelic, and similar script variants · will usually be displayed round when used with Roman type This has the advantage, especially when implicit application of variation sequences is possible, it can be applied to existing data without change. - Karl Pentzlin -- Szelp, André Szabolcs +43 (650) 79 22 400
Re: Dialects and orthographies in BCP 47 (was: Re: Draft ProposalDA to add Variation Sequences for Latin and Cyrillic letters)
will decide to reunite their cultural efforts [...] and increasing their mutual cultural exchanges instead of wasting them for old nationalist reasons You're either an utmost optimist, or you have really no idea of Eastern European history, culture and spirit. :-) I doubt your described scenario will come true in our lifetimes. /Sz On Wed, Aug 4, 2010 at 11:10 PM, verdy_p verd...@wanadoo.fr wrote: Doug Ewell wrote: There is no formal model in the sense of a standard N-letter subtag for dialects, because the concept of a dialect is too open-ended and unsystematic. The word means different things to different people. What may be a dialect to one person might be a full-blown National Language to another, or just a funny accent to a third. The formal model already exists in ISO 639, that has decided to unify all dialectal variants under the same language code. Yes the concept is fuzzy, but as long as ISO 639 will not contain a formal model for how the various languages are grouped in families and subfamilies, it will be impossible to use dialectal variant specifiers with accurate fallbacks, without using subtags for the language variants. One know problem is for exampel Norman, which ISO 639 still considers as a dialect of French, even though it is just ANOTHER Oil language (from which Standard French emerged by merging, modifying and extending several dialects). But Jersiais is now an language with official in Jersey, which is clearly part of the Norman family. And that still needs to be distinguished from French. Still, there's no ISO 639 code for Norman (as a family or as the residual language in continentla Normandy in France), and no code for Jersiais as well. And French is considered in ISO 639 as an isolated language, not as as macrolanguage. So it allows no further precision. If something is added, it can only be a variant for the dialectal difference, such as fr-norman for the Norman family, or fr-jersiais for Jersiais, unless Jersiais gets its own ISO 639-3 code as an isolated language (leaving the continental Norman still as a dialectal variant of French). The formal definition of languages is the definition of ISO 639-3 isolated languages. Everything below is dialectal (and ISO 639 has clearly stated that it planned for much later a comprehensive encoding of dialectal differences, most probably by defining a standard list of variant codes, even if these dialects may qualify as languages for some users) It's remarkable that for most linguists, Serbian, Croatian, annd Bosnian are only one language, with only dialectal differences (in the spoken language and with some grammatical derivations, and some minor lexical differences that are understood by all Serbo-Croatian speakers), orthographic differences (mostly based on their default script, even if Serbian still uses the two scripts but it defines a strict transliteration system that helps defining a unified orthography for both scripts, orthographies that are simplified in Croatian and Bosnian). So yes, the concept of dialects vs. language is fuzzy for linguists and users (and nationals that prefer to see their dialect named from their country as a full language instead of a dialect), but ISO 639 defines a formal model by its technical encoding: if there's an authority defending the position of a distinct language and defining an official lexique and orthography, it becomes a de facto language for ISO 639. Such split of languages in their dialectal differences promoted to isolated languages has occured and was endorsed by ISO 639, even if it was probably not in the interest of these countries to split their common language and to reduce its audience and cultural influence in other parts of the world (and for many of their own citizens, they won't care a lot about these formal official differences, as long as they understand it and can read and write it in a script that they can decipher it without difficulties, only because they will constantly live near other peoples sharing the same language but under a different name). Serbian is still perceived and encoded as a single language, despite it still uses two scripts, depending on the region of use (but it is now rapidly converging to the Latin script). May be the linguistic and cultural authorities of the four concerned countries (or five, now with Kosovo whose independance was recently validated by a international court?) will decide to reunite their cultural efforts, if they finally all use the same Latin script, by adopting a new neutral name (Dolmoslavic, Adriatic, Adrislavic ? Or even Yugoslavic ?) and increasing their mutual cultural exchanges instead of wasting them for old nationalist reasons (this will be even more important when they will finally ALL join the European Union with increased exchanged between them). Philippe.
Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters
Thank you for your reply. On Wednesday 4 August 2010, Karl Pentzlin karl-pentz...@acssoft.de wrote: WO Why is it not possible specifically to request a one-storey form of lowercase letter a? I did not this, as I do not know a cultural context where the two-storey form is to be suppressed to prevent an a to be mistaken for any letter too similar to a two-storey a. Well, I was intending this as a straightforward way to access glyph alternates. Noticing that you mentioned cultural context, I have now remembered a situation that might perhaps be of interest. It was in a thread about fonts for teaching children in the United Kingdom how to read and write. http://forum.high-logic.com/viewtopic.php?f=10t=296 WO What happens in relation to a character such as g circumflex? Would one be able to access a glyph alternate for g circumflex? The variant selector can be followed by any diacritic which then is applied to the base character. Yet what if one wants to use the precomposed g circumflex character? WO Could there be variants for lowercase e, ... I have found none, which of course is no proof of non-existence, WO for a horizontal line glyph design, and for an angled line, Not according to the principles outlined in my proposal, WO Venetian-style font, glyph design please? No. I was looking for a way to access a glyph alternate for typography, not for any cultural meaning. Maybe one might choose to use an e with an angled line in the words Venice and Venetian, for subtle effect in the typography. I find that adding alternate glyphs to fonts is a modern trend. There seems no current way to access them from plain text. WO Would it be possible to define U+FE15 VARIATION SELECTOR-16 to indicate an end of word alternate glyph for each lowercase Latin character? No. Even if you find a cultural context where such things are required, such things are positional variants which are to be handled by the proven mechanisms developed for scripts like Arabic. I am thinking of where a poet might specify an ending version of a glyph at the end of the last word on some lines, yet not on others, for poetic effect. I think that it would be good if one could specify that in plain text. William Overington 5 August 2010
Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters
On Wednesday 4 August 2010, Asmus Freytag asm...@ix.netcom.com wrote: However, there's no need to add variation sequences to select an *ambiguous* form. Those sequences should be removed from the proposal. Are you here talking about such things as alternate glyph styles? It depends what one means by need. Adding alternate glyphs to a font is a trend in modern font design. One approach is to use Private Use Area mappings, which can be used to produce stylish hardcopy printouts and stylish graphics for the web, yet there are the well-known problems of spell-checking and so on if Private Use Area mappings are used for much more than those application areas. The other approach is to use an alternate glyph model, where the underlying plain text is conserved. However, this, today, often means using expensive software packages with a proprietary file format in order to store the information about which glyph to use in each case. I remember those advertisements that CNN used to run promoting the concept of advertising. Advertising - your right to choose. One of the advertisements distinguished between what people need and what people want. So, maybe people do not need to use alternate glyphs in typography, yet maybe they want to do so, maybe they enjoy doing so. I feel that it is entirely reasonable that Unicode and ISO 10646 encode things that help people do what they want to do and what they enjoy doing as well as what they need to do. William Overington 5 August 2010
Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters
On 8/5/2010 3:47 AM, William_J_G Overington wrote: On Wednesday 4 August 2010, Asmus Freytag asm...@ix.netcom.com wrote: However, there's no need to add variation sequences to select an *ambiguous* form. Those sequences should be removed from the proposal. Are you here talking about such things as alternate glyph styles? No, I am referring to the element of the proposal that proposes to have a variation sequence that selects the unspecified form for lower case a. It depends what one means by need. I've written a longer answer here: http://www.unicode.org/forum/viewtopic.php?f=9t=83start=0 A./
Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters
I am thinking of where a poet might specify an ending version of a glyph at the end of the last word on some lines, yet not on others, for poetic effect. I think that it would be good if one could specify that in plain text. Why can't a poet find a poetic means of doing that, instead of depending on a standards organization to provide a standard means of doing so in plain text? Seems kind of anti-poetic to me. ;-) --Ken
Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters
On Tuesday 3 August 2010, Karl Pentzlin karl-pentz...@acssoft.de wrote: Any comments are welcome. Firstly, thank you for making the document available. I have made a few comments regarding matters that I noticed. Please know that, whilst I comment on various matters, I am enthusiastic for the general thrust of your suggestion regarding access to alternate glyphs for Latin characters using Variation Selectors. This could produce a renaissance for typography. In the document, on page 2, there is the following. quote But while the general mechanisms for doing so are standardized (i.e. OpenType features), the concrete selection of a specific glyph is not. end quote It is important that the Unicode specification does not regard any particular font technology as being the standard font technology. This issue was discussed in this mailing list some years ago. http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0106.html The last two paragraphs of the following post put that post in context. http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0095.html Why is it not possible specifically to request a one-storey form of lowercase letter a? It seems to me that being able to request either a one-storey form or a two-storey form of lowercase letter a would be better. In relation to lowercase g, would it be better to be able to request any one of open descender, closed loop descender and unclosed loop descender? For example, the lowercase letters g in the fonts Arial, Times New Roman and Trebuchet MS show the three types. What happens in relation to a character such as g circumflex? Would one be able to access a glyph alternate for g circumflex? Could there be variants for lowercase e, for a horizontal line glyph design and for an angled line, Venetian-style font, glyph design please? Would it be possible to define U+FE15 VARIATION SELECTOR-16 to indicate an end of word alternate glyph for each lowercase Latin character? Certainly, some usages would be more likely than others, with d, e, h, m, n, t, z being more likely to have an end of word alternate glyph than would some other letters, yet a general usage for all Latin characters would, in my opinion, be good. William Overington 4 August 2010
Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters (was Re: long s (was: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters))
On Tuesday, 3/8/10, Janusz S. Bień jsb...@mimuw.edu.pl wrote: I see no reason why, if I understand correctly, the long s variant is to be limited to Fraktur-like styles. Long s was used with ordinary Roman type in England for English text in at least part of the 17th and 18th centuries. How could one express the following please using variation selectors and the Zero Width Joiner ZWJ in relation to the two character sequence sh? If you have a long s available, please use it, otherwise please use an ordinary s: furthermore, if you have a long s h ligature available please use that instead. How could one express the following please using variation selectors and the Zero Width Joiner ZWJ in relation to the three character sequence ssi? If you have a long s available, please use it, otherwise please use an ordinary s: furthermore, if you have a long s long s i ligature available please use that instead. William Overington 4 August 2010
Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters (was Re: long s (was: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters))
On 4 August 2010 09:19, William_J_G Overington wjgo_10...@btinternet.com wrote: Answering the two questions below on the assumption that s-VS1 0073 FE00 were to be defined as a variation sequence for long s in all type styles, and without giving any opinion on the merits or otherwise of Karl's proposal in general, or specifically the merits of double-encoding long s as a variation sequence. How could one express the following please using variation selectors and the Zero Width Joiner ZWJ in relation to the two character sequence sh? If you have a long s available, please use it, otherwise please use an ordinary s: furthermore, if you have a long s h ligature available please use that instead. s-VS1-ZWJ-h Note that there must be no character between a variation selector and the base character it applies to, so the ZWJ must go after VS1. How could one express the following please using variation selectors and the Zero Width Joiner ZWJ in relation to the three character sequence ssi? If you have a long s available, please use it, otherwise please use an ordinary s: furthermore, if you have a long s long s i ligature available please use that instead. The use of long s versus short s and ligaturing of these letters varies widely geographically and historically and depending upon typeface. The following examples would all be valid *if* s-VS1 were to be defined as a variation sequence for long s (in all type styles): s-VS1-ZWJ-s-VS1-ZWJ-i -- for a ligatured ſſi as in miſſion (usual in 18th century English typography) s-VS1-s-i -- for a non-ligatured ſsi as in illuſtriſsimos (usual in 18th century Spanish typography) s-VS1-ZWJ-s-i -- for a ligatured ſs plus i as in bleſsings (usual for italics only in 16th and early 17th century English and French typography) s-s-VS1-ZWJ-i -- for s plus a ligatured ſi as in utilisſima (sometimes in 16th century Italian typography) Andrew
Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters
Am 03.08.2010 um 02:47 schrieb David Starner: Fraktur and Antiqua are different writing systems with slightly different orthographies No. Fraktur and Antiqua are two (of many) different renderings of the Latin writing system. Regards, A. Stötzner.
Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters (was Re: long s (was: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters))
On Wed, Aug 4, 2010 at 05:19, William_J_G Overington Long s was used with ordinary Roman type in England for English text in at least part of the 17th and 18th centuries. More on that by babelstone: http://babelstone.blogspot.com/2006/06/rules-for-long-s.html (Sorry for the duplicate email William, my mistake.) -- Leonardo Boiko
Re: Draft Proposal to add Variation Sequences for Latin and=D=A Cyrillic =9letters (was Re: long s (was: Draft Proposal to add Variation=D=A Sequences for =9Latin and Cyrillic letters))
In my opinion, adding the s+VS1 variation sequence is completely unneeded. If you really want a long s, use the code assigned to the long s. fonts or renderers should still provide a reasonnable fallback to s if the glyph is missing. This means that all existing ligatures will long s will continue to be encoded as well with long s and ZWJ. the x+VS1 proposal is an attempt to disunify the long s, when it is NOT needed at all. The only convenient variation sequence would be to add S+VS1 for the capital (because long s has no capital) only to preserve the long s semantic when converting it to uppercase or titlecase, in which case the mapping of S+VS1 to lowercase will give again the standard long s.
Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters
On Aug 4, 2010, at 8:20 AM, Andreas Stötzner wrote: Am 03.08.2010 um 02:47 schrieb David Starner: Fraktur and Antiqua are different writing systems with slightly different orthographies No. Fraktur and Antiqua are two (of many) different renderings of the Latin writing system. The two propositions are not mutually exclusive. And it /is/ true that, at least at some times, Fraktur and Antiqua have had different orthographies. -- John W Kennedy There are those who argue that everything breaks even in this old dump of a world of ours. I suppose these ginks who argue that way hold that because the rich man gets ice in the summer and the poor man gets it in the winter things are breaking even for both. Maybe so, but I'll swear I can't see it that way. -- The last words of Bat Masterson
Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters
On 8/2/2010 5:04 PM, Karl Pentzlin wrote: I have compiled a draft proposal: Proposal to add Variation Sequences for Latin and Cyrillic letters The draft can be downloaded at: http://www.pentzlin.com/Variation-Sequences-Latin-Cyrillic2.pdf (4.3 MB). The final proposal is intended to be submitted for the next UTC starting next Monday (August 9). Any comments are welcome. - Karl Pentzlin This is an interesting proposal to deal with the glyph selection problem caused by the unification process inherent in character encoding. When Unicode was first contemplated, the web did not exist and the expectation was that it would nearly always be possible to specify the font to be used for a given text and that selecting a font would give the correct glyph. As the proposal noted, universal fonts and viewing documents on other platforms and systems across the web have made this solution unattractive for general texts. We are left then with these five scenarios 1) Free variation 2) Orthographic variation of isolated characters (by language, e.g. different capitals) 3) Orthographic variation of entire texts (e.g. italic Cyrillic forms, by language) 4) Orthographic variation by type style (e.g. Fraktur conventions) 5) Notational conventions (e.g. IPA) For free variation of a glyph, the only possible solutions are either font selection or use of a variation sequence. I concur with Karl, that in this case, where notable variations have been unified, that adding variation selectors is a much more viable means of controlling authorial intent than font selection. If text is language tagged, then Opentype mechanisms exist in principle to handle scenario 2 and 3. For full texts in a certain language, using variation selectors throughout is unappealing as a solution. However, it may be a viable solution for being able to embed correctly rendered citations in other text, given that language tagging can be separated from the document and that automatic language tagging may detect large chunks of text, but not short runs. The Fraktur problem is one where one typestyle requires additional information (e.g. when to select long s) that is not required for rendering the same text in another typestyle. If it is indeed desirable (and possible) to create a correctly encoded string that can be rendered without further change automatically in both typestyles, then adding any necessary variation sequences to ensure that ability might be useful. However, that needs to be addressed in the context of a precise specification of how to encode texts so that they are dual renderable. Only addressing some isolated variation sequences makes no sense. Notational conventions are addressed in Unicode by duplicate encoding (IPA) or by variation sequences. The scheme has holes, in that it is not possible in a few cases to select one of the variants explicitly, instead, the ambiguous form has to be used, in the hope that a font is used that will have the proper variant in place for the ambiguous form. Adding a few variation sequences (like the one to allow the a at 0061 to be the two story one needed for IPA) would fill the gap for times when controlling the precise display font is not available. However, there's no need to add variation sequences to select an *ambiguous* form. Those sequences should be removed from the proposal. Overall a valuable starting point for a necessary discussion. A./
Re: Draft Proposal to add Variation Se=D=A quences for Latin and Cyrillic letters
John W Kennedy wrote: On Aug 4, 2010, at 8:20 AM, Andreas Stötzner wrote: Am 03.08.2010 um 02:47 schrieb David Starner: Fraktur and Antiqua are different writing systems with slightly different orthographies No. Fraktur and Antiqua are two (of many) different renderings of the Latin writing system. The two propositions are not mutually exclusive. And it /is/ true that, at least at some times, Fraktur and Antiqua have had different orthographies. And it is probably the main reason of the inclusion of Latf in ISO 15924, not just because it is a script variant, but really because it defines a distinct orthography, which should be specifiable in BCP 47 language tags. I think you could apply the same rationale on Hans and Hant as well (not really a different script for the UCS, but distinct orthographies.) Really, Hans, Hant, Latf, Latg could have been avoided in ISO 15924, if orthographic variants of the same languages had been encoded in the IANA database for BCP 47, independantly of the effective font style. But for now there's still no formal model for encoding language dialects, so BCP 47 language tags still need to use tags for ISO 3166-1 region codes and for the script variant, when it should just qualify the generic script code (or it could even drop this ISO 15924 code if there was a formal code for the dialect written in a specific orthography: we would also deprecate Jpan, Hrkt in ISO 15924). Orthographic variants would include also: - the various romanization systems (for example Pinyin) and phonetic transcriptions (IPA phonetic, simplified IPA phonology), - the simplified orthographies (e.g. orthographic reforms in French and German), - and some other minor variants (like the vertical presentation for East-Asian scripts, or Boustrophedon presentation for Ancient Greek, if this alters the orientation of characters that had to be encoded differently, and the default mirroring properties are not applicable to the encoded characters in the basic language). For now these dialectal/orthographic variants of written languages can be registered in the IANA database for BCP 47, using codes with at least 5 letters (or with at least 4 letters or digits if there's at least one digit), but ideally the dialectal variant should be encoded as a tag BEFORE the orthographic variant. The font style prefered for each orthographic variant is still left to the rendering system that will apply stylesheets according to the language tag. It should not be invalid to use a fallback style that will ignore the orthographic variants for which there's no font support or no support in the font rendering system or page layout system. Philippe.
Dialects and orthographies in BCP 47 (was: Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)
verdy_p verdy underscore p at wanadoo dot fr wrote: Really, Hans, Hant, Latf, Latg could have been avoided in ISO 15924, if orthographic variants of the same languages had been encoded in the IANA database for BCP 47, independantly of the effective font style. Actually it was the opposite; the ability to use standardized ISO 15924 code elements to express concepts like Simplified Han was one of the driving forces behind RFC 4646 and its shift in focus from whole tags to subtags. In any case, the bibliographers and others who use ISO 15924 but not BCP 47 might need to make these distinctions as well. But for now there's still no formal model for encoding language dialects, so BCP 47 language tags still need to use tags for ISO 3166-1 region codes and for the script variant, when it should just qualify the generic script code (or it could even drop this ISO 15924 code if there was a formal code for the dialect written in a specific orthography: we would also deprecate Jpan, Hrkt in ISO 15924). There is no formal model in the sense of a standard N-letter subtag for dialects, because the concept of a dialect is too open-ended and unsystematic. The word means different things to different people. What may be a dialect to one person might be a full-blown National Language to another, or just a funny accent to a third. BCP 47 tags never *need* to use either the region subtag or the script subtag, unless they are necessary to convey the intended meaning. A tag like ja-Jpan-JP is almost never needed, because almost all written Japanese is using the Japanese writing system ('Jpan') and as used in Japan ('JP'). I'm not sure what dialect is being posited here that would make the difference between having to specify a script subtag and not having to. Orthographic variants would include also: - the various romanization systems (for example Pinyin) and phonetic transcriptions (IPA phonetic, simplified IPA phonology), 'pinyin', 'fonipa' - the simplified orthographies (e.g. orthographic reforms in French and German), '1606nict', '1694acad', '1901', '1996' - and some other minor variants (like the vertical presentation for East-Asian scripts, or Boustrophedon presentation for Ancient Greek, if this alters the orientation of characters that had to be encoded differently, and the default mirroring properties are not applicable to the encoded characters in the basic language). For now these dialectal/orthographic variants of written languages can be registered in the IANA database for BCP 47, using codes with at least 5 letters (or with at least 4 letters or digits if there's at least one digit), A 4-character variant subtag must *begin* with a digit. but ideally the dialectal variant should be encoded as a tag BEFORE the orthographic variant. Why is this important? -- Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s
Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters
Am Mittwoch, 4. August 2010 um 00:31 schrieb Christoph Päper: CP ... than making sure every instance of a letter is CP accompanied by the appropriate VS? My proposal contains the idea of implicit application of variation sequences by higher-level protocols. I will make this clearer in my next version. CP How did you decide what to include in your proposal ... I will make this clearer also in my next version, which will contain a paragraph characters vs. variants vs. glyphs. - Karl Pentzlin
Standard fallback characters (was: Draft Proposal to add Variation=D=A Sequences for Latin and Cyrillic letters)
Asmus Freytag wrote: The Fraktur problem is one where one typestyle requires additional information (e.g. when to select long s) that is not required for rendering the same text in another typestyle. If it is indeed desirable (and possible) to create a correctly encoded string that can be rendered without further change automatically in both typestyles, then adding any necessary variation sequences to ensure that ability might be useful. However, that needs to be addressed in the context of a precise specification of how to encode texts so that they are dual renderable. Only addressing some isolated variation sequences makes no sense. I don't think so. If a text was initially using a round s, nothing prohibits it being rendered in Fraktur style, but even in this case, the conversion to long s will be inappropriate. So use the Fraktur round s directly. If a text in Fraktur absolutely requires the long s, it's only when the original text was already using this long s. In that case, encode the long s: The text will render with a long s in both modern Latin font styles like Bodoni (with a possible fallback to modern round s if that font does not have a long s), an in classic Fraktur font styles (with here also a possible fallback to Fraktur round s if the Frakut font forgets the long s in its repertoire of supported glyphs). In other words, you don't need any variation sequence: s+VS1 would be strictly encoding the same thing as the existing encoded long s. Adding this variation selector would just be a pollution (an unjustified desunification). The two existing characters are already clearly stating their semantic differences, so we should continue to use them. This does not mean that fonts should not continue to be enhanced, and that font renderers and text-layout engines should not be corrected to support more fallbacks (in fact it will be simpler to implement these fallbacks within text-renderers, instead of requiring a new font version). You can apply the same policy to the French narrow non-breaking space NNBSP (aka fine in French) that fonts do not need to map, provided that the font renderers or text layout engines are correctly infering its bet fallback as THIN SPACE, before retrying with the FIFTH EM SPACE or SIXTH EM SPACE characters, then with a standard SPACE with a reduced metric... That's because fonts never care about line-breaking properties, that are implemented only in text layout engines. The same should apply as well with NBSP, if a font does not map it (the text renderer just has to use the fallback to SPACE to find the glyph in the selected font), to the NON-BREAKING HYPHEN (just infer the fallback to the standard HYPHEN, then to MINUS-HYPHEN). In fact, it would be more elegant if Unicode provided a new property file, suggesting the best fallbacks (ordered by preference) for each character (these fallbacks possibly having their own fallbacks that will be retried if all the suggested ordered fallbacks are already failing). In most cases, only one fallback will be needed (in very few cases, several ordered fallbacks should be listed if the implied sub-fallbacks are not in the correct order of resolution). It would avoid selecting glyphs from other fallback fonts with very different metrics. Some of these fallbacks are already listed in the main UCD file, but they are too generic (because the compatibility mappings must resolve ONLY to non-compatibility decomposable characters). For example NNBSP has a compatibility decomposition as 0020, just like many other whitespace characters, so it completely looses the width information. If we had standardized fallback resolution sequences implemented in text renderers, we would not need to update complex fonts, and the job for font designers would be much simpler, and users of existing fonts could continue to use them, even if new characters are encoded. I took the example of NNBSP, because it is one character that has been encoded since long now, but vendors are still forgetting to provide a glyph mapping for it (for example in core fonts of Windows 7 such as the new Segoe UI font, even though Microsoft included an explicit mapping for NNBSP in Times New Roman). It's one of the frequent cases where this can be solved very simply by the text-renderer itself. The same should be done for providing a correct fallback to round s if ever any font does not map the long s. I also suggest that the lists of standard character fallbacks are scanned within the first selected font, without trying with other fallback fonts (including multiple font families specified in a stylesheet or generic CSS fonts), unless the list of fallback characters includes a specifier in the middle of the list that would indicate that all the characters (the original or the fallback characters already specified before ) should be searched (this will be useful mostly for
Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters
Am Dienstag, 3. August 2010 um 02:47 schrieb David Starner: DS ... I don't see why DS unspecific forms should be encoded; if you want a nonspecific a, 0061 DS is the character. This is because I take into account the implicit application of a variation sequence on a base character by a higher-level protocol, which must be overridable in some way. In the next version of my proposal, I hope to make this clearer; propably I also will put another name on the unspecific variants. - Karl Pentzlin
Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters
Am Mittwoch, 4. August 2010 um 08:52 schrieb William_J_G Overington: WO Please know that, whilst I comment on various matters, I am WO enthusiastic for the general thrust of your suggestion regarding WO access to alternate glyphs for Latin characters using Variation WO Selectors. This could produce a renaissance for typography. Admittedly, I explicitly do not want to introduce glyph encoding into Unicode through the back door. In the next version of my proposal, you will find some words about what variation sequences are *not* intended for. WO But while the general mechanisms for doing so are standardized WO (i.e. OpenType features), the concrete selection of a specific glyph is not. WO WO It is important that the Unicode specification does not regard WO any particular font technology as being the standard font technology. This is correct. I mention OpenType only as an example. WO Why is it not possible specifically to request a one-storey form of lowercase letter a? I did not this, as I do not know a cultural context where the two-storey form is to be suppressed to prevent an a to be mistaken for any letter too similar to a two-storey a. WO What happens in relation to a character such as g circumflex? WO Would one be able to access a glyph alternate for g circumflex? The variant selector can be followed by any diacritic which then is applied to the base character. WO Could there be variants for lowercase e, ... I have found none, which of course is no proof of non-existence, WO for a horizontal line glyph design, and for an angled line, Not according to the principles outlined in my proposal, WO Venetian-style font, glyph design please? No. WO Would it be possible to define U+FE15 VARIATION SELECTOR-16 to WO indicate an end of word alternate glyph for each lowercase Latin WO character? No. Even if you find a cultural context where such things are required, such things are positional variants which are to be handled by the proven mechanisms developed for scripts like Arabic. - Karl Pentzlin
Re: long s (was: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)
Am Dienstag, 3. August 2010 um 19:11 schrieb Janusz S. Bień: JJSB I see no reason why, if I understand correctly, the long s variant is JSB to be limited to Fraktur-like styles. The *variant* is applicable to situations where the character is to be displayed long when Fraktur-like styles are in effect, while it is to be displayed round when modern styles are in effect. The plain *character* long s is intended to be displayed long in all circumstances. However, in my next version, I will replace the s variants by long s variants: 017F FE00 ...LONG S VARIANT-1 STANDARD FORM · will be displayed long in any script variants 017F FE01 ...LONG S VARIANT-1 FLEXIBLE FORM (naming provisionally) · will be displayed long in Fraktur, Gaelic, and similar script variants · will usually be displayed round when used with Roman type This has the advantage, especially when implicit application of variation sequences is possible, it can be applied to existing data without change. - Karl Pentzlin
re: Dialects and orthographies in BCP 47 (was: Re: Draft Proposal=D=A to add Variation Sequences for Latin and Cyrillic letters)
Doug Ewell wrote: There is no formal model in the sense of a standard N-letter subtag for dialects, because the concept of a dialect is too open-ended and unsystematic. The word means different things to different people. What may be a dialect to one person might be a full-blown National Language to another, or just a funny accent to a third. The formal model already exists in ISO 639, that has decided to unify all dialectal variants under the same language code. Yes the concept is fuzzy, but as long as ISO 639 will not contain a formal model for how the various languages are grouped in families and subfamilies, it will be impossible to use dialectal variant specifiers with accurate fallbacks, without using subtags for the language variants. One know problem is for exampel Norman, which ISO 639 still considers as a dialect of French, even though it is just ANOTHER Oil language (from which Standard French emerged by merging, modifying and extending several dialects). But Jersiais is now an language with official in Jersey, which is clearly part of the Norman family. And that still needs to be distinguished from French. Still, there's no ISO 639 code for Norman (as a family or as the residual language in continentla Normandy in France), and no code for Jersiais as well. And French is considered in ISO 639 as an isolated language, not as as macrolanguage. So it allows no further precision. If something is added, it can only be a variant for the dialectal difference, such as fr-norman for the Norman family, or fr-jersiais for Jersiais, unless Jersiais gets its own ISO 639-3 code as an isolated language (leaving the continental Norman still as a dialectal variant of French). The formal definition of languages is the definition of ISO 639-3 isolated languages. Everything below is dialectal (and ISO 639 has clearly stated that it planned for much later a comprehensive encoding of dialectal differences, most probably by defining a standard list of variant codes, even if these dialects may qualify as languages for some users) It's remarkable that for most linguists, Serbian, Croatian, annd Bosnian are only one language, with only dialectal differences (in the spoken language and with some grammatical derivations, and some minor lexical differences that are understood by all Serbo-Croatian speakers), orthographic differences (mostly based on their default script, even if Serbian still uses the two scripts but it defines a strict transliteration system that helps defining a unified orthography for both scripts, orthographies that are simplified in Croatian and Bosnian). So yes, the concept of dialects vs. language is fuzzy for linguists and users (and nationals that prefer to see their dialect named from their country as a full language instead of a dialect), but ISO 639 defines a formal model by its technical encoding: if there's an authority defending the position of a distinct language and defining an official lexique and orthography, it becomes a de facto language for ISO 639. Such split of languages in their dialectal differences promoted to isolated languages has occured and was endorsed by ISO 639, even if it was probably not in the interest of these countries to split their common language and to reduce its audience and cultural influence in other parts of the world (and for many of their own citizens, they won't care a lot about these formal official differences, as long as they understand it and can read and write it in a script that they can decipher it without difficulties, only because they will constantly live near other peoples sharing the same language but under a different name). Serbian is still perceived and encoded as a single language, despite it still uses two scripts, depending on the region of use (but it is now rapidly converging to the Latin script). May be the linguistic and cultural authorities of the four concerned countries (or five, now with Kosovo whose independance was recently validated by a international court?) will decide to reunite their cultural efforts, if they finally all use the same Latin script, by adopting a new neutral name (Dolmoslavic, Adriatic, Adrislavic ? Or even Yugoslavic ?) and increasing their mutual cultural exchanges instead of wasting them for old nationalist reasons (this will be even more important when they will finally ALL join the European Union with increased exchanged between them). Philippe.
Re: Standard fallback characters (was: Draft Proposal to add Variation=D=A Sequences for Latin and Cyrillic letters)
On 8/4/2010 1:30 PM, verdy_p wrote: Asmus Freytag wrote: The Fraktur problem is one where one typestyle requires additional information (e.g. when to select long s) that is not required for rendering the same text in another typestyle. If it is indeed desirable (and possible) to create a correctly encoded string that can be rendered without further change automatically in both typestyles, then adding any necessary variation sequences to ensure that ability might be useful. However, that needs to be addressed in the context of a precise specification of how to encode texts so that they are dual renderable. Only addressing some isolated variation sequences makes no sense. I don't think so. If a text was initially using a round s, nothing prohibits it being rendered in Fraktur style, but even in this case, the conversion to long s will be inappropriate. So use the Fraktur round s directly. This statement makes clear that you don't understand the rules of typesetting text in Fraktur. If a text in Fraktur absolutely requires the long s, it's only when the original text was already using this long s. This statement is also incorrect. The rules when to use long s in Fraktur and when to use round s depend on the position of the character within the word in complicated ways. The same word, typeset using Antiqua style will not usually have the long s. For German, there exist a large number of texts that were typeset in both formats, so you can compare for yourself. Even in France, I suspect that research libraries would have editions of 19th century German classics in both formats. In that case, encode the long s: The text will render with a long s in both modern Latin font styles like Bodoni (with a possible fallback to modern round s if that font does not have a long s), an in classic Fraktur font styles (with here also a possible fallback to Fraktur round s if the Frakut font forgets the long s in its repertoire of supported glyphs). I'm skipping the rest, of your message because you've started from a wrong premise and sorting out which bits still apply even after accounting for the wrong premise is not something I have time, energy and inclination for. Sorry, A./
Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters
On Wed, Aug 4, 2010 at 4:33 PM, Karl Pentzlin karl-pentz...@acssoft.de wrote: Am Dienstag, 3. August 2010 um 02:47 schrieb David Starner: DS ... I don't see why DS unspecific forms should be encoded; if you want a nonspecific a, 0061 DS is the character. This is because I take into account the implicit application of a variation sequence on a base character by a higher-level protocol, which must be overridable in some way. I don't see why it must be overridable. By not including a variation sequence, you've left it up to the system to pick a glyph. Whatever glyph it picks, you have no right to complain. There is no reason for the system to do anything with the unspecific form variation sequence. -- Kie ekzistas vivo, ekzistas espero.
Re:=D=A Standard fallback characters (was: Draft Proposal to add Variation� Sequences for Latin and Cyrillic letters)
Asmus Freytag If a text was initially using a round s, nothing prohibits it being rendered in Fraktur style, but even in this case, the conversion to long s will be inappropriate. So use the Fraktur round s directly. This statement makes clear that you don't understand the rules of typesetting text in Fraktur. If a text in Fraktur absolutely requires the long s, it's only when the original text was already using this long s. This statement is also incorrect. The rules when to use long s in Fraktur and when to use round s depend on the position of the character within the word in complicated ways. The same word, typeset using Antiqua style will not usually have the long s. So you juist demonstrate that IF such rule exists and is enforceable, then you DON'T need the separate encoding. In that case you can safely use a round s everywhere, and let all the appropriate round s to be converted automatically to long s according to this rule. Your false assumption is, in my opinion, that such rule exists and is enforceable for typesetting into Fraktur. All demonstrate that this is NOT the case, just look into actual manuscripts and old books, and you'll find very frequently that the same book used the rules inconsistently, either because of a typo made by the printer (or its typists composing the pages), or that the printer wanted to respect the original orthography used in the original manuscript by the author (the printer decides to NOT decide and maintains that orthography, even if it's inconsistant). Now if you're exposed to an original book that was initially typesetted in Fraktur, and want to preserve its characters, as they are, just use standard round s and standard long s. You don't need ANY variation selector. You'll only be interested in addding ZWJ for encoding the ligatures that you see in the original document. Render it to a Fraktur font and you've done the work correctly. nothing is needed. Now render it with a Bodoni font, and all the long s will be converted to a fallback round s, if you use a correct typesetting program that will not display squares for missing glyphs. Render it on the web in HTML, and the default text renderers of browsers will use any font they have (even if you specified one, there's no warranty that it will be available, or that the user will have not applied a personal stylesheet for its own prefered fonts, so fallback fonts will still be used), in that case the browers will make all the efforts they cant to reproduce the original distinctions between long s and round s. Now if you want to render it to a high-quality Bodoni text, you'll use a font or renderer that will either display ALL the existing distinctions as they are encoded in the text (ne need of any variation selector for that), or NONE of them (all long s will be rendered like round s). For German, there exist a large number of texts that were typeset in both formats, so you can compare for yourself. Even in France, I suspect that research libraries would have editions of 19th century German classics in both formats. Yes, but this is not relevant to the issue. You DON'T need any variation sequence to encode the differences WHERE THEY EXIST. If you want the correct long s in the Fraktur-rendered text, use the standard long s where they are and nothing else. The same text will still rander with round s in a Bodoni-like font, and will display the fraktur differences when using a modern font containing mapping the two characters into two distint glyphs. And then only one case remains useful: if you still want that some long s in the original Fraktur text must convert to long s in a modern style, but others will still convert to round s, using the SAME font: Only for this case, what you'll need is NOT but REALLY , so that the renderer will know (with the presence of VS1) that the is safely convertible to when using a modern font that has mappings for both characters. In other words, the modern font will add a mapping of to the same glyph as , instead of just to when ignoring the variation selector. This VS1 will encode the long s that are not absolutely long when rendering in other styles (such as Antiqua) than the original Fraktur. For the reversed conversion (from modern texts to Fraktur), that you would use for fancy new creations, you won't need to encode anything else than (that will be converted automatically to , where appropriate and using automatically and consistantly the strict rules), and if you still want to force them some others (for fancy reasons) into the document rendered in a Fraktur-like style (but remember that the original was not using , except if they were forced in the original... With this scheme you'll still be able to preserve the original modern non-Fraktur text. Philippe.
Re: Re:=D=A Standard fallback characters (was: Draft Proposal to add Variation� Sequences for Latin and Cyrillic letters)
Philipe, Text typeset in Fraktur contains more information than text typset in Antiqua. That means, there are some places where there are some (mild) ambiguities in representation in the Antiqua version. Not enough to bother a human reader who can use deep context to read the text correctly, but enough so that a mere typesetting system cannot correctly render such a text in Fraktur. I'm not currently aware of anything that would prevent an automated system from converting a text encoded for Fraktur to one encoded for Antiqua, because you are merely throwing away information. So far we agree. The question is whether it would be possible to make this process work by default in common, unmodified rendering engines, and whether that is desirable. (I don't treat either of these question as settled one way or the other - so please don't attribute a position to me on that subject). What I do know is that there are historic documents using Antiqua fonts that do use the long s. Therefore, in principle, you don't necessarily want to create fonts that map long to round s. And, as an author, you can't rely on such a font being present on the reader's end - it might equally likely be one that does implement the long s. So, whatever automatic rendering of Fraktur-ready text with non-Fraktur general purpose fonts you have in mind, should not rely on this kind of non-standard glyph substitution. That would be a terrible hack, imperiling the ability of people to use the long s outside the context of the Fraktur tradition. All I had argued for was that Karl should take out the consideration of rendering text encoded for Fraktur from his proposal and make it part of a separate document that addresses ALL issues of this type of rendering, making it a complete specification - that would be something that allows review on its own merits. A./
Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters
On 3 Aug 2010, at 01:04, Karl Pentzlin wrote: I have compiled a draft proposal: Proposal to add Variation Sequences for Latin and Cyrillic letters The draft can be downloaded at: http://www.pentzlin.com/Variation-Sequences-Latin-Cyrillic2.pdf (4.3 MB). The final proposal is intended to be submitted for the next UTC starting next Monday (August 9). Any comments are welcome. I don't think it is a good idea. In particular the implications for Serbian orthography would be most unwelcome. Michael Everson * http://www.evertype.com/
Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters
Am Dienstag, 3. August 2010 um 09:45 schrieb Michael Everson: ME ... In particular the implications ME for Serbian orthography would be most unwelcome. Which kind of implications do you refer to? The proposed variation sequences simply provide a more general access to typographic details, which now can be accomplished by more complicated means like implementing locale-specific glyph selection within a font, and relying on a higher-level protocol supplying the correct locale information. (Anyway, such means may stay in effect in parallel to the use of variation sequences.) One of the advantages of variation sequences is that the glyph selection is transparent to the user, instead of to be implemented in each font in a non-standardized way. - Karl Pentzlin
Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters
Karl Pentzlin: The proposed variation sequences simply provide a more general access to typographic details, which now can be accomplished by more complicated means like implementing locale-specific glyph selection within a font, and relying on a higher-level protocol supplying the correct locale information. How is selecting and setting once a locale (vulgo language) more complicated than making sure every instance of a letter is accompanied by the appropriate VS? They don’t seem very handy for runs of text, but VS are probably the right tool for reference work, e.g. http://en.wikipedia.org/wiki/Cyrillic_alphabet#Letterforms_and_typography. So it makes sense to specify combinations. How did you decide what to include in your proposal, though? There are many more variants, even when not taking handwritten forms into account, e.g. ‘u’- or ‘v’-based ‘y’ and ‘w’ or uppercase letters with diacritics above rendered lower so they’re not using more vertical space than the base letters.
Draft Proposal to add Variation Sequences for Latin and Cyrillic letters
I have compiled a draft proposal: Proposal to add Variation Sequences for Latin and Cyrillic letters The draft can be downloaded at: http://www.pentzlin.com/Variation-Sequences-Latin-Cyrillic2.pdf (4.3 MB). The final proposal is intended to be submitted for the next UTC starting next Monday (August 9). Any comments are welcome. - Karl Pentzlin
Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters
0073 FE00/FE01 - must be LATIN SMALL LETTER S, not LETTER B. Leo On Mon, Aug 2, 2010 at 5:04 PM, Karl Pentzlin karl-pentz...@acssoft.de wrote: I have compiled a draft proposal: Proposal to add Variation Sequences for Latin and Cyrillic letters The draft can be downloaded at: http://www.pentzlin.com/Variation-Sequences-Latin-Cyrillic2.pdf (4.3 MB). The final proposal is intended to be submitted for the next UTC starting next Monday (August 9). Any comments are welcome. - Karl Pentzlin
Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters
On Mon, Aug 2, 2010 at 8:04 PM, Karl Pentzlin karl-pentz...@acssoft.de wrote: I have compiled a draft proposal: Proposal to add Variation Sequences for Latin and Cyrillic letters The draft can be downloaded at: http://www.pentzlin.com/Variation-Sequences-Latin-Cyrillic2.pdf (4.3 MB). The final proposal is intended to be submitted for the next UTC starting next Monday (August 9). Two things jumped out at me on a quick glance. First, I don't see why unspecific forms should be encoded; if you want a nonspecific a, 0061 is the character. Secondly, Fraktur and Antiqua are different writing systems with slightly different orthographies; instead of messing around with variation sequences, just accept that. If they must be distinguished, surely the long-s variation sequence could be used in non-Fraktur fonts, like Blackletter and 18th century-style fonts. -- Kie ekzistas vivo, ekzistas espero.
Romanian and Cyrillic
I posted this message to the message boards of Distributed Proofreaders-Europe dp.rastko.net (a joint effort of Project Rastko www.rastko.net and Project Gutenberg www.gutenberg.net), and got this response from one of the site admins. nikola wrote: Haha Romanian use Cyrillic up to 19th century, so sooner or later, we WILL have Romanian books in Cyrillic here Nikola, David refers to Moldavian situation which is little bit different compared to situation in modern Romanian state since its formation. David, here are some preliminary thoughts: Prosfilaes wrote: From the Unicode mailing list: Quote: Since we're talking about Romanian... Prior to 1991, the Soviet-controlled administration attempted to create a distinct linguistic identity, Moldovian, which as I understand it basically amounted to Romanian written in Cyrillic script. (They tried to introduce some archaic Romanian forms and Russian loans, but apparently none of it stuck.) I expect gradual influx of Romanian, Moldavian, Tzintzar and Vlach members after May 24. I'm in almost daily contact with our friends and colaborators from Bucharest and Timisoara these days, relating our Romanian NGO which is under the registration at the moment, and they'll also serve as medium of our future local Moldavian network. Before their more detailed opinion, I can offer some analogies which we have with similar cases. Bi- or three-alphabet situation is not rare in SE European or Eurasian cultures. In previous centuries we find all combinations of paralel use of Cyrillic, Glagolithic, Latin, Grek or Arabic scripts among Serbs, Croats, Romanians, Albanians etc. Religious or ideological affilitions are to be blame for very recent and opressive reducing down to usage of just one major script, but even now we have Serbian case with Cyrillic as only standard script, but Latin script widely used on daily social level without prejudicies even in the core of Serbian culture. Project Rastko's general policy is more or less to OCR/publish version in original script, but also to provide transliterated versions into other common used scripts. Although we are proponents of having one official script, we publish Serbian works in additional Latin version in order to be easily read also in Muslim or Croat areas of former Yugoslavia (which share common language with Serbian culture). For Romanian and Moldavian books printed in Cyrillic, I suppose only logical solution is to apply Rastko's rules: to process it in original script but to parallely publish Latin script version which modern Romanian readers could read. Prosfilaes wrote: Quote: How relevant is Romanian in Cyrillic script at this point? For instance, what's the likelihood that someone might want to put Romanian-Cyrillic content on the web? Already being done? A reasonable possibility? Extremely unlikely? It is reasonable possibility. The phenomenon of script is supranational and for academic purposes should be also treated as supraconfessional or supraideological. Prosfilaes wrote: I know DP-EU plans to do it sometime, but do we have stuff that could be uploaded tomorrow, or is there something in our plans, or is it something that we'll do if and when something clearable comes along (which will be hard, as this is strictly post-1945.) Tomorrow? Yes, if it is desperately needed, it could be uploaded in less 48 hours by Bucharest guys. More realistically speaking, the end of the summer or last quarter should be more systematic phase for Moldavian case. Copyright clearability does not play an issue, since Rastko's material is mostly of modern authors who gave non-exclusive rights to publish their works on the Net for free. David, please let us know anything new you get about this subject, for it could be important for several publishing projects our network prepares [We have in our computers perhaps 100 eBooks processed in 2003 about Romanian culture, waiting to be posted this year] -- ___ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm
Re: Romanian and Cyrillic
On Tue, Apr 27, 2004 at 11:29:58PM -0700, Peter Constable wrote: From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Would you need to have the same web-text [in HTML] displayed in Romanian as well as in Cyrillic script according to the reader's wishes? It could perhaps be put that way: yes, what I want to know is whether there is any potential need to have Romanian-language content such as web pages that need to be provided (whether according to a reader's wish, or to reflect the form of a historic document) in Cyrillic script rather than Latin script. I did download pages in _Moldavian_ some time before There is such a singer called Sofia Rotaru, and she was rather popular in Soviet Union, and she used to sang in Russian, Ukrainian and Moldavian (still does - I saw her recently performing on Russian TV, singing songs in all these three languages, although I do not know how is the last language called now). Anyway, I was looking for lyrics for some songs, and got to a www page with some texts of her songs. The page was itself in Russian, but the lyrics were in respective languages, including Moldavian. The page seemed to be rather recent, with regular updates etc... -- --- | Radovan Garabík http://melkor.dnp.fmph.uniba.sk/~garabik/ | | __..--^^^--..__garabik @ melkor.dnp.fmph.uniba.sk | --- Antivirus alert: file .signature infected by signature virus. Hi! I'm a signature virus! Copy me into your signature file to help me spread!
Re: Question on Unicode-prevalence (general and for Cyrillic)
Peter Kirk va escriure: 2. A graduate student mentioned that it was her impression that most Cyrillic webpages (at least for Russian--her interest) are still not encoded in Unicode. (She is doing some research on the use of certain words in Russian and wanted to know how best to do the search.) Google finds matches not just in Unicode encoded pages, but also in ones encoded in other Cyrillic encodings On the other hand, if the student is willingfull to write some kind of spider herself, this means it is very likely she shall contemplate all the encodings, shan't she? Antoine
Question on Unicode-prevalence (general and for Cyrillic)
Two questions: 1. Is there a way to determine the prevalence of Unicode in electronic file documents (vs. documents not in Unicode)? At least for the Web, has anyone done a statistical sampling to determine the percentage of Unicode-encoded webpages? 2. A graduate student mentioned that it was her impression that most Cyrillic webpages (at least for Russian--her interest) are still not encoded in Unicode. (She is doing some research on the use of certain words in Russian and wanted to know how best to do the search.) Again: Has anyone looked into the situation with Cyrillic in terms of the percentage of Web documents in Unicode? With thanks, Debbie Anderson Deborah Anderson Researcher, Dept. of Linguistics UC Berkeley Email: [EMAIL PROTECTED] or [EMAIL PROTECTED] Script Encoding Initiative: www.linguistics.berkeley.edu/~dwanders