Re: Pashto yeh characters
This is a rather late reply, but I think this document should be useful: http://www.evertype.com/standards/af/af-locales.pdf The first few pages discuss and recommend various Yeh forms to be used, and a recommendation for avoiding some in certain forms. Roozbeh On Thu, 2010-07-22 at 12:17 -0500, lingu...@artstein.org wrote: Hi, This is a query I had originally sent to the Linguist List, modified based on feedback I got there. I am hoping that someone in the Unicode community can help resolve this. I'm interested in knowing if there is a standard way to encode the various Pashto yeh-characters in Unicode, and if so, what it is. This question is a bit more complicated than it sounds, so here's the background. Pashto is written using a derivative of the Arabic script. The Arabic language uses a single character for both /j/ and /i:/ sounds. Like many Arabic characters, this one is composed of a base form (which changes shape based on its position in a word) and dots (in this case, two dots below the base form). In most of the Arabic-speaking world the dots are present with both the medial and final form, though in Egypt (and possibly other places) the convention is to have two dots on the medial form but leave them off the final form. The standard arrangement of the two dots is horizontal, but they can be placed vertically or diagonally with no change in meaning. Persian also uses a single character for /j/ and /i:/, with the convention of two dots on the medial form, no dots on the final form (same as in Egypt). The two conventions for the /j/-/i:/ character were given distinct code points in unicode despite the fact that they do not contrast; documentation is scarce, but presumably this was done in order to allow writing both Arabic and Persian in the same document. Therefore, Unicode has the following code points (I'm not giving the names, but rather the typical visual representation of the glyphs and typical use). U+064A two dots medially and finally (/j/-/i:/ Arabic convention) U+06CC two dots medially, none finally (/j/-/i:/ Persian convention) There are a few additional yeh-base code points defined, some of which are relevant to Pashto (see below). U+0649 no dots medially or finally (Arabic /a/ from etymological /j/) U+0626 hamza above medially and finally (Arabic glottal stop in certain contexts) U+06D0 two dots medially and finally in vertical arrangement U+06CD tail and no dots in final position As it so happens, there is much confusion in how these characters are used in actual electronic documents, which is not surprising given that U+06CC looks like U+064A in medial position but like U+0649 in final position. There is an excellent article by Jonathan Kew that sorts out what this means for various languages that use derivatives of the Arabic script. http://scripts.sil.org/cms/scripts/render_download.php?site_id=nrsiformat=filemedia_id=arabicletterusagenotesfilename=ArabicLetterUsageNotes.pdf Unfortunately, this article does not discuss Pashto. I have little knowledge of the language, but here's what I managed to understand from the inspection of a few documents and with the help of friendly people on the Linguist List (and please correct me if I'm wrong). Traditionally, Pashto used a single character with the same convention as in Persian, of two dots in the medial form and none on the final form, and with no significance attached to the visual arrangement of the dots. The character was 3-ways ambiguous between the sounds /j/, /i:/ and /e/. In recent decades (probably since the 1970s or 1980s) there has been some differentiation, partly due to changes in the typesetting process and partly due to a deliberate effort of the Pashto Academy at the University of Peshawar, Pakistan. One convention that has gained fairly wide acceptance is a distinction between a horizontal arrangement of the dots, representing /j/ or /i:/ as in Arabic and Persian, and a vertical arrangement representing the sound /e/. This distinction is the same as in Uighur, and the character with vertical dots has been codified as U+06D0. Additional conventions include a hamza (U+0626) or tail (U+06CD) to represent /j/ at the end of a word in certain grammatical markers. All of these are quite standard by now and do not pose much of a problem. However, a further convention appears to have arisen, which as far as I can tell is unique to Pashto in that it distinguishes between /j/ and /i:/ (though only in word-final position): /j/ is written with two dots medially, none finally /i:/ is written with two dots both medially and finally I have never seen this codified explicitly, but this is the impression I get from examining a few recent Pashto documents. Which brings me to my original question, of how to represent these characters in
Re: Pashto yeh characters
Persian and Urdu write [g] using a kaf character with a line above U+06AF, while Pashto uses kaf with a ring U+06AB. It really should be that simple. I seem to remember, that Persian used kaf with three dots above (like your Moroccan example) at least in the 19th century. No idea when they switched to the double-lined version. (and I can well imagine how the three dots would have merged to a line, thought this might as well not be the origin of that character). Szabolcs
Re: Pashto yeh characters
On Wed, Jul 28, 2010 at 7:20 PM, Murray Sargent murr...@exchange.microsoft.com wrote: Andreas Prilop commented A native speaker of English does not /automatically/ know better about English grammar, English punctuation than an informed Frenchman. So true, so true. Most native speakers of English have only limited understanding of English grammar. I very recently read an anecdote about Radloff, the russian turkologist. One day a Turk visited him and told him his theories and ideas about the Turkic languages. It became quite soon apparent, that he was not to be taken seriously. So Radloff asked: — Why do you think, your ideas are right? — Because I'm a turkologist — the man replied. — And what makes you a turkologist? — Well, I'm a Turk, and my mother tongue ist Turkish. — Oh no, my friend, a bird is not an ornithologist either... ... Actually, in general birds know pretty little about birds :-) Szabolcs
Re: Pashto yeh characters
On Tue, 27 Jul 2010, Arno Schmitt wrote: Since U+0649 is called alif maqsura it should be used for alif maqsura. But that argument, you must use U+0027 for an apostrophe instead of U+2019. The Unicode names for characters are often hictorical and you should not infer anything from such names. Please not that in the Qur'an it occurs not only at the end of words. If you argue with archaic spelling, then ð and þ are English letters. Or do you use small l for capital I when using Helvetica? They don't even have the same stroke width.
Re: Pashto yeh characters
Hi Kamal, Thanks for the helpful comment -- especially the URLs. A quick check showed that at least on the BBC, U+064A and U+06CC are used interchangeably, even in final position where the glyphs differ. My Pashto is extremely weak, but even I can recognize that in the following article, both 06A9 0631 0632 06CC (in the headline) and 06A9 0631 0632 064A (in the first line of text) spell the name of the Afghan president. http://www.bbc.co.uk/pashto/afghanistan/2010/04/100411_hh-kandahar-clash.shtml The pattern I thought I had noticed, with an emerging distinction between yeh with and without dots in final position, appears to be a fluke of the data I had examined. In a broader sampling of texts, writers use both U+064A and U+06CC and don't care much about whether dots appear on the final forms. I'm still a bit flummoxed as to how a single writer can produce U+064A and U+06CC in such an apparently random fashion, given that they require distinct keystrokes. The Mac on which I am presently writing (actually my wife's computer) has an Afghan Pashto keyboard layout where U+06CC is produced by the d key in the QUERTY layout, and U+064A is produced by shift+d (this is the same as in the keyboard layouts set by Iranian standards ISIRI 2901 and ISIRI 9147). Are the BBC typists randomly pressing shift when typing yeh? On a similar note, it didn't take me too long to find an article where the word Pentagon had two variants for the g character -- U+06AB in the headline, U+06AF in the first line of text. http://www.dw-world.de/dw/article/0,,5842070,00.html In my Afghan Pashto keyboard layout, these characters are ' and option+' respectively. Are the Deutsche Welle typists randomly pressing option when typing gaf? (These are intended as rhetorical questions, but if someone has an answer I'd be happy to hear.) -Ron. Quoting Mansour, Kamal kamal.mans...@monotypeimaging.com: Ron, as you've already noticed, there can be multiple conventions for the orthography of a single language. For the Yeh repertoire, typically the following are used: u+06CC u+06CD u+06D0 For a current corpus, have a look at BBC News (http://www.bbc.co.uk/pashto) and Deutsche Welle (http://www.dw-world.de/) Kamal On 2010.7.22 10:17, lingu...@artstein.org lingu...@artstein.org wrote: Hi, This is a query I had originally sent to the Linguist List, modified based on feedback I got there. I am hoping that someone in the Unicode community can help resolve this. I'm interested in knowing if there is a standard way to encode the various Pashto yeh-characters in Unicode, and if so, what it is. This question is a bit more complicated than it sounds, so here's the background. Pashto is written using a derivative of the Arabic script. The Arabic language uses a single character for both /j/ and /i:/ sounds. Like many Arabic characters, this one is composed of a base form (which changes shape based on its position in a word) and dots (in this case, two dots below the base form). In most of the Arabic-speaking world the dots are present with both the medial and final form, though in Egypt (and possibly other places) the convention is to have two dots on the medial form but leave them off the final form. The standard arrangement of the two dots is horizontal, but they can be placed vertically or diagonally with no change in meaning. Persian also uses a single character for /j/ and /i:/, with the convention of two dots on the medial form, no dots on the final form (same as in Egypt). The two conventions for the /j/-/i:/ character were given distinct code points in unicode despite the fact that they do not contrast; documentation is scarce, but presumably this was done in order to allow writing both Arabic and Persian in the same document. Therefore, Unicode has the following code points (I'm not giving the names, but rather the typical visual representation of the glyphs and typical use). U+064A two dots medially and finally (/j/-/i:/ Arabic convention) U+06CC two dots medially, none finally (/j/-/i:/ Persian convention) There are a few additional yeh-base code points defined, some of which are relevant to Pashto (see below). U+0649 no dots medially or finally (Arabic /a/ from etymological /j/) U+0626 hamza above medially and finally (Arabic glottal stop in certain contexts) U+06D0 two dots medially and finally in vertical arrangement U+06CD tail and no dots in final position As it so happens, there is much confusion in how these characters are used in actual electronic documents, which is not surprising given that U+06CC looks like U+064A in medial position but like U+0649 in final position. There is an excellent article by Jonathan Kew that sorts out what this means for various languages that use derivatives of the Arabic script. http://scripts.sil.org/cms/scripts/render_download.php?site_id=nrsi=file_id=arabicletterusagenotes=ArabicLetterUsageNotes.pdf
Re: Pashto yeh characters
On Tue, 27 Jul 2010, David Starner wrote: MacArabic, Windows-1256 and ISO-8859-6 are all standards for the encoding of Arabic. Thus U+0649 must be an Arabic character; existing use in both those sets and in Unicode say that is. By that circular logic, S with cedilla and T with cedilla must be Romanian letters because they are included in ISO-8859-2. Arabic 8-bit character sets go back to Arab ASMO standards http://www.itscj.ipsj.or.jp/ISO-IR/089.pdf http://www.itscj.ipsj.or.jp/ISO-IR/127.pdf which are several decades old. These standards had isolated letters ya with and without dots. I have no evidence that these ASMO standards specified initial and medial forms of ya without dots. The two letters were taken into Unicode as 0649 and 064A. The question is: When and why was it specified (in Unicode) that U+0649 should have four glyphs all without dots? The Arabic fonts in Windows XP (as well as other fonts I saw) have only isolated and final glyphs for U+0649.
Re: Pashto yeh characters
On Wed, Jul 28, 2010 at 04:33:12PM +0200, Andreas Prilop wrote: On Tue, 27 Jul 2010, Khaled Hosny wrote: it just happen not to get in those two positions in modern orthography, but it can be seen in Quran which is still written in the old, early Islamic orthography. If you argue with archaic spelling, then ð and þ are English letters. Except we are talking about a letter that is still in contemporary use, just not occurring at certain positions of the word. | http://www.unicode.org/mail-arch/unicode-ml/y2010-m07/att-0295/01-U_0649.jpg | http://www.unicode.org/mail-arch/unicode-ml/y2010-m07/att-0295/01-U_0649.jpg According to Grammatik des klassischen Arabisch by Wolfdietrich Fischer, page 9, the ya is written two dots in such cases, too. Except that this is not a Yaa and not pronounced like a Yaa, it is an Alef (note the small dagger Alef above it). I doubt such questions can be solved with reference to the Quran, which originally had no dots at all. Those are two scans from contemporary prints of Quran, where regular Yaa have dots. Just because Uyghur is still following the old orthography of placing Alef Maqsura in the middle of the word, doesn't suddenly make it a no Arabic character. Regards, Khaled -- Khaled Hosny Arabic localiser and member of Arabeyes.org team Free font developer
Re: Pashto yeh characters
On Wed, Jul 28, 2010 at 10:51 AM, Andreas Prilop prilop4...@trashmail.net wrote: On Tue, 27 Jul 2010, David Starner wrote: MacArabic, Windows-1256 and ISO-8859-6 are all standards for the encoding of Arabic. Thus U+0649 must be an Arabic character; existing use in both those sets and in Unicode say that is. By that circular logic, S with cedilla and T with cedilla must be Romanian letters because they are included in ISO-8859-2. They are the exception that proves the rule. I would say that, and the counter-argument would be that Romania, specifically and overtly, demanded that S with comma and T with comma be created for Romanian. The reason S with cedilla and T with cedilla aren't considered the right characters for Romanian is nothing more and nothing less than an overt act by ISO and Unicode. (There's certainly nothing about the characters; the S with cedilla and S with comma is in free variation in the Romanian texts I've seen, including one designed to teach young children how to read and write.) -- Kie ekzistas vivo, ekzistas espero.
Re: Pashto yeh characters
On Tue, 27 Jul 2010, Khaled Hosny wrote: According to Grammatik des klassischen Arabisch by Wolfdietrich Fischer, page 9, the ya is written two dots in such cases, too. Except that this is not a Yaa and not pronounced like a Yaa, it is an Alef (note the small dagger Alef above it). That is exactly what I meant and exactly what is written in W. Fischer. My point is that there are two dots below.
Re: Pashto yeh characters
On Wed, Jul 28, 2010 at 05:32:21PM +0200, Andreas Prilop wrote: On Tue, 27 Jul 2010, Khaled Hosny wrote: According to Grammatik des klassischen Arabisch by Wolfdietrich Fischer, page 9, the ya is written two dots in such cases, too. Except that this is not a Yaa and not pronounced like a Yaa, it is an Alef (note the small dagger Alef above it). That is exactly what I meant and exactly what is written in W. Fischer. My point is that there are two dots below. No, there aren't, at least in orthographies that differentiate between Yaa and Alef Maqsura by dots. -- Khaled Hosny Arabic localiser and member of Arabeyes.org team Free font developer
Re: Pashto yeh characters
Quoting Andreas Prilop prilop4...@trashmail.net: Hi Andreas, Thanks for the references to the old 7-bit and 8-bit Arabic character sets. http://www.itscj.ipsj.or.jp/ISO-IR/089.pdf http://www.itscj.ipsj.or.jp/ISO-IR/127.pdf I think these clearly show that alef maksura was the intention behind the dotless code point immediately preceding yeh, which later got incorporated into Unicode as U+0649. In terms of practice, Arabic-language documents are fairly consistent about using U+064A for yeh and U+0649 for alef maksura -- except in Egypt, which has a tradition of not distinguishing between alef maksura and yeh in final position (both are written without dots). Here's an arbitrary page from today's Al-Ahram newspaper, where both yeh and alef maksura are encoded as U+064A (the same holds for other pages of the site). http://www.ahram.org.eg/241/2010/07/28/25/31443.aspx On my computer this looks particularly jarring, because two dots are displayed on alef maksura in words like 'ila to and `ala on. My locale is set to en_US, I wonder if an Egyptian locale setting would cause U+064A to display without dots. Going back to my original question about Pashto, unfortunately I cannot use the advice you gave in your initial reply, Use whatever you want. I am not creating Pashto documents for print or electronic distribution, but rather working on automated language-processing tasks. It seems that the only workable solution would be to unify all U+064A and U+06CC characters found in Pashto documents into a single character for processing (and also U+0649 if we encounter it). It is unfortunate that a distinction between the characters cannot be used for disambiguating unvocalized Pashto text, but this appears to be the current state of affairs. -Ron.
Re: Pashto yeh characters
On Tue, 27 Jul 2010, Khaled Hosny wrote: According to Grammatik des klassischen Arabisch by Wolfdietrich Fischer, page 9, the ya is written two dots in such cases, too. Except that this is not a Yaa and not pronounced like a Yaa, it is an Alef (note the small dagger Alef above it). That is exactly what I meant and exactly what is written in W. Fischer. My point is that there are two dots below. Dear Mr. Prilop, your point is that this form of alef has two dots below ??? I didn't get it. Allow me a general remark: Yes, sometimes an outside view catches something -- e.g. some more theoretical aspect, but: most of the time; a native writer knows his/her language better than you. Alif maqsura and Egyptian/Persian/Quranic ya' look the same in final and isol position, but the underlying letters are not the same. Although Unicode -- when it comes to the Arabic script -- pays much attention to the shape of the letter it does not ignore the logical structure, and in the case under discussion we have two different letters in Arabic, and in Unicode two different chars representing them.
Re: Pashto yeh characters
All three Pashto Yeh characters represent significant phonetic differences. 06CC is used for the /i/ sound while 06D0 (with two vertical dots below) stands for /e/. According to some sources, the third one (06CD) represents /ej/ and is not consistently used for all dialects. I think the inconsistency you are seeing between 06CC and 06D0 is due to carelessness. I seem to remember that the DW site was more consistent in their use. They also use 06CD while the BBC does not. As to the repertoire offered by different keyboard layouts, it's become relatively easy to customize any layout. Kamal On 2010.7.27 21:51, lingu...@artstein.org lingu...@artstein.org wrote: Hi Kamal, Thanks for the helpful comment -- especially the URLs. A quick check showed that at least on the BBC, U+064A and U+06CC are used interchangeably, even in final position where the glyphs differ. My Pashto is extremely weak, but even I can recognize that in the following article, both 06A9 0631 0632 06CC (in the headline) and 06A9 0631 0632 064A (in the first line of text) spell the name of the Afghan president. http://www.bbc.co.uk/pashto/afghanistan/2010/04/100411_hh-kandahar-clash.shtml The pattern I thought I had noticed, with an emerging distinction between yeh with and without dots in final position, appears to be a fluke of the data I had examined. In a broader sampling of texts, writers use both U+064A and U+06CC and don't care much about whether dots appear on the final forms. I'm still a bit flummoxed as to how a single writer can produce U+064A and U+06CC in such an apparently random fashion, given that they require distinct keystrokes. The Mac on which I am presently writing (actually my wife's computer) has an Afghan Pashto keyboard layout where U+06CC is produced by the d key in the QUERTY layout, and U+064A is produced by shift+d (this is the same as in the keyboard layouts set by Iranian standards ISIRI 2901 and ISIRI 9147). Are the BBC typists randomly pressing shift when typing yeh? On a similar note, it didn't take me too long to find an article where the word Pentagon had two variants for the g character -- U+06AB in the headline, U+06AF in the first line of text. http://www.dw-world.de/dw/article/0,,5842070,00.html In my Afghan Pashto keyboard layout, these characters are ' and option+' respectively. Are the Deutsche Welle typists randomly pressing option when typing gaf? (These are intended as rhetorical questions, but if someone has an answer I'd be happy to hear.) -Ron. Quoting Mansour, Kamal kamal.mans...@monotypeimaging.com: Ron, as you've already noticed, there can be multiple conventions for the orthography of a single language. For the Yeh repertoire, typically the following are used: u+06CC u+06CD u+06D0 For a current corpus, have a look at BBC News (http://www.bbc.co.uk/pashto) and Deutsche Welle (http://www.dw-world.de/) Kamal On 2010.7.22 10:17, lingu...@artstein.org lingu...@artstein.org wrote: Hi, This is a query I had originally sent to the Linguist List, modified based on feedback I got there. I am hoping that someone in the Unicode community can help resolve this. I'm interested in knowing if there is a standard way to encode the various Pashto yeh-characters in Unicode, and if so, what it is. This question is a bit more complicated than it sounds, so here's the background. Pashto is written using a derivative of the Arabic script. The Arabic language uses a single character for both /j/ and /i:/ sounds. Like many Arabic characters, this one is composed of a base form (which changes shape based on its position in a word) and dots (in this case, two dots below the base form). In most of the Arabic-speaking world the dots are present with both the medial and final form, though in Egypt (and possibly other places) the convention is to have two dots on the medial form but leave them off the final form. The standard arrangement of the two dots is horizontal, but they can be placed vertically or diagonally with no change in meaning. Persian also uses a single character for /j/ and /i:/, with the convention of two dots on the medial form, no dots on the final form (same as in Egypt). The two conventions for the /j/-/i:/ character were given distinct code points in unicode despite the fact that they do not contrast; documentation is scarce, but presumably this was done in order to allow writing both Arabic and Persian in the same document. Therefore, Unicode has the following code points (I'm not giving the names, but rather the typical visual representation of the glyphs and typical use). U+064A two dots medially and finally (/j/-/i:/ Arabic convention) U+06CC two dots medially, none finally (/j/-/i:/ Persian convention) There are a few additional yeh-base code points defined, some of which are relevant to Pashto (see below). U+0649 no dots medially or finally (Arabic /a/ from etymological /j/) U+0626 hamza above
Re: Pashto yeh characters
On Wed, 28 Jul 2010, lingu...@artstein.org wrote: Here's an arbitrary page from today's Al-Ahram newspaper, [...] On my computer this looks particularly jarring, You can find enough pages from Continental Europe and Latin America that have an acute accent instead of an apostrophe due to ill-designed keyboard layouts. http://www.tut.fi/library/dlib/faq.htm
RE: Pashto yeh characters
Andreas Prilop commented A native speaker of English does not /automatically/ know better about English grammar, English punctuation than an informed Frenchman. So true, so true. Most native speakers of English have only limited understanding of English grammar. At least in my country. They regularly confuse she and her, he and him, adverbs and adjectives, etc. Sigh. Murray
Re: Pashto yeh characters
Quoting CE Whitehead cewcat...@hotmail.com: 'g' is a non-Arabic sound ... and there is no g in Standard Arabic although there are two ways to write it ... Oh, there are many more than two ways to write the [g] sound in Arabic. Standard Arabic traditionally transcribes foreign [g] as ghain U+063A, as in Granada. But particular locales have devised their own characters: Morocco: kaf with 3 dots U+0763, as in Agadir: http://www.casafree.com/modules/xcgal/albums/userpics/10070/normal_DSCN5410.JPG Tunisia: qaf with 3 dots U+06A8, as in Gafsa: http://i4.photobucket.com/albums/y131/LuXuS3000/Tunisia%20Airliners/Gafsa-Ksar.jpg Israel: jeem with 3 dots U+0686, as in Giv'at Shemuel: http://upload.wikimedia.org/wikipedia/en/6/62/Givat_shmuel_sign.png Then there are dialects of Arabic that do have the [g] sound -- in Egypt jeem U+062C is pronunced as [g] (think of Gamal Abdel Nasser), and in many other places qaf U+0642 is pronounced as [g] (think of Muammar al-Gaddafi). And that's just Arabic... Persian and Urdu write [g] using a kaf character with a line above U+06AF, while Pashto uses kaf with a ring U+06AB. It really should be that simple. You might expect a substitution if someone does not have a character in their font or doesn't know how to access it from a keyboard. However, I noticed the Persian character alongside the Pashto one inside a single Pashto document, and that's just strange. -Ron.
Re: Pashto yeh characters
On Thu, 22 Jul 2010, lingu...@artstein.org wrote: [...] To wrap up, are my observations about the Pashto writing conventions correct? And is there a standard for assigning the Pashto characters representing /j/ and /i:/ to Unicode code points? Practical answer: U+0649 and U+064A are included in MacArabic/MacFarsi and Windows-1256; but U+06CC is not. Support for 0649 and 064A in fonts is still better than for 06CC. For example, try the various Arabic fonts in Windows XP: http://www.user.uni-hannover.de/nhtcapri/temp/ya.arabic.html Therefore you should use only U+0649 and U+064A for Arabic, Persian, Urdu if you want your documents to be displayed on other computers. I have done so in http://www.user.uni-hannover.de/nhtcapri/arabic-alphabet.html http://www.user.uni-hannover.de/nhtcapri/persian-alphabet.html http://www.user.uni-hannover.de/nhtcapri/mac-urdu-alphabet.html However, for Pashto you need characters outside Windows-1256 anyway. * * * * * * Theoretical answer: U+0649 has (should have) four glyphs without any dots. This is no Arabic letter, but an Uighur letter. Therefore you should not use U+0649 for Arabic, Persian, Pashto, Urdu but only U+06CC. I have done so in http://www.user.uni-hannover.de/nhtcapri/urdu-alphabet.html http://www.user.uni-hannover.de/nhtcapri/pashto-alphabet.html U+0649 has the traditional name alif maqsura because it was taken from ISO-8859-6. But I see no objection to use U+06CC for alif maqsura. You cannot distinguish the initial and middle glyphs of 064A and 06CC. Use whatever you want. Given the practical answer above, you might prefer U+064A. But if you don't have U+06CC in your font, you probably don't have Pashto letters either.
Re: Pashto yeh characters
On Tue, Jul 27, 2010 at 12:43 PM, Andreas Prilop prilop4...@trashmail.net wrote: U+0649 has (should have) four glyphs without any dots. This is no Arabic letter, but an Uighur letter. Therefore you should not use U+0649 for Arabic, Persian, Pashto, Urdu but only U+06CC. That's wrong, though. MacArabic, Windows-1256 and ISO-8859-6 are all standards for the encoding of Arabic. Thus U+0649 must be an Arabic character; existing use in both those sets and in Unicode say that is. -- Kie ekzistas vivo, ekzistas espero.
Re: Pashto yeh characters
On Tue, Jul 27, 2010 at 06:43:19PM +0200, Andreas Prilop wrote: [...] U+0649 has (should have) four glyphs without any dots. This is no Arabic letter, but an Uighur letter. Therefore you should not use U+0649 for Arabic, Persian, Pashto, Urdu but only U+06CC. I'm not sure what is the bases of this conclusion, but U+0649 have no dots in its initial/medial forms in Arabic too, it just happen not to get in those two positions in modern orthography, but it can be seen in Quran which is still written in the old, early Islamic orthography. See the attached image showing the words فسوىهن and ميكىل. Regards, Khaled -- Khaled Hosny Arabic localiser and member of Arabeyes.org team Free font developer attachment: U+0649.jpg
Re: Pashto yeh characters
Andreas Prilop: U+0649 has the traditional name alif maqsura because it was taken from ISO-8859-6. But I see no objection to use U+06CC for alif maqsura. I beg to differ Since U+0649 is called alif maqsura it should be used for alif maqsura. Please not that in the Qur'an it occurs not only at the end of words. That two glyphs are the same dies not mean that the letters are the same. Or do you use small l for capital I when using Helvetica?
RE: Pashto yeh characters
Hi, Khaled, Arno, Andreas: All the Arabic characters (consonants, hamzas, but not vowel diacritics or numbers) that I need are betwee U621 (hamza) and 64A; there are vowel diacritics that can be used immediately following these and then the Arabic numbers. (Would any of these look-alikes be security issues? Both these characters are allowed in IDN's; see: http://unicode.org/reports/tr36/idn-chars.html) Thanks all. Best, C. E. Whitehead cewcat...@hotmail.com So I would concur with Khaled and Arno here that U649 is Arabic aleph maqsura ( Date: Tue, 27 Jul 2010 20:09:21 +0200 From: a...@zedat.fu-berlin.de To: prilop4...@trashmail.net CC: unicode@unicode.org; lingu...@artstein.org Subject: Re: Pashto yeh characters Andreas Prilop: U+0649 has the traditional name alif maqsura because it was taken from ISO-8859-6. But I see no objection to use U+06CC for alif maqsura. I beg to differ Since U+0649 is called alif maqsura it should be used for alif maqsura. Please not that in the Qur'an it occurs not only at the end of words. That two glyphs are the same dies not mean that the letters are the same. Or do you use small l for capital I when using Helvetica?
Re: Pashto yeh characters
David Starner: On Tue, Jul 27, 2010 at 12:43 PM, Andreas Prilop [U+0649] is no Arabic letter, but an Uighur letter. That's wrong, though. […] U+0649 must be an Arabic character; Andreas probably meant that U+0649 is not part of the Arabic writing system, i.e. the Arabic script as used in writing the Arabic language (with some recognised orthography). You probably mean that U+0649 is part of the Arabic script, which it certainly is. No contradiction here, just not a good idea to use ‘Arabic’ as an adjective with ‘letter’ or ‘character’, unless you make sure everyone agrees – I would – that letters are constituents of writing systems, whereas characters form scripts. Manywhere, though, ‘writing system’, ‘script’, ‘orthography’, ‘alphabet’ and even ‘language’ tend to be synonyms (and may share a name with people and religion, too), as do ‘character’, ‘letter’, ‘glyph’, ‘grapheme’, ‘sign’ and ‘symbol’. Some scholars like to use (or invent) alternative names to aid the distinction, e.g. I’ve seen – I think in one of Coulmas’ books – Latin/Roman and – elsewhere – Arabic/Arabetic/Arabian, but that would only really help if enough people understood and did it.
Re: Pashto yeh characters
Ron, as you've already noticed, there can be multiple conventions for the orthography of a single language. For the Yeh repertoire, typically the following are used: u+06CC u+06CD u+06D0 For a current corpus, have a look at BBC News (http://www.bbc.co.uk/pashto) and Deutsche Welle (http://www.dw-world.de/) Kamal On 2010.7.22 10:17, lingu...@artstein.org lingu...@artstein.org wrote: Hi, This is a query I had originally sent to the Linguist List, modified based on feedback I got there. I am hoping that someone in the Unicode community can help resolve this. I'm interested in knowing if there is a standard way to encode the various Pashto yeh-characters in Unicode, and if so, what it is. This question is a bit more complicated than it sounds, so here's the background. Pashto is written using a derivative of the Arabic script. The Arabic language uses a single character for both /j/ and /i:/ sounds. Like many Arabic characters, this one is composed of a base form (which changes shape based on its position in a word) and dots (in this case, two dots below the base form). In most of the Arabic-speaking world the dots are present with both the medial and final form, though in Egypt (and possibly other places) the convention is to have two dots on the medial form but leave them off the final form. The standard arrangement of the two dots is horizontal, but they can be placed vertically or diagonally with no change in meaning. Persian also uses a single character for /j/ and /i:/, with the convention of two dots on the medial form, no dots on the final form (same as in Egypt). The two conventions for the /j/-/i:/ character were given distinct code points in unicode despite the fact that they do not contrast; documentation is scarce, but presumably this was done in order to allow writing both Arabic and Persian in the same document. Therefore, Unicode has the following code points (I'm not giving the names, but rather the typical visual representation of the glyphs and typical use). U+064A two dots medially and finally (/j/-/i:/ Arabic convention) U+06CC two dots medially, none finally (/j/-/i:/ Persian convention) There are a few additional yeh-base code points defined, some of which are relevant to Pashto (see below). U+0649 no dots medially or finally (Arabic /a/ from etymological /j/) U+0626 hamza above medially and finally (Arabic glottal stop in certain contexts) U+06D0 two dots medially and finally in vertical arrangement U+06CD tail and no dots in final position As it so happens, there is much confusion in how these characters are used in actual electronic documents, which is not surprising given that U+06CC looks like U+064A in medial position but like U+0649 in final position. There is an excellent article by Jonathan Kew that sorts out what this means for various languages that use derivatives of the Arabic script. http://scripts.sil.org/cms/scripts/render_download.php?site_id=nrsi=file_id=arabicletterusagenotes=ArabicLetterUsageNotes.pdf http://scripts.sil.org/cms/scripts/render_download.php?site_id=nrsiformat=filemedia_id=arabicletterusagenotesfilename=ArabicLetterUsageNotes.pdfhttp://scripts.sil.org/cms/scripts/render_download.php?site_id=nrsiformat=filemedia_id=arabicletterusagenotesfilename=ArabicLetterUsageNotes.pdf Unfortunately, this article does not discuss Pashto. I have little knowledge of the language, but here's what I managed to understand from the inspection of a few documents and with the help of friendly people on the Linguist List (and please correct me if I'm wrong). Traditionally, Pashto used a single character with the same convention as in Persian, of two dots in the medial form and none on the final form, and with no significance attached to the visual arrangement of the dots. The character was 3-ways ambiguous between the sounds /j/, /i:/ and /e/. In recent decades (probably since the 1970s or 1980s) there has been some differentiation, partly due to changes in the typesetting process and partly due to a deliberate effort of the Pashto Academy at the University of Peshawar, Pakistan. One convention that has gained fairly wide acceptance is a distinction between a horizontal arrangement of the dots, representing /j/ or /i:/ as in Arabic and Persian, and a vertical arrangement representing the sound /e/. This distinction is the same as in Uighur, and the character with vertical dots has been codified as U+06D0. Additional conventions include a hamza (U+0626) or tail (U+06CD) to represent /j/ at the end of a word in certain grammatical markers. All of these are quite standard by now and do not pose much of a problem. However, a further convention appears to have arisen, which as far as I can tell is unique to Pashto in that it distinguishes between /j/ and /i:/ (though only in word-final position): /j/ is written with two dots medially, none finally /i:/ is written with two dots both medially and finally I have never seen this codified
Re: Pashto yeh characters
On Tue, Jul 27, 2010 at 5:07 PM, Christoph Päper christoph.pae...@crissov.de wrote: David Starner: On Tue, Jul 27, 2010 at 12:43 PM, Andreas Prilop [U+0649] is no Arabic letter, but an Uighur letter. That's wrong, though. […] U+0649 must be an Arabic character; Andreas probably meant that U+0649 is not part of the Arabic writing system, i.e. the Arabic script as used in writing the Arabic language (with some recognised orthography). You probably mean that U+0649 is part of the Arabic script, which it certainly is. No, what I mean was that MacArabic, Windows-1256 and ISO-8859-6 are designed to write the Arabic language. If U+0649 is in these character sets, to say that it's really a Uighur character is like saying that U+0041 is really a Greek character; it spits in the face of how the character has been used and how fonts have been designed for the character. -- Kie ekzistas vivo, ekzistas espero.