Re: Pashto yeh characters

2010-10-01 Thread Roozbeh Pournader
This is a rather late reply, but I think this document should be useful:

http://www.evertype.com/standards/af/af-locales.pdf

The first few pages discuss and recommend various Yeh forms to be used,
and a recommendation for avoiding some in certain forms.

Roozbeh

On Thu, 2010-07-22 at 12:17 -0500, lingu...@artstein.org wrote:
 Hi,
 
 This is a query I had originally sent to the Linguist List, modified  
 based on feedback I got there. I am hoping that someone in the Unicode  
 community can help resolve this.
 
 I'm interested in knowing if there is a standard way to encode the  
 various Pashto yeh-characters in Unicode, and if so, what it is. This  
 question is a bit more complicated than it sounds, so here's the  
 background.
 
 Pashto is written using a derivative of the Arabic script. The Arabic  
 language uses a single character for both /j/ and /i:/ sounds. Like  
 many Arabic characters, this one is composed of a base form (which  
 changes shape based on its position in a word) and dots (in this case,  
 two dots below the base form). In most of the Arabic-speaking world  
 the dots are present with both the medial and final form, though in  
 Egypt (and possibly other places) the convention is to have two dots  
 on the medial form but leave them off the final form. The standard  
 arrangement of the two dots is horizontal, but they can be placed  
 vertically or diagonally with no change in meaning.
 
 Persian also uses a single character for /j/ and /i:/, with the  
 convention of two dots on the medial form, no dots on the final form  
 (same as in Egypt).
 
 The two conventions for the /j/-/i:/ character were given distinct  
 code points in unicode despite the fact that they do not contrast;  
 documentation is scarce, but presumably this was done in order to  
 allow writing both Arabic and Persian in the same document. Therefore,  
 Unicode has the following code points (I'm not giving the names, but  
 rather the typical visual representation of the glyphs and typical use).
 
 U+064A two dots medially and finally (/j/-/i:/ Arabic convention)
 U+06CC two dots medially, none finally (/j/-/i:/ Persian convention)
 
 There are a few additional yeh-base code points defined, some of which  
 are relevant to Pashto (see below).
 
 U+0649 no dots medially or finally (Arabic /a/ from etymological /j/)
 U+0626 hamza above medially and finally (Arabic glottal stop in  
 certain contexts)
 U+06D0 two dots medially and finally in vertical arrangement
 U+06CD tail and no dots in final position
 
 As it so happens, there is much confusion in how these characters are  
 used in actual electronic documents, which is not surprising given  
 that U+06CC looks like U+064A in medial position but like U+0649 in  
 final position. There is an excellent article by Jonathan Kew that  
 sorts out what this means for various languages that use derivatives  
 of the Arabic script.
 
 http://scripts.sil.org/cms/scripts/render_download.php?site_id=nrsiformat=filemedia_id=arabicletterusagenotesfilename=ArabicLetterUsageNotes.pdf
 
 Unfortunately, this article does not discuss Pashto. I have little  
 knowledge of the language, but here's what I managed to understand  
 from the inspection of a few documents and with the help of friendly  
 people on the Linguist List (and please correct me if I'm wrong).
 
 Traditionally, Pashto used a single character with the same convention  
 as in Persian, of two dots in the medial form and none on the final  
 form, and with no significance attached to the visual arrangement of  
 the dots. The character was 3-ways ambiguous between the sounds /j/,  
 /i:/ and /e/. In recent decades (probably since the 1970s or 1980s)  
 there has been some differentiation, partly due to changes in the  
 typesetting process and partly due to a deliberate effort of the  
 Pashto Academy at the University of Peshawar, Pakistan.
 
 One convention that has gained fairly wide acceptance is a distinction  
 between a horizontal arrangement of the dots, representing /j/ or /i:/  
 as in Arabic and Persian, and a vertical arrangement representing the  
 sound /e/. This distinction is the same as in Uighur, and the  
 character with vertical dots has been codified as U+06D0. Additional  
 conventions include a hamza (U+0626) or tail (U+06CD) to represent /j/  
 at the end of a word in certain grammatical markers. All of these are  
 quite standard by now and do not pose much of a problem.
 
 However, a further convention appears to have arisen, which as far as  
 I can tell is unique to Pashto in that it distinguishes between /j/  
 and /i:/ (though only in word-final position):
 
 /j/ is written with two dots medially, none finally
 /i:/ is written with two dots both medially and finally
 
 I have never seen this codified explicitly, but this is the impression  
 I get from examining a few recent Pashto documents. Which brings me to  
 my original question, of how to represent these characters in 

Re: Pashto yeh characters

2010-07-29 Thread André Szabolcs Szelp
Persian and Urdu write [g] using a kaf character with a line above U+06AF,
while Pashto uses kaf with a ring U+06AB. It really should be that simple.

I seem to remember, that Persian used kaf with three dots above (like your
Moroccan example) at least in the 19th century. No idea when they switched
to the double-lined version. (and I can well imagine how the three dots
would have merged to a line, thought this might as well not be the origin of
that character).

Szabolcs


Re: Pashto yeh characters

2010-07-29 Thread André Szabolcs Szelp
On Wed, Jul 28, 2010 at 7:20 PM, Murray Sargent 
murr...@exchange.microsoft.com wrote:
Andreas Prilop commented A native speaker of English does not
/automatically/ know better about English grammar, English punctuation than
an informed Frenchman. So true, so true. Most native speakers of English
have only limited understanding of English grammar.

I very recently read an anecdote about Radloff, the russian turkologist.
One day a Turk visited him and told him his theories and ideas about the
Turkic languages. It became quite soon apparent, that he was not to be taken
seriously. So Radloff asked:
— Why do you think, your ideas are right?
— Because I'm a turkologist — the man replied.
— And what makes you a turkologist?
— Well, I'm a Turk, and my mother tongue ist Turkish.
— Oh no, my friend, a bird is not an ornithologist either...

... Actually, in general birds know pretty little about birds :-)

Szabolcs


Re: Pashto yeh characters

2010-07-28 Thread Andreas Prilop
On Tue, 27 Jul 2010, Arno Schmitt wrote:

 Since U+0649 is called alif maqsura it should be used for alif maqsura.

But that argument, you must use U+0027 for an apostrophe instead
of U+2019.

The Unicode names for characters are often hictorical and
you should not infer anything from such names.

 Please not that in the Qur'an it occurs not only at the end of words.

If you argue with archaic spelling, then ð and þ are English letters.

 Or do you use small l for capital I when using Helvetica?

They don't even have the same stroke width.



Re: Pashto yeh characters

2010-07-28 Thread linguist

Hi Kamal,

Thanks for the helpful comment -- especially the URLs. A quick check  
showed that at least on the BBC, U+064A and U+06CC are used  
interchangeably, even in final position where the glyphs differ. My  
Pashto is extremely weak, but even I can recognize that in the  
following article, both 06A9 0631 0632 06CC (in the headline) and 06A9  
0631 0632 064A (in the first line of text) spell the name of the  
Afghan president.


http://www.bbc.co.uk/pashto/afghanistan/2010/04/100411_hh-kandahar-clash.shtml

The pattern I thought I had noticed, with an emerging distinction  
between yeh with and without dots in final position, appears to be a  
fluke of the data I had examined. In a broader sampling of texts,  
writers use both U+064A and U+06CC and don't care much about whether  
dots appear on the final forms.


I'm still a bit flummoxed as to how a single writer can produce U+064A  
and U+06CC in such an apparently random fashion, given that they  
require distinct keystrokes. The Mac on which I am presently writing  
(actually my wife's computer) has an Afghan Pashto keyboard layout  
where U+06CC is produced by the d key in the QUERTY layout, and  
U+064A is produced by shift+d (this is the same as in the keyboard  
layouts set by Iranian standards ISIRI 2901 and ISIRI 9147). Are the  
BBC typists randomly pressing shift when typing yeh?


On a similar note, it didn't take me too long to find an article where  
the word Pentagon had two variants for the g character -- U+06AB  
in the headline, U+06AF in the first line of text.


http://www.dw-world.de/dw/article/0,,5842070,00.html

In my Afghan Pashto keyboard layout, these characters are ' and  
option+' respectively. Are the Deutsche Welle typists randomly  
pressing option when typing gaf?


(These are intended as rhetorical questions, but if someone has an  
answer I'd be happy to hear.)


-Ron.


Quoting Mansour, Kamal kamal.mans...@monotypeimaging.com:

Ron, as you've already noticed, there can be multiple conventions   
for the orthography of a single language.


For the Yeh repertoire, typically the following are used:
u+06CC
u+06CD
u+06D0

For a current corpus, have a look at BBC News   
(http://www.bbc.co.uk/pashto) and Deutsche Welle   
(http://www.dw-world.de/)


Kamal


On 2010.7.22 10:17, lingu...@artstein.org lingu...@artstein.org wrote:

Hi,

This is a query I had originally sent to the Linguist List, modified
based on feedback I got there. I am hoping that someone in the Unicode
community can help resolve this.

I'm interested in knowing if there is a standard way to encode the
various Pashto yeh-characters in Unicode, and if so, what it is. This
question is a bit more complicated than it sounds, so here's the
background.

Pashto is written using a derivative of the Arabic script. The Arabic
language uses a single character for both /j/ and /i:/ sounds. Like
many Arabic characters, this one is composed of a base form (which
changes shape based on its position in a word) and dots (in this case,
two dots below the base form). In most of the Arabic-speaking world
the dots are present with both the medial and final form, though in
Egypt (and possibly other places) the convention is to have two dots
on the medial form but leave them off the final form. The standard
arrangement of the two dots is horizontal, but they can be placed
vertically or diagonally with no change in meaning.

Persian also uses a single character for /j/ and /i:/, with the
convention of two dots on the medial form, no dots on the final form
(same as in Egypt).

The two conventions for the /j/-/i:/ character were given distinct
code points in unicode despite the fact that they do not contrast;
documentation is scarce, but presumably this was done in order to
allow writing both Arabic and Persian in the same document. Therefore,
Unicode has the following code points (I'm not giving the names, but
rather the typical visual representation of the glyphs and typical use).

U+064A two dots medially and finally (/j/-/i:/ Arabic convention)
U+06CC two dots medially, none finally (/j/-/i:/ Persian convention)

There are a few additional yeh-base code points defined, some of which
are relevant to Pashto (see below).

U+0649 no dots medially or finally (Arabic /a/ from etymological /j/)
U+0626 hamza above medially and finally (Arabic glottal stop in
certain contexts)
U+06D0 two dots medially and finally in vertical arrangement
U+06CD tail and no dots in final position

As it so happens, there is much confusion in how these characters are
used in actual electronic documents, which is not surprising given
that U+06CC looks like U+064A in medial position but like U+0649 in
final position. There is an excellent article by Jonathan Kew that
sorts out what this means for various languages that use derivatives
of the Arabic script.

http://scripts.sil.org/cms/scripts/render_download.php?site_id=nrsi=file_id=arabicletterusagenotes=ArabicLetterUsageNotes.pdf


Re: Pashto yeh characters

2010-07-28 Thread Andreas Prilop
On Tue, 27 Jul 2010, David Starner wrote:

 MacArabic, Windows-1256 and ISO-8859-6 are all standards for
 the encoding of Arabic. Thus U+0649 must be an Arabic character;
 existing use in both those sets and in Unicode say that is.

By that circular logic, S with cedilla and T with cedilla
must be Romanian letters because they are included in ISO-8859-2.

Arabic 8-bit character sets go back to Arab ASMO standards
 http://www.itscj.ipsj.or.jp/ISO-IR/089.pdf
 http://www.itscj.ipsj.or.jp/ISO-IR/127.pdf
which are several decades old. These standards had isolated letters
ya with and without dots. I have no evidence that these ASMO
standards specified initial and medial forms of ya without dots.
The two letters were taken into Unicode as 0649 and 064A.

The question is:
When and why was it specified (in Unicode) that U+0649 should have
four glyphs all without dots?

The Arabic fonts in Windows XP (as well as other fonts I saw)
have only isolated and final glyphs for U+0649.



Re: Pashto yeh characters

2010-07-28 Thread Khaled Hosny
On Wed, Jul 28, 2010 at 04:33:12PM +0200, Andreas Prilop wrote:
 On Tue, 27 Jul 2010, Khaled Hosny wrote:
 
  it just happen not to get in those two positions
  in modern orthography, but it can be seen in Quran
  which is still written in the old, early Islamic orthography.
 
 If you argue with archaic spelling, then ð and þ are English letters.

Except we are talking about a letter that is still in contemporary use,
just not occurring at certain positions of the word.

 | http://www.unicode.org/mail-arch/unicode-ml/y2010-m07/att-0295/01-U_0649.jpg
 | http://www.unicode.org/mail-arch/unicode-ml/y2010-m07/att-0295/01-U_0649.jpg
 
 According to Grammatik des klassischen Arabisch by Wolfdietrich Fischer,
 page 9, the ya is written two dots in such cases, too.

Except that this is not a Yaa and not pronounced like a Yaa, it is an
Alef (note the small dagger Alef above it).

 I doubt such questions can be solved with reference to the Quran,
 which originally had no dots at all.

Those are two scans from contemporary prints of Quran, where regular Yaa
have dots.

Just because Uyghur is still following the old orthography of placing
Alef Maqsura in the middle of the word, doesn't suddenly make it a no
Arabic character.

Regards,
 Khaled

-- 
 Khaled Hosny
 Arabic localiser and member of Arabeyes.org team
 Free font developer



Re: Pashto yeh characters

2010-07-28 Thread David Starner
On Wed, Jul 28, 2010 at 10:51 AM, Andreas Prilop
prilop4...@trashmail.net wrote:
 On Tue, 27 Jul 2010, David Starner wrote:

 MacArabic, Windows-1256 and ISO-8859-6 are all standards for
 the encoding of Arabic. Thus U+0649 must be an Arabic character;
 existing use in both those sets and in Unicode say that is.

 By that circular logic, S with cedilla and T with cedilla
 must be Romanian letters because they are included in ISO-8859-2.

They are the exception that proves the rule. I would say that, and the
counter-argument would be that Romania, specifically and overtly,
demanded that S with comma and T with comma be created for Romanian.
The reason S with cedilla and T with cedilla aren't considered the
right characters for Romanian is nothing more and nothing less than an
overt act by ISO and Unicode. (There's certainly nothing about the
characters; the S with cedilla and S with comma is in free variation
in the Romanian texts I've seen, including one designed to teach young
children how to read and write.)

-- 
Kie ekzistas vivo, ekzistas espero.



Re: Pashto yeh characters

2010-07-28 Thread Andreas Prilop
On Tue, 27 Jul 2010, Khaled Hosny wrote:

 According to Grammatik des klassischen Arabisch by Wolfdietrich Fischer,
 page 9, the ya is written two dots in such cases, too.

 Except that this is not a Yaa and not pronounced like a Yaa, it is an
 Alef (note the small dagger Alef above it).

That is exactly what I meant and exactly what is written in W. Fischer.

My point is that there are two dots below.



Re: Pashto yeh characters

2010-07-28 Thread Khaled Hosny
On Wed, Jul 28, 2010 at 05:32:21PM +0200, Andreas Prilop wrote:
 On Tue, 27 Jul 2010, Khaled Hosny wrote:
 
  According to Grammatik des klassischen Arabisch by Wolfdietrich Fischer,
  page 9, the ya is written two dots in such cases, too.
 
  Except that this is not a Yaa and not pronounced like a Yaa, it is an
  Alef (note the small dagger Alef above it).
 
 That is exactly what I meant and exactly what is written in W. Fischer.
 
 My point is that there are two dots below.

No, there aren't, at least in orthographies that differentiate between
Yaa and Alef Maqsura by dots.

-- 
 Khaled Hosny
 Arabic localiser and member of Arabeyes.org team
 Free font developer



Re: Pashto yeh characters

2010-07-28 Thread linguist

Quoting Andreas Prilop prilop4...@trashmail.net:

Hi Andreas,

Thanks for the references to the old 7-bit and 8-bit Arabic character sets.


 http://www.itscj.ipsj.or.jp/ISO-IR/089.pdf
 http://www.itscj.ipsj.or.jp/ISO-IR/127.pdf


I think these clearly show that alef maksura was the intention behind  
the dotless code point immediately preceding yeh, which later got  
incorporated into Unicode as U+0649.


In terms of practice, Arabic-language documents are fairly consistent  
about using U+064A for yeh and U+0649 for alef maksura -- except in  
Egypt, which has a tradition of not distinguishing between alef  
maksura and yeh in final position (both are written without dots).  
Here's an arbitrary page from today's Al-Ahram newspaper, where both  
yeh and alef maksura are encoded as U+064A (the same holds for other  
pages of the site).


http://www.ahram.org.eg/241/2010/07/28/25/31443.aspx

On my computer this looks particularly jarring, because two dots are  
displayed on alef maksura in words like 'ila to and `ala on. My  
locale is set to en_US, I wonder if an Egyptian locale setting would  
cause U+064A to display without dots.


Going back to my original question about Pashto, unfortunately I  
cannot use the advice you gave in your initial reply, Use whatever  
you want. I am not creating Pashto documents for print or electronic  
distribution, but rather working on automated language-processing  
tasks. It seems that the only workable solution would be to unify all  
U+064A and U+06CC characters found in Pashto documents into a single  
character for processing (and also U+0649 if we encounter it). It is  
unfortunate that a distinction between the characters cannot be used  
for disambiguating unvocalized Pashto text, but this appears to be the  
current state of affairs.


-Ron.




Re: Pashto yeh characters

2010-07-28 Thread Arno Schmitt
 On Tue, 27 Jul 2010, Khaled Hosny wrote:
 According to Grammatik des klassischen Arabisch by Wolfdietrich Fischer,
 page 9, the ya is written two dots in such cases, too.

 Except that this is not a Yaa and not pronounced like a Yaa, it is an
 Alef (note the small dagger Alef above it).

 That is exactly what I meant and exactly what is written in W. Fischer.

 My point is that there are two dots below.

Dear Mr. Prilop,

your point is that this form of alef has two dots below ???
I didn't get it.

Allow me a general remark:
Yes, sometimes an outside view catches something -- e.g. some
more theoretical aspect,
but:
most of the time; a native writer knows his/her language
better than you.

Alif maqsura and Egyptian/Persian/Quranic ya' look
the same in final and isol position, but the underlying
letters are not the same.
Although Unicode -- when it comes to the Arabic script --
pays much attention to the shape of the letter it does
not ignore the logical structure, and in the case under
discussion we have two different letters in Arabic, and
in Unicode two different chars representing them.







Re: Pashto yeh characters

2010-07-28 Thread Mansour, Kamal

All three Pashto Yeh characters represent significant phonetic differences.

06CC is used for the /i/ sound while 06D0 (with two vertical dots below) stands 
for /e/.
According to some sources, the third one (06CD) represents /ej/ and is not 
consistently used for all dialects.

I think the inconsistency you are seeing between 06CC and 06D0 is due to 
carelessness. I seem to remember that the DW site was more consistent in their 
use. They also use 06CD while the BBC does not.

As to the repertoire offered by different keyboard layouts, it's become 
relatively easy to customize any layout.

Kamal


On 2010.7.27 21:51, lingu...@artstein.org lingu...@artstein.org wrote:

Hi Kamal,

Thanks for the helpful comment -- especially the URLs. A quick check
showed that at least on the BBC, U+064A and U+06CC are used
interchangeably, even in final position where the glyphs differ. My
Pashto is extremely weak, but even I can recognize that in the
following article, both 06A9 0631 0632 06CC (in the headline) and 06A9
0631 0632 064A (in the first line of text) spell the name of the
Afghan president.

http://www.bbc.co.uk/pashto/afghanistan/2010/04/100411_hh-kandahar-clash.shtml

The pattern I thought I had noticed, with an emerging distinction
between yeh with and without dots in final position, appears to be a
fluke of the data I had examined. In a broader sampling of texts,
writers use both U+064A and U+06CC and don't care much about whether
dots appear on the final forms.

I'm still a bit flummoxed as to how a single writer can produce U+064A
and U+06CC in such an apparently random fashion, given that they
require distinct keystrokes. The Mac on which I am presently writing
(actually my wife's computer) has an Afghan Pashto keyboard layout
where U+06CC is produced by the d key in the QUERTY layout, and
U+064A is produced by shift+d (this is the same as in the keyboard
layouts set by Iranian standards ISIRI 2901 and ISIRI 9147). Are the
BBC typists randomly pressing shift when typing yeh?

On a similar note, it didn't take me too long to find an article where
the word Pentagon had two variants for the g character -- U+06AB
in the headline, U+06AF in the first line of text.

http://www.dw-world.de/dw/article/0,,5842070,00.html

In my Afghan Pashto keyboard layout, these characters are ' and
option+' respectively. Are the Deutsche Welle typists randomly
pressing option when typing gaf?

(These are intended as rhetorical questions, but if someone has an
answer I'd be happy to hear.)

-Ron.


Quoting Mansour, Kamal kamal.mans...@monotypeimaging.com:

 Ron, as you've already noticed, there can be multiple conventions
 for the orthography of a single language.

 For the Yeh repertoire, typically the following are used:
 u+06CC
 u+06CD
 u+06D0

 For a current corpus, have a look at BBC News
 (http://www.bbc.co.uk/pashto) and Deutsche Welle
 (http://www.dw-world.de/)

 Kamal


 On 2010.7.22 10:17, lingu...@artstein.org lingu...@artstein.org wrote:

 Hi,

 This is a query I had originally sent to the Linguist List, modified
 based on feedback I got there. I am hoping that someone in the Unicode
 community can help resolve this.

 I'm interested in knowing if there is a standard way to encode the
 various Pashto yeh-characters in Unicode, and if so, what it is. This
 question is a bit more complicated than it sounds, so here's the
 background.

 Pashto is written using a derivative of the Arabic script. The Arabic
 language uses a single character for both /j/ and /i:/ sounds. Like
 many Arabic characters, this one is composed of a base form (which
 changes shape based on its position in a word) and dots (in this case,
 two dots below the base form). In most of the Arabic-speaking world
 the dots are present with both the medial and final form, though in
 Egypt (and possibly other places) the convention is to have two dots
 on the medial form but leave them off the final form. The standard
 arrangement of the two dots is horizontal, but they can be placed
 vertically or diagonally with no change in meaning.

 Persian also uses a single character for /j/ and /i:/, with the
 convention of two dots on the medial form, no dots on the final form
 (same as in Egypt).

 The two conventions for the /j/-/i:/ character were given distinct
 code points in unicode despite the fact that they do not contrast;
 documentation is scarce, but presumably this was done in order to
 allow writing both Arabic and Persian in the same document. Therefore,
 Unicode has the following code points (I'm not giving the names, but
 rather the typical visual representation of the glyphs and typical use).

 U+064A two dots medially and finally (/j/-/i:/ Arabic convention)
 U+06CC two dots medially, none finally (/j/-/i:/ Persian convention)

 There are a few additional yeh-base code points defined, some of which
 are relevant to Pashto (see below).

 U+0649 no dots medially or finally (Arabic /a/ from etymological /j/)
 U+0626 hamza above 

Re: Pashto yeh characters

2010-07-28 Thread Andreas Prilop
On Wed, 28 Jul 2010, lingu...@artstein.org wrote:

 Here's an arbitrary page from today's Al-Ahram newspaper,
 [...]
 On my computer this looks particularly jarring,

You can find enough pages from Continental Europe and Latin
America that have an acute accent instead of an apostrophe
due to ill-designed keyboard layouts.
 http://www.tut.fi/library/dlib/faq.htm



RE: Pashto yeh characters

2010-07-28 Thread Murray Sargent
Andreas Prilop commented A native speaker of English does not /automatically/ 
know better about English grammar, English punctuation than an informed 
Frenchman. So true, so true. Most native speakers of English have only limited 
understanding of English grammar. At least in my country. They regularly 
confuse she and her, he and him, adverbs and adjectives, etc. Sigh.

Murray






Re: Pashto yeh characters

2010-07-28 Thread linguist

Quoting CE Whitehead cewcat...@hotmail.com:

'g' is  a non-Arabic sound ... and there is no g in Standard  
Arabic although there are two ways to write it ...


Oh, there are many more than two ways to write the [g] sound in  
Arabic. Standard Arabic traditionally transcribes foreign [g] as ghain  
U+063A, as in Granada. But particular locales have devised their own  
characters:


Morocco: kaf with 3 dots U+0763, as in Agadir:  
http://www.casafree.com/modules/xcgal/albums/userpics/10070/normal_DSCN5410.JPG


Tunisia: qaf with 3 dots U+06A8, as in Gafsa:  
http://i4.photobucket.com/albums/y131/LuXuS3000/Tunisia%20Airliners/Gafsa-Ksar.jpg


Israel: jeem with 3 dots U+0686, as in Giv'at Shemuel:  
http://upload.wikimedia.org/wikipedia/en/6/62/Givat_shmuel_sign.png


Then there are dialects of Arabic that do have the [g] sound -- in  
Egypt jeem U+062C is pronunced as [g] (think of Gamal Abdel Nasser),  
and in many other places qaf U+0642 is pronounced as [g] (think of  
Muammar al-Gaddafi). And that's just Arabic...


Persian and Urdu write [g] using a kaf character with a line above  
U+06AF, while Pashto uses kaf with a ring U+06AB. It really should be  
that simple. You might expect a substitution if someone does not have  
a character in their font or doesn't know how to access it from a  
keyboard. However, I noticed the Persian character alongside the  
Pashto one inside a single Pashto document, and that's just strange.


-Ron.




Re: Pashto yeh characters

2010-07-27 Thread Andreas Prilop
On Thu, 22 Jul 2010, lingu...@artstein.org wrote:

 [...]
 To wrap up, are my observations about the Pashto writing conventions
 correct? And is there a standard for assigning the Pashto characters
 representing /j/ and /i:/ to Unicode code points?

Practical answer:

U+0649 and U+064A are included in MacArabic/MacFarsi and Windows-1256;
but U+06CC is not. Support for 0649 and 064A in fonts is still better
than for 06CC. For example, try the various Arabic fonts in Windows XP:
 http://www.user.uni-hannover.de/nhtcapri/temp/ya.arabic.html

Therefore you should use only U+0649 and U+064A for Arabic, Persian, Urdu
if you want your documents to be displayed on other computers.
I have done so in
 http://www.user.uni-hannover.de/nhtcapri/arabic-alphabet.html
 http://www.user.uni-hannover.de/nhtcapri/persian-alphabet.html
 http://www.user.uni-hannover.de/nhtcapri/mac-urdu-alphabet.html

However, for Pashto you need characters outside Windows-1256 anyway.

   * * * * * *

Theoretical answer:

U+0649 has (should have) four glyphs without any dots. This is no
Arabic letter, but an Uighur letter. Therefore you should not use
U+0649 for Arabic, Persian, Pashto, Urdu but only U+06CC.
I have done so in
 http://www.user.uni-hannover.de/nhtcapri/urdu-alphabet.html
 http://www.user.uni-hannover.de/nhtcapri/pashto-alphabet.html

U+0649 has the traditional name alif maqsura because it was
taken from ISO-8859-6. But I see no objection to use U+06CC
for alif maqsura.

You cannot distinguish the initial and middle glyphs of 064A and 06CC.
Use whatever you want. Given the practical answer above, you might
prefer U+064A. But if you don't have U+06CC in your font, you
probably don't have Pashto letters either.



Re: Pashto yeh characters

2010-07-27 Thread David Starner
On Tue, Jul 27, 2010 at 12:43 PM, Andreas Prilop
prilop4...@trashmail.net wrote:
 U+0649 has (should have) four glyphs without any dots. This is no
 Arabic letter, but an Uighur letter. Therefore you should not use
 U+0649 for Arabic, Persian, Pashto, Urdu but only U+06CC.

That's wrong, though. MacArabic, Windows-1256 and ISO-8859-6 are all
standards for the encoding of Arabic. Thus U+0649 must be an Arabic
character; existing use in both those sets and in Unicode say that is.

-- 
Kie ekzistas vivo, ekzistas espero.



Re: Pashto yeh characters

2010-07-27 Thread Khaled Hosny
On Tue, Jul 27, 2010 at 06:43:19PM +0200, Andreas Prilop wrote:

[...]

 U+0649 has (should have) four glyphs without any dots. This is no
 Arabic letter, but an Uighur letter. Therefore you should not use
 U+0649 for Arabic, Persian, Pashto, Urdu but only U+06CC.

I'm not sure what is the bases of this conclusion, but U+0649 have no
dots in its initial/medial forms in Arabic too, it just happen not to
get in those two positions in modern orthography, but it can be seen in
Quran which is still written in the old, early Islamic orthography.

See the attached image showing the words فسوىهن and ميكىل.

Regards,
 Khaled

-- 
 Khaled Hosny
 Arabic localiser and member of Arabeyes.org team
 Free font developer
attachment: U+0649.jpg

Re: Pashto yeh characters

2010-07-27 Thread Arno Schmitt
Andreas Prilop:
 U+0649 has the traditional name alif maqsura because it was
 taken from ISO-8859-6. But I see no objection to use U+06CC
 for alif maqsura.

I beg to differ
Since U+0649 is called alif maqsura
it should be used for alif maqsura.

Please not that in the Qur'an
it occurs not only at the end of words.

That two glyphs are the same
dies not mean that the letters are the same.
Or do you use small l for capital I
when using Helvetica?





RE: Pashto yeh characters

2010-07-27 Thread CE Whitehead

Hi, Khaled, Arno, Andreas:

 

All the Arabic characters (consonants, hamzas, but not vowel diacritics or 
numbers) that I need are betwee U621 (hamza) and 64A; there are vowel 
diacritics that can be used immediately following these and then the Arabic 
numbers.  (Would any of these look-alikes be security issues?  Both these 
characters are allowed in IDN's; see:

http://unicode.org/reports/tr36/idn-chars.html)

 

Thanks all.

 

Best,

 

C. E. Whitehead

cewcat...@hotmail.com

So I would concur with Khaled and Arno here that U649 is Arabic aleph maqsura (


 
 Date: Tue, 27 Jul 2010 20:09:21 +0200
 From: a...@zedat.fu-berlin.de
 To: prilop4...@trashmail.net
 CC: unicode@unicode.org; lingu...@artstein.org
 Subject: Re: Pashto yeh characters
 
 Andreas Prilop:
  U+0649 has the traditional name alif maqsura because it was
  taken from ISO-8859-6. But I see no objection to use U+06CC
  for alif maqsura.
 
 I beg to differ
 Since U+0649 is called alif maqsura
 it should be used for alif maqsura.
 
 Please not that in the Qur'an
 it occurs not only at the end of words.
 
 That two glyphs are the same
 dies not mean that the letters are the same.
 Or do you use small l for capital I
 when using Helvetica?
 
 
 
  

Re: Pashto yeh characters

2010-07-27 Thread Christoph Päper
David Starner:
 On Tue, Jul 27, 2010 at 12:43 PM, Andreas Prilop
 [U+0649] is no Arabic letter, but an Uighur letter.
 
 That's wrong, though. […] U+0649 must be an Arabic character;

Andreas probably meant that U+0649 is not part of the Arabic writing system, 
i.e. the Arabic script as used in writing the Arabic language (with some 
recognised orthography).

You probably mean that U+0649 is part of the Arabic script, which it certainly 
is.

No contradiction here, just not a good idea to use ‘Arabic’ as an adjective 
with ‘letter’ or ‘character’, unless you make sure everyone agrees – I would – 
that letters are constituents of writing systems, whereas characters form 
scripts. 

Manywhere, though, ‘writing system’, ‘script’, ‘orthography’, ‘alphabet’ and 
even ‘language’ tend to be synonyms (and may share a name with people and 
religion, too), as do ‘character’, ‘letter’, ‘glyph’, ‘grapheme’, ‘sign’ and 
‘symbol’. Some scholars like to use (or invent) alternative names to aid the 
distinction, e.g. I’ve seen – I think in one of Coulmas’ books – Latin/Roman 
and – elsewhere – Arabic/Arabetic/Arabian, but that would only really help if 
enough people understood and did it.



Re: Pashto yeh characters

2010-07-27 Thread Mansour, Kamal
Ron, as you've already noticed, there can be multiple conventions for the 
orthography of a single language.

For the Yeh repertoire, typically the following are used:
u+06CC
u+06CD
u+06D0

For a current corpus, have a look at BBC News (http://www.bbc.co.uk/pashto) and 
Deutsche Welle (http://www.dw-world.de/)

Kamal


On 2010.7.22 10:17, lingu...@artstein.org lingu...@artstein.org wrote:

Hi,

This is a query I had originally sent to the Linguist List, modified
based on feedback I got there. I am hoping that someone in the Unicode
community can help resolve this.

I'm interested in knowing if there is a standard way to encode the
various Pashto yeh-characters in Unicode, and if so, what it is. This
question is a bit more complicated than it sounds, so here's the
background.

Pashto is written using a derivative of the Arabic script. The Arabic
language uses a single character for both /j/ and /i:/ sounds. Like
many Arabic characters, this one is composed of a base form (which
changes shape based on its position in a word) and dots (in this case,
two dots below the base form). In most of the Arabic-speaking world
the dots are present with both the medial and final form, though in
Egypt (and possibly other places) the convention is to have two dots
on the medial form but leave them off the final form. The standard
arrangement of the two dots is horizontal, but they can be placed
vertically or diagonally with no change in meaning.

Persian also uses a single character for /j/ and /i:/, with the
convention of two dots on the medial form, no dots on the final form
(same as in Egypt).

The two conventions for the /j/-/i:/ character were given distinct
code points in unicode despite the fact that they do not contrast;
documentation is scarce, but presumably this was done in order to
allow writing both Arabic and Persian in the same document. Therefore,
Unicode has the following code points (I'm not giving the names, but
rather the typical visual representation of the glyphs and typical use).

U+064A two dots medially and finally (/j/-/i:/ Arabic convention)
U+06CC two dots medially, none finally (/j/-/i:/ Persian convention)

There are a few additional yeh-base code points defined, some of which
are relevant to Pashto (see below).

U+0649 no dots medially or finally (Arabic /a/ from etymological /j/)
U+0626 hamza above medially and finally (Arabic glottal stop in
certain contexts)
U+06D0 two dots medially and finally in vertical arrangement
U+06CD tail and no dots in final position

As it so happens, there is much confusion in how these characters are
used in actual electronic documents, which is not surprising given
that U+06CC looks like U+064A in medial position but like U+0649 in
final position. There is an excellent article by Jonathan Kew that
sorts out what this means for various languages that use derivatives
of the Arabic script.

http://scripts.sil.org/cms/scripts/render_download.php?site_id=nrsi=file_id=arabicletterusagenotes=ArabicLetterUsageNotes.pdf
 
http://scripts.sil.org/cms/scripts/render_download.php?site_id=nrsiformat=filemedia_id=arabicletterusagenotesfilename=ArabicLetterUsageNotes.pdfhttp://scripts.sil.org/cms/scripts/render_download.php?site_id=nrsiformat=filemedia_id=arabicletterusagenotesfilename=ArabicLetterUsageNotes.pdf

Unfortunately, this article does not discuss Pashto. I have little
knowledge of the language, but here's what I managed to understand
from the inspection of a few documents and with the help of friendly
people on the Linguist List (and please correct me if I'm wrong).

Traditionally, Pashto used a single character with the same convention
as in Persian, of two dots in the medial form and none on the final
form, and with no significance attached to the visual arrangement of
the dots. The character was 3-ways ambiguous between the sounds /j/,
/i:/ and /e/. In recent decades (probably since the 1970s or 1980s)
there has been some differentiation, partly due to changes in the
typesetting process and partly due to a deliberate effort of the
Pashto Academy at the University of Peshawar, Pakistan.

One convention that has gained fairly wide acceptance is a distinction
between a horizontal arrangement of the dots, representing /j/ or /i:/
as in Arabic and Persian, and a vertical arrangement representing the
sound /e/. This distinction is the same as in Uighur, and the
character with vertical dots has been codified as U+06D0. Additional
conventions include a hamza (U+0626) or tail (U+06CD) to represent /j/
at the end of a word in certain grammatical markers. All of these are
quite standard by now and do not pose much of a problem.

However, a further convention appears to have arisen, which as far as
I can tell is unique to Pashto in that it distinguishes between /j/
and /i:/ (though only in word-final position):

/j/ is written with two dots medially, none finally
/i:/ is written with two dots both medially and finally

I have never seen this codified 

Re: Pashto yeh characters

2010-07-27 Thread David Starner
On Tue, Jul 27, 2010 at 5:07 PM, Christoph Päper
christoph.pae...@crissov.de wrote:
 David Starner:
 On Tue, Jul 27, 2010 at 12:43 PM, Andreas Prilop
 [U+0649] is no Arabic letter, but an Uighur letter.

 That's wrong, though. […] U+0649 must be an Arabic character;

 Andreas probably meant that U+0649 is not part of the Arabic writing system, 
 i.e. the Arabic script as used in writing the Arabic language (with some 
 recognised orthography).

 You probably mean that U+0649 is part of the Arabic script, which it 
 certainly is.

No, what I mean was that MacArabic, Windows-1256 and ISO-8859-6 are
designed to write the Arabic language. If U+0649 is in these character
sets, to say that it's really a Uighur character is like saying that
U+0041 is really a Greek character; it spits in the face of how the
character has been used and how fonts have been designed for the
character.

-- 
Kie ekzistas vivo, ekzistas espero.