This is a rather late reply, but I think this document should be useful: http://www.evertype.com/standards/af/af-locales.pdf
The first few pages discuss and recommend various Yeh forms to be used, and a recommendation for avoiding some in certain forms. Roozbeh On Thu, 2010-07-22 at 12:17 -0500, lingu...@artstein.org wrote: > Hi, > > This is a query I had originally sent to the Linguist List, modified > based on feedback I got there. I am hoping that someone in the Unicode > community can help resolve this. > > I'm interested in knowing if there is a standard way to encode the > various Pashto yeh-characters in Unicode, and if so, what it is. This > question is a bit more complicated than it sounds, so here's the > background. > > Pashto is written using a derivative of the Arabic script. The Arabic > language uses a single character for both /j/ and /i:/ sounds. Like > many Arabic characters, this one is composed of a base form (which > changes shape based on its position in a word) and dots (in this case, > two dots below the base form). In most of the Arabic-speaking world > the dots are present with both the medial and final form, though in > Egypt (and possibly other places) the convention is to have two dots > on the medial form but leave them off the final form. The standard > arrangement of the two dots is horizontal, but they can be placed > vertically or diagonally with no change in meaning. > > Persian also uses a single character for /j/ and /i:/, with the > convention of two dots on the medial form, no dots on the final form > (same as in Egypt). > > The two conventions for the /j/-/i:/ character were given distinct > code points in unicode despite the fact that they do not contrast; > documentation is scarce, but presumably this was done in order to > allow writing both Arabic and Persian in the same document. Therefore, > Unicode has the following code points (I'm not giving the names, but > rather the typical visual representation of the glyphs and typical use). > > U+064A two dots medially and finally (/j/-/i:/ Arabic convention) > U+06CC two dots medially, none finally (/j/-/i:/ Persian convention) > > There are a few additional yeh-base code points defined, some of which > are relevant to Pashto (see below). > > U+0649 no dots medially or finally (Arabic /a/ from etymological /j/) > U+0626 hamza above medially and finally (Arabic glottal stop in > certain contexts) > U+06D0 two dots medially and finally in vertical arrangement > U+06CD tail and no dots in final position > > As it so happens, there is much confusion in how these characters are > used in actual electronic documents, which is not surprising given > that U+06CC looks like U+064A in medial position but like U+0649 in > final position. There is an excellent article by Jonathan Kew that > sorts out what this means for various languages that use derivatives > of the Arabic script. > > http://scripts.sil.org/cms/scripts/render_download.php?site_id=nrsi&format=file&media_id=arabicletterusagenotes&filename=ArabicLetterUsageNotes.pdf > > Unfortunately, this article does not discuss Pashto. I have little > knowledge of the language, but here's what I managed to understand > from the inspection of a few documents and with the help of friendly > people on the Linguist List (and please correct me if I'm wrong). > > Traditionally, Pashto used a single character with the same convention > as in Persian, of two dots in the medial form and none on the final > form, and with no significance attached to the visual arrangement of > the dots. The character was 3-ways ambiguous between the sounds /j/, > /i:/ and /e/. In recent decades (probably since the 1970s or 1980s) > there has been some differentiation, partly due to changes in the > typesetting process and partly due to a deliberate effort of the > Pashto Academy at the University of Peshawar, Pakistan. > > One convention that has gained fairly wide acceptance is a distinction > between a horizontal arrangement of the dots, representing /j/ or /i:/ > as in Arabic and Persian, and a vertical arrangement representing the > sound /e/. This distinction is the same as in Uighur, and the > character with vertical dots has been codified as U+06D0. Additional > conventions include a hamza (U+0626) or tail (U+06CD) to represent /j/ > at the end of a word in certain grammatical markers. All of these are > quite standard by now and do not pose much of a problem. > > However, a further convention appears to have arisen, which as far as > I can tell is unique to Pashto in that it distinguishes between /j/ > and /i:/ (though only in word-final position): > > /j/ is written with two dots medially, none finally > /i:/ is written with two dots both medially and finally > > I have never seen this codified explicitly, but this is the impression > I get from examining a few recent Pashto documents. Which brings me to > my original question, of how to represent these characters in Unicode. > The linguist in me notices a correspondence between sounds and Unicode > code points (which, given the history I have just described, is most > certainly accidental): > > /j/ corresponds to U+06CC > /i:/ corresponds to U+064A > > The wikipedia article on the Pashto alphabet > http://en.wikipedia.org/wiki/Pashto_alphabet gives a different > correspondence, based on visual appearance: > > forms with dots: U+064A (/i:/ and /j/ medially, /i:/ finally) > forms without dots: U+0649 (only /j/ in word-final position) > > And there is yet a third convention, which I encountered in an > electronic lexicon and also appears in the following document: > http://www.afghanan.net/pashto/pashto%20alifba.pdf > > U+06CC: medial forms with dots (/i:/ and /j/) and dotless final form (/j/) > U+064A: final form with dots (/i:/) > > To wrap up, are my observations about the Pashto writing conventions > correct? And is there a standard for assigning the Pashto characters > representing /j/ and /i:/ to Unicode code points? > > -Ron. > >