Dear Doug Ewell, William Overington, James E. Agenbroad, and Maurice Bauhahn,
Thank you all for the reply. May I assume u+0b85 as official? Some explanations for the need for a visible "a". In Tamil, a/ dependent "ai", and "au" has ligatures. infact "au" and "ou" at present utilise the same ligature. (Additionally the use of "ai and au" are expected to be introduced, not as ligatures.) b/ (dithong?) such as ae, ao need to be linearly represented (without ligatures) d/ The use of visible "a" for educational purposes with consonants are a necessity. e/ A design plan need to be implemented, anticipating the possible use of visible "a" instead of inherent "a" in the distance future. Regards Sinnathurai Srivas >> Avarangal I'm not sure why you want a character for a "inherent a" which, in Indic scripts, "exists"any consonant unmarked by a vowel sign or virama - perhaps you could describe your application. You could use 17B4 in the Khmer block. Since Khmer is also an Indic script, this character essentially has the properties you're looking for - though it's in a different block. - Another idea might be to use 0B85 (TAMIL LETTER A) + 093C (NUKTA) or maybe 0B85 + 0F9D (VIRAMA); - I don't know Tamil but I think these combinations would not normally occur. There was once a proposal to encode an "inherent a" or "root marker" at 0F70 in the Tibetan block as some people thought this was necessary as Tibetan syllables often contain silent prefixes and suffixes - but the primary collation is on the main letter in a syllable (which may be the second or third character in the string). In Tibetan the first consonant (or consonant stack) marked by a vowel sign is the root of a syllable - but where there is no vowel sign (i.e. an "inherent a") there is no "flag" to indicate the root consonant so some thought it would simplify processing to have one. A problem with this is that there would be no visible glyph for such a character and if the consonant marked by such an invisible character was deleted the inherent a character might get left behind and consequently flag an adjoining character where it might not be wanted. Also, since such a character is not necessary to display Tibetan properly, chances are you'd wind up with some people/ applications making use of this character and others not using it - so you'd get two different strings for the same word. In the case of Tibetan, the root consonant in a syllable or word can be determined by rules or a lookup - and in the end it was thought better to leave it to applications to determine unmarked root consonants when they needed to rather than having an inherent a character to mark them (which in any case would require a rule based system or lookup to insert reliably - unless you leave it to users to type in) . IMO in general use such a character would probably cause more problems than it solved - though it might sometimes be useful in private data. - Chris >>> > > While we're waiting for someone with better knowledge of Indic scripts > to reply... > > 1. An *inherent* A wouldn't have its own code point, would it? I don't > think of it as having an existence outside of the consonant it goes > with. Tamil KA is U+0B95, which represents K plus the inherent A. If > you wanted to represent only the K, you would use U+0B95 plus the Tamil > virama, U+0BCD, to kill the A. But how could you represent an inherent > vowel by itself? > > 2. Assuming you have an answer to #1 above, the only way "you" could > allocate a Unicode code point for this character would be to use the > Private Use Area. You could choose any code point from U+E000 to U+F8FF > for this purpose. (There are unofficial assignments for some of these, > but you are perfectly free to ignore them.) Do *not* assign a code > point in the Tamil block, or anywhere else except the Private Use Area, > even if it's only for temporary and internal use. To do so would be > very non-conformant. > > -Doug Ewell > Fullerton, California > Monday, April 1, 2002 There is always 0B85 for this vowel when it is not "inhering" to a consonant. Regards, Jim Agenbroad ( [EMAIL PROTECTED] ) "It is not true that people stop pursuing their dreams because they grow old, they grow old because they stop pursuing their dreams." Adapted from a letter by Gabriel Garcia Marquez. The above are purely personal opinions, not necessarily the official views of any government or any agency of any. Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A. Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896. >>>> I have no knowledge of the Tamil language and I am neither a member of the Unicode Consortium nor a representative of the Unicode Consortium. However, as a specific suggestion is sought for a temporary code point to use for "inherent a", then as an individual end user of the Unicode system, I suggest the following code point within the Private Use Area. U+E7C0 This will not provide any exclusivity of definition for this code point, however, it is in the lower part of the Private Use Area and is therefore less likely to clash with code points in the Private Use Area used internally within commercially available software, which uses tend to be in the upper part of the Private Use Area in accordance with guidance notes in the Unicode specification. The fact that a specific code point is being suggested within this forum may possibly also mean that various people will make a note of its use in their own lists of characters, so that, although its use will not be an official Unicode Consortium allocation, people interested in the use of the Private Use Area may well make a note of the usage. My reasons for suggesting this particular code point is that I am producing some code points for research, and hopefully application, and have a block of special characters from U+E700 through to U+E7FF, including U+E707 for a ct ligature. I am looking at including a set of long s ligatures such as LONG S B and LONG S L and so on. I have not yet finalized the codes, yet they will be above U+E707 as I am not using U+E700 through to U+E706 at all, so that this section of ct and long s characters dovetails with the Alphabetic Presentation Forms, U+FB00 through to U+FB06, in the hope that the ct and long s characters that I suggest might one day be promoted to the Alphabetic Presentation Forms section. The upper part of the 256 code point block from U+E700 through to U+E7FF is presently unused in my use of the Private Use Area and so a section from U+E7C0 through to U+E7FF would seem a good place to have a section for various code points used for research. I too am interested in how the inherent "a" character would be used. Does it have its own glyph or is it a code that modifies something else, or what? William Overington 1 April 2002 www.users.globalnet.co.uk/~ngo -----Original Message----- From: Avarangal <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] <[EMAIL PROTECTED]> Date: Friday, March 29, 2002 9:33 PM Subject: Inherant "a" >I need to allocate a U+codepoint for inherent "a", to be used for Tamil >research. Can anyone suggest a temporary location or is it possible to find >such code point within the existing code point for Tamil. >>>>>> Why do you need to have a code for 'inherent a' in Tamil? There is some imprecision concerning what constitutes an 'inherent' vowel. In this note I am referring to normally unwritten vowels that are nevertheless pronounced. I know nothing about Tamil, but in Khmer Unicode there are two such inherent "a" characters. A long inherent (native Khmer language) at U+17B5 and a short inherent (Sanskrit/Pali) at U+17B4. Their encoding has raised some outcry (in fact some parties are trying to deprecate them), but the more I analyse grammars, dictionaries, and round-trip transliteration the more importance they assume. (1) If you look at a dependent vowel series in an Indic script...they often start with an unwritten 'inherent a' character, recognising their unique existence. (2) If you transliterate between an Indic script and a Latin [or other phonetic] transliteration, the inherent vowel must become explicit in the transliteration (hence it would be extremely useful for round-trip conversion reasons to have a code in the Indic encoding to match that). Dependable round trip conversion of text is becoming increasingly important when a single minority language spans national borders where government authorities on opposite sides of the boarder insist the 'national' script of their respective country be used to render that language. (3) Not every consonant cluster that lacks an explicit dependent vowel also contains an 'inherent a' (in particular in Khmer it is unpredictable from the context [i.e., without a lookup] whether a final consonant cluster without a dependent vowel has a pronounced inherent or not). (4) Non-final clusters lacking an explicit dependent vowel 'always' (a dangerous word to use!) have an 'inherent a', possibly short or long. (5) Depending on the foreignness of the word an 'inherent a' in Khmer may be short (foreign) or long (Khmer language) (6) Dictionaries have to make the short 'inherent a' vowel explicit in their pronouncing sections (usually borrowing U+17C8 to display it; however you would not want to raise ambiguity by using that code both when it is normally displayed and when it is there for making pronunciation clear) (7) For phonetic rendering of an Indic script, therefore, it would be very useful to selectively encode it. In the future data input and output will increasingly move to verbal/aural, rather than keyboard means. This would be quite an exciting development for Khmer...because Khmer is difficult to keyboard and presumably relatively easy for a computer to recognise (what with about fifty vowel/vowel-sign combinations that are easier for computers to recognise than consonants). Hence, I would assume that codes to capture verbal data converted to Unicode text will similarly become increasingly important. (8) 'Inherent a' is often used in combination with vowel-like signs such as U+17C6 NIKAHIT, U+17C7 REAHMUK, U+17C8 YUUKALEAPINTU to generate vowels with consonantal final sounds. Failure to recognised the 'inherent a' results in wrongly interpreting those consonant-like signs as vowels. These vowel+sign ligatures are in fact treated like unique vowels in sorting. There are arguments against using 'inherent' vowels. (a) Unwritten characters tend to not be typed! And if they were, the data stream length would grow remarkably. (b) Binary comparison of words with and words without 'inherent' vowels would be problematic (c) The average user would probably not gain advantage from the inclusion of 'inherent' vowels in the text stream (d) I could not find more than one instance in the authoritative Chuon Nath Khmer dictionary where two words otherwise spelled the same were distinguished by the length of their inherent vowels. It is hard to write a sorting rule on one data point;-) (e) Rendering mechanisms may not recognise the (rarely used) inherent code and cause problems when it is used. Hence, it would be preferred that the use of inherent vowels be sharply circumscribed...but not eliminated altogether. In summary, inherent vowels: (1) Are characters in their own right (2) Are needed for round trip script conversion (transliteration) (3) Are not a trivial case: They are not contained in every consonant cluster even when that cluster does not contain a visual dependent vowel (4) Are useful for preserving phonetic value in dictionaries or text-to-speech applications Interested, Maurice Bauhahn -----Original Message----- From: Sent: 29 March 2002 19:38 Subject: Inherant "a" I need to allocate a U+codepoint for inherent "a", to be used for Tamil research. Can anyone suggest a temporary location or is it possible to find such code point within the existing code point for Tamil. Maurice Bauhahn >>>>>>>>