Re: compatibility between unicode 2.0 and 3.0
--- Kenneth Whistler [EMAIL PROTECTED] wrote: This depends greatly on what implementation you did for sorting and searching, and how it handles unassigned code points in your Unicode 2.0 code. If the code was designed to be forward compatible, it should do reasonable things with unassigned code points, and getting Unicode 3.0 data which is actually using those code points should not disturb your existing code. But, on the other hand, if you have built in a bunch of range checks or have used tables which cannot gracefully handle the appearance of unassigned code points in your data, then it could well blow up. Can you please explain what is the best practice to handle unassigned code points so that applications can easily become forward compatible? If we just ignore unassigned code points, then will it make for application easier to migrate to later version of Unicode? - Keyur __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com
RE: Suggestions in Unicode Indic FAQ
--- Kent Karlsson [EMAIL PROTECTED] wrote: Without that dotted circle appearing, the e-matra would appear to have been properly encoded, No, with proper reordering (and normal display mode), the e-matra at the beginning of the second word would appear to be last glyph of the first word. Similarly, for the second case, the e-matra glyph would have come to the left of the pa. The fluent reader (ok, not me...) would then see those errors anyway, just like I can find spelling errors in Swedish, most often without any kind of special marking. (I'm assuming through-out that reordrant combining characters are reordered.) Illegal sequences are not reordered as you indicated. Also, as far as I know there is no mention of reordering of illegal input sequence (or invalid combining mark) in Unicode standard. Consider the last set of glyphs (left-to-right, top-to-bottom) in the attached image. It is the rendering effect of illegal input sequence Devanagari Vowel Sign I [U+093F] + Devanagari Letter Ka [U+0915] and without any dotted circle. As you might be knowing the correct input sequence should be U+0915 followed by U+093F. In that case the result would have been similar to what appears right now. (Though some more sophisticated font/application may want to replace the appearing glyph for U+093F to be substituted by some other glyph with proper attachment point). Now there is no way that user can identify this illegal input sequence without dotted circle. In the worst case even this rendered glyph is attached to the character from a class (for example, consonant cluster of Ka Virama Ma) for which the glyph has been designed to render with. In such case even a fluent reader can not identify the error. There are spelling errors, yes. But there are other ways of indicating spelling errors, that are (by now) fairly conventional for any language (as long as there is an appropriate dictionary installed), and that also are more general (in catching more spelling errors) and less obtrusive (the author really wants to write it that way, for some reason). Apparently, Michka used a non-OpenType Bengali Unicode font when he embedded the fonts into the page. As long as you are looking at the page on-line, with the embedded fonts, these errors are invisible. It may be typographically horrible. It *should* be typographically horrible in order to illustrate bad sequences clearly. I'd prefer little red wiggly lines under the word, or yellow background or some such (just for screen display, not for printing; screen grabs not counted). And that for any spelling error. Spelling mistakes can be categorized into two different classes. One arising from illegal input sequence (e.g., Vowel Sign E as the first character in a word) and the other one is legal input sequence with no contextual meaning in the dictionary. While indication of the second type of mistake is generally used only in sophisticated applications like word processor, everyone wants to know the first kind of mistake. With your explanation it seems that even plain text editor is not useful at all to identify such common typing mistakes! - Keyur __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com inline: img1.jpg
RE: Suggestions in Unicode Indic FAQ
--- Kent Karlsson [EMAIL PROTECTED] wrote: No fallback rendering is coming into picture with your explanation. Yes, there is. A character sequence FULL STOP, VOWEL SIGN E (say) is very unlikely to have a ligature, specially adapted (and fitting) adjustment points, or similar. The rendering would in that sense need to use a fallback mechanism that renders an approximation for this rare combination. Do you mean to say that an application has to take care of combination of all other Unicode characters with each combining marks in the fallback mechanism for such approximation? Can you count the number of combinations which may result in millions!? - Keyur __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com
Re: Indic Devanagari Query
Hi Aditya, --- Aditya Gokhale [EMAIL PROTECTED] wrote: I had few query regarding representation of Devanagari script in Unicode (Code page - 0x0900 - 0x097F). Devanagari is a writing script, is used in Hindi, Marathi and Sanskrit languages. I have following questions - In the same script code page, how do I use these two different Glyphs, to represent the same character ? Is there any way by which I can do it in an Open type font and Free type font implementation ? Yes, it is certainly possible with OpenType font. Please note that FreeType is not a font format but it is a rendering library used to rasterize different kind of fonts including TrueType and OpenType fonts. In an Opentype font, you can include all glyphs with alternate shapes and then select one of them depending upon the script and language. Application should specify script and language tag while sending character codes to the opentype rendering library/engine. All substitution will be taken place depending on the language and/or script selection. There should be a default script in the font. Similarly there will be a default language for that script which will be used as fallback language if application does not specify which language to be used for processing. From the list of alternate glyphs you may want to use the glyph for default language for an entry in cmap table. This default glyph can be substituted by alternate glyph depending upon the language specification. You have to use GSUB table and write language dependent lookup for substitution. 2. Implementation Query - In an implementation where I need to send / process Hindi, Marathi and Sanskrit data, how do I differentiate between languages (Hindi, Marathi and Sanskrit). Say for example, I am writing a translation engine, and I want to translate a document having Hindi, Marathi and Sanskrit Text in it, how do I know from the code points between 0x0900 and 0x097F, that the data under perusal is Hindi / Marathi / Sanskrit ? Unicode is not divided into code pages. Unlike few old encodings there is only one code page for entire Unicode standard. However, for better readability and quick user reference the entire chart has been divided into different sections which you might interpret as code pages. I would suggest that we should give different code pages for Marathi, Hindi and Sanskrit. May be current code page of Devanagari can be traded as Hindi and two new code pages for Marathi and Sanskrit be added. This could solve these issues. If there is any better way of solving this, any one suggest. Unicode gives code points to script only and not language. In fact it is not desirable to give code points to individual languages falling under the same script. Also, Unicode encodes characters which have abstract meaning and properties. Unicode does not encode glyphs. The shapes of glyphs shown in the Unicode chart have been given just for convenience and not actually represent the shapes to be used in the font. The shape of the glyph for a Unicode character may vary from one font to another. Since it is already possible to select proper glyph(s) depending upon language selection, this scheme is suitable for all Indian languages. 3. Character codes for jna, shra, ksh - In Sanskrit and Marathi jna, shra and ksh are considered as separate characters and not ligatures. How do we take care of this ? Can I get over all views on the matter from the group ? In my opinion they should be given different code points in the specific language code page. Please find below the character glyphs - jna shra ksh All of the above can be composed through following consonant clusters: jna - ja halant nya shra - sha halant ra ksh - ka halant ssha The point that the above sequences are considered as characters in some of the Indian languages has merit. If there is demand from native speakers then a proposal can be submitted to Unicode. There is a predefined procedure for proposal submission. Once this is discussed with concerned people and agreed upon then these ligatures can be added in Devanagari script itself because Devenagari script represent all three languages you mentioned namely Sanskrit, Marathi, and Hindi. Meanwhile you can write rules for composing them from the consonant clusters. Regards, Keyur __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com
Re: Indic Devanagari Query
Hi, Forgot to reply implementation query. The reply is inline. --- Aditya Gokhale [EMAIL PROTECTED] wrote: 2. Implementation Query - In an implementation where I need to send / process Hindi, Marathi and Sanskrit data, how do I differentiate between languages (Hindi, Marathi and Sanskrit). Say for example, I am writing a translation engine, and I want to translate a document having Hindi, Marathi and Sanskrit Text in it, how do I know from the code points between 0x0900 and 0x097F, that the data under perusal is Hindi / Marathi / Sanskrit ? I would suggest that we should give different code pages for Marathi, Hindi and Sanskrit. May be current code page of Devanagari can be traded as Hindi and two new code pages for Marathi and Sanskrit be added. This could solve these issues. If there is any better way of solving this, any one suggest. Instead of changing/recommending change in an encoding standard, your problem can best be solved in your application. You can use tags in your text to specify language. Unicode also facilitates tagging your text but its use in Unicode is highly discouraged. So you can use some language similar to xml or html to specify language boundary. Then parse your text, identify the language boundaries, and do further processing depending upon the language. If you don't want to use tags in your text then you can predict language by using some heuristic. This heuristic can be used on some language properties which may be different for all three languages. In this case your processing will be divided into two phases. First phase involves applying some heuristic rule to identify language bounadaries from plain text and the second is actually processing text for translation. But beware that the result will not be accurate all the time with such heuristic processing. Hence use of tags is recommended. Regards, Keyur __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com
Re: Indic Devanagari Query
--- Asmus Freytag [EMAIL PROTECTED] wrote: All of the above can be composed through following consonant clusters: jna - ja halant nya shra - sha halant ra ksh - ka halant ssha The point that the above sequences are considered as characters in some of the Indian languages has merit. If there is demand from native speakers then a proposal can be submitted to Unicode. There is a predefined procedure for proposal submission. Once this is discussed with concerned people and agreed upon then these ligatures can be added in Devanagari script itself because Devenagari script represent all three languages you mentioned namely Sanskrit, Marathi, and Hindi. Meanwhile you can write rules for composing them from the consonant clusters. I wouldn't go so far. The fact that clusters belong together is something that can be handled by the software. Collation and other data processing needs to deal with such issues already for many other languages. See http://www.unicode.org/reports/tr10 on the collation algorithm. I beg to differ with you on this point. Merely having some provision for composing a character doesn't mean that the character is not a candidate for inclusion as separate code point. India is a big country with millions of people geographically divided and speaking variety of languages. Sentiments are attached with cultures which may vary from one geographical area to another. So when one of the many languages falling under the same script dominate the entire encoding for the script, then other group of people may feel that their language has not been represented properly in the encoding. While Unicode encodes scripts only, the aim was to provide sufficient representation to as many languages as possible. In Unicode many characters have been given codepoints regardless of the fact that the same character could have been rendered through some compose mechanism. This includes Indic scripts as well as other scripts. For example, in Devanagari script some code points are allocated to characters (ConsonantNukta) even though the same characters could be produced with combination of the consonant and Nukta. Similarly, in Latin-1 range [U+0080-U+00FF] there are few characters which can be produced otherwise. That is why the text should be normalized to either pre-composed or de-composed character sequence before going for further processing in operations like searching and sorting. Also, many times processing of text depends on the smallest addressable unit of that language. Again as discussed in earlier e-mails this may vary from one language to another in the same script. Consider a case when a language processor/application wants to count the number of characters in some text in order to find number of keystrokes required to input the text. Further assume that API functions used for this purpose are based on either WChar (wide characters) or UTF-8. In this case it is very much necessary that you assign the character, say Kssha, to the class consonant. Since assignment to this class consonant applies to single code point (the smallest addressable unit) and not to the sequence of codes, it is very much necessary to have single code point for the character Kssha. This is my understanding. Please enlighten me if I am wrong. Regards, Keyur __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com
Suggestions in Unicode Indic FAQ
Hello, There are few discrepancies in Indic FAQ. Though it was reported earlier by Andy White, I see they still have place there in the FAQ. I also clarified it but by mistake I sent the mail to Yahoo groups where this mailing list is archived and hence my mail never reached to this mailing list. You can refer to the link http://groups.yahoo.com/group/unicode/message/16352 The following are the suggestions. SUGGESTION-1: In the FAQ http://www.unicode.org/faq/indic.html#2 it is mentioned that ISCII: Unicode: Halant + Halant Halant + ZWJ produce similar result. This is wrong. In ISCII, Halant+Halant is known as explicit halant and its Unicode equivalent sequence is Halant+ZWNJ. So ZWJ should be replaced by ZWNJ. SUGGESTION-2: In the FAQ http://www.unicode.org/faq/indic.html#16 It is mentioned that following are equivalent ISCII Unicode KA halant INV KA virama ZWJ RA halant INV RAsup (i.e., repha) In fact there is no way in Unicode to produce RAsup directly, i.e., without using base consonant. The sequence RA virama ZWJ will actually produce half-RA (or eyelash-RA) which is used commonly in Marathi. eyelash-RA can also be produced with the sequence RA Halant Nukta sequence both in ISCII (known as soft halant) and Unicode (just for conformance with ISCII). Also, in the same answer the following sequence is recommended. ISCII Unicode INV halant RA SPACE virama RA (RAsub) SUGGESTION-3: Use of SPACE character as consonant may create problem for state machine which finds language/syllable boundary. In fact we need a codepoint for one invisible consonant (similar to INV in ISCII) in Unicode which can solve this problem with Unicode. After inclusion of INV character the following can be recommended. ISCII Unicode KA halant INV KA virama INV RA halant INV RA virama INV (i.e., repha) INV halant RA INV virama RA (RAsub) The INV character in Unicode can also be used for displaying dependent vowel matras without dotted circle. Unicode INV Vowel sign O INV Vowel sign AI etc. This can replace existing definition of SPACE as invisible consonant depending on the context. Any other pointers!!? - Keyur __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com
RE: Suggestions in Unicode Indic FAQ
--- Marco Cimarosti [EMAIL PROTECTED] wrote: Why not representing INV with a double ZWJ? E.g.: ISCII Unicode KA halant INV KA virama ZWJ ZWJ RA halant INV RA virama ZWJ ZWJ (i.e., repha) INV halant RA ZWJ ZWJ virama RA (RAsub) This has the advantage that the most common sequences will work OK also on old display engines implemented *before* the double-ZWJ convention is introduced. E.g., sequence KA virama ZWJ ZWJ works well also on an old engine, for the simple reason that the first ZWJ is enough to do the work, and the second ZWJ is invisible. Of course, an old engine will still display a RA[eyelash] for RA virama ZWJ ZWJ, but that is not worse than displaying RA+virama followed by a white box, which is what would happen with your new INV character. Certainly. This looks more promising because even RAsub has two alternate forms. One form is used with consonants KA, KHA, GHA, etc and the other form is used with consonants TTA, TTHA, DDA, DDHA, etc. With your ZWJ based scheme we can insert as many ZWJ as we wish to produce all possible alternate forms! But sometimes a user may want visual representation of these symbols in two different ways: with dotted circle and without dotted circle. Example of this could be RAsup on top of dotted circle and RAsup on top of space character. Current use of space character to eliminate dotted circle is really painful and may create problems in determining language and syllable boundaries. The main problem with space character is that unlike ZWJ/ZWNJ/Dotted Circle, it falls within the range of other important script Latin. Finally it may affect all important text processing which uses Unicode characters to find language boundaries. Use of INV character in one shot can solve all these problems. We can put it in consonant class which can help text processing applications. Moreover, it will be difficult for all possible to provide upward compatibility all the time even though it is desirable. Implementation of Unicode will need to be upgraded with every introduction of new glyphs or rules. Otherwise applications have to explicitly declare the version of Unicode used in implementation. - Keyur __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com
RE: Suggestions in Unicode Indic FAQ
--- Kent Karlsson [EMAIL PROTECTED] wrote: A space followed by a dependent vowel sign should display just the dependent vowel sign, no dotted circle. Indeed, (except for a show invisibles mode, or a character chart display mode) no (Indic or other) text that does not contain the *character* DOTTED CIRCLE should ever display a dotted circle as part of the displayed text. Systems that do display a dotted circle (in normal display mode) where there is no such *character* in the displayed text are buggy! In Indic scripts any sign that appear in text not in conjunction with a valid consonant base may be rendered with dotted circle as fallback mechanism (Section 5.14 Rendering Nonspacing Marks http://www.unicode.org/uni2book/ch05.pdf). Any system implementing this as default behaviour should not be considered buggy. What should be the default rendering behaviour (i.e., show hidden or not) may vary from one script to another script and also depends on implementation policy. For scripts other than Indic scripts, it may be useful to render the nonspacing mark without dotted circle because even after rendering it as an overlap glyph, the result is recognizable. However, for Indic scripts use of dotted circle is very useful as default behaviour since it gives immediate feedback to the user that there may be some defective combining character in the text. Most of the time such errors are unintentional rather than intentional. Unicode has provision to remove this dotted circle. Space character is used to give indication to fallback mechanism that no dotted circle should be used while rendering this stand alone sign which is normally attached to other characters. This is useful when sometimes user want to display the sign without any circle. Also, with this scheme it is possible to show some combining marks with dotted circle and some without dotted circle. - Keyur __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com
RE: Suggestions in Unicode Indic FAQ
--- Marco Cimarosti [EMAIL PROTECTED] wrote: Keyur Shroff wrote: But sometimes a user may want visual representation of these symbols in two different ways: with dotted circle and without dotted circle. Why not using a dotted circle character explicity, when you want to see one? Note that whenever I mention the word combining mark I am really talking about vowel signs (matras) and other modifiers in Indic scripts which is script dependent. I am sorry if I have confused you with the combining diacritical marks in the block [U+0300-U+036F] which I really didn't mean. Let me give a proper example this time. Consider a Vowel Sign E [U+0947] appearing after any non-consonant character. This sign is generally attached to the consonants. It has zero advance width with negative left side bearing in the font. Clearly, since in this case the sign is not preceded by any consonant base, it has to be rendered using one of the mechanisms specified in fallback rendering of non-spacing marks. If we render it with space, as you said, then we have to insert space character at the time of fallback rendering (which can be taken care in rendering pipeline) even though space character is not present in backing store of the application. Now in order to render it with dotted circle if we introduce the circle in the text before this sign then also the circle is invalid base for this Vowel Sign E. As a result, again fallback rendering will take place with rendering circle and the vowel sign positionally separate. In this case first dotted circle will apear which will be followed by vowel sign (matra) on top of space character. If you know any other way to solve this problem then please explain. Also let me know if I have misinterpreted the text written in Unicode standard. Example of this could be RAsup on top of dotted circle and RAsup on top of space character. Current use of space character to eliminate dotted circle is really painful and may create problems in determining language and syllable boundaries. Languages or syllable boundaries have nothing to do with this. These special sequences should *never* be part of any syllabe or word in any language: they are just a way of showing the shape of a glyph, to be used when, e.g., talking about typography or spelling. Then how can we rake care of fallback mechanism? Thanks for taking pain for answering my queries :-) - Keyur __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com