Re: compatibility between unicode 2.0 and 3.0

2003-02-03 Thread Keyur Shroff

--- Kenneth Whistler [EMAIL PROTECTED] wrote:
 
 This depends greatly on what implementation you did for
 sorting and searching, and how it handles unassigned code points
 in your Unicode 2.0 code. If the code was designed to be
 forward compatible, it should do reasonable things with
 unassigned code points, and getting Unicode 3.0 data which
 is actually using those code points should not disturb your
 existing code. But, on the other hand, if you have built
 in a bunch of range checks or have used tables which cannot
 gracefully handle the appearance of unassigned code points
 in your data, then it could well blow up.

Can you please explain what is the best practice to handle unassigned code
points so that applications can easily become forward compatible? If we
just ignore unassigned code points, then will it make for application
easier to migrate to later version of Unicode?

- Keyur


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com




RE: Suggestions in Unicode Indic FAQ

2003-02-02 Thread Keyur Shroff

--- Kent Karlsson [EMAIL PROTECTED] wrote:

  
  Without that dotted circle appearing, the e-matra would appear to
  have been properly encoded, 
 
 No, with proper reordering (and normal display mode), the e-matra at
 the beginning of the second word would appear to be last glyph of the
 first word.  Similarly, for the second case, the e-matra glyph would
 have come to the left of the pa.  The fluent reader (ok, not me...)
 would then see those errors anyway, just like I can find spelling
 errors in Swedish, most often without any kind of special marking. (I'm
 assuming through-out that reordrant combining characters are reordered.)

Illegal sequences are not reordered as you indicated. Also, as far as I
know there is no mention of reordering of illegal input sequence (or
invalid combining mark) in Unicode standard.

Consider the last set of glyphs (left-to-right, top-to-bottom) in the
attached image. It is the rendering effect of illegal input sequence
Devanagari Vowel Sign I [U+093F] + Devanagari Letter Ka [U+0915] and
without any dotted circle. As you might be knowing the correct input
sequence should be U+0915 followed by U+093F. In that case the result would
have been similar to what appears right now. (Though some more
sophisticated font/application may want to replace the appearing glyph for
U+093F to be substituted by some other glyph with proper attachment point).
Now there is no way that user can identify this illegal input sequence
without dotted circle. In the worst case even this rendered glyph is
attached to the character from a class (for example, consonant cluster of
Ka Virama Ma) for which the glyph has been designed to render with.
In such case even a fluent reader can not identify the error.

 
 There are spelling errors, yes.  But there are other ways of indicating
 spelling errors, that are (by now) fairly conventional for any language
 (as long as there is an appropriate dictionary installed), and that also
 are more general (in catching more spelling errors) and less obtrusive
 (the author really wants to write it that way, for some reason).
 
  Apparently, Michka used a non-OpenType Bengali Unicode font when
  he embedded the fonts into the page.  As long as you are looking
  at the page on-line, with the embedded fonts, these errors are
  invisible.  
  
  It may be typographically horrible.  It *should* be typographically
  horrible in order to illustrate bad sequences clearly.
 
 I'd prefer little red wiggly lines under the word, or yellow background
 or some such (just for screen display, not for printing; screen grabs
 not counted).  And that for any spelling error.

Spelling mistakes can be categorized into two different classes. One
arising from illegal input sequence (e.g., Vowel Sign E as the first
character in a word) and the other one is legal input sequence with no
contextual meaning in the dictionary. While indication of the second type
of mistake is generally used only in sophisticated applications like word
processor, everyone wants to know the first kind of mistake. With your
explanation it seems that even plain text editor is not useful at all to
identify such common typing mistakes!

- Keyur


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com
inline: img1.jpg

RE: Suggestions in Unicode Indic FAQ

2003-02-02 Thread Keyur Shroff

--- Kent Karlsson [EMAIL PROTECTED] wrote:
  
  No fallback rendering is coming into picture with your explanation. 
 
 Yes, there is.  A character sequence FULL STOP, VOWEL SIGN E (say)
 is very unlikely to have a ligature, specially adapted (and fitting)
 adjustment points, or similar.  The rendering would in that sense
 need to use a fallback mechanism that renders an approximation
 for this rare combination.

Do you mean to say that an application has to take care of combination of
all other Unicode characters with each combining marks in the fallback
mechanism for such approximation? Can you count the number of combinations
which may result in millions!?

- Keyur


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com




Re: Indic Devanagari Query

2003-01-29 Thread Keyur Shroff
Hi Aditya,

--- Aditya Gokhale [EMAIL PROTECTED] wrote:
 I had few query regarding representation of Devanagari script in
 Unicode
 (Code page - 0x0900 - 0x097F). Devanagari is a writing script, is used in
 Hindi, Marathi and Sanskrit languages. I have following questions - 
 
 
 In the same script code page, how do I use these two different Glyphs, to
 represent the same character ? Is there any way by which I can do it in
 an Open type font and Free type font implementation ?

Yes, it is certainly possible with OpenType font. Please note that FreeType
is not a font format but it is a rendering library used to rasterize
different kind of fonts including TrueType and OpenType fonts.

In an Opentype font, you can include all glyphs with alternate shapes and
then select one of them depending upon the script and language. Application
should specify script and language tag while sending character codes to the
opentype rendering library/engine. All substitution will be taken place
depending on the language and/or script selection. There should be a
default script in the font. Similarly there will be a default language for
that script which will be used as fallback language if application does not
specify which language to be used for processing.

From the list of alternate glyphs you may want to use the glyph for default
language for an entry in cmap table. This default glyph can be substituted
by alternate glyph depending upon the language specification. You have to
use GSUB table and write language dependent lookup for substitution.

 
 2. Implementation Query - 
 In an implementation where I need to send / process Hindi, Marathi
 and Sanskrit data, how do I differentiate between languages (Hindi,
 Marathi and Sanskrit). Say for example, I am writing a translation
 engine, and I want to translate a document having Hindi, Marathi and
 Sanskrit Text in it, how do I know from the code points between 0x0900
 and 0x097F, that the data under perusal is Hindi / Marathi / Sanskrit ?

Unicode is not divided into code pages. Unlike few old encodings there is
only one code page for entire Unicode standard. However, for better
readability and quick user reference the entire chart has been divided into
different sections which you might interpret as code pages.

 I would suggest that we should give different code pages for Marathi,
 Hindi and Sanskrit. May be current code page of Devanagari can be traded
 as Hindi and two new code pages for Marathi and Sanskrit be added. This
 could solve these issues. If there is any better way of solving this, any
 one suggest.


Unicode gives code points to script only and not language. In fact it is
not desirable to give code points to individual languages falling under the
same script. Also, Unicode encodes characters which have abstract meaning
and properties. Unicode does not encode glyphs. The shapes of glyphs shown
in the Unicode chart have been given just for convenience and not actually
represent the shapes to be used in the font. The shape of the glyph for a
Unicode character may vary from one font to another. Since it is already
possible to select proper glyph(s) depending upon language selection, this
scheme is suitable for all Indian languages.


 
 
 3. Character codes for jna, shra, ksh - 
 
 In Sanskrit and Marathi jna, shra and ksh are considered as separate
 characters and not ligatures. How do we take care of this ? Can I get
 over all views on the matter from the group ? In my opinion they should
 be given different code points in the specific language code page.
 Please find below the character glyphs - 
 
 jna
 shra
 ksh

All of the above can be composed through following consonant clusters:
  jna - ja halant nya
  shra - sha halant ra
  ksh - ka halant ssha

The point that the above sequences are considered as characters in some of
the Indian languages has merit. If there is demand from native speakers
then a proposal can be submitted to Unicode. There is a predefined
procedure for proposal submission. Once this is discussed with concerned
people and agreed upon then these ligatures can be added in Devanagari
script itself because Devenagari script represent all three languages you
mentioned namely Sanskrit, Marathi, and Hindi. Meanwhile you can write
rules for composing them from the consonant clusters.

Regards,
Keyur


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com




Re: Indic Devanagari Query

2003-01-29 Thread Keyur Shroff
Hi,

Forgot to reply implementation query. The reply is inline.

--- Aditya Gokhale [EMAIL PROTECTED] wrote:
 2. Implementation Query - 
 In an implementation where I need to send / process Hindi, Marathi
 and Sanskrit data, how do I differentiate between languages (Hindi,
 Marathi and Sanskrit). Say for example, I am writing a translation
 engine, and I want to translate a document having Hindi, Marathi and
 Sanskrit Text in it, how do I know from the code points between 0x0900
 and 0x097F, that the data under perusal is Hindi / Marathi / Sanskrit ?
 I would suggest that we should give different code pages for Marathi,
 Hindi and Sanskrit. May be current code page of Devanagari can be traded
 as Hindi and two new code pages for Marathi and Sanskrit be added. This
 could solve these issues. If there is any better way of solving this, any
 one suggest.

Instead of changing/recommending change in an encoding standard, your
problem can best be solved in your application. You can use tags in your
text to specify language. Unicode also facilitates tagging your text but
its use in Unicode is highly discouraged. So you can use some language
similar to xml or html to specify language boundary. Then parse your text,
identify the language boundaries, and do further processing depending upon
the language.

If you don't want to use tags in your text then you can predict language by
using some heuristic. This heuristic can be used on some language
properties which may be different for all three languages. In this case
your processing will be divided into two phases. First phase involves
applying some heuristic rule to identify language bounadaries from plain
text and the second is actually processing text for translation. But beware
that the result will not be accurate all the time with such heuristic
processing. Hence use of tags is recommended.

Regards,
Keyur


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com




Re: Indic Devanagari Query

2003-01-29 Thread Keyur Shroff

--- Asmus Freytag [EMAIL PROTECTED] wrote:

 
 All of the above can be composed through following consonant clusters:
jna - ja halant nya
shra - sha halant ra
ksh - ka halant ssha
 
 The point that the above sequences are considered as characters in some
 of
 the Indian languages has merit. If there is demand from native speakers
 then a proposal can be submitted to Unicode. There is a predefined
 procedure for proposal submission. Once this is discussed with concerned
 people and agreed upon then these ligatures can be added in Devanagari
 script itself because Devenagari script represent all three languages
 you
 mentioned namely Sanskrit, Marathi, and Hindi. Meanwhile you can write
 rules for composing them from the consonant clusters.
 
 I wouldn't go so far. The fact that clusters belong together is something
 
 that can be handled by the software. Collation and other data processing 
 needs to deal with such issues already for many other languages. See 
 http://www.unicode.org/reports/tr10 on the collation algorithm.

I beg to differ with you on this point. Merely having some provision for
composing a character doesn't mean that the character is not a candidate
for inclusion as separate code point. India is a big country with millions
of people geographically divided and speaking variety of languages.
Sentiments are attached with cultures which may vary from one geographical
area to another. So when one of the many languages falling under the same
script dominate the entire encoding for the script, then other group of
people may feel that their language has not been represented properly in
the encoding. While Unicode encodes scripts only, the aim was to provide
sufficient representation to as many languages as possible. 

In Unicode many characters have been given codepoints regardless of the
fact that the same character could have been rendered through some compose
mechanism. This includes Indic scripts as well as other scripts. For
example, in Devanagari script some code points are allocated to characters
(ConsonantNukta) even though the same characters could be produced with
combination of the consonant and Nukta. Similarly, in Latin-1 range
[U+0080-U+00FF] there are few characters which can be produced otherwise.
That is why the text should be normalized to either pre-composed or
de-composed character sequence before going for further processing in
operations like searching and sorting.

Also, many times processing of text depends on the smallest addressable
unit of that language. Again as discussed in earlier e-mails this may vary
from one language to another in the same script. Consider a case when a
language processor/application wants to count the number of characters in
some text in order to find number of keystrokes required to input the text.
Further assume that API functions used for this purpose are based on either
WChar (wide characters) or UTF-8. In this case it is very much necessary
that you assign the character, say Kssha, to the class consonant. Since
assignment to this class consonant applies to single code point (the
smallest addressable unit) and not to the sequence of codes, it is very
much necessary to have single code point for the character Kssha.

This is my understanding. Please enlighten me if I am wrong.

Regards,
Keyur


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com




Suggestions in Unicode Indic FAQ

2003-01-29 Thread Keyur Shroff
Hello,

There are few discrepancies in Indic FAQ. Though it was reported earlier by
Andy White, I see they still have place there in the FAQ. I also clarified
it but by mistake I sent the mail to Yahoo groups where this mailing list
is archived and hence my mail never reached to this mailing list. You can
refer to the link http://groups.yahoo.com/group/unicode/message/16352


The following are the suggestions.

SUGGESTION-1:

In the FAQ
   http://www.unicode.org/faq/indic.html#2
it is mentioned that 

ISCII:   Unicode:
Halant + Halant  Halant + ZWJ

produce similar result. This is wrong. In ISCII, Halant+Halant is known as
explicit halant and its Unicode equivalent sequence is Halant+ZWNJ. So ZWJ
should be replaced by ZWNJ.


SUGGESTION-2:

In the FAQ
   http://www.unicode.org/faq/indic.html#16

It is mentioned that following are equivalent

ISCII Unicode
KA halant INV KA virama ZWJ
RA halant INV RAsup (i.e., repha)

In fact there is no way in Unicode to produce RAsup directly, i.e., without
using base consonant. The sequence RA virama ZWJ will actually produce
half-RA (or eyelash-RA) which is used commonly in Marathi. eyelash-RA can
also be produced with the sequence RA Halant Nukta sequence both in ISCII
(known as soft halant) and Unicode (just for conformance with ISCII).

Also, in the same answer the following sequence is recommended.

ISCII Unicode
INV halant RA SPACE virama RA (RAsub)



SUGGESTION-3:

Use of SPACE character as consonant may create problem for state machine
which finds language/syllable boundary. In fact we need a codepoint for one
invisible consonant (similar to INV in ISCII) in Unicode which can solve
this problem with Unicode.

After inclusion of INV character the following can be recommended.

ISCII Unicode
KA halant INV KA virama INV
RA halant INV RA virama INV (i.e., repha)
INV halant RA INV virama RA (RAsub)

The INV character in Unicode can also be used for displaying dependent
vowel matras without dotted circle.

Unicode
INV Vowel sign O
INV Vowel sign AI

etc. This can replace existing definition of SPACE as invisible consonant
depending on the context.

Any other pointers!!?

- Keyur


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com




RE: Suggestions in Unicode Indic FAQ

2003-01-29 Thread Keyur Shroff

--- Marco Cimarosti [EMAIL PROTECTED] wrote:

 Why not representing INV with a double ZWJ? E.g.:
 
   ISCII Unicode
   KA halant INV KA virama ZWJ ZWJ
   RA halant INV RA virama ZWJ ZWJ (i.e., repha)
   INV halant RA ZWJ ZWJ virama RA (RAsub)
 
 This has the advantage that the most common sequences will work OK also
 on
 old display engines implemented *before* the double-ZWJ convention is
 introduced.
 
 E.g., sequence KA virama ZWJ ZWJ works well also on an old engine, for
 the
 simple reason that the first ZWJ is enough to do the work, and  the
 second ZWJ is invisible.
 
 Of course, an old engine will still display a RA[eyelash] for RA
 virama
 ZWJ ZWJ, but that is not worse than displaying RA+virama followed by a
 white box, which is what would happen with your new INV character.

Certainly. This looks more promising because even RAsub has two alternate
forms. One form is used with consonants KA, KHA, GHA, etc and the other
form is used with consonants TTA, TTHA, DDA, DDHA, etc. With your ZWJ based
scheme we can insert as many ZWJ as we wish to produce all possible
alternate forms!

But sometimes a user may want visual representation of these symbols in two
different ways: with dotted circle and without dotted circle. Example of
this could be RAsup on top of dotted circle and RAsup on top of space
character. Current use of space character to eliminate dotted circle is
really painful and may create problems in determining language and syllable
boundaries. The main problem with space character is that unlike
ZWJ/ZWNJ/Dotted Circle, it falls within the range of other important script
Latin. Finally it may affect all important text processing which uses
Unicode characters to find language boundaries. Use of INV character in one
shot can solve all these problems. We can put it in consonant class which
can help text processing applications. Moreover, it will be difficult for
all possible to provide upward compatibility all the time even though it is
desirable. Implementation of Unicode will need to be upgraded with every
introduction of new glyphs or rules. Otherwise applications have to
explicitly declare the version of Unicode used in implementation.

- Keyur


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com




RE: Suggestions in Unicode Indic FAQ

2003-01-29 Thread Keyur Shroff

--- Kent Karlsson [EMAIL PROTECTED] wrote:
 
 A space followed by a dependent vowel sign should display just the
 dependent vowel sign, no dotted circle.  Indeed, (except for a show
 invisibles mode, or a character chart display mode) no (Indic or
 other)
 text that does not contain the *character* DOTTED CIRCLE should ever
 display a dotted circle as part of the displayed text. Systems that
 do display a dotted circle (in normal display mode) where there is
 no such *character* in the displayed text are buggy!

In Indic scripts any sign that appear in text not in conjunction with a
valid consonant base may be rendered with dotted circle as fallback
mechanism (Section 5.14 Rendering Nonspacing Marks
http://www.unicode.org/uni2book/ch05.pdf). Any system implementing this as
default behaviour should not be considered buggy. What should be the
default rendering behaviour (i.e., show hidden or not) may vary from one
script to another script and also depends on implementation policy. 

For scripts other than Indic scripts, it may be useful to render the
nonspacing mark without dotted circle because even after rendering it as an
overlap glyph, the result is recognizable. However, for Indic scripts use
of dotted circle is very useful as default behaviour since it gives
immediate feedback to the user that there may be some defective combining
character in the text. Most of the time such errors are unintentional
rather than intentional.

Unicode has provision to remove this dotted circle. Space character is used
to give indication to fallback mechanism that no dotted circle should be
used while rendering this stand alone sign which is normally attached to
other characters. This is useful when sometimes user want to display the
sign without any circle. Also, with this scheme it is possible to show some
combining marks with dotted circle and some without dotted circle.

- Keyur


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com




RE: Suggestions in Unicode Indic FAQ

2003-01-29 Thread Keyur Shroff

--- Marco Cimarosti [EMAIL PROTECTED] wrote:
 Keyur Shroff wrote:
  But sometimes a user may want visual representation of these 
  symbols in two different ways: with dotted circle and
  without dotted circle.
 
 Why not using a dotted circle character explicity, when you want to see
 one?

Note that whenever I mention the word combining mark I am really talking
about vowel signs (matras) and other modifiers in Indic scripts which is
script dependent. I am sorry if I have confused you with the combining
diacritical marks in the block [U+0300-U+036F] which I really didn't mean.

Let me give a proper example this time. Consider a Vowel Sign E [U+0947]
appearing after any non-consonant character. This sign is generally
attached to the consonants. It has zero advance width with negative left
side bearing in the font. Clearly, since in this case the sign is not
preceded by any consonant base, it has to be rendered using one of the
mechanisms specified in fallback rendering of non-spacing marks. If we
render it with space, as you said, then we have to insert space character
at the time of fallback rendering (which can be taken care in rendering
pipeline) even though space character is not present in backing store of
the application. Now in order to render it with dotted circle if we
introduce the circle in the text before this sign then also the circle is
invalid base for this Vowel Sign E. As a result, again fallback rendering
will take place with rendering circle and the vowel sign positionally
separate. In this case first dotted circle will apear which will be
followed by vowel sign (matra) on top of space character.

If you know any other way to solve this problem then please explain. Also
let me know if I have misinterpreted the text written in Unicode standard.


 
  Example of
  this could be RAsup on top of dotted circle and RAsup on top of space
  character. Current use of space character to eliminate dotted 
  circle is really painful and may create problems in determining 
  language and syllable boundaries.
 
 Languages or syllable boundaries have nothing to do with this. These
 special
 sequences should *never* be part of any syllabe or word in any language:
 they are just a way of showing the shape of a glyph, to be used when,
 e.g., talking about typography or spelling.

Then how can we rake care of fallback mechanism?


Thanks for taking pain for answering my queries :-)

- Keyur



__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com