Re: Plane 14 Tag Deprecation Issue (was Re: VS vs. P14 (was Re: Indic Devanagari Query))

2003-02-07 Thread Asmus Freytag
At 11:54 AM 2/6/03 -0800, Kenneth Whistler wrote:

My personal opinion? The whole debate about deprecation of
language tag characters is a frivolous distraction from
other technical matters of greater import, and things would
be just fine with the current state of the documentation.
But, if formal deprecation by the UTC is what it would take
to get people to stop advocating more use of the language
tags after the UTC has long determined that their use is
strongly discouraged, then so be it.


My personal opinion is that labelling them as restricted for
use with protocols requiring their use is sufficient and proper.
In the context of such protocols, the use of tag characters is
a fine mechanism. They certainly have some advantages over
ASCII-style markup (e.g. lang=...) in many situations.

Where they don't have a place is in regular 'plain' text streams.

Formal deprecation would imply to me that ANY use is discouraged,
including the use with protocols that wish to make use of them.
THAT seems to be going too far in this case.

Where we have deprecated format characters in the past it has been
precisely in situations where we wanted to discourage the use of
particular 'protocols', for example for shaping and national digit
selection.

A./




Re: VS vs. P14 (was Re: Indic Devanagari Query)

2003-02-07 Thread Andrew C. West
John H. Jenkins wrote:

 Ah, but decorative motifs are not plain text.

Ah, but it could be.




Re: Plane 14 Tag Deprecation Issue (was Re: VS vs. P14 (was Re: Indic Devanagari Query))

2003-02-07 Thread William Overington
I feel that as the matter was put forward for Public Review then it is
reasonable for someone reading of that review to respond to the review on
the basis of what is stated as the issue in the Public Review item itself.

Kenneth Whistler now states an opinion as to what the review is about and
mentions a file PropList.txt of which I was previously unaware.

Recent discussions in the later part of 2002 in this forum about the
possibilities of using language tags only started as a direct result of the
Unicode Consortium instituting the Public Review.

The recent statement by Asmus Freytag seems fine to me.  Certainly I might
be inclined to add in a little so as to produce Plane 14 tags are reserved
for use with particular protocols requiring, or providing facilities for,
their use so that the possibility of using them to add facilities rather
than simply using them when obligated to do so is included, but that is not
a great issue: what Asmus wrote is fine.

Public Review is, in my opinion, a valuable innovation.  Two issues have so
far been resolved using the Public Review process.  Those results do seem to
indicate the value of seeking opinions by Public Review.

As I have mentioned before I have a particular interest in the use of
Unicode in relation to the implementation of my telesoftware invention using
the DVB-MHP (Digital Video Broadcasting - Multimedia Home Platform) system.
I feel that language tags may potentially be very useful for broadcasts of
multimedia packages which include Unicode text files, by direct broadcast
satellites across whole continents.  Someone on this list, I forget who, but
I am grateful for the comment, mentioned that even if formal deprecation
goes ahead then that does not stop the language tags being used as once an
item is in Unicode it is always there.  So fine, though it would be nice if
the Unicode Specification did allow for such possibilities within its
wording.  The wording stated by Asmus Freytag pleases me, as it seems a
good, well-rounded balance between avoiding causing people who make many
widely used packages needing to include software to process language tags,
whilst still formally recognizing the opportunity for language tags to be
used to advantage in appropriate special circumstances.  I feel that that is
a magnificent compromise wording which will hopefully be widely applauded.

In using Unicode on the DVB-MHP platform I am thinking of using Unicode
characters in a file and the file being processed by a Java program which
has been broadcast.  The file PropList.txt just does not enter into it for
this usage, so it is not a problem for me as to what is in that file.  My
thinking is that many, maybe most, multimedia packages being broadcast will
not use language tags and will have no facilities for decoding them.
However, I feel that it is important to keep open the possibility that some
such packages can use language tags provided that the programs which handle
them are appropriately programmed.  There will need to be a protocol.
Hopefully a protocol already available in general internationalization and
globalization work can be used directly.  If not, hopefully a special
Panplanet protocol can be devised specifically for DVB-MHP broadcasting.

On the matter of using Unicode on the DVB-MHP platform, readers might like
to have a look at the following about the U+FFFC character.

http://www.users.globalnet.co.uk/~ngo/ast03200.htm

Readers who are interested in uses of the Private Use Area might like to
have a look at the following.  They are particularly oriented towards the
DVB-MHP platform but do have wider applications both on the web and in
computing generally.

http://www.users.globalnet.co.uk/~ngo/ast03000.htm

http://www.users.globalnet.co.uk/~ngo/ast03100.htm

http://www.users.globalnet.co.uk/~ngo/ast03300.htm

The main index page of the webspace is as follows.

http://www.users.globalnet.co.uk/~ngo

William Overington

7 February 2003



















Re: VS vs. P14 (was Re: Indic Devanagari Query)

2003-02-07 Thread Asmus Freytag
At 01:52 AM 2/7/03 -0800, Andrew C. West wrote:

 Ah, but decorative motifs are not plain text.

Ah, but it could be.


Ah, but it wouldn't be Unicode.

A(h)./




Re: VS vs. P14 (was Re: Indic Devanagari Query)

2003-02-06 Thread Doug Ewell
Asmus Freytag asmusf at ix dot netcom dot com wrote:

 Unicode 4.0 will be quite specific: P14 tags are reserved for
 use with particular protocols requiring their use is what the
 text will say more or less.

I didn't know the question of what to do about Plane 14 language tags
had already been resolved.

If that is the case, it might make sense to add an explanatory note to
the Public Review item on Plane 14 tags, or simply to remove the item.

-Doug Ewell
 Fullerton, California





VS vs. P14 (was Re: Indic Devanagari Query)

2003-02-06 Thread Andrew C. West
James Kass wrote,

 (What happens if someone discovers a 257th variant? Do they
 get a prize? Or, would they be forever banished from polite
 society?)

I was thinking about that. 256 variants of a single character may seem a tad
excessive, but there is a common Chinese decoartive motif (frequently seen on
trays and tea-pots and scarves and such like) comprising the ideograph shou4
(U+58FD, U+5900, U+5BFF) longevity written in 100 variant forms (called bai3
shou4 tu2 in Chinese). See
http://www.tydao.com/sxsu/shenhuo/minju/images/mj17.htm for an example.

A quick google on qian1 shou4 tu2 (the ideograph shou4 written in a thousand
different forms) came up with a piece of calligraphy by Wang Yunzhuang (b.1942)
which comprises the ideograph shou4 written in no less than 1,256 unique variant
forms !

Googling on wan4 shou4 tu2 (the ideograph shou4 written in 10,000 forms)
also had a number of hits, but these refer to a compilation of calligraphy by
forty artists that took 16 years to create (written on a scroll 160 metres in
length), so these may not all be unique variants.

There are also a number of other auspicious characters, such as fu2 (U+798F)
good fortune that may be found written in a hundred variant forms as a
decorative motif.

All in all the new variant selectors may be kept quite busy if applied to the
ideograph shou4 and its friends !

Andrew




Re: VS vs. P14 (was Re: Indic Devanagari Query)

2003-02-06 Thread John H. Jenkins
On Thursday, February 6, 2003, at 08:47 AM, Andrew C. West wrote:


There are also a number of other auspicious characters, such as fu2 
(U+798F)
good fortune that may be found written in a hundred variant forms as 
a
decorative motif.

Ah, but decorative motifs are not plain text.

==
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://www.tejat.net/





Re: Indic Devanagari Query

2003-02-05 Thread Andrew C. West
On Wed, 05 Feb 2003 02:00:30 -0800 (PST), [EMAIL PROTECTED] wrote:

 If these alternate forms were needed to be displayed in a single
 multi-lingual plain-text file, wouldn't we need some method of 
 tagging the runs of Latin text for their specific languages?

Is this not what the variation selectors are available for ?

And now that we soon to have 256 of them, perhaps Unicode ought not to be shy
about using them for characters other than mathematical symbols.

Andrew




Re: Indic Devanagari Query

2003-02-05 Thread Peter_Constable

On 02/04/2003 02:52:25 PM jameskass wrote:

If these alternate forms were needed to be displayed in a single
multi-lingual plain-text file, wouldn't we need some method of
tagging the runs of Latin text for their specific languages?

The plain-text file would be legible without that -- I don't think this is
an argument in favour of plane 14 tag characters. Preserving
culturally-preferred appearance would certainly require markup of some
form, whether lang IDs or for font-face and perhaps font-feature
formatting.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485







Re: Indic Devanagari Query

2003-02-05 Thread Peter_Constable

On 02/05/2003 04:05:44 AM Andrew C. West wrote:

 If these alternate forms were needed to be displayed in a single
 multi-lingual plain-text file, wouldn't we need some method of
 tagging the runs of Latin text for their specific languages?

Is this not what the variation selectors are available for ?

That is a possible technical solution to such variations, though specific
character+variant combinations would have to be approved and documented by
UTC. It's not the only solution, and might or might not be the best.




- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485







VS vs. P14 (was Re: Indic Devanagari Query)

2003-02-05 Thread jameskass
.
Andrew C. West wrote,

 Is this not what the variation selectors are available for ?

 And now that we soon to have 256 of them, perhaps Unicode ought not to be shy
 about using them for characters other than mathematical symbols.


Yes, there seem to be additional variation selectors coming in 
Unicode 4.0 as part of the 1207 (is that number right?) new
characters.

(What happens if someone discovers a 257th variant?  Do they
get a prize?  Or, would they be forever banished from polite
society?)

The variation selectors could be a practical and effective method 
of handling different glyph forms.

But, consider the burden of incorporating a large amount of
variation selectors into a text file and contrast that with the
use of Plane Fourteen language tags.  With the P14 tags, it's
only necessary to insert two special characters, one at the
beginning of a text run, the other at the ending.

Jim Allan wrote,

 One could start with indications as to whether the text was traditional 
 Chinese, simplified Chinese, Japanese, Korean, etc. :-(
 
 But I don't see that there is anything particularly wrong with citing or 
 using a language in a different typographical tradition.
 ...

Neither do I.  I kind of like seeing variant glyphs in runs of text and
am perfectly happy to accept unusual combinations.

Perhaps those of us who deal closely with multilingual material
and are familiar with variant forms are simply more tolerant
and accepting.

 ... A linguistic 
 study of the distribution of the Eng sound might cite written forms with 
 capital letters from Sami and some from African languages, but need not 
 and probably should not be concerned about matching exactly the exact 
 typographical norms in those tongues, for _eng_ or for any other letter.

On the one hand, there's a feeling that insistence upon variant glyphs
for a particular language is provincial.  On the other hand, everyone
has the right to be provincial (or not).  IMO, it's the ability to
choose that is paramount.

If anyone wishes to distinguish different appearances of an acute
accent between, say, French and Spanish... or the difference of the
ogonek between Polish and Navajo... or the variant forms of
capital eng, then there should be a mechanism in place enabling 
them to do so.

Variation selectors would be an exact method with the V.S. characters
manually inserted where desired.  P14 tags would also work for this;
entire runs of text could be tagged and those runs could be properly
rendered once the technology catches up to the Standard.

Neither V.S. nor P14 tags should interfere with text processing
or break any existing applications.  There are pros and cons for
either approach.

Best regards,

James Kass
.




VS vs. P14 (was Re: Indic Devanagari Query)

2003-02-05 Thread jameskass
.
Peter Constable wrote,

 The plain-text file would be legible without that -- I don't think this is
 an argument in favour of plane 14 tag characters. Preserving
 culturally-preferred appearance would certainly require markup of some
 form, whether lang IDs or for font-face and perhaps font-feature
 formatting.

Any Unicode formatting character can be considered as mark-up,
even P14 tags or VSs.

The advantages of using P14 tags (...equals lang IDs mark-up) is
that runs of text could be tagged *in a standard fashion* and
preserved in plain-text.

Best regards,

James Kass
.




Re: VS vs. P14 (was Re: Indic Devanagari Query)

2003-02-05 Thread Asmus Freytag
At 06:24 PM 2/5/03 +, [EMAIL PROTECTED] wrote:

The advantages of using P14 tags (...equals lang IDs mark-up) is
that runs of text could be tagged *in a standard fashion* and
preserved in plain-text.


The minute you have scoped tagging, you are no longer using
plain text.

The P14 tags are no different than HTML markup in that regard,
however, unlike HTML markup they can be filtered out by a
process that does not implement them. (In order to filter
out HTML, you need to know the HTML syntax rules. In order
to filter out P14 tags you only need to know their code point
range.)

Variation selectors also can be ignored based on their code
point values, but unlike p14 tags, they don't become invalid
when text is cutpaste from the middle of a string.

If 'unaware' applications treat them like unknown combining
marks and keep them with the base character like they would
any other combining mark during editing, then variation
selectors have a good chance surviving in plain text.

P14 tags do not.

Unicode 4.0 will be quite specific: P14 tags are reserved for
use with particular protocols requiring their use is what the
text will say more or less.

A./






Re: VS vs. P14 (was Re: Indic Devanagari Query)

2003-02-05 Thread Peter_Constable

On 02/05/2003 12:24:39 PM jameskass wrote:

The advantages of using P14 tags (...equals lang IDs mark-up) is
that runs of text could be tagged *in a standard fashion* and
preserved in plain-text.

Sure, but why do we want to place so much demand on plain text when the
vast majority of content we interchange is in some form of marked-up or
rich text? Let's let plain text be that -- plain -- and look to the markup
conventions that we've invested so much in and that are working for us to
provide the kinds of thing that we designed markup for in the first place.
Besides, a plain-text file that begins and ends with p14 tags is a
marked-up file, whether someone calls it plain text or not. We have
little or no infrastructure for handling that form of markup, and a large
and increasing amount of infrastructure for handling the more typical forms
of markup.

I repeat, plain text remains legible without anything indicating which eng
(or whatever) may be preferred by the author, and (since the requirement
for plain text is legibility) therefore this is not really an argument for
using p14 language tags. IMO.




- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485











Re: VS vs. P14 (was Re: Indic Devanagari Query)

2003-02-05 Thread Michael Everson
At 16:47 -0500 2003-02-05, Jim Allan wrote:


There are often conflicting orthographic usages within a language. 
Language tagging alone does not indicate whether German text is to 
be rendered in Roman or Fraktur, whether Gaelic text is to be 
rendered in Roman or Uncial, and if Uncial, a modern Uncial or more 
traditional Uncial, whether English text is in Roman or Morse Code 
or Braille.

We have script codes (very nearly a published standard) for that.

By the way, modern uncial and more traditional uncial isn't 
really sufficient I think for describing Gaelic letterforms. See 
http://www.evertype.com/celtscript/fonthist.html for a sketch of a 
more robust taxonomy.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



Re: VS vs. P14 (was Re: Indic Devanagari Query)

2003-02-05 Thread jameskass
.
Asmus Freytag wrote,

 Variation selectors also can be ignored based on their code
 point values, but unlike p14 tags, they don't become invalid
 when text is cutpaste from the middle of a string.

Excellent point.

 Unicode 4.0 will be quite specific: P14 tags are reserved for
 use with particular protocols requiring their use is what the
 text will say more or less.

This seems to be an eminently practical solution to the P14
situation.

If I were using an application which invoked a protocol requiring
P14 tags to read a file which included P14 tags and wanted to cut
and paste text into another application, in a perfect world the
application would be savvy enough to recognize any applicable P14
tags for the selected text and insert the proper Variation Selectors
into the text stream to be pasted.

The application which received the pasted text, if it was an application
which used a protocol requiring P14 tags, would be savvy enough to
strip the variation selectors and enclose the pasted string in
the appropriate P14 tags.  If the pasted material was being inserted
into a run of text in which the same P14 tag applied, then the tags
wouldn't be inserted.  If the pasted material was being inserted
into a run of text in which a different P14 tag applied, then the
application would insert begin and end P14 tags as needed.

In a perfect world, in the best of both worlds, both P14 tags and
variation selectors could be used for this purpose.

Is it likely to happen?  Perhaps not.

But, by not formally deprecating P14 tags and using (more or less)
the language you mentioned, the possibilities remain open-ended.

Best regards,

James Kass
.




Re: VS vs. P14 (was Re: Indic Devanagari Query)

2003-02-05 Thread jameskass
.
Peter Constable wrote,

 Sure, but why do we want to place so much demand on plain text when the
 vast majority of content we interchange is in some form of marked-up or
 rich text? Let's let plain text be that -- plain -- and look to the markup
 conventions that we've invested so much in and that are working for us to
 provide the kinds of thing that we designed markup for in the first place.
 Besides, a plain-text file that begins and ends with p14 tags is a
 marked-up file, whether someone calls it plain text or not. We have
 little or no infrastructure for handling that form of markup, and a large
 and increasing amount of infrastructure for handling the more typical forms
 of markup.

We place so much demand on plain text because we use plain text.

We continue to advance from the days when “plain text” meant ASCII only
rendered in bitmapped monospaced monochrome.

We don’t rely on mark-up or higher protocols to distinguish between different
European styles of quotation marks.  We no longer need proprietary rich-text
formats and font switching abilities to be able to display Greek and Latin
text from the same file.

 I repeat, plain text remains legible without anything indicating which eng
 (or whatever) may be preferred by the author, and (since the requirement
 for plain text is legibility) therefore this is not really an argument for
 using p14 language tags. IMO.

Is legibility the only requirement of plain text?  Might additional 
requirements
include appropriate, correct encoding and correct display?

To illustrate a legible plain text run which displays as intended (all things 
being
equal) yet is not appropriately encoded (this e-mail is being sent as plain 
text
UTF-8):

푰풇 풚풐풖 풄풂풏 풓풆풂풅 풕풉풊풔 
풎풆풔풔풂품풆...
풚풐풖 풎풂풚 풘풊풔풉 풕풐 풋풐풊풏 푴푨푨푨* 
풂풕
퓫퓵퓪퓱퓫퓵퓪퓱퓫퓵퓪퓱퓭퓸퓽퓬퓸퓶

(*헠햺헍헁 헔헅헉헁햺햻햾헍헌 헔햻헎헌햾헋헌 
헔헇허헇헒헆허헎헌)

Clearly, correct and appropriate encoding (as well as legibility) should be a 
requirement of plain text.  Is correct display also a valid requirement for 
plain text?

It is for some...

Respectfully,

James Kass
.




Re: Indic Devanagari Query

2003-02-04 Thread Peter_Constable

On 01/30/2003 03:03:24 PM Anto'nio Martins-Tuva'lkin wrote:

Not very different from the serbian vs. russian rendition of cyrillic
lower case i in italics. There are more examples, though (almost?)
none in the latin script.

There are indeed some examples in Latin script. For instance, there are
three different typeforms form 014A used by different language communities.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485







Re: Indic Devanagari Query

2003-02-04 Thread jameskass
.
Peter Constable wrote,

 There are indeed some examples in Latin script. For instance, there are
 three different typeforms form 014A used by different language communities.

It's also been reported that there's a strong local preference
for a variant of U+0257 in certain African language communities.

(It would be nice to have confirmation about U+0257...)

If these alternate forms were needed to be displayed in a single
multi-lingual plain-text file, wouldn't we need some method of 
tagging the runs of Latin text for their specific languages?

Best regards,

James Kass
.




Re: Indic Devanagari Query

2003-02-04 Thread Jim Allan
Peter Constable wrote,


There are indeed some examples in Latin script. For instance, there are
three different typeforms form 014A used by different language communities.


It's also been reported that there's a strong local preference
for a variant of U+0257 in certain African language communities.

(It would be nice to have confirmation about U+0257...)

If these alternate forms were needed to be displayed in a single
multi-lingual plain-text file, wouldn't we need some method of
tagging the runs of Latin text for their specific languages?

Best regards,

James Kass 

One could start with indications as to whether the text was traditional 
Chinese, simplified Chinese, Japanese, Korean, etc. :-(

But I don't see that there is anything particularly wrong with citing or 
using a language in a different typographical tradition. A linguistic 
study of the distribution of the Eng sound might cite written forms with 
capital letters from Sami and some from African languages, but need not 
and probably should not be concerned about matching exactly the exact 
typographical norms in those tongues, for _eng_ or for any other letter.

Jim Allan









Re: Indic Devanagari Query

2003-01-30 Thread Anto'nio Martins-Tuva'lkin
On 2003.01.29, 05:52, Aditya Gokhale [EMAIL PROTECTED] wrote:

 1. In Marathi and Sanskrit language two characters glyphs of 'la' and
 'sha' are represented differently as shown in the image below -

 (First glyph is 'la' and second one is 'sha')

 as compared to Hindi where these character glyphs are represented as
 shown in the image below -

 (First glyph is 'la' and second one is 'sha')

Not very different from the serbian vs. russian rendition of cyrillic
lower case i in italics. There are more examples, though (almost?)
none in the latin script.

--   .
António MARTINS-Tuválkin|  ()|
[EMAIL PROTECTED]   ||
R. Laureano de Oliveira, 64 r/c esq. |
PT-1885-050 MOSCAVIDE (LRS)  Não me invejo de quem tem   |
+351 917 511 459 carros, parelhas e montes   |
http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe   |
http://pagina.de/bandeiras/  a água em todas as fontes   |





Re: Indic Devanagari Query

2003-01-29 Thread Keyur Shroff
Hi Aditya,

--- Aditya Gokhale [EMAIL PROTECTED] wrote:
 I had few query regarding representation of Devanagari script in
 Unicode
 (Code page - 0x0900 - 0x097F). Devanagari is a writing script, is used in
 Hindi, Marathi and Sanskrit languages. I have following questions - 
 
 
 In the same script code page, how do I use these two different Glyphs, to
 represent the same character ? Is there any way by which I can do it in
 an Open type font and Free type font implementation ?

Yes, it is certainly possible with OpenType font. Please note that FreeType
is not a font format but it is a rendering library used to rasterize
different kind of fonts including TrueType and OpenType fonts.

In an Opentype font, you can include all glyphs with alternate shapes and
then select one of them depending upon the script and language. Application
should specify script and language tag while sending character codes to the
opentype rendering library/engine. All substitution will be taken place
depending on the language and/or script selection. There should be a
default script in the font. Similarly there will be a default language for
that script which will be used as fallback language if application does not
specify which language to be used for processing.

From the list of alternate glyphs you may want to use the glyph for default
language for an entry in cmap table. This default glyph can be substituted
by alternate glyph depending upon the language specification. You have to
use GSUB table and write language dependent lookup for substitution.

 
 2. Implementation Query - 
 In an implementation where I need to send / process Hindi, Marathi
 and Sanskrit data, how do I differentiate between languages (Hindi,
 Marathi and Sanskrit). Say for example, I am writing a translation
 engine, and I want to translate a document having Hindi, Marathi and
 Sanskrit Text in it, how do I know from the code points between 0x0900
 and 0x097F, that the data under perusal is Hindi / Marathi / Sanskrit ?

Unicode is not divided into code pages. Unlike few old encodings there is
only one code page for entire Unicode standard. However, for better
readability and quick user reference the entire chart has been divided into
different sections which you might interpret as code pages.

 I would suggest that we should give different code pages for Marathi,
 Hindi and Sanskrit. May be current code page of Devanagari can be traded
 as Hindi and two new code pages for Marathi and Sanskrit be added. This
 could solve these issues. If there is any better way of solving this, any
 one suggest.


Unicode gives code points to script only and not language. In fact it is
not desirable to give code points to individual languages falling under the
same script. Also, Unicode encodes characters which have abstract meaning
and properties. Unicode does not encode glyphs. The shapes of glyphs shown
in the Unicode chart have been given just for convenience and not actually
represent the shapes to be used in the font. The shape of the glyph for a
Unicode character may vary from one font to another. Since it is already
possible to select proper glyph(s) depending upon language selection, this
scheme is suitable for all Indian languages.


 
 
 3. Character codes for jna, shra, ksh - 
 
 In Sanskrit and Marathi jna, shra and ksh are considered as separate
 characters and not ligatures. How do we take care of this ? Can I get
 over all views on the matter from the group ? In my opinion they should
 be given different code points in the specific language code page.
 Please find below the character glyphs - 
 
 jna
 shra
 ksh

All of the above can be composed through following consonant clusters:
  jna - ja halant nya
  shra - sha halant ra
  ksh - ka halant ssha

The point that the above sequences are considered as characters in some of
the Indian languages has merit. If there is demand from native speakers
then a proposal can be submitted to Unicode. There is a predefined
procedure for proposal submission. Once this is discussed with concerned
people and agreed upon then these ligatures can be added in Devanagari
script itself because Devenagari script represent all three languages you
mentioned namely Sanskrit, Marathi, and Hindi. Meanwhile you can write
rules for composing them from the consonant clusters.

Regards,
Keyur


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com




Re: Indic Devanagari Query

2003-01-29 Thread Keyur Shroff
Hi,

Forgot to reply implementation query. The reply is inline.

--- Aditya Gokhale [EMAIL PROTECTED] wrote:
 2. Implementation Query - 
 In an implementation where I need to send / process Hindi, Marathi
 and Sanskrit data, how do I differentiate between languages (Hindi,
 Marathi and Sanskrit). Say for example, I am writing a translation
 engine, and I want to translate a document having Hindi, Marathi and
 Sanskrit Text in it, how do I know from the code points between 0x0900
 and 0x097F, that the data under perusal is Hindi / Marathi / Sanskrit ?
 I would suggest that we should give different code pages for Marathi,
 Hindi and Sanskrit. May be current code page of Devanagari can be traded
 as Hindi and two new code pages for Marathi and Sanskrit be added. This
 could solve these issues. If there is any better way of solving this, any
 one suggest.

Instead of changing/recommending change in an encoding standard, your
problem can best be solved in your application. You can use tags in your
text to specify language. Unicode also facilitates tagging your text but
its use in Unicode is highly discouraged. So you can use some language
similar to xml or html to specify language boundary. Then parse your text,
identify the language boundaries, and do further processing depending upon
the language.

If you don't want to use tags in your text then you can predict language by
using some heuristic. This heuristic can be used on some language
properties which may be different for all three languages. In this case
your processing will be divided into two phases. First phase involves
applying some heuristic rule to identify language bounadaries from plain
text and the second is actually processing text for translation. But beware
that the result will not be accurate all the time with such heuristic
processing. Hence use of tags is recommended.

Regards,
Keyur


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com




Re: Indic Devanagari Query

2003-01-29 Thread Aditya Gokhale

Hello,
Thanks for the reply. I will check the points as you said, as far as the
font issues are considered. We all know how jna,shra and ksh are formed in
UNICODE and ISCII, but the point I wanted to make was, if we have to sort /
search / process the data in Devanagari script, then we have to keep track
of at least three characters and not one. This becomes tedious, thought not
impossible. If single
code point is present it will be very easy to process.
With regards, to predict language by using some heuristic, in my
opinion it is a very risky solution, at least when I don't have much
information at stage one of my application. I am running OCR engine on a
Devanagari page, then based on the formatting, tagging the language. So I
think tagging, as I am doing right now is a better solution. I also agree
with the views expressed by Asmus Freytag, that if we go on including all
the 6000 languages, it will be extremely impossible to cross-correlate these
'code pages'.

-Aditya






RE: Indic Devanagari Query

2003-01-29 Thread Marco Cimarosti
Aditya Gokhale wrote:
 Hello Everybody,
 I had few query regarding representation of Devanagari 
 script in Unicode

All your questions are FAQ's, so I'll just reference the entries which
answers them.

 (Code page - 0x0900 - 0x097F). Devanagari is a writing 
 script, is used in Hindi, Marathi and Sanskrit languages. I 
 have following questions - 

Unicode has no code pages:
http://www.unicode.org/faq/basic_q.html#18

 1. In Marathi and Sanskrit language two characters glyphs of 
 'la' and 'sha' are represented differently as shown in the 
 image below - 
  (First glyph is 'la' and second one is 'sha')
 as compared to Hindi where these character glyphs are 
 represented as shown in the image below - 
 (First glyph is 'la' and second one is 'sha')

Unicode encodes (abstract) characters, not glyphs:
http://www.unicode.org/faq/han_cjk.html#3

(This FAQ is in the Chinese/Japanese/Korean section because it is more often
raised for Chinese ideograms.)

 In the same script code page, how do I use these two 
 different Glyphs, to represent the same character ? Is there 
 any way by which I can do it in an Open type font and Free 
 type font implementation ?

Unicode's requirements for fonts:
http://www.unicode.org/faq/font_keyboard.html#1

A few links to OpenType stuff:
http://www.unicode.org/faq/font_keyboard.html#4

 2. Implementation Query - 
 In an implementation where I need to send / process 
 Hindi, Marathi and Sanskrit data, how do I differentiate 
 between languages (Hindi, Marathi and Sanskrit). Say for 
 example, I am writing a translation engine, and I want to 
 translate a document having Hindi, Marathi and Sanskrit Text 
 in it, how do I know from the code points between 0x0900 and 
 0x097F, that the data under perusal is Hindi / Marathi / Sanskrit ?

What you need here is some sort of language tagging:
http://www.unicode.org/faq/languagetagging.html

 I would suggest that we should give different code pages 
 for Marathi, Hindi and Sanskrit. May be current code page of 
 Devanagari can be traded as Hindi and two new code pages for 
 Marathi and Sanskrit be added. This could solve these issues. 
 If there is any better way of solving this, any one suggest.

Characters are encoder per scripts, not per languages:
http://www.unicode.org/faq/basic_q.html#17

 3. Character codes for jna, shra, ksh - 
 
 In Sanskrit and Marathi jna, shra and ksh are considered as 
 separate characters and not ligatures. How do we take care of 
 this ? Can I get over all views on the matter from the group 
 ? In my opinion they should be given different code points in 
 the specific language code page.
 Please find below the character glyphs - 

Unicode encodes Indic analytically:
http://www.unicode.org/faq/indic.html#17

 thanks,

For more details about Devanagari in Unicode, see Chapter 9 of the Standard:
http://www.unicode.org/uni2book/ch09.pdf

_ Marco




Re: Indic Devanagari Query

2003-01-29 Thread Keyur Shroff

--- Asmus Freytag [EMAIL PROTECTED] wrote:

 
 All of the above can be composed through following consonant clusters:
jna - ja halant nya
shra - sha halant ra
ksh - ka halant ssha
 
 The point that the above sequences are considered as characters in some
 of
 the Indian languages has merit. If there is demand from native speakers
 then a proposal can be submitted to Unicode. There is a predefined
 procedure for proposal submission. Once this is discussed with concerned
 people and agreed upon then these ligatures can be added in Devanagari
 script itself because Devenagari script represent all three languages
 you
 mentioned namely Sanskrit, Marathi, and Hindi. Meanwhile you can write
 rules for composing them from the consonant clusters.
 
 I wouldn't go so far. The fact that clusters belong together is something
 
 that can be handled by the software. Collation and other data processing 
 needs to deal with such issues already for many other languages. See 
 http://www.unicode.org/reports/tr10 on the collation algorithm.

I beg to differ with you on this point. Merely having some provision for
composing a character doesn't mean that the character is not a candidate
for inclusion as separate code point. India is a big country with millions
of people geographically divided and speaking variety of languages.
Sentiments are attached with cultures which may vary from one geographical
area to another. So when one of the many languages falling under the same
script dominate the entire encoding for the script, then other group of
people may feel that their language has not been represented properly in
the encoding. While Unicode encodes scripts only, the aim was to provide
sufficient representation to as many languages as possible. 

In Unicode many characters have been given codepoints regardless of the
fact that the same character could have been rendered through some compose
mechanism. This includes Indic scripts as well as other scripts. For
example, in Devanagari script some code points are allocated to characters
(ConsonantNukta) even though the same characters could be produced with
combination of the consonant and Nukta. Similarly, in Latin-1 range
[U+0080-U+00FF] there are few characters which can be produced otherwise.
That is why the text should be normalized to either pre-composed or
de-composed character sequence before going for further processing in
operations like searching and sorting.

Also, many times processing of text depends on the smallest addressable
unit of that language. Again as discussed in earlier e-mails this may vary
from one language to another in the same script. Consider a case when a
language processor/application wants to count the number of characters in
some text in order to find number of keystrokes required to input the text.
Further assume that API functions used for this purpose are based on either
WChar (wide characters) or UTF-8. In this case it is very much necessary
that you assign the character, say Kssha, to the class consonant. Since
assignment to this class consonant applies to single code point (the
smallest addressable unit) and not to the sequence of codes, it is very
much necessary to have single code point for the character Kssha.

This is my understanding. Please enlighten me if I am wrong.

Regards,
Keyur


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com




Re: Indic Devanagari Query

2003-01-29 Thread John Cowan
Keyur Shroff scripsit:

 Sentiments are attached with cultures which may vary from one geographical
 area to another. So when one of the many languages falling under the same
 script dominate the entire encoding for the script, then other group of
 people may feel that their language has not been represented properly in
 the encoding. 

Indeed, they may have such beliefs, but those beliefs are based on two
incorrect notions: that what the charts show is normative, and that the
codepoint is the proper unit of processing.

 In Unicode many characters have been given codepoints regardless of the
 fact that the same character could have been rendered through some compose
 mechanism. 

In every case this was done for backward compatibility with existing
encodings.  No new codepoints of this type will be added in future.

 That is why the text should be normalized to either pre-composed or
 de-composed character sequence before going for further processing in
 operations like searching and sorting.

The collation algorithm makes allowance for these points.
It will be quite typical to tailor the algorithm to take language-specific
rules into account.

 Also, many times processing of text depends on the smallest addressable
 unit of that language. Again as discussed in earlier e-mails this may vary
 from one language to another in the same script. Consider a case when a
 language processor/application wants to count the number of characters in
 some text in order to find number of keystrokes required to input the text.

This will not work without knowledge of the keyboard layout in any case.
To enter Latin-1 characters on the Windows U.S. keyboard requires 5 keystrokes,
but they are represented by one or two Unicode characters.

-- 
Henry S. Thompson said, / Syntactic, structural,   John Cowan
Value constraints we / Express on the fly. [EMAIL PROTECTED]
Simon St. Laurent: Your / Incomprehensible http://www.reutershealth.com
Abracadabralike / schemas must die!http://www.ccil.org/~cowan




Re: Indic Devanagari Query

2003-01-29 Thread Michael Everson
At 02:13 -0800 2003-01-29, Keyur Shroff wrote:

I beg to differ with you on this point. Merely having some provision for
composing a character doesn't mean that the character is not a candidate
for inclusion as separate code point.


Yes, it does.


India is a big country with millions of people geographically 
divided and speaking variety of languages. Sentiments are attached 
with cultures which may vary from one geographical area to another. 
So when one of the many languages falling under the same script 
dominate the entire encoding for the script, then other group of 
people may feel that their language has not been represented 
properly in the encoding.

A lot of these feelings are simply WRONG, and that has to be faced. 
The syllable KSSA may be treated as a single letter, but this does 
not change the fact that it is a ligature of KA and SSA and that it 
can be represented in Unicode by a string of three characters.

In Unicode many characters have been given codepoints regardless of the
fact that the same character could have been rendered through some compose
mechanism. This includes Indic scripts as well as other scripts. For
example, in Devanagari script some code points are allocated to characters
(ConsonantNukta) even though the same characters could be produced with
combination of the consonant and Nukta.


There are historical and compatibility reasons that most of this 
stuff, as well as the similar stuff in the Latin range, were encoded. 
At one point some years ago the line was drawn, normalization was 
enacted, and that was that.

Also, many times processing of text depends on the smallest addressable
unit of that language. Again as discussed in earlier e-mails this may vary
from one language to another in the same script. Consider a case when a
language processor/application wants to count the number of characters in
some text in order to find number of keystrokes required to input the text.


I can't think of any reason why this would be useful. And what if you 
were not typing, but speaking to your computer? Then there would be 
no keystrokes at all!

Further assume that API functions used for this purpose are based on either
WChar (wide characters) or UTF-8. In this case it is very much necessary
that you assign the character, say Kssha, to the class consonant. Since
assignment to this class consonant applies to single code point (the
smallest addressable unit) and not to the sequence of codes, it is very
much necessary to have single code point for the character Kssha.


We are not going to encode KSSA as a single character. It is a 
ligature of KA and SSA, and can already be represented in Unicode. 
You need to handle this consonant issue with some other protocol.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



RE: Indic Devanagari Query

2003-01-29 Thread Kent Karlsson


  I wouldn't go so far. The fact that clusters belong together is something
  that can be handled by the software. Collation and other data processing 
  needs to deal with such issues already for many other languages. See 
  http://www.unicode.org/reports/tr10 on the collation algorithm.
 
 I beg to differ with you on this point. Merely having some provision for
 composing a character doesn't mean that the character is not a candidate
 for inclusion as separate code point. 

At this point, having some provision for composing a particular letter
is very much preventing it from being encoded at a separate code position.
This is due mostly to the fixation of normal forms (except for very rare
error corrections).

 In Unicode many characters have been given codepoints regardless of the
 fact that the same character could have been rendered through some compose
 mechanism. This includes Indic scripts as well as other scripts. For

For legacy reasons, yes.  These reasons no longer apply for
not-yet-encoded compositions.

 Also, many times processing of text depends on the smallest addressable
 unit of that language. Again as discussed in earlier e-mails this may vary
 from one language to another in the same script. Consider a case when a
 language processor/application wants to count the number of characters in
 some text in order to find number of keystrokes required to input the text.

You cannot find the number of keystrokes that way.  Not even 
if you know which keyboard (and disregarding backspace).  E.g.
ä can be produced by one or two (or more, if you count hex input)
keystrokes on (most) Swedish keyboards.

 Further assume that API functions used for this purpose are based on either
 WChar (wide characters) or UTF-8. In this case it is very much necessary
 that you assign the character, say Kssha, to the class consonant. Since
 assignment to this class consonant applies to single code point (the
 smallest addressable unit) and not to the sequence of codes, it is very
 much necessary to have single code point for the character Kssha.

No, that is not the case.  E.g. Hungarian (Magyar) has gy, ny, ly
(and more) as letters (look in a Hungarian dictionary, and its headings).
Similarly, Albanian has dh, rr, th (and more) as letters. None of
these combinations are candidates for single code point allocation.  For 
compatibility reasons the Dutch ij got a single code point, but it
is better to just use i followed by j (though that has some
difficulties; e.g. the titlecase of ijs is IJs, not Ijs).

/Kent K





Re: Indic Devanagari Query

2003-01-29 Thread Christopher John Fynn
 Michael Everson wrote:

 At 02:13 -0800 2003-01-29, Keyur Shroff wrote:
 I beg to differ with you on this point. Merely having some provision for
 composing a character doesn't mean that the character is not a candidate
 for inclusion as separate code point.
 
 Yes, it does.
 
 India is a big country with millions of people geographically 
 divided and speaking variety of languages. Sentiments are attached 
 with cultures which may vary from one geographical area to another. 
 So when one of the many languages falling under the same script 
 dominate the entire encoding for the script, then other group of 
 people may feel that their language has not been represented 
 properly in the encoding.

 A lot of these feelings are simply WRONG, and that has to be faced. 
 The syllable KSSA may be treated as a single letter, but this does 
 not change the fact that it is a ligature of KA and SSA and that it 
 can be represented in Unicode by a string of three characters.

Of course an anomoly is that KSSA *is* encoded in the Tibetan 
block at U+0F69. In normal Tibetan or Dzongkha words KSSA 
U+0F69 (or the combination U+0F40 U+0FB5) does not occur  
- AFAIK it  is *only* used when writing Sanskrit words containing 
KSSA in Tibetan script.  

I had thought that the argument for including KSSA as a seperate
character in the Tibetan block (rather than only having U+0F40 and 
U+0FB5) was originally for compatibility / cross mapping with 
Devanagari and other Indic scripts.  

- Chris






Re: Indic Devanagari Query

2003-01-29 Thread Rick McGowan
Aditya Gokhale wrote:

 1. In Marathi and Sanskrit language two characters glyphs of
 'la' and 'sha' are represented differently as shown in the
 image below -

Actually, for everyone's information: these allographs for Marathi were  
recently brought to our attention, and Unicode 4.0 will have a mention of  
the allographs, including pictures of the variant glyphs.

Rick





RE: Indic Devanagari Query

2003-01-29 Thread Marco Cimarosti
Christopher John Fynn wrote:
 I had thought that the argument for including KSSA as a seperate
 character in the Tibetan block (rather than only having U+0F40 and 
 U+0FB5) was originally for compatibility / cross mapping with 
 Devanagari and other Indic scripts.  

Which is not a valid reason either, considering that U+0F69 and the
combination U+0F40 U+0FB5 are *canonically* equivalent. This means that
normalizing applications are not allowed to treat U+0F69 differntly from
U+0F40 U+0FB5, including displaying them differently or mapping them
differently to something else.

_ Marco




Indic Devanagari Query

2003-01-28 Thread Aditya Gokhale



Hello Everybody, I had few query 
regarding representation of Devanagari script in Unicode(Code page - 0x0900 
- 0x097F). Devanagari is a writing script, isused in Hindi, Marathi and 
Sanskrit languages. I have following questions - 

1. In Marathi and Sanskrit language two charactersglyphs 
of 'la' and 'sha' are represented differently as shown in the image below - 


(Firstglyph is 
'la' and second one is 'sha')
as compared to Hindi where these character glyphs are 
represented as shown in the image below - 

(First glyph is 'la' and 
second one is 'sha')

In the same script code page, how do I use these two different 
Glyphs, to represent the same character ? Is there any way by which I can do it 
in an Open type font and Free type font implementation ?

2. Implementation Query -
 In an implementation where I need to send / 
process Hindi, Marathi and Sanskrit data, how do Idifferentiate between 
languages (Hindi, Marathi and Sanskrit). Say for example, I am writing a 
translation engine, and I want to translate a document having Hindi, Marathi and 
Sanskrit Text in it, how do I know from the code points between 0x0900 and 
0x097F, that the data under perusal is Hindi / Marathi / Sanskrit ?
 I would suggest that we should give 
different code pages for Marathi, Hindi and Sanskrit. May be current code page 
of Devanagari can be traded as Hindi and two new code pages for Marathi and 
Sanskrit be added. This could solve these issues. If there is any better way of 
solving this, any one suggest.


3. Character codes for jna, shra, ksh - 

In Sanskrit and Marathi jna, shra and ksh are considered as 
separate characters and not ligatures. How do we take care of this ? Can I get 
over all views on the matter from the group ? In my opinion they should be given 
different code points in the specific language code page.
Please find below the character glyphs - 

jna

shra

ksh


thanks,
Aditya Gokhale.
GIST Research and Development Lab,
C-DAC Pune,
Maharashtra, India.

http://www.cdacindia.com/html/gist/gistidx.asp