Re: Counting Devanagari Aksharas

2017-04-26 Thread Eli Zaretskii via Unicode
> Date: Wed, 26 Apr 2017 07:45:07 +0100
> From: Richard Wordingham via Unicode 
> 
> On Wed, 26 Apr 2017 08:48:13 +0300
> Eli Zaretskii via Unicode  wrote:
> 
> > > Date: Sun, 23 Apr 2017 22:59:49 +0100
> > > From: Richard Wordingham 
> > > Cc: Eli Zaretskii 
> > > 
> > > If I search for CGJ, highlighting it is frequently supremely
> > > useless. I want to know where it is; highlighting is merely a tool
> > > to find it on the screen.  
> > 
> > So I guess this means highlighting is useful after all ;-)
> 
> ᩺Not if the area highlit is zero pixels wide.

If you elide too much of the context, the discussion could lose all of
its meaning.  Let me restore some of the relevant context:

> > > > > On 2017-04-22, Eli Zaretskii via Unicode  wrote:
> > > > 
> > > > > > I could imagine Emacs decomposing characters temporarily when only
> > > > > > part of a cluster matches the search string.  Assuming this would
> > > > > > make sense to users of some complex scripts, that is.  You are
> > > > > > welcome to suggest such a feature by using report-emacs-bug.  
> > > > 
> > > > The cursor moves to the cluster boundary, so there is much less of a
> > > > problem with Emacs.
> > > 
> > > But you wanted to highlight only part of the cluster, AFAIU.
> > 
> > If I search for CGJ, highlighting it is frequently supremely useless.
> > I want to know where it is; highlighting is merely a tool to find it on
> > the screen.
> 
> So I guess this means highlighting is useful after all ;-)

IOW, the context was a suggestion to temporarily disable character
composition, in which case CGJ _will_ be displayed as non-zero width
glyph, at least in the default Emacs display configuration, and CGJ
_will_ be visible with its highlight.


Re: Counting Devanagari Aksharas

2017-04-26 Thread Richard Wordingham via Unicode
On Wed, 26 Apr 2017 08:48:13 +0300
Eli Zaretskii via Unicode  wrote:

> > Date: Sun, 23 Apr 2017 22:59:49 +0100
> > From: Richard Wordingham 
> > Cc: Eli Zaretskii 
> > 
> > If I search for CGJ, highlighting it is frequently supremely
> > useless. I want to know where it is; highlighting is merely a tool
> > to find it on the screen.  
> 
> So I guess this means highlighting is useful after all ;-)

᩺Not if the area highlit is zero pixels wide.

Richard.



Re: Counting Devanagari Aksharas

2017-04-25 Thread Eli Zaretskii via Unicode
> Date: Sun, 23 Apr 2017 22:59:49 +0100
> From: Richard Wordingham 
> Cc: Eli Zaretskii 
> 
> If I search for CGJ, highlighting it is frequently supremely useless.
> I want to know where it is; highlighting is merely a tool to find it on
> the screen.

So I guess this means highlighting is useful after all ;-)


Re: Go romanize! Re: Counting Devanagari Aksharas

2017-04-25 Thread Naena Guru via Unicode

Quote from below:

The word indeed means 'danger' (Pali/Sanskrit _antarāya_).  The
pronunciation is /ʔontʰalaːi/; the Tai languages that use(d) the Tai
Tham script no longer have /r/.  The older sequence /tr/ normally
became /tʰ/ (except in Lao), but the spelling has not been updated - at
least, not amongst the more literate.  The script has a special symbol
for the short vowel /o/, which it shares with the Lao script.  This
symbol is used in writing that word.  Two ways I have seen it spelt,
each with two orthographic syllables, are ᩋᩫ᩠ᨶᨲᩕᩣ᩠ᨿ on-trAy (the second
syllable has two stacks) and ᩋᩫᨶ᩠ᨲᩕᩣ᩠ᨿ o-ntrAy.  I have also seen a
form closer to Pali, namely _antarAy_, written ᩋᨶ᩠ᨲᩁᩂ᩠ᨿ a-nta-rAy.
However, I have seen nothing that shows that I won't encounter
ᩋᩢᨶ᩠ᨲᩁᩣ᩠ᨿ a-nta-rAy with the first vowel written explicitly, or even
ᩋᩢ᩠ᨶᨲᩁᩣ᩠ᨿ an-ta-rAy. How does your scheme distinguish such alternatives?

Response:
Perhaps this word is derived from Sanskrit 'anþaraða'
(Search: antarada at 
http://www.sanskrit-lexicon.uni-koeln.de/cgi-bin/tamil/recherche)

Sinhala:anþaraaðaayakayi, anþaraava, anþaraavayi, anþraava, anþraavayi Use this 
font to read the above Sinhala words: http://smartfonts.net/ttf/aruna.ttf


-=- svasþi siððham! -=-


On 4/25/2017 2:07 AM, Richard Wordingham via Unicode wrote:


On Mon, 24 Apr 2017 20:53:12 +0530
Naena Guru via Unicode<unicode@unicode.org>  wrote:


Quote by Richard:
Unless this implies a spelling reform for many languages, I'd like to
see how this works for the Tai Tham script.  I'm not happy with the
Romanisation I use to work round hostile rendering engines.  (My
scheme is only documented in variable hack_ss02 in the last script
blocks ofhttp://wrdingam.co.uk/lanna/denderer_test.htm.)  For
example, there are several different ways of writing what one might
naively record as "ontarAy".

MY RESPONSE:
Richard, I stuck to the two specifications (Unicode and Font) and
Sanskrit grammar. The akSara has two aspects, its sound (zabða,
phoneme) and its shape. (letter, ruupa). Reduce the writing system to
its consonants, vowels etc. (zabða) and assign SBCS letters/codes to
them (ruupa). SBCS provides the best technical facilities for any
language. (This is why now more than 130 languages romanize despite
Unicode). Use English letters for similar sounds in the native
speech. Now, treat all combinations as ligatures. For example, 'po'
sound in Indic has the p consonant with a sign ahead plus a sign
after.

In many Indic scripts, yes.  In Devanagari, the vowel sign is normally
a singly element classified as following the consonant.  In Thai, the
vowel sign precedes the consonant.  Tai Tham uses both a two-part sign
and a preceding sign.  The preceding sign is for Tai words and the
two-part sign for Pali words, but loanwords from Pali into the Tai
languages may retain the two part sign.


For the font, there is no difference between the way it makes
the combination 'ä', which has a sign above and the Indic having two
on either side.

For OpenType, there is.  The first can be made by providing a
simple table of where the diaeresis goes relative to the base
characters, in this case the diaeresis.  The second is painfully
complicated, for the 'p' may have other marks attached to it, so doing
it be relative positioning is painfully complicated and error-prone.
This job is given to the rendering engine, which may introduce its own
problems.

AAT and Graphite offer the font maker the ability to move the 'sign
ahead' from after the 'p' to before it.


Recall that long ago, Unicode stopped defining fixed
ligatures and asked the font makers to define them in the PUA.

While the first is true enough, I believe the second is false.  Not
every glyph has to be mapped to by a single character.  I don't do that
for contextual forms or ligatures in my font.


Spelling and speech:
There is indeed a confusion about writing and reading in Hindi, as I
have observed. Like in English and Tamil, Hindi tends to end words
with a consonant. So, there is this habit among the Hindi speakers to
drop the ending vowel, mostly 'a' from words that actually end with
it. For example, the famous name Jayantha (miserable mine too, haha!
= jayanþa as Romanized), is pronounced Jayanth by Hindi speakers. It
is a Sanskrit word. Sanskrit and languages like Sinhhala have vowel
ending and are traditionally spoken as such.

This loss is also to be found in Further India.  Thai, Lao and Khmer
now require that such a word-final vowel be written explicitly if it is
still pronounced.


Looking at the word you gave, ontarAy, it looks to me like an
Anglicized form. If I am to make a guess, its ending is like in
ontarAyi. Is it said something like, own-the-raa-yi? (danger?) If I
am right, this is a good example of decline if a writing system owing
to bad, uncaring application of technology. We are in the Digital
Age, and we need not compromise any more. In fact, we can fix errors
and decadence introduced by past technol

Re: Go romanize! Re: Counting Devanagari Aksharas

2017-04-24 Thread Richard Wordingham via Unicode
On Mon, 24 Apr 2017 20:53:12 +0530
Naena Guru via Unicode <unicode@unicode.org> wrote:

> Quote by Richard:
> Unless this implies a spelling reform for many languages, I'd like to
> see how this works for the Tai Tham script.  I'm not happy with the
> Romanisation I use to work round hostile rendering engines.  (My
> scheme is only documented in variable hack_ss02 in the last script
> blocks of http://wrdingam.co.uk/lanna/denderer_test.htm.)  For
> example, there are several different ways of writing what one might
> naively record as "ontarAy".
> 
> MY RESPONSE:
> Richard, I stuck to the two specifications (Unicode and Font) and
> Sanskrit grammar. The akSara has two aspects, its sound (zabða,
> phoneme) and its shape. (letter, ruupa). Reduce the writing system to
> its consonants, vowels etc. (zabða) and assign SBCS letters/codes to
> them (ruupa). SBCS provides the best technical facilities for any
> language. (This is why now more than 130 languages romanize despite
> Unicode). Use English letters for similar sounds in the native
> speech. Now, treat all combinations as ligatures. For example, 'po'
> sound in Indic has the p consonant with a sign ahead plus a sign
> after.

In many Indic scripts, yes.  In Devanagari, the vowel sign is normally
a singly element classified as following the consonant.  In Thai, the
vowel sign precedes the consonant.  Tai Tham uses both a two-part sign
and a preceding sign.  The preceding sign is for Tai words and the
two-part sign for Pali words, but loanwords from Pali into the Tai
languages may retain the two part sign.

> For the font, there is no difference between the way it makes
> the combination 'ä', which has a sign above and the Indic having two
> on either side.

For OpenType, there is.  The first can be made by providing a
simple table of where the diaeresis goes relative to the base
characters, in this case the diaeresis.  The second is painfully
complicated, for the 'p' may have other marks attached to it, so doing
it be relative positioning is painfully complicated and error-prone.
This job is given to the rendering engine, which may introduce its own
problems.

AAT and Graphite offer the font maker the ability to move the 'sign
ahead' from after the 'p' to before it.

> Recall that long ago, Unicode stopped defining fixed
> ligatures and asked the font makers to define them in the PUA.

While the first is true enough, I believe the second is false.  Not
every glyph has to be mapped to by a single character.  I don't do that
for contextual forms or ligatures in my font.

> Spelling and speech:
> There is indeed a confusion about writing and reading in Hindi, as I
> have observed. Like in English and Tamil, Hindi tends to end words
> with a consonant. So, there is this habit among the Hindi speakers to
> drop the ending vowel, mostly 'a' from words that actually end with
> it. For example, the famous name Jayantha (miserable mine too, haha!
> = jayanþa as Romanized), is pronounced Jayanth by Hindi speakers. It
> is a Sanskrit word. Sanskrit and languages like Sinhhala have vowel
> ending and are traditionally spoken as such.

This loss is also to be found in Further India.  Thai, Lao and Khmer
now require that such a word-final vowel be written explicitly if it is
still pronounced.

> Looking at the word you gave, ontarAy, it looks to me like an
> Anglicized form. If I am to make a guess, its ending is like in
> ontarAyi. Is it said something like, own-the-raa-yi? (danger?) If I
> am right, this is a good example of decline if a writing system owing
> to bad, uncaring application of technology. We are in the Digital
> Age, and we need not compromise any more. In fact, we can fix errors
> and decadence introduced by past technologies.

The word indeed means 'danger' (Pali/Sanskrit _antarāya_).  The
pronunciation is /ʔontʰalaːi/; the Tai languages that use(d) the Tai
Tham script no longer have /r/.  The older sequence /tr/ normally
became /tʰ/ (except in Lao), but the spelling has not been updated - at
least, not amongst the more literate.  The script has a special symbol
for the short vowel /o/, which it shares with the Lao script.  This
symbol is used in writing that word.  Two ways I have seen it spelt,
each with two orthographic syllables, are ᩋᩫ᩠ᨶᨲᩕᩣ᩠ᨿ on-trAy (the second
syllable has two stacks) and ᩋᩫᨶ᩠ᨲᩕᩣ᩠ᨿ o-ntrAy.  I have also seen a
form closer to Pali, namely _antarAy_, written ᩋᨶ᩠ᨲᩁᩂ᩠ᨿ a-nta-rAy.
However, I have seen nothing that shows that I won't encounter
ᩋᩢᨶ᩠ᨲᩁᩣ᩠ᨿ a-nta-rAy with the first vowel written explicitly, or even
ᩋᩢ᩠ᨶᨲᩁᩣ᩠ᨿ an-ta-rAy. How does your scheme distinguish such alternatives?

Richard.



Go romanize! Re: Counting Devanagari Aksharas

2017-04-24 Thread Naena Guru via Unicode

Quote by Richard:
Unless this implies a spelling reform for many languages, I'd like to
see how this works for the Tai Tham script.  I'm not happy with the
Romanisation I use to work round hostile rendering engines.  (My
scheme is only documented in variable hack_ss02 in the last script
blocks of http://wrdingam.co.uk/lanna/denderer_test.htm.)  For example,
there are several different ways of writing what one might naively
record as "ontarAy".

MY RESPONSE:
Richard, I stuck to the two specifications (Unicode and Font) and Sanskrit 
grammar. The akSara has two aspects, its sound (zabða, phoneme) and its shape. 
(letter, ruupa). Reduce the writing system to its consonants, vowels etc. 
(zabða) and assign SBCS letters/codes to them (ruupa). SBCS provides the best 
technical facilities for any language. (This is why now more than 130 languages 
romanize despite Unicode). Use English letters for similar sounds in the native 
speech. Now, treat all combinations as ligatures. For example, 'po' sound in 
Indic has the p consonant with a sign ahead plus a sign after. For the font, 
there is no difference between the way it makes the combination 'ä', which has 
a sign above and the Indic having two on either side. Recall that long ago, 
Unicode stopped defining fixed ligatures and asked the font makers to define 
them in the PUA.

Spelling and speech:
There is indeed a confusion about writing and reading in Hindi, as I have 
observed. Like in English and Tamil, Hindi tends to end words with a consonant. 
So, there is this habit among the Hindi speakers to drop the ending vowel, 
mostly 'a' from words that actually end with it. For example, the famous name 
Jayantha (miserable mine too, haha! = jayanþa as Romanized), is pronounced 
Jayanth by Hindi speakers. It is a Sanskrit word. Sanskrit and languages like 
Sinhhala have vowel ending and are traditionally spoken as such.

Dictionary is a commercial invention. When Caxton brought lead types to 
England, French-speaking Latin-flaunting elites did not care about the poor 
natives. Earlier, invading Romans forced them to drop Fuþark and adopt the 
22-letter Latin alphabet. So, they improvised. Struck a line across d and made 
ð, Eth; added a sign to 'a' and made æ (Asc) and continued using Thorn (þ) by 
rounding the loop. Lead type printing hit English for the second time, ruining 
it as the spell standardizing began. Dictionaries sold. THE POWERFUL CAN RUIN 
PEOPLE'S PROPERTY BECAUSE THEY CAN IN ORDER TO MAKE MONEY. Unicode enthusiasts, 
take heed!

Looking at the word you gave, ontarAy, it looks to me like an Anglicized form. 
If I am to make a guess, its ending is like in ontarAyi. Is it said something 
like, own-the-raa-yi? (danger?) If I am right, this is a good example of 
decline if a writing system owing to bad, uncaring application of technology. 
We are in the Digital Age, and we need not compromise any more. In fact, we can 
fix errors and decadence introduced by past technologies.


RICHARD:
That sounds like a letter-assembly system.

MY RESPONSE:
Nothing assembled there, my friend.



On 4/24/2017 12:38 PM, Richard Wordingham via Unicode wrote:

On Mon, 24 Apr 2017 00:36:26 +0530
Naena Guru via Unicode  wrote:


The Unicode approach to Sanskrit and all Indic is flawed. Indic
should not be letter-assembly systems.

Sanskrit vyaakaraNa (grammar) explains the phonemes as the atoms of
the speech. Each writing system then assigns a shape to the
phonetically precise phoneme.

The most technically and grammatically proper solution for Indic is
first to ROMANIZE the group of writing systems at the level of
phonemes. That is, assign romanized shapes to vowels, consonants,
prenasals, post-vowel phonemes (anusvara and visarjaniiya with its
allophones) etc. This approach is similar to how European languages
picked up Latin, improvised the script and even uses Simples and
Capitals repertoire. Romanizing immediately makes typing easier and
eliminates sometimes embarrassing ambiguity in Anglicizing -- you
type phonetically on key layouts close to QWERTY. (Only four
positions are different in Romanized Sinhala layout).

If we drop the capitalizing rules and utilize caps to indicate the
'other' forms of a common letter, we get an intuitively typed system
for each language, and readable too. When this is done carefully,
comparing phoneme sets of the languages, we can reach a common set of
Latin-derived SINGLE-BYTE letters completely covering all phonemes of
all Indic.

Unless this implies a spelling reform for many languages, I'd like to
see how this works for the Tai Tham script.  I'm not happy with the
Romanisation I use to work round hostile rendering engines.  (My
scheme is only documented in variable hack_ss02 in the last script
blocks of http://wrdingam.co.uk/lanna/denderer_test.htm.)  For example,
there are several different ways of writing what one might naively
record as "ontarAy".


Next, each native script can be obtained by making 

Re: Counting Devanagari Aksharas

2017-04-24 Thread Richard Wordingham via Unicode
On Mon, 24 Apr 2017 00:36:26 +0530
Naena Guru via Unicode  wrote:

> The Unicode approach to Sanskrit and all Indic is flawed. Indic
> should not be letter-assembly systems.
> 
> Sanskrit vyaakaraNa (grammar) explains the phonemes as the atoms of
> the speech. Each writing system then assigns a shape to the
> phonetically precise phoneme.
> 
> The most technically and grammatically proper solution for Indic is 
> first to ROMANIZE the group of writing systems at the level of
> phonemes. That is, assign romanized shapes to vowels, consonants,
> prenasals, post-vowel phonemes (anusvara and visarjaniiya with its
> allophones) etc. This approach is similar to how European languages
> picked up Latin, improvised the script and even uses Simples and
> Capitals repertoire. Romanizing immediately makes typing easier and
> eliminates sometimes embarrassing ambiguity in Anglicizing -- you
> type phonetically on key layouts close to QWERTY. (Only four
> positions are different in Romanized Sinhala layout).
> 
> If we drop the capitalizing rules and utilize caps to indicate the 
> 'other' forms of a common letter, we get an intuitively typed system
> for each language, and readable too. When this is done carefully,
> comparing phoneme sets of the languages, we can reach a common set of 
> Latin-derived SINGLE-BYTE letters completely covering all phonemes of 
> all Indic.

Unless this implies a spelling reform for many languages, I'd like to
see how this works for the Tai Tham script.  I'm not happy with the
Romanisation I use to work round hostile rendering engines.  (My
scheme is only documented in variable hack_ss02 in the last script
blocks of http://wrdingam.co.uk/lanna/denderer_test.htm.)  For example,
there are several different ways of writing what one might naively
record as "ontarAy".

> Next, each native script can be obtained by making orthographic smart 
> fonts that display the SBCS codes in the respective shapes of the
> native scripts.

That sounds like a letter-assembly system.

So how does your scheme help one split words into orthographic
syllables?

> I have successfully romanized Sinhala and revived the full repertoire
> of Sinhla + Sanskrit orthography losing nothing. Sinhala script is
> perhaps the most complex of all Indic because it is used to write
> both Sanskrit and Pali.

What complication does Pali impose on top of Sanskrit.  As far as I'm
aware, it just needs one extra letter, usually called LLA, which you
will already have if 'Sanskrit' includes Vedic Sanskrit.
 
> See this: http://ahangama.com/ (It's all SBCS underneath).
> Test here: http://ahangama.com/edit.htm

All I get for these are blank pages.  Perhaps there's an unreported
communication failure in the network,

Richard.


Re: Counting Devanagari Aksharas

2017-04-23 Thread Richard Wordingham via Unicode
On Sun, 23 Apr 2017 05:40:29 +0300
Eli Zaretskii via Unicode  wrote:

> > The cursor moves to the cluster boundary, so there is much less of a
> > problem with Emacs.  
> 
> But you wanted to highlight only part of the cluster, AFAIU.

If I search for CGJ, highlighting it is frequently supremely useless.
I want to know where it is; highlighting is merely a tool to find it on
the screen.

Richard.


Re: Counting Devanagari Aksharas

2017-04-23 Thread Naena Guru via Unicode
The Unicode approach to Sanskrit and all Indic is flawed. Indic should 
not be letter-assembly systems.


Sanskrit vyaakaraNa (grammar) explains the phonemes as the atoms of the 
speech. Each writing system then assigns a shape to the phonetically 
precise phoneme.


The most technically and grammatically proper solution for Indic is 
first to ROMANIZE the group of writing systems at the level of phonemes. 
That is, assign romanized shapes to vowels, consonants, prenasals, 
post-vowel phonemes (anusvara and visarjaniiya with its allophones) etc. 
This approach is similar to how European languages picked up Latin, 
improvised the script and even uses Simples and Capitals repertoire. 
Romanizing immediately makes typing easier and eliminates sometimes 
embarrassing ambiguity in Anglicizing -- you type phonetically on key 
layouts close to QWERTY. (Only four positions are different in Romanized 
Sinhala layout).


If we drop the capitalizing rules and utilize caps to indicate the 
'other' forms of a common letter, we get an intuitively typed system for 
each language, and readable too. When this is done carefully, comparing 
phoneme sets of the languages, we can reach a common set of 
Latin-derived SINGLE-BYTE letters completely covering all phonemes of 
all Indic.


Next, each native script can be obtained by making orthographic smart 
fonts that display the SBCS codes in the respective shapes of the native 
scripts.


I have successfully romanized Sinhala and revived the full repertoire of 
Sinhla + Sanskrit orthography losing nothing. Sinhala script is perhaps 
the most complex of all Indic because it is used to write both Sanskrit 
and Pali.


See this: http://ahangama.com/ (It's all SBCS underneath).
Test here: http://ahangama.com/edit.htm


On 4/20/2017 5:05 AM, Richard Wordingham via Unicode wrote:

Is there consensus on how to count aksharas in the Devanagari script?
The doubts I have relate to a visible halant in orthographic syllables
other than the first.

For example, according to 'Devanagari VIP Team Issues Report'
http://www.unicode.org/L2/L2011/11370-devanagari-vip-issues.pdf, a
derived form from Nepali श्रीमान्  should be written श्रीमान्‌को
<U+0936 DEVANAGARI LETTER SHA, U+094D DEVANAGARI SIGN VIRAMA, U+0930
DEVANAGARI LETTER RA, U+0940 DEVANAGARI VOWEL SIGN II, U+092E
DEVANAGARI LETTER MA, U+093E DEVANAGARI VOWEL SIGN AA, U+0928
DEVANAGARI LETTER NA, U+094D, U+200C ZERO WIDTH NON-JOINER, U+0915
DEVANAGARI LETTER KA, U+094B DEVANAGARI VOWEL SIGN O> and not
श्रीमान्को  <U+0936, U+094D, U+0930, U+0940, U+092E, U+093E, U+0928,
U+094D, U+0915, U+094B>.  Now, if the font used has a conjunct for
SHRA, I would count the former as having 4 aksharas SH.RII, MAA, N, KO
and the latter as having 3 aksharas SH.RII, MAA, N.KO.

If the font leads to the use of a visible halant instead of the vattu
conjunct SH.RA, as happens when I view this email, would there then be
5 and 4 aksharas respectively?  A further complication is that the font
chosen treats what looks like SH, RA as a conjunct; the vowel I appears
to the left of SH when added after RA (श्रि).

Richard.





Re: Counting Devanagari Aksharas

2017-04-23 Thread Asmus Freytag via Unicode

  
  
On 4/22/2017 9:25 PM, Manish Goregaokar
  via Unicode wrote:


  Backspace in browsers (chrome and firefox) deletes within EGCs too.
They delete matras in devanagari, and jamos in hangul. They don't
*exactly* work off of code points (e.g. flag emoji gets deleted as a
whole in many backspace implementations)

Flag emoji and many other "invisible" sequences
are different from ligatures and conjuncts in one important way:
their elements are not usually key strokes, but the full
sequence would be inserted from a pick list or other type of
input method. If you didn't "type" each of the elements of the
sequence, then deleting individual ones is something you would
only need for debugging or other specialized purposes, not for
undoing a physical action (keystroke) in reverse order.
Speaking of undoing: not all editors always
support full key-stroke by key-stroke undo, some will coalesce
longer runs of text. This saves on space for the undo buffer,
but also makes undoing more extensive edits less painful. It's
clearly a personal preference whether such "streamlining" would
feel "right" or "bothersome".
Beyond the last line typed, or two, I may really
not care if undo went word by word, say.
A./

  



Re: Counting Devanagari Aksharas

2017-04-22 Thread Manish Goregaokar via Unicode
> You cannot even
> meaningfully move by single characters in most clusters, because
> composing characters generally completely changes how the original
> characters looked, so there's nowhere you can display the cursor.

Yes, and this is one of the reasons it feels broken in devanagari, you
get cursors in the midst of aksharas, in weird places.


Backspace in browsers (chrome and firefox) deletes within EGCs too.
They delete matras in devanagari, and jamos in hangul. They don't
*exactly* work off of code points (e.g. flag emoji gets deleted as a
whole in many backspace implementations)
-Manish


On Sat, Apr 22, 2017 at 12:22 PM, Eli Zaretskii via Unicode
<unicode@unicode.org> wrote:
>> Date: Sat, 22 Apr 2017 17:13:36 +0100
>> From: Richard Wordingham via Unicode <unicode@unicode.org>
>>
>> > Movement by grapheme
>> > cluster is AFAIK the most natural way of moving in complex scripts.
>>
>> Evidence?
>
> Personal experience?
>
>> It's easiest for displaying the cursor.
>
> It's the _only_ way of displaying the cursor.  You cannot even
> meaningfully move by single characters in most clusters, because
> composing characters generally completely changes how the original
> characters looked, so there's nowhere you can display the cursor.  And
> without being able to position the cursor, a visual feedback to the
> user becomes troublesome at best.
>
>> I've encountered the problem that, while at least I can search for
>> text smaller than a cluster, there's no indication in the window of
>> where in the window the text is.
>
> I could imagine Emacs decomposing characters temporarily when only
> part of a cluster matches the search string.  Assuming this would make
> sense to users of some complex scripts, that is.  You are welcome to
> suggest such a feature by using report-emacs-bug.
>
>> SIL's Graphite supports the idea of a split cursor, which
>> shows the glyphs corresponding to the characters before and after the
>> cursor position.
>
> I find split-cursor to be a nuisance, FWIW.  IME, it confuses the
> users without making anything much clearer.


Re: Counting Devanagari Aksharas

2017-04-22 Thread Eli Zaretskii via Unicode
> Date: Sun, 23 Apr 2017 00:51:59 +0100
> Cc: Julian Bradfield 
> From: Richard Wordingham via Unicode 
> 
> On Sat, 22 Apr 2017 21:39:42 +0100 (BST)
> Julian Bradfield via Unicode  wrote:
> 
> > On 2017-04-22, Eli Zaretskii via Unicode  wrote:
> 
> > > I could imagine Emacs decomposing characters temporarily when only
> > > part of a cluster matches the search string.  Assuming this would
> > > make sense to users of some complex scripts, that is.  You are
> > > welcome to suggest such a feature by using report-emacs-bug.  
> 
> The cursor moves to the cluster boundary, so there is much less of a
> problem with Emacs.

But you wanted to highlight only part of the cluster, AFAIU.

> > That's what I do in my emacs with combining characters, and if I had
> > complex script support, I'd expect the same to happen there.
> > emacs is a programmer's editor, after all :)
> 
> Emacs probably has a way of toggling complex script support somewhere.
> I'm torn between seeing the text properly set out and seeing exactly
> what it is that I've typed.  'Reveal codes' doesn't seem widely
> supported.

"M-x auto-composition-mode RET" should do what you want.


Re: Counting Devanagari Aksharas

2017-04-22 Thread Richard Wordingham via Unicode
On Sat, 22 Apr 2017 21:39:42 +0100 (BST)
Julian Bradfield via Unicode  wrote:

> On 2017-04-22, Eli Zaretskii via Unicode  wrote:

> > I could imagine Emacs decomposing characters temporarily when only
> > part of a cluster matches the search string.  Assuming this would
> > make sense to users of some complex scripts, that is.  You are
> > welcome to suggest such a feature by using report-emacs-bug.  

The cursor moves to the cluster boundary, so there is much less of a
problem with Emacs.

> That's what I do in my emacs with combining characters, and if I had
> complex script support, I'd expect the same to happen there.
> emacs is a programmer's editor, after all :)

Emacs probably has a way of toggling complex script support somewhere.
I'm torn between seeing the text properly set out and seeing exactly
what it is that I've typed.  'Reveal codes' doesn't seem widely
supported.

Richard.


Re: Counting Devanagari Aksharas

2017-04-22 Thread Julian Bradfield via Unicode
On 2017-04-22, Eli Zaretskii via Unicode  wrote:
>> From: Richard Wordingham via Unicode 
[...]
>> I've encountered the problem that, while at least I can search for
>> text smaller than a cluster, there's no indication in the window of
>> where in the window the text is.
>
> I could imagine Emacs decomposing characters temporarily when only
> part of a cluster matches the search string.  Assuming this would make
> sense to users of some complex scripts, that is.  You are welcome to
> suggest such a feature by using report-emacs-bug.

That's what I do in my emacs with combining characters, and if I had
complex script support, I'd expect the same to happen there.
emacs is a programmer's editor, after all :)

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Counting Devanagari Aksharas

2017-04-22 Thread Eli Zaretskii via Unicode
> Date: Sat, 22 Apr 2017 17:13:36 +0100
> From: Richard Wordingham via Unicode 
> 
> > Movement by grapheme
> > cluster is AFAIK the most natural way of moving in complex scripts.
> 
> Evidence?

Personal experience?

> It's easiest for displaying the cursor.

It's the _only_ way of displaying the cursor.  You cannot even
meaningfully move by single characters in most clusters, because
composing characters generally completely changes how the original
characters looked, so there's nowhere you can display the cursor.  And
without being able to position the cursor, a visual feedback to the
user becomes troublesome at best.

> I've encountered the problem that, while at least I can search for
> text smaller than a cluster, there's no indication in the window of
> where in the window the text is.

I could imagine Emacs decomposing characters temporarily when only
part of a cluster matches the search string.  Assuming this would make
sense to users of some complex scripts, that is.  You are welcome to
suggest such a feature by using report-emacs-bug.

> SIL's Graphite supports the idea of a split cursor, which
> shows the glyphs corresponding to the characters before and after the
> cursor position.

I find split-cursor to be a nuisance, FWIW.  IME, it confuses the
users without making anything much clearer.


Re: Counting Devanagari Aksharas

2017-04-22 Thread Richard Wordingham via Unicode
On Sat, 22 Apr 2017 13:34:32 +0300
Eli Zaretskii via Unicode  wrote:

> AFAIR, Emacs allows one to _delete_ individual characters,
> i.e. Backspace and C-d delete character-by-character, so the problem
> shouldn't be so grave for imperfect typists.

Deleting forwards by one _character_ certainly makes life less harsh.
It's pleasanter than the UAX#29 suggestion, "For example, on a given
system the backspace key might delete by code point, while the delete
key may delete an entire cluster".

> Movement by grapheme
> cluster is AFAIK the most natural way of moving in complex scripts.

Evidence?  It's easiest for displaying the cursor.  I've encountered the
problem that, while at least I can search for text smaller than a
cluster, there's no indication in the window of where in the window the
text is.  SIL's Graphite supports the idea of a split cursor, which
shows the glyphs corresponding to the characters before and after the
cursor position.

Richard.


Re: Counting Devanagari Aksharas

2017-04-22 Thread Eli Zaretskii via Unicode
> Date: Sat, 22 Apr 2017 11:13:16 +0100
> From: Richard Wordingham via Unicode 
> 
> At present these are split into two and three grapheme clusters
> respectively, and LibreOffice cursor movement responds accordingly.
> (SIGN AA starts a grapheme cluster in several scripts of further
> India.)  However, if one teaches the Emacs editor what a Tai Tham
> syllable is, so that it can use the M17n rendering library, the cursor
> then advances syllable by syllable, which is unpleasant for imperfect
> typists.

AFAIR, Emacs allows one to _delete_ individual characters,
i.e. Backspace and C-d delete character-by-character, so the problem
shouldn't be so grave for imperfect typists.  Movement by grapheme
cluster is AFAIK the most natural way of moving in complex scripts.


Re: Counting Devanagari Aksharas

2017-04-22 Thread Richard Wordingham via Unicode
On Fri, 21 Apr 2017 16:27:43 -0700
Manish Goregaokar via Unicode <unicode@unicode.org> wrote:

> > Do Hindi speakers really think of orthographic syllables as
> > characters?  
> 
> When rendered as a cluster, yes? I've asked around, and folks seem to
> insist on coupling it to the rendering.

That argues that it's a unit, which I don't think is in dispute.  Words
are also units, and nowadays we don't normally insist that one retype a
word just to change one bit of it.

> Given most fonts render
> *normal* (common, etc) clusters, I think making them EGCs and looking
> at nonrendered clusters the same way we do family emoji is fine
> (family emojis of length 5 are a single EGC, but that's not what's
> actually perceived by the user, but it's a use case that's very rare
> in the wild, so it doesn't matter).

That depends on the language.  In the Tai Tham script, even without
consonant clusters one can get 5 graphic characters in a syllable,
e.g. ᨧᩮᩢ᩶ᩣ _cao_  'lord;
you (polite)', and when one adds consonant clusters one easily gets
monosyllables like ᨠᩖ᩠᩶ᩅ᩠ᨿ _kluai_  'banana' with 5 graphic characters and
additionally 2 coengs.  (One can distinguish Pali from the Tai
languages simply by the density of the ink!)

At present these are split into two and three grapheme clusters
respectively, and LibreOffice cursor movement responds accordingly.
(SIGN AA starts a grapheme cluster in several scripts of further
India.)  However, if one teaches the Emacs editor what a Tai Tham
syllable is, so that it can use the M17n rendering library, the cursor
then advances syllable by syllable, which is unpleasant for imperfect
typists.  Fortunately, it's possible to add functions to Emacs to allow
it to advance character-by-character; I forget if one has to also add a
few code changes.  (The downside is that text either side of the cursor
is rendered independently, which can be a nuisance when editing very
long lines.)

> The way I see it, the current
> system is wrong, and so would the proposed system of not breaking at
> viramas (or not breaking at viramas followed by a consonant if we want
> to be more precise), but the proposed system would be wrong much less
> often.
 
> I am only talking about Devanagari, though scripts like
> Bangla/Gujrati/Gurmukhi may have similar needs. Breaking on ZWNJ seems
> sensible.

Indeed, viramas (InSC=Virama) will have to be handled case-by-case.  One
should continue to break after pulli (U+0BCD TAMIL SIGN VIRAMA) except
for the cases of the ligatures/conjuncts.  I don't know if there are
obscure cases, or whether it's only _shri_ and <KA, SSA> for which one
should not break just because of the virama.  Continuation after coengs
(InSC=Invisible_Stacker) should be automatic.

Malayalam will need customisation.  Definitions by codepoints are only
a fallback, for when a font cannot be used to guide the process. 

Formally, normalisation is a problem, as these characters can be
separated from letters by other marks.  This is a problem in practice
for normalised text in Tai Tham.

Pure killers (InSC=Pure_Killer) should probably be given no special
treatment, as at present, by default, though I wonder if we should
define orthographic syllables for Pali in Thai script.  The two
orthographies will need different rules, and renderers won't help.
Defining orthographic syllables for languages in the Latin script is
probably excessive.

Richard.



Re: Counting Devanagari Aksharas

2017-04-21 Thread Manish Goregaokar via Unicode
> Do Hindi speakers really think of orthographic syllables as characters?

When rendered as a cluster, yes? I've asked around, and folks seem to
insist on coupling it to the rendering. Given most fonts render
*normal* (common, etc) clusters, I think making them EGCs and looking
at nonrendered clusters the same way we do family emoji is fine
(family emojis of length 5 are a single EGC, but that's not what's
actually perceived by the user, but it's a use case that's very rare
in the wild, so it doesn't matter). The way I see it, the current
system is wrong, and so would the proposed system of not breaking at
viramas (or not breaking at viramas followed by a consonant if we want
to be more precise), but the proposed system would be wrong much less
often.

I am only talking about Devanagari, though scripts like
Bangla/Gujrati/Gurmukhi may have similar needs. Breaking on ZWNJ seems
sensible.
-Manish


On Fri, Apr 21, 2017 at 4:04 PM, Richard Wordingham via Unicode
<unicode@unicode.org> wrote:
> On Thu, 20 Apr 2017 11:17:05 -0700
> Manish Goregaokar via Unicode <unicode@unicode.org> wrote:
>
>> On Wed, Apr 19, 2017 at 4:35 PM, Richard Wordingham via Unicode
>> <unicode@unicode.org> wrote:
>
>> > Is there consensus on how to count aksharas in the Devanagari
>> > script? The doubts I have relate to a visible halant in
>> > orthographic syllables other than the first.
>
>> I don't think there's consensus.
>
> I've found related discussion at
> https://lists.w3.org/Archives/Public/public-i18n-indic/.  The question
> of how to count was raised and not answered there.
>
>> On Wed, Apr 19, 2017 at 4:35 PM,
>> Richard Wordingham via Unicode <unicode@unicode.org> wrote:
>> > Is there consensus on how to count aksharas in the Devanagari
>> > script? The doubts I have relate to a visible halant in
>> > orthographic syllables other than the first.
>
>> I'm of the opinion that Unicode should start considering devanagari
>> (and possibly other indic) consonant clusters as single extended
>> grapheme clusters.
>
> Do Hindi speakers really think of orthographic syllables as characters?
>
> What may be useful is the concept of a definition of an orthographic
> syllable.  It may be possible to get the information from a font -
> depending on the renderer - but a locale-dependent definition should be
> possible for use as a fall-back.  Devanagari rules won't work for
> Tamil, and I think rules for Hindi and Nepali will be slightly
> different - <VIRAMA, ZWNJ> looks like a problem.
>
> The concept is possibly not useful in some Indic scripts - the concept
> won't work well in Thai, but will work in Pali in the Thai script, for
> both Pali orthographies.
>
> Richard.


Re: Counting Devanagari Aksharas

2017-04-21 Thread Richard Wordingham via Unicode
On Thu, 20 Apr 2017 11:17:05 -0700
Manish Goregaokar via Unicode <unicode@unicode.org> wrote:

> On Wed, Apr 19, 2017 at 4:35 PM, Richard Wordingham via Unicode
> <unicode@unicode.org> wrote:

> > Is there consensus on how to count aksharas in the Devanagari
> > script? The doubts I have relate to a visible halant in
> > orthographic syllables other than the first.

> I don't think there's consensus.

I've found related discussion at
https://lists.w3.org/Archives/Public/public-i18n-indic/.  The question
of how to count was raised and not answered there.

> On Wed, Apr 19, 2017 at 4:35 PM,
> Richard Wordingham via Unicode <unicode@unicode.org> wrote:
> > Is there consensus on how to count aksharas in the Devanagari
> > script? The doubts I have relate to a visible halant in
> > orthographic syllables other than the first.

> I'm of the opinion that Unicode should start considering devanagari
> (and possibly other indic) consonant clusters as single extended
> grapheme clusters.

Do Hindi speakers really think of orthographic syllables as characters?

What may be useful is the concept of a definition of an orthographic
syllable.  It may be possible to get the information from a font -
depending on the renderer - but a locale-dependent definition should be
possible for use as a fall-back.  Devanagari rules won't work for
Tamil, and I think rules for Hindi and Nepali will be slightly
different - <VIRAMA, ZWNJ> looks like a problem.

The concept is possibly not useful in some Indic scripts - the concept
won't work well in Thai, but will work in Pali in the Thai script, for
both Pali orthographies.

Richard.


Re: Counting Devanagari Aksharas

2017-04-21 Thread Manish Goregaokar via Unicode
That seems like a relatively niche use case (especially with Vedic
Sanskrit) compared to having weird selection for everything else. I'm
not convinced. When I use a romanized Devanagari input method (I
typically do on my laptop), deleting the whole cluster is necessary
anyway for things to work well. Direct input methods do let you edit
in a more granular way but I've never seen the need for that.

I guess this boils down to a matter of opinion and anecdotal
experience, so there's not much I can do to convince this list
otherwise :)

-Manish


On Fri, Apr 21, 2017 at 12:23 AM, Richard Wordingham via Unicode
<unicode@unicode.org> wrote:
> On Fri, 21 Apr 2017 00:08:24 -0500
> Anshuman Pandey via Unicode <unicode@unicode.org> wrote:
>
>> > On Apr 20, 2017, at 8:19 PM, Richard Wordingham via Unicode
>> > <unicode@unicode.org> wrote:
>
>> > Now imagine you're
>> > typing Vedic Sanskrit, with its clusters and pitch indicators.
>
>> I tried typing Vedic Sanskrit, and it seems to work:
>
>> http://pandey.pythonanywhere.com/devsyll
>
> That should demonstrate nothing relevant if you type correctly first
> time.  The issue comes when you mistype and have to correct, to give
> the usual worst case, the first letter of a conjunct.  Now, I looked at
> your page in Firefox on Ubuntu, and I found the cursor seemed to move
> by extended grapheme cluster.  That means that to change a consonant
> you have to retype the following marks.
>
> I did find two issues with your analyser.
>
> Firstly, it broke श्रीमान्‌को into श्री·मा·न्को, which does not
> concatenate back to the original.
>
> Secondly, you have a problem with ANUDATTA.  You are not accepting
> <U+0924, U+0902, U+0952> as a syllable.  Perhaps you believed
> https://www.microsoft.com/typography/OpenTypeDev/devanagari/intro.htm
> as to the structure of a Devanagari syllable.  I suspect ANUDATTA as a
> consonant modifier went out when U+097B DEVANAGARI LETTER GGA and the
> like came in.
>
> Richard.
>



Re: Counting Devanagari Aksharas

2017-04-21 Thread Richard Wordingham via Unicode
On Fri, 21 Apr 2017 00:08:24 -0500
Anshuman Pandey via Unicode <unicode@unicode.org> wrote:

> > On Apr 20, 2017, at 8:19 PM, Richard Wordingham via Unicode
> > <unicode@unicode.org> wrote:

> > Now imagine you're
> > typing Vedic Sanskrit, with its clusters and pitch indicators.  
 
> I tried typing Vedic Sanskrit, and it seems to work:
 
> http://pandey.pythonanywhere.com/devsyll

That should demonstrate nothing relevant if you type correctly first
time.  The issue comes when you mistype and have to correct, to give
the usual worst case, the first letter of a conjunct.  Now, I looked at
your page in Firefox on Ubuntu, and I found the cursor seemed to move
by extended grapheme cluster.  That means that to change a consonant
you have to retype the following marks.

I did find two issues with your analyser.

Firstly, it broke श्रीमान्‌को into श्री·मा·न्को, which does not
concatenate back to the original.

Secondly, you have a problem with ANUDATTA.  You are not accepting
<U+0924, U+0902, U+0952> as a syllable.  Perhaps you believed
https://www.microsoft.com/typography/OpenTypeDev/devanagari/intro.htm
as to the structure of a Devanagari syllable.  I suspect ANUDATTA as a
consonant modifier went out when U+097B DEVANAGARI LETTER GGA and the
like came in. 

Richard.



Re: Counting Devanagari Aksharas

2017-04-20 Thread Anshuman Pandey via Unicode

> On Apr 20, 2017, at 8:19 PM, Richard Wordingham via Unicode 
> <unicode@unicode.org> wrote:
> 
> On Thu, 20 Apr 2017 14:14:00 -0700
> Manish Goregaokar via Unicode <unicode@unicode.org> wrote:
> 
>> On Thu, Apr 20, 2017 at 12:14 PM, Richard Wordingham via Unicode
>> <unicode@unicode.org> wrote:
> 
>>> On Thu, 20 Apr 2017 11:17:05 -0700
>>> Manish Goregaokar via Unicode <unicode@unicode.org> wrote:
> 
>>>> I'm of the opinion that Unicode should start considering devanagari
>>>> (and possibly other indic) consonant clusters as single extended
>>>> grapheme clusters.
> 
>>> You won't like it if cursor movement granularity is reduced to one
>>> extended grapheme cluster.  I'm grateful that Emacs allows me to
> 
>> I mean, we do the same for Hangul.
> 
> Hangul is generally a maximum of three characters, which is about the
> border of tolerance. I find it irritating to have to completely retype
> Thai grapheme clusters of consonant, vowel and tone mark.  There were
> loud protests from the Thais when preposed vowels were added to the
> Thai grapheme cluster and implementations then responded, and Unicode
> quickly removed them. Now imagine you're typing Vedic Sanskrit, with its
> clusters and pitch indicators.

I tried typing Vedic Sanskrit, and it seems to work:

http://pandey.pythonanywhere.com/devsyll

Haven't tried the orthographic oddity of the Nepali case in question. Above my 
pay grade.

If you access the above link on an iOS device you'll see tofu and missing 
characters. Apple's Devanagari font needs to be fixed.

- AP



Re: Counting Devanagari Aksharas

2017-04-20 Thread Richard Wordingham via Unicode
On Thu, 20 Apr 2017 14:14:00 -0700
Manish Goregaokar via Unicode <unicode@unicode.org> wrote:

> On Thu, Apr 20, 2017 at 12:14 PM, Richard Wordingham via Unicode
> <unicode@unicode.org> wrote:

> > On Thu, 20 Apr 2017 11:17:05 -0700
> > Manish Goregaokar via Unicode <unicode@unicode.org> wrote:

> >> I'm of the opinion that Unicode should start considering devanagari
> >> (and possibly other indic) consonant clusters as single extended
> >> grapheme clusters.

> > You won't like it if cursor movement granularity is reduced to one
> > extended grapheme cluster.  I'm grateful that Emacs allows me to

> I mean, we do the same for Hangul.

Hangul is generally a maximum of three characters, which is about the
border of tolerance. I find it irritating to have to completely retype
Thai grapheme clusters of consonant, vowel and tone mark.  There were
loud protests from the Thais when preposed vowels were added to the
Thai grapheme cluster and implementations then responded, and Unicode
quickly removed them. Now imagine you're typing Vedic Sanskrit, with its
clusters and pitch indicators.

> The main time you need intra-conjunct segmentation in Devanagari is
> when deleting something you just typed.

You'll typically be several words beyond by the time you notice, or by
the time a spell-checker spots a problem.

Richard.


Re: Counting Devanagari Aksharas

2017-04-20 Thread Manish Goregaokar via Unicode
I mean, we do the same for Hangul.

The main time you need intra-conjunct segmentation in Devanagari is
when deleting something you just typed. And backspace usually operates
on code points anyway (except for some weird cases like flag emoji,
though this isn't uniform across platforms). I don't see how
intra-conjunct selection would be useful otherwise.
-Manish


On Thu, Apr 20, 2017 at 12:14 PM, Richard Wordingham via Unicode
<unicode@unicode.org> wrote:
> On Thu, 20 Apr 2017 11:17:05 -0700
> Manish Goregaokar via Unicode <unicode@unicode.org> wrote:
>
>> When given a rendered representation people seem to uniformly count
>> conjuncts as multiple aksharas if rendered with visible halant, and as
>> a single akshara if they are rendered conjoined.
>
> Now, that's what I expected.
>
>> I'm of the opinion that Unicode should start considering devanagari
>> (and possibly other indic) consonant clusters as single extended
>> grapheme clusters. Yes, sometimes it's not rendered as a single glyph,
>> but sometimes family emoji will not render as a single glyph either
>> (if you use skin tones or more than 4 family members) and we still
>> consider those EGCs.
>
> You won't like it if cursor movement granularity is reduced to one
> extended grapheme cluster.  I'm grateful that Emacs allows me to
> delete and replace the first NFC character of a grapheme cluster.
>
> Richard.


Re: Counting Devanagari Aksharas

2017-04-20 Thread Richard Wordingham via Unicode
On Thu, 20 Apr 2017 11:17:05 -0700
Manish Goregaokar via Unicode <unicode@unicode.org> wrote:

> When given a rendered representation people seem to uniformly count
> conjuncts as multiple aksharas if rendered with visible halant, and as
> a single akshara if they are rendered conjoined.

Now, that's what I expected.

> I'm of the opinion that Unicode should start considering devanagari
> (and possibly other indic) consonant clusters as single extended
> grapheme clusters. Yes, sometimes it's not rendered as a single glyph,
> but sometimes family emoji will not render as a single glyph either
> (if you use skin tones or more than 4 family members) and we still
> consider those EGCs.

You won't like it if cursor movement granularity is reduced to one
extended grapheme cluster.  I'm grateful that Emacs allows me to
delete and replace the first NFC character of a grapheme cluster.

Richard.


Re: Counting Devanagari Aksharas

2017-04-20 Thread Richard Wordingham via Unicode
On Thu, 20 Apr 2017 15:33:37 +0530
Shriramana Sharma via Unicode <unicode@unicode.org> wrote:
 
> All I can say is that Tamil script has eschewed most consonant cluster
> ligatures/conjoining forms. As for Devanagari, writing श्रीमान्‌को (I
> used ZWNJ) i.o. श्रीमान्को is quite possible with existing technology.
> The latter would be Sanskrit orthography and former perhaps Hindi,
> although I wouldn't know why anyone would want to run in the को with
> the preceding श्रीमान् even in Hindi.

According to p23 of
http://www.unicode.org/L2/L2011/11370-devanagari-vip-issues.pdf, it's
Nepali.  It's a compromise between श्रीमान्को and Hindi-style श्रीमान्
को.

> And IMO it would be better to
> clearly define at the outset what you meant by "akshara" in your
> question to avoid confusions by people replying having a different
> idea of the meaning of that term.

I didn't want to be any more precise than "orthographic syllable".
Swaran Lata is urging, in submission
http://www.unicode.org/L2/L2017/17094-indic-text-seg.pdf to the UTC,
that UAX#29 "Unicode Text Segmentation" adopt a rather naïve definition
of an Indian orthographic syllable.  The worst outcome in my opinion
would be if it were adopted for the extended grapheme cluster
definition - it would make editing orthographic clusters even more
difficult.  However, it would make sense for CLDR to carry localised
definitions.

For layout, the definition would be relevant for 'drop capital effects'
and for the analogue of inserting spaces between letters.  There are
recommendations in a maturing W3C specification for Indic layout,
though to be fair the specification fairly quickly restricts its scope
to Indian scripts.  Now, if the spacing were applied to the Nepali word 
श्रीमान्‌को I would expect to see something like श्री मा न् को, as the
base word itself would appear as श्री मा न्  when subjected to the
same treatment. However, before suggesting minor improvements that might
be in order, I thought I should check whether there was agreement that
<VIRAMA, ZWNJ> terminated an orthographic syllable.  It now seems that
any general agreement would in fact be that it did *not* terminate an
orthographic syllable!  I must say that stretching श्रीमान्‌को out as
श्री मा न्‌को  feels wrong.  If my feeling is right, then the definition
of orthographic syllable, if it can be done without reference to a
font, belongs in CLDR, as UAX#29 implies, and not in the Unicode
Character Database and Unicode standards.

Richard.



Re: Counting Devanagari Aksharas

2017-04-20 Thread Manish Goregaokar via Unicode
I don't think there's consensus.

When given a rendered representation people seem to uniformly count
conjuncts as multiple aksharas if rendered with visible halant, and as
a single akshara if they are rendered conjoined.

Most fonts for devanagari these days are pretty good at conjoining
consonants. They seem to do so for all common conjuncts, and usually
for most practical (i.e. not ridiculously long) conjuncts. I've never
seen a visible halant in text I've read.

I'm of the opinion that Unicode should start considering devanagari
(and possibly other indic) consonant clusters as single extended
grapheme clusters. Yes, sometimes it's not rendered as a single glyph,
but sometimes family emoji will not render as a single glyph either
(if you use skin tones or more than 4 family members) and we still
consider those EGCs.
-Manish


On Wed, Apr 19, 2017 at 4:35 PM, Richard Wordingham via Unicode
<unicode@unicode.org> wrote:
> Is there consensus on how to count aksharas in the Devanagari script?
> The doubts I have relate to a visible halant in orthographic syllables
> other than the first.
>
> For example, according to 'Devanagari VIP Team Issues Report'
> http://www.unicode.org/L2/L2011/11370-devanagari-vip-issues.pdf, a
> derived form from Nepali श्रीमान्  should be written श्रीमान्‌को
> <U+0936 DEVANAGARI LETTER SHA, U+094D DEVANAGARI SIGN VIRAMA, U+0930
> DEVANAGARI LETTER RA, U+0940 DEVANAGARI VOWEL SIGN II, U+092E
> DEVANAGARI LETTER MA, U+093E DEVANAGARI VOWEL SIGN AA, U+0928
> DEVANAGARI LETTER NA, U+094D, U+200C ZERO WIDTH NON-JOINER, U+0915
> DEVANAGARI LETTER KA, U+094B DEVANAGARI VOWEL SIGN O> and not
> श्रीमान्को  <U+0936, U+094D, U+0930, U+0940, U+092E, U+093E, U+0928,
> U+094D, U+0915, U+094B>.  Now, if the font used has a conjunct for
> SHRA, I would count the former as having 4 aksharas SH.RII, MAA, N, KO
> and the latter as having 3 aksharas SH.RII, MAA, N.KO.
>
> If the font leads to the use of a visible halant instead of the vattu
> conjunct SH.RA, as happens when I view this email, would there then be
> 5 and 4 aksharas respectively?  A further complication is that the font
> chosen treats what looks like SH, RA as a conjunct; the vowel I appears
> to the left of SH when added after RA (श्रि).
>
> Richard.
>



Re: Counting Devanagari Aksharas

2017-04-20 Thread Shriramana Sharma via Unicode
Hello Richard. Yes my earlier reply wasn't intended to be offlist. I
have near-zero knowledge about non-Indic languages.

All I can say is that Tamil script has eschewed most consonant cluster
ligatures/conjoining forms. As for Devanagari, writing श्रीमान्‌को (I
used ZWNJ) i.o. श्रीमान्को is quite possible with existing technology.
The latter would be Sanskrit orthography and former perhaps Hindi,
although I wouldn't know why anyone would want to run in the को with
the preceding श्रीमान् even in Hindi. And IMO it would be better to
clearly define at the outset what you meant by "akshara" in your
question to avoid confusions by people replying having a different
idea of the meaning of that term.



-- 
Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा



Re: Counting Devanagari Aksharas

2017-04-20 Thread Richard Wordingham via Unicode
I was offered the following reply:

> To my knowledge except in Tamil script vowel less consonants in
> written form aren't considered as separate "akshara"s in native
> terminology.

Word-finally they seem to be being treated as such.  To be more
precise, a final cluster of one or more consonants marked as having no
vowel is - Sanskrit has a few word-final clusters.

> However for text shaping purposes they will surely have
> to be considered as separate orthographic syllables in Unicode
> terminology since in word end position they can sometimes carry svara
> markers.

The complication comes word internally.  My understanding is that
phonetically syllable-final consonants in non-Indic words in
non-Indic languages have a tendency not to be included in an akshara
along with the start of the next syllable.  However, that tendency is
more evident in scripts other than Devanagari; Devanagari has developed
in the context of Indic languages.

Renderers' syllable-recognition algorithms will naturally treat
word-final devowelled sequences as separate units, rather than
associate them with the previous implicit or explict vowel.

Burmese is a good example of what can happen with a non-Indic language;
in native words, phonetic syllabic boundaries tend to be orthographic
syllable boundaries.

Text-shaping engines like Microsoft's Uniscribe are more complicated.
For scripts with a virama, they seem to assume that the virama may be
a combining operator, and wait for data from the font to decide how
many clusters to form.

One test is the insertion of white spaces in a word when it is stretched
out.  Of course, that test can only be applied where human decisions
are involved - otherwise we are just looking at what dominant
renderers are actually doing, rather than looking at what they ought
to be doing.

Richard.


Counting Devanagari Aksharas

2017-04-19 Thread Richard Wordingham via Unicode
Is there consensus on how to count aksharas in the Devanagari script?
The doubts I have relate to a visible halant in orthographic syllables
other than the first.

For example, according to 'Devanagari VIP Team Issues Report'
http://www.unicode.org/L2/L2011/11370-devanagari-vip-issues.pdf, a
derived form from Nepali श्रीमान्  should be written श्रीमान्‌को
<U+0936 DEVANAGARI LETTER SHA, U+094D DEVANAGARI SIGN VIRAMA, U+0930
DEVANAGARI LETTER RA, U+0940 DEVANAGARI VOWEL SIGN II, U+092E
DEVANAGARI LETTER MA, U+093E DEVANAGARI VOWEL SIGN AA, U+0928
DEVANAGARI LETTER NA, U+094D, U+200C ZERO WIDTH NON-JOINER, U+0915
DEVANAGARI LETTER KA, U+094B DEVANAGARI VOWEL SIGN O> and not
श्रीमान्को  <U+0936, U+094D, U+0930, U+0940, U+092E, U+093E, U+0928,
U+094D, U+0915, U+094B>.  Now, if the font used has a conjunct for
SHRA, I would count the former as having 4 aksharas SH.RII, MAA, N, KO
and the latter as having 3 aksharas SH.RII, MAA, N.KO.

If the font leads to the use of a visible halant instead of the vattu
conjunct SH.RA, as happens when I view this email, would there then be
5 and 4 aksharas respectively?  A further complication is that the font
chosen treats what looks like SH, RA as a conjunct; the vowel I appears
to the left of SH when added after RA (श्रि).

Richard.



Re: Sanskrit -e/o a- Sandhi in Devanagari

2017-02-24 Thread Shriramana Sharma
This seems quite reasonable.

On 25 Feb 2017 04:06, "Richard Wordingham" <richard.wording...@ntlworld.com>
wrote:

> The usual form of this sandhi in modern Sanskrit is described as the a-
> dropping and being replaced by avagraha.  If word boundaries are
> represented by SPACE, am I correct in believing that the change in
> codepoints is:
>
> <U+0020 SPACE, U+0905 LETTER A> becomes <U+200B ZERO WIDTH SPACE, U+093D
> DEVANAGARI SIGN AVAGRAHA>
>
> I ask because I have seen lines starting with avagraha, though within a
> line there seems not to be a space before avagraha.  (I am ignoring
> didactic writing which shows sandhi effects but leaves a space between
> the original words.)
>
> Richard.
>


Sanskrit -e/o a- Sandhi in Devanagari

2017-02-24 Thread Richard Wordingham
The usual form of this sandhi in modern Sanskrit is described as the a-
dropping and being replaced by avagraha.  If word boundaries are
represented by SPACE, am I correct in believing that the change in
codepoints is:

<U+0020 SPACE, U+0905 LETTER A> becomes <U+200B ZERO WIDTH SPACE, U+093D
DEVANAGARI SIGN AVAGRAHA>

I ask because I have seen lines starting with avagraha, though within a
line there seems not to be a space before avagraha.  (I am ignoring
didactic writing which shows sandhi effects but leaves a space between
the original words.)

Richard.


Re: Devanagari and Subscript and Superscript

2015-12-16 Thread Doug Ewell
I missed this yesterday.

Plug Gulp wrote:

> General support for all characters, words and sentences could be
> achieved by just three new formatting characters, e.g. SCR, SUP and
> SUB, similar to the way other formatting characters such as ZWS, ZWJ,
> ZWNJ etc are defined. The new formatting characters could be defined
> as:
>
> SCR: In a character stream, all the characters following this
> formatting character shall be treated as [...]
>
> SUP: In a character stream, all the characters following this
> formatting character shall be treated as [...]
>
> SUB: In a character stream, all the characters following this
> formatting character shall be treated as [...]

This isn't similar to ZWSP or ZWJ or ZWNJ. Those formatting characters
are not stateful; they affect the rendering of, at most, the single
characters immediately preceding and following them.

The ones you suggest are stateful; they affect the rendering of
arbitrary amounts of subsequent data, in a way reminiscent of ECMA-48
("ANSI") attribute switching, or ISO 2022 character-set switching.
Unicode tries hard to avoid encoding such things.

--
Doug Ewell | http://ewellic.org | Thornton, CO 




Re: Devanagari and Subscript and Superscript

2015-12-16 Thread Philippe Verdy
2015-12-16 19:16 GMT+01:00 Doug Ewell :

> The ones you suggest are stateful; they affect the rendering of
> arbitrary amounts of subsequent data, in a way reminiscent of ECMA-48
> ("ANSI") attribute switching, or ISO 2022 character-set switching.
> Unicode tries hard to avoid encoding such things.


You can try as hard as you want, there are cases where it is impossible to
avoid stateful encoding if we want to avoid desunifications, or even for
some characters that cannot even work without stateful analysis.

And this is not solved just by style markup when that "style" is in fact
completely semantic. The situation must be taken into account with more
care :

- For example, the superscript Latin letter o, aka "ordinal masculine",
which is not just a superscript but a notation adding the semantics of a
abbreviation for the final letters, linked to the other letters before it,
the whole being semantically a single word: the superscript style does not
create such attachment, it creates a separate "word" inside it, so it was
disunified from the letter o.

- But it is not a good practive to encode in Unicode things that are just
styles without clear semantics (so encoding SUB/SUP is really a bad idea).

- On the opposite it is simply impossible to work with Egyptian hieroglyphs
as the default clusters are clearly insufficient to create ANY kind of
plain-text: you need extra markup to add the necessary semantic, not style,
and this markup should be encodable as plain-text without external markup
for the presentation when this presenation is fully semantic and clear
(e.g. the Egyptian "cartouche" for names of kings).
- Similar issue occur with SingWriting and other scripts that DO require
always a complex (non-linear) layout where basic clusters are clearly
insufficient in ALL texts, meaning that the characters that were encoded
are almost **useless** in all plain-text documents: you need extra "format"
characters to create some form of orthographic rule, independantly of the
style or from an external markup language.

I'm in favor of adding **semantic** format characters in Unicode, not
stylistic-only format characters, as soon as there does exist a wellknown
orthographic convention which whould work independantly of styling. But for
now the encoded format characters only work on too small clusters, clusters
are only linear and this is clearly not enough (even for instructing other
kinds of text analysis (such as breakers).

Then the renderers will be adapted and extended to work with more complex
clusters with their internal structures with simpler clusters parts). Other
renderers using the legacy rules will not be able to do that but will
attempt to render some basic fallback (possibly with special visible glyphs
for those controls).

One kind of semantic format character which is useful and encoded is the
"invisible parentheses" for mathematics, which can be encoded for example
after a radical sign: use them around a number to define the extension of
the radical to more than one digit (and make a clear visual and semantic
distinction between "sqrt(24)" and "sqrt(2)4" when you don't want to render
any parentheses, or making the distinction between "sqrt(2+sqrt(3))" and
"sqrt(2)+sqrt(3)").


Re: Devanagari and Subscript and Superscript

2015-12-15 Thread Doug Ewell
Plug Gulp wrote:

> It will help if Unicode standard itself intrinsically supports
> generalised subscript/superscript text.

This falls outside the scope of "plain text" as defined by Unicode, in
much the same way as bold and italic styles and colors and font faces
and sizes.

There are several rich-text formats besides HTML that support arbitrary
subscript and superscript text. PDF and Word leap to mind.

--
Doug Ewell | http://ewellic.org | Thornton, CO 




Re: Devanagari and Subscript and Superscript

2015-12-15 Thread srivas sinnathurai
Does the standard support the use of diacritics in plain text format, when used
with all and any complex scripts?

Regards

Sinnathurai

> 
> On 15 December 2015 at 17:46 Doug Ewell  wrote:
> 
> 
> Plug Gulp wrote:
> 
> > It will help if Unicode standard itself intrinsically supports
> > generalised subscript/superscript text.
> 
> This falls outside the scope of "plain text" as defined by Unicode, in
> much the same way as bold and italic styles and colors and font faces
> and sizes.
> 
> There are several rich-text formats besides HTML that support arbitrary
> subscript and superscript text. PDF and Word leap to mind.
> 
> --
> Doug Ewell | http://ewellic.org | Thornton, CO 
> 
> 

>

RE: Devanagari and Subscript and Superscript

2015-12-15 Thread Doug Ewell
srivas sinnathurai wrote:

> Does the standard support the use of diacritics in plain text format,
> when used with all and any complex scripts?

It probably depends on what you mean by "support" and "diacritics." I
can type a Tamil letter followed by a combining acute accent or
diaeresis, and in Arial Unicode MS it actually looks halfway decent.
Many years ago, William Overington famously put a combining circumflex
on top of U+2604 COMET. You just type one character followed by another
and hope for the best, display-wise. You don't get any other special
behavior.

I'm not sure if this was supposed to be a comment on my statement that
arbitrary subscript and superscript is similar to other attributes that
are not defined to be part of plain text.

--
Doug Ewell | http://ewellic.org | Thornton, CO 




Re: Devanagari and Subscript and Superscript

2015-12-15 Thread Plug Gulp
 SUP character is
reached.

A general support within Unicode for subscripting and superscripting
text(characters and words) will tremendously help languages and
scripts that are not English/Latin.

Thanks and kind regards,

~Plug





>>
>> Hi,
>>
>> I am trying to understand if there is a way to use Devanagari
>> characters (and grapheme clusters) as subscript and/or superscript in
>> unicode text. It will help if someone could please direct me to any
>> document that explains how to achieve that. Is there a unicode marker
>> that will treat the next grapheme cluster in the unicode text as
>> super/subscript? For e.g. if one wants to represent "ब raise to क्ष"
>> how does one achieve that; is there a marker to represent it as
>> follows: ब + SUP + क + ् + ष
>> where SUP acts as a marker for superscripting the next grapheme
>> cluster. Similar for subscripting.
>>
>> Sorry if this is not the right place to ask this question; in that
>> case please could you direct me to the right forum?
>>
>> Thanks and kind regards
>>
>> ~Plug
>>
>> .
>>
>



Re: Devanagari and Subscript and Superscript

2015-12-15 Thread Richard Wordingham
On Tue, 15 Dec 2015 18:00:16 + (GMT)
srivas sinnathurai  wrote:

> Does the standard support the use of diacritics in plain text format,
> when used with all and any complex scripts?

Relatively few scalar value sequences are prohibited - just possibly
sequences containing unassigned characters that are not
non-characters, but I can't think of any others.  (The
prohibition on unpaired surrogates applies to coded character
sequences, but surrogate characters aren't scalar values.) 

It would appear by Conformance Requirement C5, 'A process shall not
assume that it is required to interpret any particular coded character
sequence', that a process is at liberty to decline to interpret a
sequence of scalar values, even if it has just interpreted it.

I am not aware of any requirements in the standard to interpret
specific character sequences.

In general, the interpretation of character sequences is undefined.
For example, a request for advice on the interpretation of
the combination of U+0331 COMBINING MACRON BELOW and U+0E39 THAI
CHARACTER SARA UU was answered with the instruction to consult the
non-existent typographical tradition.  It's been left to rendering
engine writers to define the interpretation.

Indeed, I am not sure that every sequence of defined scalar values
has an interpretation.  Most pairs of regional indicators don't have an
interpretation, and the interpretation of each variation sequences may
change at least twice, once when the base character becomes defined
(or is defined not to be a possible base character), and again when
the variation sequence is assigned an interpretation as an ill-defined
(or grossly ill-defined) family of glyphs.

Do U+0337 COMBINING SHORT SOLIDUS OVERLAY and U+20E5 COMBINING REVERSE
SOLIDUS OVERLAY have a defined interpretation when their base character
is to be represented by a mirrored glyph.  Note that in general, the
Unicode standard does not define when a character is to be represented
by a mirrored glyph.  This may be defined by a lower level protocol
(the font file).

Richard.


Re: Devanagari and Subscript and Superscript

2015-12-15 Thread Khaled Hosny
On Tue, Dec 15, 2015 at 11:55:02AM +, Plug Gulp wrote:
> Please note that the teacher had to use a Circumflex Accent (Caret) to
> indicate superscript, which is an unwritten convention, in the absence
> of proper superscript support within Unicode.

If the teacher is explaining actual math to his students, then the
superscript is the least of his worries.

Math typesetting is two dimensional, and is much more complex than
regular formated text (not even regular plan text)that it needs its own
typesetting engines.

There are various plain text markup languages to markup math, if one
really wants to represent complex mathematical notation in plain text.


Regards,
Khaled


Re: Devanagari and Subscript and Superscript

2015-12-11 Thread Richard Wordingham
On Wed, 9 Dec 2015 03:24:39 +
Plug Gulp <plug.g...@gmail.com> wrote:

> I am trying to understand if there is a way to use Devanagari
> characters (and grapheme clusters) as subscript and/or superscript in
> unicode text.

Why do you want to do this?  Are you asking about writing Devanagari
vertically rather than horizontally?  If that is what you want, you
should be looking at mark-up such as is found in cascading style sheets
(CSS).  It is an important issue for CJK and Mongolian, and there have
been questions as to what is needed for Indian scripts.  (There's also
an antiquarian interest for historical scripts, such as Phags-pa and
even Egyptian - moves are afoot to support the hieroglyphic script as
plain text.)

Richard.


Re: Devanagari and Subscript and Superscript

2015-12-08 Thread Richard Wordingham
On Wed, 9 Dec 2015 03:24:39 +
Plug Gulp <plug.g...@gmail.com> wrote:

> Hi,
> 
> I am trying to understand if there is a way to use Devanagari
> characters (and grapheme clusters) as subscript and/or superscript in
> unicode text.

The view is that such would not be 'plain text', and therefore need not
be catered for in Unicode.  On the other hand, the desire for
spacing raised and lowered characters is sufficient that markup to
produce them is widely available, as Martin Dürst pointed out.

Non-spacing stacked characters are not common enough for general
support to be available.  In many Indic scripts, stacking is the normal
arrangement, and is supplied via a script-specific special character
that is overloaded with a vowel cancellation symbol.  However,
font-specific deviations from vertical stacking are arranged, and
vowels marks are treated independently.  There is no provision for
vertical stacks to have horiziontal offshoots.  (Scripts written
vertically are a different case.)

For characters stacked directly above and below not in the normal
modern fashion of writing words, there can be special characters for
special cases.  For example, there are U+A8EE COMBINING DEVANAGARI
LETTER PA in the Devanagari Extended block and U+0364 COMBINING LATIN
SMALL LETTER E.

Other, clumsier scheme-specific techniques are available other cases.
See for example the writing of nuclides with an explicit atomic number
in https://en.wikipedia.org/wiki/Nuclide.  The notation needs a mass
number at top left and an atomic number at bottom right.

A fairly general case is the annotation of kanji known as 'ruby'.
Sometimes an application or mark-up scheme will support this directly.

Richard.



Re: Devanagari and Subscript and Superscript

2015-12-08 Thread Martin J. Dürst

Hello Plug,

I suggest using HTML:

बक ्ष

Regards,   Martin.

On 2015/12/09 12:24, Plug Gulp wrote:

Hi,

I am trying to understand if there is a way to use Devanagari
characters (and grapheme clusters) as subscript and/or superscript in
unicode text. It will help if someone could please direct me to any
document that explains how to achieve that. Is there a unicode marker
that will treat the next grapheme cluster in the unicode text as
super/subscript? For e.g. if one wants to represent "ब raise to क्ष"
how does one achieve that; is there a marker to represent it as
follows: ब + SUP + क + ् + ष
where SUP acts as a marker for superscripting the next grapheme
cluster. Similar for subscripting.

Sorry if this is not the right place to ask this question; in that
case please could you direct me to the right forum?

Thanks and kind regards

~Plug

.



Devanagari and Subscript and Superscript

2015-12-08 Thread Plug Gulp
Hi,

I am trying to understand if there is a way to use Devanagari
characters (and grapheme clusters) as subscript and/or superscript in
unicode text. It will help if someone could please direct me to any
document that explains how to achieve that. Is there a unicode marker
that will treat the next grapheme cluster in the unicode text as
super/subscript? For e.g. if one wants to represent "ब raise to क्ष"
how does one achieve that; is there a marker to represent it as
follows: ब + SUP + क + ् + ष
where SUP acts as a marker for superscripting the next grapheme
cluster. Similar for subscripting.

Sorry if this is not the right place to ask this question; in that
case please could you direct me to the right forum?

Thanks and kind regards

~Plug



RE: Devanagari Letter Short A

2004-02-19 Thread Aparna A. Kulkarni
The character U+0904 (DEVANAGARI LETTER SHORT A) is not a part of ISCII 91.
Neither was it encoded in any of the earlier versions of ISCII. Hence
according to the ISCII standard this character simply cannot be formed.

Aparna A. Kulkarni

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On
Behalf Of Ernest Cline
Sent: Monday, February 16, 2004 10:59 AM
To: Unicode List
Subject: Devanagari Letter Short A

I've been trying to make sense of the Indian scripts, but am
having one small difficulty.  I can't seem to find the ISCII 1991
equivalent for U+0904 (DEVANAGARI LETTER SHORT A).

Is this a character that is part of the set accessed by the
extended code (xF0) or was this part of the ISCII 1988
standard that did not survive the changes to ISCII 1991?

Alternatively, does ISCII encode this as xA4 + xE0 as this
would seem to generate the proper glyph even tho it
violates the syllable grammar given in Section 8 of ISCII?

Or even more alternatively, am I just missing something
that should be obvious, but which  for some reason I can't see?
Even with the slight differences in the naming conventions
between ISCII and Unicode, I don't seem to be misplacing
any of the other vowels or consonants.

Ernest Cline
[EMAIL PROTECTED]






Re: Devanagari Letter Short A

2004-02-19 Thread Philippe Verdy
From: Aparna A. Kulkarni [EMAIL PROTECTED]
To: [EMAIL PROTECTED]; 'Unicode List' [EMAIL PROTECTED]
Sent: Thursday, February 19, 2004 8:23 AM
Subject: RE: Devanagari Letter Short A


 The character U+0904 (DEVANAGARI LETTER SHORT A) is not a part of ISCII 91.
 Neither was it encoded in any of the earlier versions of ISCII. Hence
 according to the ISCII standard this character simply cannot be formed.

 Aparna A. Kulkarni

So could this character exist only for the purpose of supporting languages that
are not covered by ISCII but that share the same Devanagari script, and is then
needed for other countries than India?

(Here I think about Dravidian transiptions).

If there's no ISCII standard related to its meaning or encoding, then what is
invalid when coding it with LETTER A then the LETTER SHORT E vowel modifier,
possibly with an intermediate INV or other ISCII-compatible control? How would
this break ISCII compatibility?

Aren't there existing practices to represent LETTER SHORT A in ISCII?




Re: Devanagari Letter Short A

2004-02-18 Thread Antoine Leca
Philippe Verdy va escriure:
 
 U+0904 DEVANAGARI LETTER SHORT A is used only for the case of an
 independant vowel. It can be viewed as a conjunct of the
 independant vowel U+0905 DEVANAGARI LETTER A and the dependant
 vowel sign U+0946 DEVANAGARI VOWEL SIGN SHORT E (noted for
 transcribing Dravidian vowels in the Unicode charts).

You may regard it this way, but that is not so.
U+0905 followed by U+0946 is really U+090E. Compare with the other
scripts to understand why.

 I  don't know why this is not documented, because I can find various
 sources that use U+0904 or U+0905,U+0946 which have exactly the
 same rendering and probably the same meaning and usage.

Whow! You have various sources that use a character added to Unicode
about 2 years and half ago! Impressionnant!

About the rendering of U+0905,U+0946, since it violates the usual
rules, it is up to your system. Mine does not render it properly,
though (unless I cheat).

 I think that U+0946 was added in ISCII 1991 but was absent from ISCII
 1988

No. It was there even in ISCII 83.

 (I think it's too late to define it: ISCII 1988 has been used 
 consistently before,

H... I have really no evidence that ISCII 1988 was used at all...
Would be happy to find one, though...


Antoine




Re: Devanagari Letter Short A

2004-02-18 Thread Antoine Leca
Ernest Cline wrote:
 
 I've been trying to make sense of the Indian scripts, but am
 having one small difficulty.  I can't seem to find the ISCII 1991
 equivalent for U+0904 (DEVANAGARI LETTER SHORT A).

I do not believe you'll find it there.
U+0904 had been added to Unicode for version 4.0. In 2001.
URL:http://www.unicode.org/consortium/utc-minutes/UTC-089-200111.html
Search for 89-C19.


 Is this a character that is part of the set accessed by the
 extended code (xF0) or was this part of the ISCII 1988
 standard that did not survive the changes to ISCII 1991?

No and no.

 
 Alternatively, does ISCII encode this as xA4 + xE0 as this
 would seem to generate the proper glyph even tho it
 violates the syllable grammar given in Section 8 of ISCII?

It does not. At the very least, if you want to generate this
character in ISCII this way, try A4 DB E0 (using INV).
This is an ugly hack, of course.

As an aside, in some version of ISCII (EA-ISCII, notably),
A4 E0 is supposed to be equivalent to AD. This is the way
the alphabet is sometimes taught to children in India.

 
Antoine



Re: Devanagari Letter Short A

2004-02-16 Thread Philippe Verdy
My understanding of the Indian scripts coded in Unicode, is that the mapping
from ISCII to Unicode is not straightforward one-to-one, because ISCII uses a
contextual encoding for characters (allowing shifts between several scripts) and
some rich-text features.

The ISCII character model is not exactly the same as the Unicode character
model, even though there was an attempt to make this mapping as simple as
possible by allocating the Unicode code points for each individual
ISCII-supported script in the same relative order, leaving gaps in the
Unicode-encoded scripts for ISCII characters that are not used in one specific
script.

The good reference for how Indian scripts are coded in Unicode is Chapter 9 of
the Unicode 4 reference:
http://www.unicode.org/versions/Unicode4.0.0/ch09.pdf
In summary with Unicode, the model for Devenagari:
- uses consonnantal letters with an implied (default) vowel A, modified by the
next coded dependant vowel sign (matra) that create graphic conjuncts with the
consonnant, or
- uses half-forms of consonnants to drop the implied vowel in initial
consonnants, or
- uses a virama (halant) U+094D, to mark other omissions of the implied vowel on
dead consonnant letters (most often on final consonnants, but this occurs as
well on initial or medial consonnants), by removing the final stem of the full
(live) consonnant that is normally used to depict also a phonetic syllable
boundary with a necessary vowel. So the virama allows creating conjuncts with
other following dead consonnants or live consonnants, and normally attaches both
consonnant letters into the same syllable or conjunct.
- in some cases, the omission of the implied dependant vowel must not create a
ligated conjunct, so the virama still needs to represent the omission of the
vowel without creating a conjunct that would break the perceived phonetic, and a
ZWNJ is used between the dead consonnant (consonnant letter+virama) and the next
live consonnant.

There's a U+0905 pseudo-consonnant /a/ which is used in absence of a phonetic
consonnant, but it follows the same encoding rule as other consonnant letters
/*a/, i.e. coding another isolated vowel requires coding /a/ before the vowel
sign (matra). This encodes approximately the same thing as isolated vowels,
except that the intended rendering is different.

U+0904 DEVANAGARI LETTER SHORT A is used only for the case of an independant
vowel. It can be viewed as a conjunct of the independant vowel U+0905
DEVANAGARI LETTER A and the dependant vowel sign U+0946 DEVANAGARI VOWEL SIGN
SHORT E (noted for transcribing Dravidian vowels in the Unicode charts). I
don't know why this is not documented, because I can find various sources that
use U+0904 or U+0905,U+0946 which have exactly the same rendering and
probably the same meaning and usage. I think that U+0946 was added in ISCII 1991
but was absent from ISCII 1988 (verify, I don't have the ISCII 1988 reference
document), so U+0904 has survived just to allow a mostly one-to-one mapping with
ISCII 1988. But the addition of U+0946

May be I'm wrong here, and there's some reasons for this choice. there's no
canonical or compatibility equivalence defined between U+0904 and
U+0905,U+0946 (I think it's too late to define it: ISCII 1988 has been used
consistently before, and the Unicode stability policy forbids now defining now
new equivalences between them).

- Original Message - 
From: Ernest Cline [EMAIL PROTECTED]
To: Unicode List [EMAIL PROTECTED]
Sent: Monday, February 16, 2004 6:28 AM
Subject: Devanagari Letter Short A


 I've been trying to make sense of the Indian scripts, but am
 having one small difficulty.  I can't seem to find the ISCII 1991
 equivalent for U+0904 (DEVANAGARI LETTER SHORT A).

 Is this a character that is part of the set accessed by the
 extended code (xF0) or was this part of the ISCII 1988
 standard that did not survive the changes to ISCII 1991?

 Alternatively, does ISCII encode this as xA4 + xE0 as this
 would seem to generate the proper glyph even tho it
 violates the syllable grammar given in Section 8 of ISCII?

 Or even more alternatively, am I just missing something
 that should be obvious, but which  for some reason I can't see?
 Even with the slight differences in the naming conventions
 between ISCII and Unicode, I don't seem to be misplacing
 any of the other vowels or consonants.

 Ernest Cline
 [EMAIL PROTECTED]




Devanagari Letter Short A

2004-02-15 Thread Ernest Cline
I've been trying to make sense of the Indian scripts, but am
having one small difficulty.  I can't seem to find the ISCII 1991
equivalent for U+0904 (DEVANAGARI LETTER SHORT A).

Is this a character that is part of the set accessed by the
extended code (xF0) or was this part of the ISCII 1988
standard that did not survive the changes to ISCII 1991?

Alternatively, does ISCII encode this as xA4 + xE0 as this
would seem to generate the proper glyph even tho it
violates the syllable grammar given in Section 8 of ISCII?

Or even more alternatively, am I just missing something
that should be obvious, but which  for some reason I can't see?
Even with the slight differences in the naming conventions
between ISCII and Unicode, I don't seem to be misplacing
any of the other vowels or consonants.

Ernest Cline
[EMAIL PROTECTED]






Re: Devanagari Glottal Stop

2003-04-06 Thread Michael Everson
I wrote:

  I would have to disagree with these Indian experts in this instance.
 The Devanagari glottal stop does not have a dot, and indeed, in the
 languages which use it, this character will certainly coexist with
 the question mark. They have different shapes, and different
 functions.
At 15:03 -0800 2003-04-05, Mark Davis wrote:
Can you respond back to them with the information as to the 
languages involved?
I believe they read the Unicore list, don't they, Mark? N2543 and 
02/394 show the character used for the Limbu language, and shows the 
glyph without a dot and with a horizontal headbar, which the question 
mark never has. (It also shows an example where, because the 
typesetters didn't have the letter available they substituted a 
question mark, but that just goes to show that we need to encode 
this, because it is a letter, not a punctuation mark.)
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



Re: Devanagari Glottal Stop

2003-04-05 Thread Michael Everson
I would have to disagree with these Indian experts in this instance. 
The Devanagari glottal stop does not have a dot, and indeed, in the 
languages which use it, this character will certainly coexist with 
the question mark. They have different shapes, and different 
functions.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



Re: Devanagari Glottal Stop

2003-04-05 Thread Mark Davis
Can you respond back to them with the information as to the languages
involved?

Mark
(  )

[EMAIL PROTECTED]
IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
(408) 256-3148
fax: (408) 256-0799

- Original Message -
From: Michael Everson [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Saturday, April 05, 2003 01:45
Subject: Re: Devanagari Glottal Stop


 I would have to disagree with these Indian experts in this instance.
 The Devanagari glottal stop does not have a dot, and indeed, in the
 languages which use it, this character will certainly coexist with
 the question mark. They have different shapes, and different
 functions.
 --
 Michael Everson * * Everson Typography *  * http://www.evertype.com






Re: Plane 14 Tag Deprecation Issue (was Re: VS vs. P14 (was Re: Indic Devanagari Query))

2003-02-07 Thread Asmus Freytag
At 11:54 AM 2/6/03 -0800, Kenneth Whistler wrote:

My personal opinion? The whole debate about deprecation of
language tag characters is a frivolous distraction from
other technical matters of greater import, and things would
be just fine with the current state of the documentation.
But, if formal deprecation by the UTC is what it would take
to get people to stop advocating more use of the language
tags after the UTC has long determined that their use is
strongly discouraged, then so be it.


My personal opinion is that labelling them as restricted for
use with protocols requiring their use is sufficient and proper.
In the context of such protocols, the use of tag characters is
a fine mechanism. They certainly have some advantages over
ASCII-style markup (e.g. lang=...) in many situations.

Where they don't have a place is in regular 'plain' text streams.

Formal deprecation would imply to me that ANY use is discouraged,
including the use with protocols that wish to make use of them.
THAT seems to be going too far in this case.

Where we have deprecated format characters in the past it has been
precisely in situations where we wanted to discourage the use of
particular 'protocols', for example for shaping and national digit
selection.

A./




Re: VS vs. P14 (was Re: Indic Devanagari Query)

2003-02-07 Thread Andrew C. West
John H. Jenkins wrote:

 Ah, but decorative motifs are not plain text.

Ah, but it could be.




Re: Plane 14 Tag Deprecation Issue (was Re: VS vs. P14 (was Re: Indic Devanagari Query))

2003-02-07 Thread William Overington
I feel that as the matter was put forward for Public Review then it is
reasonable for someone reading of that review to respond to the review on
the basis of what is stated as the issue in the Public Review item itself.

Kenneth Whistler now states an opinion as to what the review is about and
mentions a file PropList.txt of which I was previously unaware.

Recent discussions in the later part of 2002 in this forum about the
possibilities of using language tags only started as a direct result of the
Unicode Consortium instituting the Public Review.

The recent statement by Asmus Freytag seems fine to me.  Certainly I might
be inclined to add in a little so as to produce Plane 14 tags are reserved
for use with particular protocols requiring, or providing facilities for,
their use so that the possibility of using them to add facilities rather
than simply using them when obligated to do so is included, but that is not
a great issue: what Asmus wrote is fine.

Public Review is, in my opinion, a valuable innovation.  Two issues have so
far been resolved using the Public Review process.  Those results do seem to
indicate the value of seeking opinions by Public Review.

As I have mentioned before I have a particular interest in the use of
Unicode in relation to the implementation of my telesoftware invention using
the DVB-MHP (Digital Video Broadcasting - Multimedia Home Platform) system.
I feel that language tags may potentially be very useful for broadcasts of
multimedia packages which include Unicode text files, by direct broadcast
satellites across whole continents.  Someone on this list, I forget who, but
I am grateful for the comment, mentioned that even if formal deprecation
goes ahead then that does not stop the language tags being used as once an
item is in Unicode it is always there.  So fine, though it would be nice if
the Unicode Specification did allow for such possibilities within its
wording.  The wording stated by Asmus Freytag pleases me, as it seems a
good, well-rounded balance between avoiding causing people who make many
widely used packages needing to include software to process language tags,
whilst still formally recognizing the opportunity for language tags to be
used to advantage in appropriate special circumstances.  I feel that that is
a magnificent compromise wording which will hopefully be widely applauded.

In using Unicode on the DVB-MHP platform I am thinking of using Unicode
characters in a file and the file being processed by a Java program which
has been broadcast.  The file PropList.txt just does not enter into it for
this usage, so it is not a problem for me as to what is in that file.  My
thinking is that many, maybe most, multimedia packages being broadcast will
not use language tags and will have no facilities for decoding them.
However, I feel that it is important to keep open the possibility that some
such packages can use language tags provided that the programs which handle
them are appropriately programmed.  There will need to be a protocol.
Hopefully a protocol already available in general internationalization and
globalization work can be used directly.  If not, hopefully a special
Panplanet protocol can be devised specifically for DVB-MHP broadcasting.

On the matter of using Unicode on the DVB-MHP platform, readers might like
to have a look at the following about the U+FFFC character.

http://www.users.globalnet.co.uk/~ngo/ast03200.htm

Readers who are interested in uses of the Private Use Area might like to
have a look at the following.  They are particularly oriented towards the
DVB-MHP platform but do have wider applications both on the web and in
computing generally.

http://www.users.globalnet.co.uk/~ngo/ast03000.htm

http://www.users.globalnet.co.uk/~ngo/ast03100.htm

http://www.users.globalnet.co.uk/~ngo/ast03300.htm

The main index page of the webspace is as follows.

http://www.users.globalnet.co.uk/~ngo

William Overington

7 February 2003



















Re: VS vs. P14 (was Re: Indic Devanagari Query)

2003-02-07 Thread Asmus Freytag
At 01:52 AM 2/7/03 -0800, Andrew C. West wrote:

 Ah, but decorative motifs are not plain text.

Ah, but it could be.


Ah, but it wouldn't be Unicode.

A(h)./




Re: VS vs. P14 (was Re: Indic Devanagari Query)

2003-02-06 Thread Doug Ewell
Asmus Freytag asmusf at ix dot netcom dot com wrote:

 Unicode 4.0 will be quite specific: P14 tags are reserved for
 use with particular protocols requiring their use is what the
 text will say more or less.

I didn't know the question of what to do about Plane 14 language tags
had already been resolved.

If that is the case, it might make sense to add an explanatory note to
the Public Review item on Plane 14 tags, or simply to remove the item.

-Doug Ewell
 Fullerton, California





VS vs. P14 (was Re: Indic Devanagari Query)

2003-02-06 Thread Andrew C. West
James Kass wrote,

 (What happens if someone discovers a 257th variant? Do they
 get a prize? Or, would they be forever banished from polite
 society?)

I was thinking about that. 256 variants of a single character may seem a tad
excessive, but there is a common Chinese decoartive motif (frequently seen on
trays and tea-pots and scarves and such like) comprising the ideograph shou4
(U+58FD, U+5900, U+5BFF) longevity written in 100 variant forms (called bai3
shou4 tu2 in Chinese). See
http://www.tydao.com/sxsu/shenhuo/minju/images/mj17.htm for an example.

A quick google on qian1 shou4 tu2 (the ideograph shou4 written in a thousand
different forms) came up with a piece of calligraphy by Wang Yunzhuang (b.1942)
which comprises the ideograph shou4 written in no less than 1,256 unique variant
forms !

Googling on wan4 shou4 tu2 (the ideograph shou4 written in 10,000 forms)
also had a number of hits, but these refer to a compilation of calligraphy by
forty artists that took 16 years to create (written on a scroll 160 metres in
length), so these may not all be unique variants.

There are also a number of other auspicious characters, such as fu2 (U+798F)
good fortune that may be found written in a hundred variant forms as a
decorative motif.

All in all the new variant selectors may be kept quite busy if applied to the
ideograph shou4 and its friends !

Andrew




Re: VS vs. P14 (was Re: Indic Devanagari Query)

2003-02-06 Thread John H. Jenkins
On Thursday, February 6, 2003, at 08:47 AM, Andrew C. West wrote:


There are also a number of other auspicious characters, such as fu2 
(U+798F)
good fortune that may be found written in a hundred variant forms as 
a
decorative motif.

Ah, but decorative motifs are not plain text.

==
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://www.tejat.net/





Re: Indic Devanagari Query

2003-02-05 Thread Andrew C. West
On Wed, 05 Feb 2003 02:00:30 -0800 (PST), [EMAIL PROTECTED] wrote:

 If these alternate forms were needed to be displayed in a single
 multi-lingual plain-text file, wouldn't we need some method of 
 tagging the runs of Latin text for their specific languages?

Is this not what the variation selectors are available for ?

And now that we soon to have 256 of them, perhaps Unicode ought not to be shy
about using them for characters other than mathematical symbols.

Andrew




Re: Indic Devanagari Query

2003-02-05 Thread Peter_Constable

On 02/04/2003 02:52:25 PM jameskass wrote:

If these alternate forms were needed to be displayed in a single
multi-lingual plain-text file, wouldn't we need some method of
tagging the runs of Latin text for their specific languages?

The plain-text file would be legible without that -- I don't think this is
an argument in favour of plane 14 tag characters. Preserving
culturally-preferred appearance would certainly require markup of some
form, whether lang IDs or for font-face and perhaps font-feature
formatting.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485







Re: Indic Devanagari Query

2003-02-05 Thread Peter_Constable

On 02/05/2003 04:05:44 AM Andrew C. West wrote:

 If these alternate forms were needed to be displayed in a single
 multi-lingual plain-text file, wouldn't we need some method of
 tagging the runs of Latin text for their specific languages?

Is this not what the variation selectors are available for ?

That is a possible technical solution to such variations, though specific
character+variant combinations would have to be approved and documented by
UTC. It's not the only solution, and might or might not be the best.




- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485







VS vs. P14 (was Re: Indic Devanagari Query)

2003-02-05 Thread jameskass
.
Andrew C. West wrote,

 Is this not what the variation selectors are available for ?

 And now that we soon to have 256 of them, perhaps Unicode ought not to be shy
 about using them for characters other than mathematical symbols.


Yes, there seem to be additional variation selectors coming in 
Unicode 4.0 as part of the 1207 (is that number right?) new
characters.

(What happens if someone discovers a 257th variant?  Do they
get a prize?  Or, would they be forever banished from polite
society?)

The variation selectors could be a practical and effective method 
of handling different glyph forms.

But, consider the burden of incorporating a large amount of
variation selectors into a text file and contrast that with the
use of Plane Fourteen language tags.  With the P14 tags, it's
only necessary to insert two special characters, one at the
beginning of a text run, the other at the ending.

Jim Allan wrote,

 One could start with indications as to whether the text was traditional 
 Chinese, simplified Chinese, Japanese, Korean, etc. :-(
 
 But I don't see that there is anything particularly wrong with citing or 
 using a language in a different typographical tradition.
 ...

Neither do I.  I kind of like seeing variant glyphs in runs of text and
am perfectly happy to accept unusual combinations.

Perhaps those of us who deal closely with multilingual material
and are familiar with variant forms are simply more tolerant
and accepting.

 ... A linguistic 
 study of the distribution of the Eng sound might cite written forms with 
 capital letters from Sami and some from African languages, but need not 
 and probably should not be concerned about matching exactly the exact 
 typographical norms in those tongues, for _eng_ or for any other letter.

On the one hand, there's a feeling that insistence upon variant glyphs
for a particular language is provincial.  On the other hand, everyone
has the right to be provincial (or not).  IMO, it's the ability to
choose that is paramount.

If anyone wishes to distinguish different appearances of an acute
accent between, say, French and Spanish... or the difference of the
ogonek between Polish and Navajo... or the variant forms of
capital eng, then there should be a mechanism in place enabling 
them to do so.

Variation selectors would be an exact method with the V.S. characters
manually inserted where desired.  P14 tags would also work for this;
entire runs of text could be tagged and those runs could be properly
rendered once the technology catches up to the Standard.

Neither V.S. nor P14 tags should interfere with text processing
or break any existing applications.  There are pros and cons for
either approach.

Best regards,

James Kass
.




VS vs. P14 (was Re: Indic Devanagari Query)

2003-02-05 Thread jameskass
.
Peter Constable wrote,

 The plain-text file would be legible without that -- I don't think this is
 an argument in favour of plane 14 tag characters. Preserving
 culturally-preferred appearance would certainly require markup of some
 form, whether lang IDs or for font-face and perhaps font-feature
 formatting.

Any Unicode formatting character can be considered as mark-up,
even P14 tags or VSs.

The advantages of using P14 tags (...equals lang IDs mark-up) is
that runs of text could be tagged *in a standard fashion* and
preserved in plain-text.

Best regards,

James Kass
.




Re: VS vs. P14 (was Re: Indic Devanagari Query)

2003-02-05 Thread Asmus Freytag
At 06:24 PM 2/5/03 +, [EMAIL PROTECTED] wrote:

The advantages of using P14 tags (...equals lang IDs mark-up) is
that runs of text could be tagged *in a standard fashion* and
preserved in plain-text.


The minute you have scoped tagging, you are no longer using
plain text.

The P14 tags are no different than HTML markup in that regard,
however, unlike HTML markup they can be filtered out by a
process that does not implement them. (In order to filter
out HTML, you need to know the HTML syntax rules. In order
to filter out P14 tags you only need to know their code point
range.)

Variation selectors also can be ignored based on their code
point values, but unlike p14 tags, they don't become invalid
when text is cutpaste from the middle of a string.

If 'unaware' applications treat them like unknown combining
marks and keep them with the base character like they would
any other combining mark during editing, then variation
selectors have a good chance surviving in plain text.

P14 tags do not.

Unicode 4.0 will be quite specific: P14 tags are reserved for
use with particular protocols requiring their use is what the
text will say more or less.

A./






Re: VS vs. P14 (was Re: Indic Devanagari Query)

2003-02-05 Thread Peter_Constable

On 02/05/2003 12:24:39 PM jameskass wrote:

The advantages of using P14 tags (...equals lang IDs mark-up) is
that runs of text could be tagged *in a standard fashion* and
preserved in plain-text.

Sure, but why do we want to place so much demand on plain text when the
vast majority of content we interchange is in some form of marked-up or
rich text? Let's let plain text be that -- plain -- and look to the markup
conventions that we've invested so much in and that are working for us to
provide the kinds of thing that we designed markup for in the first place.
Besides, a plain-text file that begins and ends with p14 tags is a
marked-up file, whether someone calls it plain text or not. We have
little or no infrastructure for handling that form of markup, and a large
and increasing amount of infrastructure for handling the more typical forms
of markup.

I repeat, plain text remains legible without anything indicating which eng
(or whatever) may be preferred by the author, and (since the requirement
for plain text is legibility) therefore this is not really an argument for
using p14 language tags. IMO.




- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485











Re: VS vs. P14 (was Re: Indic Devanagari Query)

2003-02-05 Thread Michael Everson
At 16:47 -0500 2003-02-05, Jim Allan wrote:


There are often conflicting orthographic usages within a language. 
Language tagging alone does not indicate whether German text is to 
be rendered in Roman or Fraktur, whether Gaelic text is to be 
rendered in Roman or Uncial, and if Uncial, a modern Uncial or more 
traditional Uncial, whether English text is in Roman or Morse Code 
or Braille.

We have script codes (very nearly a published standard) for that.

By the way, modern uncial and more traditional uncial isn't 
really sufficient I think for describing Gaelic letterforms. See 
http://www.evertype.com/celtscript/fonthist.html for a sketch of a 
more robust taxonomy.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



Re: VS vs. P14 (was Re: Indic Devanagari Query)

2003-02-05 Thread jameskass
.
Asmus Freytag wrote,

 Variation selectors also can be ignored based on their code
 point values, but unlike p14 tags, they don't become invalid
 when text is cutpaste from the middle of a string.

Excellent point.

 Unicode 4.0 will be quite specific: P14 tags are reserved for
 use with particular protocols requiring their use is what the
 text will say more or less.

This seems to be an eminently practical solution to the P14
situation.

If I were using an application which invoked a protocol requiring
P14 tags to read a file which included P14 tags and wanted to cut
and paste text into another application, in a perfect world the
application would be savvy enough to recognize any applicable P14
tags for the selected text and insert the proper Variation Selectors
into the text stream to be pasted.

The application which received the pasted text, if it was an application
which used a protocol requiring P14 tags, would be savvy enough to
strip the variation selectors and enclose the pasted string in
the appropriate P14 tags.  If the pasted material was being inserted
into a run of text in which the same P14 tag applied, then the tags
wouldn't be inserted.  If the pasted material was being inserted
into a run of text in which a different P14 tag applied, then the
application would insert begin and end P14 tags as needed.

In a perfect world, in the best of both worlds, both P14 tags and
variation selectors could be used for this purpose.

Is it likely to happen?  Perhaps not.

But, by not formally deprecating P14 tags and using (more or less)
the language you mentioned, the possibilities remain open-ended.

Best regards,

James Kass
.




Re: VS vs. P14 (was Re: Indic Devanagari Query)

2003-02-05 Thread jameskass
.
Peter Constable wrote,

 Sure, but why do we want to place so much demand on plain text when the
 vast majority of content we interchange is in some form of marked-up or
 rich text? Let's let plain text be that -- plain -- and look to the markup
 conventions that we've invested so much in and that are working for us to
 provide the kinds of thing that we designed markup for in the first place.
 Besides, a plain-text file that begins and ends with p14 tags is a
 marked-up file, whether someone calls it plain text or not. We have
 little or no infrastructure for handling that form of markup, and a large
 and increasing amount of infrastructure for handling the more typical forms
 of markup.

We place so much demand on plain text because we use plain text.

We continue to advance from the days when “plain text” meant ASCII only
rendered in bitmapped monospaced monochrome.

We don’t rely on mark-up or higher protocols to distinguish between different
European styles of quotation marks.  We no longer need proprietary rich-text
formats and font switching abilities to be able to display Greek and Latin
text from the same file.

 I repeat, plain text remains legible without anything indicating which eng
 (or whatever) may be preferred by the author, and (since the requirement
 for plain text is legibility) therefore this is not really an argument for
 using p14 language tags. IMO.

Is legibility the only requirement of plain text?  Might additional 
requirements
include appropriate, correct encoding and correct display?

To illustrate a legible plain text run which displays as intended (all things 
being
equal) yet is not appropriately encoded (this e-mail is being sent as plain 
text
UTF-8):

푰풇 풚풐풖 풄풂풏 풓풆풂풅 풕풉풊풔 
풎풆풔풔풂품풆...
풚풐풖 풎풂풚 풘풊풔풉 풕풐 풋풐풊풏 푴푨푨푨* 
풂풕
퓫퓵퓪퓱퓫퓵퓪퓱퓫퓵퓪퓱퓭퓸퓽퓬퓸퓶

(*헠햺헍헁 헔헅헉헁햺햻햾헍헌 헔햻헎헌햾헋헌 
헔헇허헇헒헆허헎헌)

Clearly, correct and appropriate encoding (as well as legibility) should be a 
requirement of plain text.  Is correct display also a valid requirement for 
plain text?

It is for some...

Respectfully,

James Kass
.




Re: Indic Devanagari Query

2003-02-04 Thread Peter_Constable

On 01/30/2003 03:03:24 PM Anto'nio Martins-Tuva'lkin wrote:

Not very different from the serbian vs. russian rendition of cyrillic
lower case i in italics. There are more examples, though (almost?)
none in the latin script.

There are indeed some examples in Latin script. For instance, there are
three different typeforms form 014A used by different language communities.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485







Re: Indic Devanagari Query

2003-02-04 Thread jameskass
.
Peter Constable wrote,

 There are indeed some examples in Latin script. For instance, there are
 three different typeforms form 014A used by different language communities.

It's also been reported that there's a strong local preference
for a variant of U+0257 in certain African language communities.

(It would be nice to have confirmation about U+0257...)

If these alternate forms were needed to be displayed in a single
multi-lingual plain-text file, wouldn't we need some method of 
tagging the runs of Latin text for their specific languages?

Best regards,

James Kass
.




Re: Indic Devanagari Query

2003-02-04 Thread Jim Allan
Peter Constable wrote,


There are indeed some examples in Latin script. For instance, there are
three different typeforms form 014A used by different language communities.


It's also been reported that there's a strong local preference
for a variant of U+0257 in certain African language communities.

(It would be nice to have confirmation about U+0257...)

If these alternate forms were needed to be displayed in a single
multi-lingual plain-text file, wouldn't we need some method of
tagging the runs of Latin text for their specific languages?

Best regards,

James Kass 

One could start with indications as to whether the text was traditional 
Chinese, simplified Chinese, Japanese, Korean, etc. :-(

But I don't see that there is anything particularly wrong with citing or 
using a language in a different typographical tradition. A linguistic 
study of the distribution of the Eng sound might cite written forms with 
capital letters from Sami and some from African languages, but need not 
and probably should not be concerned about matching exactly the exact 
typographical norms in those tongues, for _eng_ or for any other letter.

Jim Allan









Re: Indic Devanagari Query

2003-01-30 Thread Anto'nio Martins-Tuva'lkin
On 2003.01.29, 05:52, Aditya Gokhale [EMAIL PROTECTED] wrote:

 1. In Marathi and Sanskrit language two characters glyphs of 'la' and
 'sha' are represented differently as shown in the image below -

 (First glyph is 'la' and second one is 'sha')

 as compared to Hindi where these character glyphs are represented as
 shown in the image below -

 (First glyph is 'la' and second one is 'sha')

Not very different from the serbian vs. russian rendition of cyrillic
lower case i in italics. There are more examples, though (almost?)
none in the latin script.

--   .
António MARTINS-Tuválkin|  ()|
[EMAIL PROTECTED]   ||
R. Laureano de Oliveira, 64 r/c esq. |
PT-1885-050 MOSCAVIDE (LRS)  Não me invejo de quem tem   |
+351 917 511 459 carros, parelhas e montes   |
http://www.tuvalkin.web.pt/bandeira/ só me invejo de quem bebe   |
http://pagina.de/bandeiras/  a água em todas as fontes   |





Re: Indic Devanagari Query

2003-01-29 Thread Keyur Shroff
Hi Aditya,

--- Aditya Gokhale [EMAIL PROTECTED] wrote:
 I had few query regarding representation of Devanagari script in
 Unicode
 (Code page - 0x0900 - 0x097F). Devanagari is a writing script, is used in
 Hindi, Marathi and Sanskrit languages. I have following questions - 
 
 
 In the same script code page, how do I use these two different Glyphs, to
 represent the same character ? Is there any way by which I can do it in
 an Open type font and Free type font implementation ?

Yes, it is certainly possible with OpenType font. Please note that FreeType
is not a font format but it is a rendering library used to rasterize
different kind of fonts including TrueType and OpenType fonts.

In an Opentype font, you can include all glyphs with alternate shapes and
then select one of them depending upon the script and language. Application
should specify script and language tag while sending character codes to the
opentype rendering library/engine. All substitution will be taken place
depending on the language and/or script selection. There should be a
default script in the font. Similarly there will be a default language for
that script which will be used as fallback language if application does not
specify which language to be used for processing.

From the list of alternate glyphs you may want to use the glyph for default
language for an entry in cmap table. This default glyph can be substituted
by alternate glyph depending upon the language specification. You have to
use GSUB table and write language dependent lookup for substitution.

 
 2. Implementation Query - 
 In an implementation where I need to send / process Hindi, Marathi
 and Sanskrit data, how do I differentiate between languages (Hindi,
 Marathi and Sanskrit). Say for example, I am writing a translation
 engine, and I want to translate a document having Hindi, Marathi and
 Sanskrit Text in it, how do I know from the code points between 0x0900
 and 0x097F, that the data under perusal is Hindi / Marathi / Sanskrit ?

Unicode is not divided into code pages. Unlike few old encodings there is
only one code page for entire Unicode standard. However, for better
readability and quick user reference the entire chart has been divided into
different sections which you might interpret as code pages.

 I would suggest that we should give different code pages for Marathi,
 Hindi and Sanskrit. May be current code page of Devanagari can be traded
 as Hindi and two new code pages for Marathi and Sanskrit be added. This
 could solve these issues. If there is any better way of solving this, any
 one suggest.


Unicode gives code points to script only and not language. In fact it is
not desirable to give code points to individual languages falling under the
same script. Also, Unicode encodes characters which have abstract meaning
and properties. Unicode does not encode glyphs. The shapes of glyphs shown
in the Unicode chart have been given just for convenience and not actually
represent the shapes to be used in the font. The shape of the glyph for a
Unicode character may vary from one font to another. Since it is already
possible to select proper glyph(s) depending upon language selection, this
scheme is suitable for all Indian languages.


 
 
 3. Character codes for jna, shra, ksh - 
 
 In Sanskrit and Marathi jna, shra and ksh are considered as separate
 characters and not ligatures. How do we take care of this ? Can I get
 over all views on the matter from the group ? In my opinion they should
 be given different code points in the specific language code page.
 Please find below the character glyphs - 
 
 jna
 shra
 ksh

All of the above can be composed through following consonant clusters:
  jna - ja halant nya
  shra - sha halant ra
  ksh - ka halant ssha

The point that the above sequences are considered as characters in some of
the Indian languages has merit. If there is demand from native speakers
then a proposal can be submitted to Unicode. There is a predefined
procedure for proposal submission. Once this is discussed with concerned
people and agreed upon then these ligatures can be added in Devanagari
script itself because Devenagari script represent all three languages you
mentioned namely Sanskrit, Marathi, and Hindi. Meanwhile you can write
rules for composing them from the consonant clusters.

Regards,
Keyur


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com




Re: Indic Devanagari Query

2003-01-29 Thread Keyur Shroff
Hi,

Forgot to reply implementation query. The reply is inline.

--- Aditya Gokhale [EMAIL PROTECTED] wrote:
 2. Implementation Query - 
 In an implementation where I need to send / process Hindi, Marathi
 and Sanskrit data, how do I differentiate between languages (Hindi,
 Marathi and Sanskrit). Say for example, I am writing a translation
 engine, and I want to translate a document having Hindi, Marathi and
 Sanskrit Text in it, how do I know from the code points between 0x0900
 and 0x097F, that the data under perusal is Hindi / Marathi / Sanskrit ?
 I would suggest that we should give different code pages for Marathi,
 Hindi and Sanskrit. May be current code page of Devanagari can be traded
 as Hindi and two new code pages for Marathi and Sanskrit be added. This
 could solve these issues. If there is any better way of solving this, any
 one suggest.

Instead of changing/recommending change in an encoding standard, your
problem can best be solved in your application. You can use tags in your
text to specify language. Unicode also facilitates tagging your text but
its use in Unicode is highly discouraged. So you can use some language
similar to xml or html to specify language boundary. Then parse your text,
identify the language boundaries, and do further processing depending upon
the language.

If you don't want to use tags in your text then you can predict language by
using some heuristic. This heuristic can be used on some language
properties which may be different for all three languages. In this case
your processing will be divided into two phases. First phase involves
applying some heuristic rule to identify language bounadaries from plain
text and the second is actually processing text for translation. But beware
that the result will not be accurate all the time with such heuristic
processing. Hence use of tags is recommended.

Regards,
Keyur


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com




Re: Indic Devanagari Query

2003-01-29 Thread Aditya Gokhale

Hello,
Thanks for the reply. I will check the points as you said, as far as the
font issues are considered. We all know how jna,shra and ksh are formed in
UNICODE and ISCII, but the point I wanted to make was, if we have to sort /
search / process the data in Devanagari script, then we have to keep track
of at least three characters and not one. This becomes tedious, thought not
impossible. If single
code point is present it will be very easy to process.
With regards, to predict language by using some heuristic, in my
opinion it is a very risky solution, at least when I don't have much
information at stage one of my application. I am running OCR engine on a
Devanagari page, then based on the formatting, tagging the language. So I
think tagging, as I am doing right now is a better solution. I also agree
with the views expressed by Asmus Freytag, that if we go on including all
the 6000 languages, it will be extremely impossible to cross-correlate these
'code pages'.

-Aditya






RE: Indic Devanagari Query

2003-01-29 Thread Marco Cimarosti
Aditya Gokhale wrote:
 Hello Everybody,
 I had few query regarding representation of Devanagari 
 script in Unicode

All your questions are FAQ's, so I'll just reference the entries which
answers them.

 (Code page - 0x0900 - 0x097F). Devanagari is a writing 
 script, is used in Hindi, Marathi and Sanskrit languages. I 
 have following questions - 

Unicode has no code pages:
http://www.unicode.org/faq/basic_q.html#18

 1. In Marathi and Sanskrit language two characters glyphs of 
 'la' and 'sha' are represented differently as shown in the 
 image below - 
  (First glyph is 'la' and second one is 'sha')
 as compared to Hindi where these character glyphs are 
 represented as shown in the image below - 
 (First glyph is 'la' and second one is 'sha')

Unicode encodes (abstract) characters, not glyphs:
http://www.unicode.org/faq/han_cjk.html#3

(This FAQ is in the Chinese/Japanese/Korean section because it is more often
raised for Chinese ideograms.)

 In the same script code page, how do I use these two 
 different Glyphs, to represent the same character ? Is there 
 any way by which I can do it in an Open type font and Free 
 type font implementation ?

Unicode's requirements for fonts:
http://www.unicode.org/faq/font_keyboard.html#1

A few links to OpenType stuff:
http://www.unicode.org/faq/font_keyboard.html#4

 2. Implementation Query - 
 In an implementation where I need to send / process 
 Hindi, Marathi and Sanskrit data, how do I differentiate 
 between languages (Hindi, Marathi and Sanskrit). Say for 
 example, I am writing a translation engine, and I want to 
 translate a document having Hindi, Marathi and Sanskrit Text 
 in it, how do I know from the code points between 0x0900 and 
 0x097F, that the data under perusal is Hindi / Marathi / Sanskrit ?

What you need here is some sort of language tagging:
http://www.unicode.org/faq/languagetagging.html

 I would suggest that we should give different code pages 
 for Marathi, Hindi and Sanskrit. May be current code page of 
 Devanagari can be traded as Hindi and two new code pages for 
 Marathi and Sanskrit be added. This could solve these issues. 
 If there is any better way of solving this, any one suggest.

Characters are encoder per scripts, not per languages:
http://www.unicode.org/faq/basic_q.html#17

 3. Character codes for jna, shra, ksh - 
 
 In Sanskrit and Marathi jna, shra and ksh are considered as 
 separate characters and not ligatures. How do we take care of 
 this ? Can I get over all views on the matter from the group 
 ? In my opinion they should be given different code points in 
 the specific language code page.
 Please find below the character glyphs - 

Unicode encodes Indic analytically:
http://www.unicode.org/faq/indic.html#17

 thanks,

For more details about Devanagari in Unicode, see Chapter 9 of the Standard:
http://www.unicode.org/uni2book/ch09.pdf

_ Marco




Re: Indic Devanagari Query

2003-01-29 Thread Keyur Shroff

--- Asmus Freytag [EMAIL PROTECTED] wrote:

 
 All of the above can be composed through following consonant clusters:
jna - ja halant nya
shra - sha halant ra
ksh - ka halant ssha
 
 The point that the above sequences are considered as characters in some
 of
 the Indian languages has merit. If there is demand from native speakers
 then a proposal can be submitted to Unicode. There is a predefined
 procedure for proposal submission. Once this is discussed with concerned
 people and agreed upon then these ligatures can be added in Devanagari
 script itself because Devenagari script represent all three languages
 you
 mentioned namely Sanskrit, Marathi, and Hindi. Meanwhile you can write
 rules for composing them from the consonant clusters.
 
 I wouldn't go so far. The fact that clusters belong together is something
 
 that can be handled by the software. Collation and other data processing 
 needs to deal with such issues already for many other languages. See 
 http://www.unicode.org/reports/tr10 on the collation algorithm.

I beg to differ with you on this point. Merely having some provision for
composing a character doesn't mean that the character is not a candidate
for inclusion as separate code point. India is a big country with millions
of people geographically divided and speaking variety of languages.
Sentiments are attached with cultures which may vary from one geographical
area to another. So when one of the many languages falling under the same
script dominate the entire encoding for the script, then other group of
people may feel that their language has not been represented properly in
the encoding. While Unicode encodes scripts only, the aim was to provide
sufficient representation to as many languages as possible. 

In Unicode many characters have been given codepoints regardless of the
fact that the same character could have been rendered through some compose
mechanism. This includes Indic scripts as well as other scripts. For
example, in Devanagari script some code points are allocated to characters
(ConsonantNukta) even though the same characters could be produced with
combination of the consonant and Nukta. Similarly, in Latin-1 range
[U+0080-U+00FF] there are few characters which can be produced otherwise.
That is why the text should be normalized to either pre-composed or
de-composed character sequence before going for further processing in
operations like searching and sorting.

Also, many times processing of text depends on the smallest addressable
unit of that language. Again as discussed in earlier e-mails this may vary
from one language to another in the same script. Consider a case when a
language processor/application wants to count the number of characters in
some text in order to find number of keystrokes required to input the text.
Further assume that API functions used for this purpose are based on either
WChar (wide characters) or UTF-8. In this case it is very much necessary
that you assign the character, say Kssha, to the class consonant. Since
assignment to this class consonant applies to single code point (the
smallest addressable unit) and not to the sequence of codes, it is very
much necessary to have single code point for the character Kssha.

This is my understanding. Please enlighten me if I am wrong.

Regards,
Keyur


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com




Re: Indic Devanagari Query

2003-01-29 Thread John Cowan
Keyur Shroff scripsit:

 Sentiments are attached with cultures which may vary from one geographical
 area to another. So when one of the many languages falling under the same
 script dominate the entire encoding for the script, then other group of
 people may feel that their language has not been represented properly in
 the encoding. 

Indeed, they may have such beliefs, but those beliefs are based on two
incorrect notions: that what the charts show is normative, and that the
codepoint is the proper unit of processing.

 In Unicode many characters have been given codepoints regardless of the
 fact that the same character could have been rendered through some compose
 mechanism. 

In every case this was done for backward compatibility with existing
encodings.  No new codepoints of this type will be added in future.

 That is why the text should be normalized to either pre-composed or
 de-composed character sequence before going for further processing in
 operations like searching and sorting.

The collation algorithm makes allowance for these points.
It will be quite typical to tailor the algorithm to take language-specific
rules into account.

 Also, many times processing of text depends on the smallest addressable
 unit of that language. Again as discussed in earlier e-mails this may vary
 from one language to another in the same script. Consider a case when a
 language processor/application wants to count the number of characters in
 some text in order to find number of keystrokes required to input the text.

This will not work without knowledge of the keyboard layout in any case.
To enter Latin-1 characters on the Windows U.S. keyboard requires 5 keystrokes,
but they are represented by one or two Unicode characters.

-- 
Henry S. Thompson said, / Syntactic, structural,   John Cowan
Value constraints we / Express on the fly. [EMAIL PROTECTED]
Simon St. Laurent: Your / Incomprehensible http://www.reutershealth.com
Abracadabralike / schemas must die!http://www.ccil.org/~cowan




Re: Indic Devanagari Query

2003-01-29 Thread Michael Everson
At 02:13 -0800 2003-01-29, Keyur Shroff wrote:

I beg to differ with you on this point. Merely having some provision for
composing a character doesn't mean that the character is not a candidate
for inclusion as separate code point.


Yes, it does.


India is a big country with millions of people geographically 
divided and speaking variety of languages. Sentiments are attached 
with cultures which may vary from one geographical area to another. 
So when one of the many languages falling under the same script 
dominate the entire encoding for the script, then other group of 
people may feel that their language has not been represented 
properly in the encoding.

A lot of these feelings are simply WRONG, and that has to be faced. 
The syllable KSSA may be treated as a single letter, but this does 
not change the fact that it is a ligature of KA and SSA and that it 
can be represented in Unicode by a string of three characters.

In Unicode many characters have been given codepoints regardless of the
fact that the same character could have been rendered through some compose
mechanism. This includes Indic scripts as well as other scripts. For
example, in Devanagari script some code points are allocated to characters
(ConsonantNukta) even though the same characters could be produced with
combination of the consonant and Nukta.


There are historical and compatibility reasons that most of this 
stuff, as well as the similar stuff in the Latin range, were encoded. 
At one point some years ago the line was drawn, normalization was 
enacted, and that was that.

Also, many times processing of text depends on the smallest addressable
unit of that language. Again as discussed in earlier e-mails this may vary
from one language to another in the same script. Consider a case when a
language processor/application wants to count the number of characters in
some text in order to find number of keystrokes required to input the text.


I can't think of any reason why this would be useful. And what if you 
were not typing, but speaking to your computer? Then there would be 
no keystrokes at all!

Further assume that API functions used for this purpose are based on either
WChar (wide characters) or UTF-8. In this case it is very much necessary
that you assign the character, say Kssha, to the class consonant. Since
assignment to this class consonant applies to single code point (the
smallest addressable unit) and not to the sequence of codes, it is very
much necessary to have single code point for the character Kssha.


We are not going to encode KSSA as a single character. It is a 
ligature of KA and SSA, and can already be represented in Unicode. 
You need to handle this consonant issue with some other protocol.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



RE: Indic Devanagari Query

2003-01-29 Thread Kent Karlsson


  I wouldn't go so far. The fact that clusters belong together is something
  that can be handled by the software. Collation and other data processing 
  needs to deal with such issues already for many other languages. See 
  http://www.unicode.org/reports/tr10 on the collation algorithm.
 
 I beg to differ with you on this point. Merely having some provision for
 composing a character doesn't mean that the character is not a candidate
 for inclusion as separate code point. 

At this point, having some provision for composing a particular letter
is very much preventing it from being encoded at a separate code position.
This is due mostly to the fixation of normal forms (except for very rare
error corrections).

 In Unicode many characters have been given codepoints regardless of the
 fact that the same character could have been rendered through some compose
 mechanism. This includes Indic scripts as well as other scripts. For

For legacy reasons, yes.  These reasons no longer apply for
not-yet-encoded compositions.

 Also, many times processing of text depends on the smallest addressable
 unit of that language. Again as discussed in earlier e-mails this may vary
 from one language to another in the same script. Consider a case when a
 language processor/application wants to count the number of characters in
 some text in order to find number of keystrokes required to input the text.

You cannot find the number of keystrokes that way.  Not even 
if you know which keyboard (and disregarding backspace).  E.g.
ä can be produced by one or two (or more, if you count hex input)
keystrokes on (most) Swedish keyboards.

 Further assume that API functions used for this purpose are based on either
 WChar (wide characters) or UTF-8. In this case it is very much necessary
 that you assign the character, say Kssha, to the class consonant. Since
 assignment to this class consonant applies to single code point (the
 smallest addressable unit) and not to the sequence of codes, it is very
 much necessary to have single code point for the character Kssha.

No, that is not the case.  E.g. Hungarian (Magyar) has gy, ny, ly
(and more) as letters (look in a Hungarian dictionary, and its headings).
Similarly, Albanian has dh, rr, th (and more) as letters. None of
these combinations are candidates for single code point allocation.  For 
compatibility reasons the Dutch ij got a single code point, but it
is better to just use i followed by j (though that has some
difficulties; e.g. the titlecase of ijs is IJs, not Ijs).

/Kent K





Re: Indic Devanagari Query

2003-01-29 Thread Christopher John Fynn
 Michael Everson wrote:

 At 02:13 -0800 2003-01-29, Keyur Shroff wrote:
 I beg to differ with you on this point. Merely having some provision for
 composing a character doesn't mean that the character is not a candidate
 for inclusion as separate code point.
 
 Yes, it does.
 
 India is a big country with millions of people geographically 
 divided and speaking variety of languages. Sentiments are attached 
 with cultures which may vary from one geographical area to another. 
 So when one of the many languages falling under the same script 
 dominate the entire encoding for the script, then other group of 
 people may feel that their language has not been represented 
 properly in the encoding.

 A lot of these feelings are simply WRONG, and that has to be faced. 
 The syllable KSSA may be treated as a single letter, but this does 
 not change the fact that it is a ligature of KA and SSA and that it 
 can be represented in Unicode by a string of three characters.

Of course an anomoly is that KSSA *is* encoded in the Tibetan 
block at U+0F69. In normal Tibetan or Dzongkha words KSSA 
U+0F69 (or the combination U+0F40 U+0FB5) does not occur  
- AFAIK it  is *only* used when writing Sanskrit words containing 
KSSA in Tibetan script.  

I had thought that the argument for including KSSA as a seperate
character in the Tibetan block (rather than only having U+0F40 and 
U+0FB5) was originally for compatibility / cross mapping with 
Devanagari and other Indic scripts.  

- Chris






Re: Indic Devanagari Query

2003-01-29 Thread Rick McGowan
Aditya Gokhale wrote:

 1. In Marathi and Sanskrit language two characters glyphs of
 'la' and 'sha' are represented differently as shown in the
 image below -

Actually, for everyone's information: these allographs for Marathi were  
recently brought to our attention, and Unicode 4.0 will have a mention of  
the allographs, including pictures of the variant glyphs.

Rick





RE: Indic Devanagari Query

2003-01-29 Thread Marco Cimarosti
Christopher John Fynn wrote:
 I had thought that the argument for including KSSA as a seperate
 character in the Tibetan block (rather than only having U+0F40 and 
 U+0FB5) was originally for compatibility / cross mapping with 
 Devanagari and other Indic scripts.  

Which is not a valid reason either, considering that U+0F69 and the
combination U+0F40 U+0FB5 are *canonically* equivalent. This means that
normalizing applications are not allowed to treat U+0F69 differntly from
U+0F40 U+0FB5, including displaying them differently or mapping them
differently to something else.

_ Marco




Indic Devanagari Query

2003-01-28 Thread Aditya Gokhale



Hello Everybody, I had few query 
regarding representation of Devanagari script in Unicode(Code page - 0x0900 
- 0x097F). Devanagari is a writing script, isused in Hindi, Marathi and 
Sanskrit languages. I have following questions - 

1. In Marathi and Sanskrit language two charactersglyphs 
of 'la' and 'sha' are represented differently as shown in the image below - 


(Firstglyph is 
'la' and second one is 'sha')
as compared to Hindi where these character glyphs are 
represented as shown in the image below - 

(First glyph is 'la' and 
second one is 'sha')

In the same script code page, how do I use these two different 
Glyphs, to represent the same character ? Is there any way by which I can do it 
in an Open type font and Free type font implementation ?

2. Implementation Query -
 In an implementation where I need to send / 
process Hindi, Marathi and Sanskrit data, how do Idifferentiate between 
languages (Hindi, Marathi and Sanskrit). Say for example, I am writing a 
translation engine, and I want to translate a document having Hindi, Marathi and 
Sanskrit Text in it, how do I know from the code points between 0x0900 and 
0x097F, that the data under perusal is Hindi / Marathi / Sanskrit ?
 I would suggest that we should give 
different code pages for Marathi, Hindi and Sanskrit. May be current code page 
of Devanagari can be traded as Hindi and two new code pages for Marathi and 
Sanskrit be added. This could solve these issues. If there is any better way of 
solving this, any one suggest.


3. Character codes for jna, shra, ksh - 

In Sanskrit and Marathi jna, shra and ksh are considered as 
separate characters and not ligatures. How do we take care of this ? Can I get 
over all views on the matter from the group ? In my opinion they should be given 
different code points in the specific language code page.
Please find below the character glyphs - 

jna

shra

ksh


thanks,
Aditya Gokhale.
GIST Research and Development Lab,
C-DAC Pune,
Maharashtra, India.

http://www.cdacindia.com/html/gist/gistidx.asp









RE: converting devanagari to mangal unicode

2002-12-17 Thread Marco Cimarosti
John Hudson wrote:
 At 03:09 PM 12/16/2002, Eric Muller wrote:
 
 In order to convert any Devanagari font to be rendered in 
 the same way,
 
 May be Sunil is just asking for a conversion of data, 
 presumably from 
 ISCII to Unicode.
 
 Ah, yes, this is possible. I'm so used to people asking the 
 other question 
 that I assumed from the slightly mixed up references in the 
 question that this was what Sunil intended.

OK, this is my interpretation of Sunil's question: He has text data encoded
in a so-called font encoding (e.g. Shusha), and he needs to convert it
to Unicode.

The Linux Technology Development for Indian Languages
(http://www.cse.iitk.ac.in/users/isciig/) has two ongoing projects for
similar conversions:

- iconverter
(http://www.cse.iitk.ac.in/users/isciig/iconverter/main.html)
- ISSCIIlib
(http://www.cse.iitk.ac.in/users/isciig/isciilib/main.html)

_ Marco




Re: converting devanagari to mangal unicode

2002-12-17 Thread Bob_Hallissy

On 16/12/2002 22:02:36 Magda Danish (Unicode) wrote:

 I have a data in devanagri true type font i want to convert
 this data into mangal unicode.

Sunil,

For Windows or Mac use: If you want to convert data from one encoding to
Unicode, one option is to look at the free TECkit package.  There are many
non-Unicode encodings of Devanagari, so I'm unable to guess how your data
is currently encoded. TECkit is table-driven, i.e., you find or prepare a
description of the mapping between your encoding and Unicode, and then
TECkit uses that description to convert data. You may even be able to find
a mapping description already prepared as TECkit can use the XML mapping
definitions from ICU (see
http://oss.software.ibm.com/cvs/icu/charset/data/xml/)  For more
information about TECkit or to download it, see
http://www.sil.org/nrsi/teckit/

Depending on the characteristics of your encoding and your desire to do a
bit of programming, you may also be able to incorporate the ICU
(International Components for Unicode) library into your own program to do
the conversion you need. See
http://oss.software.ibm.com/developerworks/opensource/icu/project/ for more
information.

NB: One of the complexities you may run into, and which will limit your
options, is that your encoding may store text in a different order than
Unicode requires. If this is the case, TECkit can do the rearrangement for
you but I'm not sure ICU will easily do that. Certainly the current
standard for XML-based descriptions of encoding mappings as given in
Unicode Technical Report 22 (see
http://www.unicode.org/unicode/reports/tr22/ ) cannot express such
mappings.

Bob








RE: converting devanagari to mangal unicode

2002-12-17 Thread Marco Cimarosti
Bob Hallissy wrote:
 NB: One of the complexities you may run into, and which will limit your
 options, is that your encoding may store text in a different order than
 Unicode requires. If this is the case, TECkit can do the rearrangement for
 you but I'm not sure ICU will easily do that. Certainly the current
 standard for XML-based descriptions of encoding mappings as given in
 Unicode Technical Report 22 (see
 http://www.unicode.org/unicode/reports/tr22/ ) cannot express such
 mappings.

Someone made me notice recently that UTR#22 can indeed implement Indic
visual-to-logical mappings, provided that one chooses the whole Indic
syllable as a mapping unit. E.g.:

a b=69 73 6B 27 u=0930 094D 0938 094D 0915 093F c=र्स्कि /
!-- matraI+halfSa+Ka+Repha = Ra+Virama+Sa+Virama+Ka+matraI --

Of course, this requires very big tables, which could be avoided using a
smarter mechanisms. Moreover, it only works with well-formed sequences in an
anticipated set of languages, but fails with misspellings or new
orthographies.

_ Marco




Re: converting devanagari to mangal unicode

2002-12-17 Thread Peter_Constable

On 12/16/2002 05:09:04 PM Eric Muller wrote:

May be Sunil is just asking for a conversion of data, presumably from
ISCII to Unicode.

Or perhaps from one of a variety of non-standard Devanagari encodings.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485







converting devanagari to mangal unicode

2002-12-16 Thread Magda Danish (Unicode)


 -Original Message-
 Date/Time:Mon Dec 16 11:28:22 EST 2002
 Contact:  [EMAIL PROTECTED]
 Report Type:  Submission (FAQ, Tech Note)
 
 HI
 
 I am Gis/Website developer my query is 
 
 I have a data in devanagri true type font i want to convert 
 this data into mangal unicode. 
 
 I want to know whether any converter is available for 
 converting devanagari to mangal unicode.
 
 Please reply ASAP
 
 Sunil
 
 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
 (End of Report)
 
 




Re: converting devanagari to mangal unicode

2002-12-16 Thread John Hudson


 I am Gis/Website developer my query is

 I have a data in devanagri true type font i want to convert
 this data into mangal unicode.

 I want to know whether any converter is available for
 converting devanagari to mangal unicode.


This is, excuse the pun, a bit of a mangled question. Mangal is Microsoft's 
Hindi UI font; it is an OpenType font that uses glyph substitution and 
positioning to correctly display the Devanagari script on top of a standard 
Unicode text string. In order to convert any Devanagari font to be rendered 
in the same way, two steps are necessary:

1. Make sure that the font has a Unicode cmap table and that the base forms 
of Devanagari characters are encoded in it in accordance with the Unicode 
standard.

2. Use Microsoft's free VOLT tool to add OpenType Layout tables for glyph 
substitution and positioning.

There is no automated way to do such a conversion, although various 
sub-stages could be automated within particular tools (e.g. defining 
Unicode cmap mappings from glyph names in FontLab). The nature of the 
OpenType Layout lookups required will depend on the glyph repertoire of the 
individual font.

See http://www.microsoft.com/typography/specs/default.htm for more 
information about making OpenType fonts for Devanagari and other scripts.

John Hudson

Tiro Typeworks		www.tiro.com
Vancouver, BC		[EMAIL PROTECTED]

A book is a visitor whose visits may be rare,
or frequent, or so continual that it haunts you
like your shadow and becomes a part of you.
   - al-Jahiz, The Book of Animals




Re: converting devanagari to mangal unicode

2002-12-16 Thread Eric Muller
In order to convert any Devanagari font to be rendered in the same way, 


May be Sunil is just asking for a conversion of data, presumably from 
ISCII to Unicode.

Eric.





Re: converting devanagari to mangal unicode

2002-12-16 Thread John Hudson
At 03:09 PM 12/16/2002, Eric Muller wrote:


In order to convert any Devanagari font to be rendered in the same way,


May be Sunil is just asking for a conversion of data, presumably from 
ISCII to Unicode.

Ah, yes, this is possible. I'm so used to people asking the other question 
that I assumed from the slightly mixed up references in the question that 
this was what Sunil intended.

John Hudson

Tiro Typeworks		www.tiro.com
Vancouver, BC		[EMAIL PROTECTED]

A book is a visitor whose visits may be rare,
or frequent, or so continual that it haunts you
like your shadow and becomes a part of you.
   - al-Jahiz, The Book of Animals




Devanagari

2002-12-03 Thread Vipul Garg








I have downloaded your font chart for Devanagari,
which is in the range from 0900 to 097F. I have also installed the Arial Unicode font
supplied by Microsoft office XP suite. I found that not all characters are
available for Devanagari. For example letters such as
Aadha KA, Aadha KHA, Aadha GA etc. are not available. 



These letters are required in the devanagari words such as KANYA, NANHA,
PARMATMA etc.



If you could provide the above letters then our
requirement for formation of Devanagari words would
be possible. This requirement is very crucial as we have a large volume project
on Devanagari language involving data storage in Oracle
database.



Would appreciate an
early reply.



Best Regards,

Vipul Garg

Phone: (022) 55994861














BEGIN:VCARD
VERSION:2.1
N:Garg;Vipul
FN:Vipul Garg ([EMAIL PROTECTED])
ORG:Mind Axis (I) Solutions Pvt. Ltd.
TITLE:Project Director
TEL;WORK;VOICE:91-22-55994860
TEL;WORK;FAX:91-22-55994861
ADR;WORK;ENCODING=QUOTED-PRINTABLE:;;A-203, Hamilton,=0D=0AHiranandani Estate,=0D=0AGhodbunder Road,=0D=0APatli=
pada,;Thane (W);Maharashtra;400607;India
LABEL;WORK;ENCODING=QUOTED-PRINTABLE:A-203, Hamilton,=0D=0AHiranandani Estate,=0D=0AGhodbunder Road,=0D=0APatlipa=
da,=0D=0AThane (W), Maharashtra 400607=0D=0AIndia
URL:
URL:http://www.mindaxis.com
EMAIL;PREF;INTERNET:[EMAIL PROTECTED]
REV:20021118T122317Z
END:VCARD



RE: Devanagari

2002-12-03 Thread Alan Wood
Vipul Garg wrote:

 I have downloaded your font chart for Devanagari, which is in the range
 from 0900 to 097F. I have also installed the Arial Unicode font supplied
 by Microsoft office XP suite. I found that not all characters are
 available for Devanagari. For example letters such as Aadha KA, Aadha KHA,
 Aadha GA etc. are not available. 
  
 These letters are required in the devanagari words such as KANYA, NANHA,
 PARMATMA etc.
  
 If you could provide the above letters then our requirement for formation
 of Devanagari words would be possible. This requirement is very crucial as
 we have a large volume project on Devanagari language involving data
 storage in Oracle database.
 
You could try using a different font, for example one of the specialist
Devanagari fonts listed at:

http://www.alanwood.net/unicode/fonts.html#devanagari

Alan Wood
http://www.alanwood.net (Unicode, special characters, pesticide names)





RE: Devanagari

2002-12-03 Thread Andy White

Vipal Garg was asking why half characters were not included in Unicode
code charts and in his copy of Arial Unicode font.
 
More recent versions of Arial Unicode Do contain half characters etc.
for Devanagari.
As to the code charts, to answer this, you needed to explore the Unicode
web site a bit more to find the answer.  Please see the following for
detailed information regarding the half characters etc:
http://www.unicode.org/unicode/standard/where/
http://www.unicode.org/unicode/faq/indic.html
http://www.unicode.org/unicode/uni2book/ch09.pdf

Best Regards
Andy

You Wrote:
I have downloaded your font chart for Devanagari, which is in the range
from 0900 to 097F. I have also installed the Arial Unicode font supplied
by Microsoft office XP suite. I found that not all characters are
available for Devanagari. For example letters such as Aadha KA, Aadha
KHA, Aadha GA etc. are not available. 
 
These letters are required in the devanagari words such as KANYA, NANHA,
PARMATMA etc.





RE: Devanagari

2002-12-03 Thread Marco Cimarosti
Vipul Garg wrote:
 I have downloaded your font chart for Devanagari, which is in 
 the range from 0900 to 097F. I have also installed the Arial 
 Unicode font supplied by Microsoft office XP suite. I found 
 that not all characters are available for Devanagari. For 
 example letters such as Aadha KA, Aadha KHA, Aadha GA etc. 
 are not available. 
  
 These letters are required in the devanagari words such as 
 KANYA, NANHA, PARMATMA etc.
  
 If you could provide the above letters then our requirement 
 for formation of Devanagari words would be possible. This 
 requirement is very crucial as we have a large volume project 
 on Devanagari language involving data storage in Oracle database.
  
 Would appreciate an early reply.

Please, see document Where is my character:

http://www.unicode.org/unicode/standard/where/

Also have a look to question 17 in the Indic FAQ:

http://www.unicode.org/unicode/faq/indic.html#17

All is explained in more detail in Section 9.1 Devanagari of the Unicode
manual:

http://www.unicode.org/unicode/uni2book/ch09.pdf

Regards.
M.C.




Re: Devanagari

2002-12-03 Thread John Cowan
[EMAIL PROTECTED] scripsit:

 Au contraire! You might find the attached gif of interest. (This is version
 1.0 of the font. Some people might have earlier versions.)

Ah, excellent.  It has not always been so.

 If you're not getting Indic shaping with Arial Unicode MS, it's very likely
 the fault of your software, not the font (and, of course, not Unicode).

Indeed, but the original poster specified the use of XP (Windows or Office,
I forget which), so I discounted that.

-- 
They do not preach  John Cowan
  that their God will rouse them[EMAIL PROTECTED]
A little before the nuts work loose.http://www.ccil.org/~cowan
They do not teach   http://www.reutershealth.com
  that His Pity allows them --Rudyard Kipling,
to drop their job when they damn-well choose.   The Sons of Martha




  1   2   3   >