Precomposed Ethiopic (Was: Precomposed Tibetan)

2002-12-17 Thread John Hudson
At 01:25 PM 12/17/2002, Carl W. Brown wrote:


Michael,

> >I was disappointed that Unicode used precomposed encoding for Ethiopic.
>
> Heavens, why?

I assume that you are being tongue-in-cheek.  If not:

Since you key in syllables as consonant+vowel combinations you can keep the
encoding under 256 characters like most other languages with syllabic glyphs
and keep the processing consistent with other languages.


With which other languages? Not Yi, or the languages that use the Canadian 
Aboriginal Syllabics.

The processing model for scripts in Unicode tends to follow fairly closely 
the nature of the scripts as traditionally understood. The Tibetan script, 
like Korean Hangul and also like the Indic scripts from which it derives, 
is generative: syllables are built by the manipulation, substitution and 
positioning of sub-syllabic units. This is inherent in the design of these 
scripts.

The Ethiopic script is *not* made up of sub-syllabic units: the syllable is 
the minimum unit of writing. The same is true to Yi and the Canadian 
Aboriginal Syllabics. The fact that Ethiopic has recently been input 
phonetically should not lead to confusion about the inherent nature of the 
script, which is not generative.

John Hudson

Tiro Typeworks		www.tiro.com
Vancouver, BC		[EMAIL PROTECTED]

A book is a visitor whose visits may be rare,
or frequent, or so continual that it haunts you
like your shadow and becomes a part of you.
   - al-Jahiz, The Book of Animals




RE: Precomposed Tibetan

2002-12-17 Thread David Starner
At 01:25 PM 12/17/2002 -0800, Carl W. Brown wrote:

Michael,

> >I was disappointed that Unicode used precomposed encoding for Ethiopic.
>
> Heavens, why?

I assume that you are being tongue-in-cheek.  If not:


One of the issues with using a precomposed encoding instead of a decomposed 
encoding is that a poorly designed precomposed encoding will leave you 
constantly having to encode new characters as people need this and that. 
The Latin script has that problem even though the decomposed encoding is 
offered as well as a precomposed encoding. But having watched the proposals 
fairly closely for a while, I've only seen one request for more Ethiopic 
characters, a good sign that the right choice was made.  




Re: converting devanagari to mangal unicode

2002-12-17 Thread Peter_Constable

On 12/16/2002 05:09:04 PM Eric Muller wrote:

>May be Sunil is just asking for a conversion of data, presumably from
>ISCII to Unicode.

Or perhaps from one of a variety of non-standard Devanagari encodings.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485







Re: Precomposed Tibetan

2002-12-17 Thread Peter_Constable

On 12/17/2002 09:52:18 AM Jungshik Shin wrote:

> Is there any opentype/AAT font for Tibetan? Do Uniscribe, Pango,
>ATSUI, and Graphite support them if there are opentype Tibetan fonts?

I know that Chris Fynn has been working on a Tibetan font, but can't
comment on progress. OpenType tables for complex rendering would not be
supported by AAT or Graphite, but Tibetan fonts could be implemented using
either of those font technologies. I don't know of any Tibetan fonts using
either of these formats, though.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485







RE: Precomposed Tibetan

2002-12-17 Thread John Hudson
At 01:32 PM 12/17/2002, Kenneth Whistler wrote:


Peter Lofting asked:

> Presumedly the present proposal of 900+ stacks is a maturation of the
> same system. And the claim for universality is based on it being able
> to typeset everything they have published to-date.

It is based on the Founders system software, as Michael mentioned.


Steve Hartwell -- who has worked with Peter in the past on Tibetan 
implementation for Apple -- and I met two people from Founders at the 
Microsoft OpenType seminar in August. Their understanding of Tibetan seemed 
to be heavily influenced by preconceptions based on Chinese scripts 
(including setting the glyphs on fixed  ideographic widths) and ignored the 
nature of the Indic scripts that influenced the development of Tibetan. 
Ironically, considering this Chinese proposal based on the Founders system, 
the representatives we met with very quickly understood the value of the 
OpenType glyph substitution and positioning that Steve explained to them, 
and immediately saw the value of this technology in handling Tibetan stack 
formation. So it is possible that while this new proposal might find 
support for reasons of backwards compatibility and handling of existing 
data, Founders themselves might begin to treat Tibetan as the complex 
script it is.

John Hudson

Tiro Typeworks		www.tiro.com
Vancouver, BC		[EMAIL PROTECTED]

A book is a visitor whose visits may be rare,
or frequent, or so continual that it haunts you
like your shadow and becomes a part of you.
   - al-Jahiz, The Book of Animals




Re: Precomposed Tibetan

2002-12-17 Thread Michael Everson
At 16:12 -0800 2002-12-17, Michael \(michka\) Kaplan wrote:


Everyone here KNOWS this. What Ken was pointing out is that not only will it
create such problems, but it will not solve the problem that they claim it
will. It was an additional reason to say no, and one they might be forced to
acknowledge since it refutes their claims.


Well, duh, I guess, MichKa. You know I was one of those who actively 
encoded Tibetan in the first place and I did talk to the Tibetans 
during the meetings. (It was rather pleasant doing so.)

What Ken said, maybe rhetorically, was to ask if Unicoders thought 
that the "introduction of significant normalization problems into 
Tibetan (for everyone) is a worthwhile tradeoff", and, since many 
Unicoders were not at the meeting, I thought it best to comment.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



Re: Precomposed Tibetan

2002-12-17 Thread Michael \(michka\) Kaplan
From: "Michael Everson" <[EMAIL PROTECTED]>

> At 13:53 -0800 2002-12-17, Kenneth Whistler wrote:
>
> >The question for Unicoders is whether introduction of significant
> >normalization problems into Tibetan (for everyone) is a worthwhile
tradeoff
> >for this claimed legacy ease of transition for one system, when it is
> >clear that all existing legacy data using these precomposed stacks is
> >going to have to either be reencoded anyway (or surrounded by migration
> >filters for new systems).
>
> Is it a question? To do so would be a disaster for the encoding of
Tibetan.

Michael,

Everyone here KNOWS this. What Ken was pointing out is that not only will it
create such problems, but it will not solve the problem that they claim it
will. It was an additional reason to say no, and one they might be forced to
acknowledge since it refutes their claims.

MichKa [MS]





RE: Precomposed Tibetan

2002-12-17 Thread Michael Everson
At 13:53 -0800 2002-12-17, Kenneth Whistler wrote:


The question for Unicoders is whether introduction of significant
normalization problems into Tibetan (for everyone) is a worthwhile tradeoff
for this claimed legacy ease of transition for one system, when it is
clear that all existing legacy data using these precomposed stacks is
going to have to either be reencoded anyway (or surrounded by migration
filters for new systems).


Is it a question? To do so would be a disaster for the encoding of Tibetan.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com




RE: Precomposed Tibetan

2002-12-17 Thread Michael Everson
At 13:25 -0800 2002-12-17, Carl W. Brown wrote:


Since you key in syllables as consonant+vowel combinations


Inputting is unrelated to the encoding, and it is conceivable that a 
non-alphabetic input method could exist for Ethiopic.

you can keep the encoding under 256 characters


There is nothing magic about this number.


like most other languages with syllabic glyphs


What? Canadian Syllabics takes 40 columns and Yi takes 73.


and keep the processing consistent with other languages.


I don't know what you are talking about. What processing?
--
Michael Everson * * Everson Typography *  * http://www.evertype.com




RE: Precomposed Tibetan

2002-12-17 Thread Kenneth Whistler
Marco commented:

> Another key point, IMHO, is verifying the following claim contained in the
> proposal document:
> 
>   "Tibetan BrdaRten characters are structure-stable characters widely
> used in education, publication, classics documentation including Tibetan
> medicine. The electronic data containing BrdaRten characters are
> estimated beyond billions. Once the Tibetan BrdaRten characters are encoded
  ^
> in BMP, many current systems supporting ISO/IEC10646 will enable Tibetan
> processing without major modification. Therefore, the international standard
 ^^
> Tibetan BrdaRten characters will speed up the standardization and
> digitalization of Tibetan information, keep the consistency of
> implementation level of Tibetan and other scripts, develop the Tibetan
> culture and make the Tibetan culture resources shared by the world."  [BTW,
> billions of what!?]

The Chinese delegation at the WG2 meeting agreed with a restatement of
this as "gigabytes of data". Exactly what kind of data, they did not say,
but in principle that could consist of a few medium-size databases. It
almost certainly does not consist of billions of *documents*.

> I'd propose the following:
> 
>   1. Find all the available technical details about this BrdaRten
> encoding.

One additional detail for people. The BrdaRten stacks are currently
implemented, in the Founders System software in Tibet, as an extension
to GB 2312.

>   2. Come up with a precise machine-readable mapping file between
> BrdaRten encoding to *decomposed* Unicode Tibetan, possibly accompanied by a
> sample conversion application.
>   Reasons: (a) to make it easy to migrate BrdaRten legacy data to
> Unicode; (b) to easily update existing BrdaRten applications to export
> Unicode text; (c) to easily retrofit new Unicode applications to import
> BrdaRten text.

See the key words "without major modification" above. If the BrdaRten
stacks were encoded in Unicode, they would automatically become part
of GB 18030 (because of the UTF-like nature of that strange standard).
However, the catch is that the actual code points for Unicode/10646 are
not predictable or controllable by the Chinese NB. That means that the
final code points in GB 18030 are also not predictable -- and almost
certainly are not the same as those used by the current GB 2312 extension
in Tibet. And *that* means that the current "characters ... estimated
beyond billions" will have to be migrated to a new encoding, anyway,
once the systems are updated to GB 18030. That is the reason for the
quibble word "major" in the phrase above. All the data will be reencoded,
but the transition GB 2312 + Tibetan extension ==> GB 18030 containing
Tibetan extension is viewed as "just a mapping" and not a major system
modification.

The alternative (and even scarier) prospect is that the existing GB 2312
Tibetan extension code points would be forced as is into a new version
of GB 18030, invalidating the mapping for the existing code points,
and creating a completely new version of GB 18030 that would have to
be supported as a different "code page" from the existing GB 18030. This
would start us down the road to a indefinite number of distinct GB 18030
mappings, probably not properly labeled in interchange, with large numbers
of interoperability problems predictable (and likely to dwarf the JIS
yen sign/backslash kinds of problems). The reason this prospect is even
thinkable is that any existing implementation of the BrdaRten stacks
in a GB 2312 extension would surely be using 2-byte character encodings,
and a transition to 4-byte GB 18030 character encodings would likely
disrupt the existing implementations significantly.

The question for Unicoders is whether introduction of significant
normalization problems into Tibetan (for everyone) is a worthwhile tradeoff
for this claimed legacy ease of transition for one system, when it is
clear that all existing legacy data using these precomposed stacks is
going to have to either be reencoded anyway (or surrounded by migration
filters for new systems).

--Ken






RE: Precomposed Tibetan

2002-12-17 Thread Carl W. Brown
Michael,

> >I was disappointed that Unicode used precomposed encoding for Ethiopic.
>
> Heavens, why?

I assume that you are being tongue-in-cheek.  If not:

Since you key in syllables as consonant+vowel combinations you can keep the
encoding under 256 characters like most other languages with syllabic glyphs
and keep the processing consistent with other languages.

Carl






RE: Precomposed Tibetan

2002-12-17 Thread Kenneth Whistler
Peter Lofting asked:

> Presumedly the present proposal of 900+ stacks is a maturation of the 
> same system. And the claim for universality is based on it being able 
> to typeset everything they have published to-date. 

It is based on the Founders system software, as Michael mentioned.

> The question is 
> whether that list of texts is representative of the full literary and 
> linguistic corpus 

It is not.

> or is only a sub-set?

It is. The Chinese delegation admitted that the collection of stacks
was aimed at modern Tibetan use and would not cover literary Tibetan.

This means that in practice systems based on the current Founders
system technology would be restricted in their coverage, and that
Unicode-based systems would have to deal with *both* the precomposed
stacks and with the rest of Tibetan, leading to Hangul-like
normalization nightmares.

> Could the Chinese be asked to provide detailed information on this 
> system and the texts that it has published so we can get an idea of 
> the domain that their stack set covers?

They were asked some questions during the meeting. The correct
way to proceed now is to provide national body feedback on their
proposal. Such feedback can, of course, contain such questions
regarding the intended scope of coverage of the repertoire in
the proposal.

--Ken

> 
> Peter Lofting
> 





RE: Precomposed Tibetan

2002-12-17 Thread Carl W. Brown
Marco,

I was disappointed that Unicode used precomposed encoding for Ethiopic.  

Carl





RE: Precomposed Tibetan

2002-12-17 Thread Marco Cimarosti
Michael Everson wrote:
> What the encoding of a set of brDa rTen precomposed syllables would 
> do would be to restrict the Tibetans to this set, to which they have 
> been restricted by the proprietary Founder software used in China. 
> These 950 syllables are insufficient to express anything but 
> newspaper and bureaucratic Tibetan.

I totally agree. My point was another: if it is true that there is a large
existing corpus encoded in that encoding, a prerequisite to reject the
proposal is demonstrating that the path to Unicode is be smooth, with no
risk for the data and no unsustainable costs.

_ Marco




Re: Precomposed Tibetan

2002-12-17 Thread Peter Lofting

No strangeness: I was just taking it for granted that this resource 
is well known and in this case off topic as the question was about 
OpenType/AAT fonts for Tibetan, wheras the Tibetan Language Kit is a 
Worldscript 8-bit implementation with no smarts in the fonts it uses 
(The itl5 resource contains the state tables rather than the fonts).

Peter Lofting


At 3:08 PM -0500 12/17/02, Martin Heijdra wrote:
Strangely Peter Lofting didn't say this, since he was one of its original
developers, but there is also a (now) free Tibetan Language Kit for the Mac
at

http://www.otani.ac.jp/cri/twrp/TLK/index.html

which forms stacking characters based upon single characters.

Martin Heijdra

- Original Message -
From: "Alan Wood" <[EMAIL PROTECTED]>
To: "Unicode Mailing List" <[EMAIL PROTECTED]>
Sent: Tuesday, December 17, 2002 12:06 PM
Subject: RE: Precomposed Tibetan



 Jungshik Shin wrote:

 >  Is there any opentype/AAT font for Tibetan? Do Uniscribe, Pango,
 > ATSUI, and Graphite support them if there are opentype Tibetan fonts?
 > In addition to the principle of character encoding, the best practical
 > counterargument would come from a demonstration that Unicode encoding
 > model for Tibetan script does work in practice.
 >
 I don't know if it includes OpenType or AAT features, but XenoType has

just

 announced a Tibetan Unicode Language Kit for Mac OS X 10.2:

 http://www.xenotypetech.com/

 This page also announces kits for Burmese, Cherokee, Inuktitut,  Kannada,
 Lao, Malayalam and Thai.

 >
 > Alan Wood
 > http://www.alanwood.net (Unicode, special characters, pesticide names)
 >
 >
 >






RE: Precomposed Tibetan

2002-12-17 Thread Marco Cimarosti
Carl W. Brown wrote:
> Marco,
> 
> I was disappointed that Unicode used precomposed encoding for 
> Ethiopic.  

Was that my fault? I'm not even a member of Unicode!

_ Marco :-)




Re: Precomposed Tibetan

2002-12-17 Thread Martin Heijdra
Strangely Peter Lofting didn't say this, since he was one of its original
developers, but there is also a (now) free Tibetan Language Kit for the Mac
at

http://www.otani.ac.jp/cri/twrp/TLK/index.html

which forms stacking characters based upon single characters.

Martin Heijdra

- Original Message -
From: "Alan Wood" <[EMAIL PROTECTED]>
To: "Unicode Mailing List" <[EMAIL PROTECTED]>
Sent: Tuesday, December 17, 2002 12:06 PM
Subject: RE: Precomposed Tibetan


> Jungshik Shin wrote:
>
> >  Is there any opentype/AAT font for Tibetan? Do Uniscribe, Pango,
> > ATSUI, and Graphite support them if there are opentype Tibetan fonts?
> > In addition to the principle of character encoding, the best practical
> > counterargument would come from a demonstration that Unicode encoding
> > model for Tibetan script does work in practice.
> >
> I don't know if it includes OpenType or AAT features, but XenoType has
just
> announced a Tibetan Unicode Language Kit for Mac OS X 10.2:
>
> http://www.xenotypetech.com/
>
> This page also announces kits for Burmese, Cherokee, Inuktitut,  Kannada,
> Lao, Malayalam and Thai.
>
> Alan Wood
> http://www.alanwood.net (Unicode, special characters, pesticide names)
>
>
>





RE: Precomposed Tibetan

2002-12-17 Thread Michael Everson
At 11:37 -0800 2002-12-17, Carl W. Brown wrote:

Marco,

I was disappointed that Unicode used precomposed encoding for Ethiopic.


Heavens, why?
--
Michael Everson * * Everson Typography *  * http://www.evertype.com




RE: Precomposed Tibetan

2002-12-17 Thread Michael Everson
At 19:32 +0100 2002-12-17, Marco Cimarosti wrote:


"Tibetan BrdaRten characters are structure-stable characters widely
used in education, publication, classics documentation including Tibetan
medicine. The electronic data containing BrdaRten characters are
estimated beyond billions. Once the Tibetan BrdaRten characters are encoded
in BMP, many current systems supporting ISO/IEC10646 will enable Tibetan
processing without major modification. Therefore, the international standard
Tibetan BrdaRten characters will speed up the standardization and
digitalization of Tibetan information, keep the consistency of
implementation level of Tibetan and other scripts, develop the Tibetan
culture and make the Tibetan culture resources shared by the world."


What the encoding of a set of brDa rTen precomposed syllables would 
do would be to restrict the Tibetans to this set, to which they have 
been restricted by the proprietary Founder software used in China. 
These 950 syllables are insufficient to express anything but 
newspaper and bureaucratic Tibetan.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



RE: Precomposed Tibetan

2002-12-17 Thread Peter Lofting
At 7:32 PM +0100 12/17/02, Marco Cimarosti wrote:

Once the Tibetan BrdaRten characters are encoded
in BMP, many current systems supporting ISO/IEC10646 will enable Tibetan
processing without major modification.


There was an earlier proposal by the Chinese for a pre-composed 
Tibetan set (ISO10646-WG2-N964) that I analyzed in Jan 1994. It had 
708 character stacks.

It was believed to be from a PC hardware card-based implimentation 
that the Chinese posts & telegraph department had early on and was 
for supporting colloquial Tibetan plus a bit extra (transliterations 
of foreign place names, etc.). The 1994 proposal document was dot 
matrix printed and contained some hand-drawn glyphs, indicating that 
the PC implementation of that time could not support some chars.

Presumedly the present proposal of 900+ stacks is a maturation of the 
same system. And the claim for universality is based on it being able 
to typeset everything they have published to-date. The question is 
whether that list of texts is representative of the full literary and 
linguistic corpus or is only a sub-set?

Could the Chinese be asked to provide detailed information on this 
system and the texts that it has published so we can get an idea of 
the domain that their stack set covers?

Peter Lofting



RE: Precomposed Tibetan

2002-12-17 Thread Marco Cimarosti
Jungshik Shin wrote:
> [...]
> > http://std.dkuug.dk/jtc1/sc2/WG2/docs/n2558.pdf
> [...]
> 
>  Is there any opentype/AAT font for Tibetan? Do Uniscribe, Pango,
> ATSUI, and Graphite support them if there are opentype Tibetan fonts?
> In addition to the principle of character encoding, the best practical
> counterargument would come from a demonstration that Unicode encoding
> model for Tibetan script does work in practice.

Another key point, IMHO, is verifying the following claim contained in the
proposal document:

"Tibetan BrdaRten characters are structure-stable characters widely
used in education, publication, classics documentation including Tibetan
medicine. The electronic data containing BrdaRten characters are
estimated beyond billions. Once the Tibetan BrdaRten characters are encoded
in BMP, many current systems supporting ISO/IEC10646 will enable Tibetan
processing without major modification. Therefore, the international standard
Tibetan BrdaRten characters will speed up the standardization and
digitalization of Tibetan information, keep the consistency of
implementation level of Tibetan and other scripts, develop the Tibetan
culture and make the Tibetan culture resources shared by the world."  [BTW,
billions of what!?]

If the claim proves to be false, well... But if it is true (or even if it is
not but someone insists it is), I think that it is necessary to
*demonstrate* the possibility and convenience of alternative solutions.

I'd propose the following:

1. Find all the available technical details about this BrdaRten
encoding.

2. Come up with a precise machine-readable mapping file between
BrdaRten encoding to *decomposed* Unicode Tibetan, possibly accompanied by a
sample conversion application.
Reasons: (a) to make it easy to migrate BrdaRten legacy data to
Unicode; (b) to easily update existing BrdaRten applications to export
Unicode text; (c) to easily retrofit new Unicode applications to import
BrdaRten text.

3. (The opposite of point 2) come up with a precise machine-readable
mapping file between *decomposed* Unicode Tibetan and BrdaRten encoding,
possibly accompanied by a sample conversion application.
Reasons: (a) to make it easy to recycle precomposed glyphs from
existing BrdaRten fonts into modern "smart fonts"; (b) to easily update
existing BrdaRten applications to import Unicode text; (c) to easily
retrofit new Unicode applications to export BrdaRten text.

_ Marco




Re: Documenting in Tamil Computing

2002-12-17 Thread Barry Caplan
At 10:34 AM 12/17/2002 +0100, Stephane Bortzmeyer wrote:
>> There are various extensions and kluges described in various RFCs
>> (ESMTP, MIME, etc. )
>
>All these extensions are referenced in the same RFC, 2821, which is
>the authoritative one about SMTP. I do not know any mainstream SMTP
>server which does not implement them.
>
>The most important for us is 8BITMIME:
>
>   Eight-bit message content transmission MAY be requested of the server
>   by a client using extended SMTP facilities, notably the "8BITMIME"
>   extension [20].  8BITMIME SHOULD be supported by SMTP servers.


There is another RFC, whose number I forget, that defines "should". Essentially it 
says you must not rely on anyone else actually implementing this feature.


>> but they are not universally implemented at the server transport
>> layer,
>
>This is absolutely wrong. sendmail, Postfix and qmail allow 8-bits
>transport for a *very* long time.

Well, aside from the fact that those are not the only 2 pieces of mail transport sw by 
a long shot, this feature   e is a configurable option, and may not always be turned 
on.


>> But for arbitrary email from one address to another, you can't rely on it.
>
>I send Latin-1 (ISO 8859-1) emails for more than ten years (and
>without using quoted-printable or other similar hacks) to
>French-speaking people in various parts of the world and I'm still
>waiting for an actual problem.

>You're playing with words. 

Not really - this is very clearly dealt with in an RFC that defines "SHOULD" and 
"MUST".


>In real life, all SMTP servers support 
>8-bits mail because all SMTP servers authors are aware of the issue 
>(true, it was long and difficult to convince them all but it 
>worked). Any counter-example?

Jungshik Shin wrote: 
>Besides, some email servers still don't 
>abide by ESMTP standard and don't include '8BITMIME' in their response 
>when queried with 'EHLO' although they support 8bit clean transport 
>(as you wrote).

I did a quick survey of mail servers in the .com top level domain about 18 months ago 
to see which servers implemented 8bitmime and which didn't.  IIRC, about 20% or more 
did not. As I said earlier, that does not mean 8 nits wouldn't go through anyway if 
they are modern servers, but you can't rely on that.

I would like to do a wider survey if someone could donate some bandwidth or maybe 
someone at W3 who was going to look into this at the time can bring this back to top 
of the things to do list (no names, but I am pretty sure he is on this list...:)

Barry Caplan
www.i18n.com





Re: Precomposed Tibetan

2002-12-17 Thread Andrew C. West
On Tue, 17 Dec 2002 08:45:05 -0800 (PST), Jungshik Shin wrote:

>  Is there any opentype/AAT font for Tibetan? Do Uniscribe, Pango,
> ATSUI, and Graphite support them if there are opentype Tibetan fonts?
> In addition to the principle of character encoding, the best practical
> counterargument would come from a demonstration that Unicode encoding
> model for Tibetan script does work in practice.

I have a Tibetan test page
(http://uk.geocities.com/BabelStone1357/Test/Tibetan.html) that encodes a number
of Tibetan texts in a variety of different styles using Unicode. At present the
only freely available OpenType Tibetan font is the "Tibetan Uchen" font being
developed by Tashi Tsering - a working version is available for download from
http://www.cs.virginia.edu/~tt3e/files/Research.html. This displays "native"
Tibetan (i.e. no unusual or complex Sankrit stacks) passably well, but is still
quite primitive. However, expect some professional quality Tibetan OpenType
fonts that cover all naturally occuring Tibetan stack combinations to be
released soon.

The Unicode model for encoding Tibetan does work in practice, and providing
pre-composed forms adds no value to Tibetan users whatsoever - it takes the same
effort to type in or otherwise select the syllable "skyi", for example, whether
its encoded as a single character (Chinese proposal U+A54C) or four characters
(U+0F66, U+0F90, U+0FB1, U+0F72). The Chinese proposal seems to suggest that the
main reason for encoding the precomposed forms is that existing Tibetan fonts
already cover this set of glyphs, and it would be easier for font designers to
have a one-to-one mapping between glyph and character than to have to map these
presentation forms to sequences of Unicode characters. Well, compared with
Mongolian, Tibetan's a doddle ! And if Tibetan gets precomposed forms, then
Mongolian can't be far behind. To accept the Chinese proposal would be more than
the thin edge of the wedge, it would tear Unicode apart. But I'm sure there's no
realistic chance of this happening.

Andrew




Re: Precomposed Tibetan

2002-12-17 Thread Michael Everson
At 10:52 -0500 2002-12-17, Jungshik Shin wrote:


I sincerely hope the proposed character set won't become a second case
of Hangul precomposed syllables albeit in a scale about 10 times smaller.
It'd be interesting to see how South Korea will vote on this. It may
not be easy to vote against it because of its past 'sin'.


One could hardly argue for an industrial reason for such a change, as 
was argued by Korea for Hangul at the Geneva meeting of WG2.

Is there any opentype/AAT font for Tibetan? Do Uniscribe, Pango,
ATSUI, and Graphite support them if there are opentype Tibetan fonts?
In addition to the principle of character encoding, the best practical
counterargument would come from a demonstration that Unicode encoding
model for Tibetan script does work in practice.


I guess the Tibetans will be shown one at the October meeting of WG2 
in California. I did speak to them at the meeting and informed them 
about appopriate font technologies.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



RE: Precomposed Tibetan

2002-12-17 Thread Alan Wood
Jungshik Shin wrote:

>  Is there any opentype/AAT font for Tibetan? Do Uniscribe, Pango,
> ATSUI, and Graphite support them if there are opentype Tibetan fonts?
> In addition to the principle of character encoding, the best practical
> counterargument would come from a demonstration that Unicode encoding
> model for Tibetan script does work in practice.
> 
I don't know if it includes OpenType or AAT features, but XenoType has just
announced a Tibetan Unicode Language Kit for Mac OS X 10.2:

http://www.xenotypetech.com/

This page also announces kits for Burmese, Cherokee, Inuktitut,  Kannada,
Lao, Malayalam and Thai.

Alan Wood
http://www.alanwood.net (Unicode, special characters, pesticide names)





Re: 8-bit MIME (was: Documenting in Tamil Computing)

2002-12-17 Thread Jungshik Shin

On Tue, 17 Dec 2002, Stephane Bortzmeyer wrote:

> On Tue, Dec 17, 2002 at 01:28:00PM +0100,
>  Otto Stolz <[EMAIL PROTECTED]> wrote

> > I have seen many messages, originally in ISO-8859-1-encoded French,
> > that got the high-bit of every accented character chopped off, thus
> > replacing "é" with "i", "î" with "n", and so forth.

  When was the last time you saw this?

> Last time I saw such problems was something like ten years ago. It was
> almost never the fault of the SMTP server, but of some programs on the
> destination machine (or sometimes the faults of funny gateways like
> X400 servers, something you cannot blame on the Internet).

  Although I agree that 8BITMIME is implemented and deployed
very widely these days(it's been more than two years since I received
garbled emails due to 7bit-only path. I receive tens of emails in 8bit
encodings  every day), I'm afraid it's your unique experience that the
last time you received emails with MSB stripped off was 10 years ago.
While trying to counter the exaggeration made against the ability of the
internet email to transport UTF-8 emails, you may have gone to the other
extreme.  In 1992, sendmail 4.x/5.x transported more than half (if not
more) of the Internet email and they're not 8bit clean. That's why RFC
1468 and RFC 1557 were written circa 1992 for Japanese and Korean email
exchanges in 7bit ISO-2022-JP and ISO-2022-KR, respectively. (in case
of ISO-2022-JP, there's another important reason. there are two major
encodings used for Japanese, Shift_JIS on DOS/Windows/Mac and EUC-JP on
Unix) As lately as 1999, I did receive MSB-stripped emails which didn't
go through non-SMTP gateway (e.g. X400).  Back then,  some mail servers
still used 7bit-only sendmail 4.x, 5.x (on old Sun OS 4.x, AIX 3.x, 4.x,
HP/UX 8.x, IRIX, etc machines), old version of PMDF(old VMS machines)
and smail(on some Unix machines) while 8bit clean sendmail 8.6.x or
later had been around since mid-1990's.

Besides, some email servers still don't
abide by ESMTP standard and don't include '8BITMIME' in their response
when queried with 'EHLO' although they support 8bit clean transport
(as you wrote).

Nonetheless, I  agree that these days most mail transport paths are 8bit
clean. Even if not, Base64 and QP(I don't regard them as hack as you do)
are well supported by most modern MUAs so that end-users have little
problem exchanging emails in UTF-8 (or other legacy 8bit encodings).
Most of them don't have to care whether 8BITMIME is used in transit
or which C-T-E is used, 8bit,QP, or Base64.


> > take the pains to transform 8-bit MIME to some transfer-encoding
> > supported by the receiving server.
>
> Very bad idea, BTW, since it mangles the mail, which can be a problem
> with applications like cryptographic signatures. I always turn it off
> and it was never a problem. In practice (do note I refer to the real
> world), all SMTP servers accept 8-bits EVEN IF THEY DO NOT ADVERTISE
> IT PROPERLY with the 8BITMIME option.

  Doing this type of C-T-E change (from 8bit to QP/Base64)
automatically at the MTA level may be a bad idea, but doing this with
MUAs should not be a problem(that's what end-users choose). With most
modern MUAs supporting MIME standard very well(with  notable exceptions
being Eudora and some popular web mail services), the 8bit-cleanness
of the transport path doesn't matter much for UTF-8 email exchange
as I wrote above.

 IMHO, the biggest obstacle to email exchange in UTF-8 is not
7bit only SMTP but the fact that people don't feel a strong need to
switch because they think legacy encodings just work fine for them.
(not many people need to exchange emails in languages other than their
native ones, let alone multilingual emails that cross the boundary of
legacy encodings). Another obstacle is that popular web mail services don't
support UTF-8 well incorrectly assuming that there's 'the' invariant
mapping between languages and MIME charset/encodings(e.g. for French,
use ISO-8859-15/1 or Windows-1252, for Japanese ISO-2022-JP). Therefore,
even though major MUAs have no problem with UTF-8 emails, some people
get reluctant to send all their outgoing emails in UTF-8 for fear that
their correspondents with web mail accounts won't be able to read them
without some 'user-intervention'.

 Jungshik Shin





Re: Precomposed Tibetan

2002-12-17 Thread Jungshik Shin

On Fri, 13 Dec 2002, Andrew C. West wrote:

> I have just noticed that the Chinese government have presented a proposal to
> encode 956 "BrdaRten" characters in the BMP. See
> http://std.dkuug.dk/jtc1/sc2/WG2/docs/n2558.pdf

> Would I be correct in believing that there is no chance of these precomposed
> forms being accepted, even when pushed by a country with the clout of China ?

  I sincerely hope the proposed character set won't become a second case
of Hangul precomposed syllables albeit in a scale about 10 times smaller.
It'd be interesting to see how South Korea will vote on this. It may
not be easy to vote against it because of its past 'sin'.

 Is there any opentype/AAT font for Tibetan? Do Uniscribe, Pango,
ATSUI, and Graphite support them if there are opentype Tibetan fonts?
In addition to the principle of character encoding, the best practical
counterargument would come from a demonstration that Unicode encoding
model for Tibetan script does work in practice.


  Jungshik





RE: converting devanagari to mangal unicode

2002-12-17 Thread Marco Cimarosti
Bob Hallissy wrote:
> NB: One of the complexities you may run into, and which will limit your
> options, is that your encoding may store text in a different order than
> Unicode requires. If this is the case, TECkit can do the rearrangement for
> you but I'm not sure ICU will easily do that. Certainly the current
> standard for XML-based descriptions of encoding mappings as given in
> Unicode Technical Report 22 (see
> http://www.unicode.org/unicode/reports/tr22/ ) cannot express such
> mappings.

Someone made me notice recently that UTR#22 can indeed implement Indic
visual-to-logical mappings, provided that one chooses the whole Indic
"syllable" as a mapping unit. E.g.:




Of course, this requires very big tables, which could be avoided using a
smarter mechanisms. Moreover, it only works with well-formed sequences in an
anticipated set of languages, but fails with misspellings or new
orthographies.

_ Marco




Re: CJK fonts

2002-12-17 Thread Andrew C. West
On Tue, 17 Dec 2002 02:25:13 -0800 (PST), Thomas Chan wrote:

> What edition of the _Kangxi Zidian_ are you using that gives explicit
> Mandarin readings like "yi4", or are you interpreting the fanqie notation
> yourself?  I use the 1958 edition, 1997 2nd printing published by
> Zhonghua, ISBN 7-101-00518-7.

I've got two Zhonghua Shuju editions, one published in Hong Kong, and one
published in Beijing - the pagination is different but they are both facsimile
reprints of the same original edition. I'm interpretting the fanqie notation. In
the case of YI4 (U+3CBC), Kangxi quotes Guang Yun as having a fanqie notation of
U+9B5A [YU2] / U+ 80BA [FEI4], whilst it quotes Ji Yun as having a fanqie
notation of U+9B5A [YU2] / U+ 5208 [YI4], with the additional note, pronounced
the same as U+4E42 [YI4], which is fairly unambiguous.

> I find self-interpretation of fanqie to be fraught with peril, partially
> as fanqie was never a completely perfect transcription system, not to
> mention that fanqie from old dictonaries does not necessarily tell one
> anything about contemporary pronunciation.

Agreed, I would not use fanqie readings as evidence for contemporary
pronunciations, but the fanqie readings for obscure and obsolete ideographs
given in Ji Yun, Guang Yun, Yu Pian etc. and quoted in dictionaries like the
Kangxi Zidian are our main evidence for their pronunciation. Where do modern
dictionaries like Hanyu Da Zidian, Hanyu Da Cidian, Ci Yuan and Ci Hai get their
pinyin readings of obscure and obsolete ideographs ? Presumably from the fanqie
readings (that may date back to the Tang dynasty) in pre-modern dictionaries.
For example, what about the reading of HAN2 for U+5481 when meaning "milk" that
is given in Hanyu Da Cidian and Hanyu Da Zidian. The only reference Hanyu Da
Cidian gives for this reading is to Yu Pian, and all that my edition of "Songben
Yu Pian" says of the character is "U+4E73, X,Y qie" (can't remember the actual
fanqie notation given, but I'm sure it correlates to a reading of something like
HAM2 which would be Mandarinised to HAN2). Given that probably nobody's used
U+5481 to mean "milk" for a thousand years, Yu Pian's fanqie reading is all we
have to go on.

> e.g., U+5B7B, is a Yue (Cantonese), Hakka, and Min character, meaning
> 'last (child)' (derived from 'last child of an old man', hence the
> character's appearance as 'child' + 'to use up'), pronounced laai1 or lai1
> in Cantonese.[1]  However, the old dictionaries including Kangxi give a
> fanqie of U+6CE5 U+53F0 U+5207, which would yield an artificial nai2 in
> Mandarin, which is exactly what the _Hanyu Da Zidian_ says explicitly.
> Either the pronunciation has changed from [n-] and [l-] and reading old
> dictionaries fails to account for modern developments, or whoever choose
> U+6CE5 to indicate the onset was pronouncing U+6CE5 as *l-.
> 
> [1] While there is a long-standing ongoing sound change in Cantonese from
> [n-] to [l-], this is probably no longer one of them, and *naai1/nai1
> would now be regarded as hypercorrection.

I suspect that this is a whole new can of worms, and I don't feel qualified to
make any comment without the safety net of Wang Li or Karlgren ... I'll think
about this at home, and get back to you off-list if I have anything sensible to
say.

> But what if the character is obscure, and the reading thusly also obscure?
> I think there are diminishing benefits to overly-proofing the
> unihan database for such characters--if they are so rare, then no one will
> find the character by searching on an obscure/artificial reading, and if
> it is so rare, then those interested should be consulting actual
> comprehensive dictionaries (like the Kangxi or _Hanyu Da Zidian_) instead
> of relying on a text file.  In a way, we currently have this 
> situation--the Plane 2 characters are, on average, more obscure than the
> BMP characters, and the lack of information is kind of saying "look it up
> yourself if you really, really need to know".

Agreed. Reiterating my  comment below, maybe the Unihan Mandarin readings should
be completely rewritten based on Hanyu Da Zidian.

> I agree with your sentiment that "gem4" is an aberration, despite my
> support of the _Cihai_ (PRC 1979) in that it did not get included in the
> unihan database from out of nowhere.

Yes, your'e probably right that the Unihan reading of GEM4 is not a mistake as
such, but a reading derived from Ci Hai - the readings for the basic CJK range
probably pre-date the publications of the more reliable Hanyu Da Cidian and
Hanyu Da Zidian. If anyone has nothing better to do with their time they might
consider completely rewriting the Unihan Mandarin readings using the Hanyu Da
Zidian as the primary (or even sole) source.

>  When U+5481 was reinvented by the
> Cantonese, it was patterned both graphically and phonologically on U+7518,
> which is gan1 'sweet' in Mandarin (gam1 in Cantonese).  U+5481 is in
> Cantonese gam3 'so (quantity)' (3 = yinqu tone); hence "gan

Re: converting devanagari to mangal unicode

2002-12-17 Thread Bob_Hallissy

On 16/12/2002 22:02:36 "Magda Danish (Unicode)" wrote:

>> I have a data in devanagri true type font i want to convert
>> this data into mangal unicode.

Sunil,

For Windows or Mac use: If you want to convert data from one encoding to
Unicode, one option is to look at the free TECkit package.  There are many
non-Unicode encodings of Devanagari, so I'm unable to guess how your data
is currently encoded. TECkit is table-driven, i.e., you find or prepare a
description of the mapping between your encoding and Unicode, and then
TECkit uses that description to convert data. You may even be able to find
a mapping description already prepared as TECkit can use the XML mapping
definitions from ICU (see
http://oss.software.ibm.com/cvs/icu/charset/data/xml/)  For more
information about TECkit or to download it, see
http://www.sil.org/nrsi/teckit/

Depending on the characteristics of your encoding and your desire to do a
bit of programming, you may also be able to incorporate the ICU
(International Components for Unicode) library into your own program to do
the conversion you need. See
http://oss.software.ibm.com/developerworks/opensource/icu/project/ for more
information.

NB: One of the complexities you may run into, and which will limit your
options, is that your encoding may store text in a different order than
Unicode requires. If this is the case, TECkit can do the rearrangement for
you but I'm not sure ICU will easily do that. Certainly the current
standard for XML-based descriptions of encoding mappings as given in
Unicode Technical Report 22 (see
http://www.unicode.org/unicode/reports/tr22/ ) cannot express such
mappings.

Bob








Re: 8-bit MIME (was: Documenting in Tamil Computing)

2002-12-17 Thread Stephane Bortzmeyer
On Tue, Dec 17, 2002 at 01:28:00PM +0100,
 Otto Stolz <[EMAIL PROTECTED]> wrote 
 a message of 65 lines which said:

> As of November 2002, RFC 2821 is still a Proposed Standard, and RFC 821
> is the Standard Protocol (cf. ).

For those on the mailing list not versed in IETF language, let us add
that most Internet protocols are just Proposed Standard: it takes a
lot of time to move to an upper level. (The RFC 2821 is more than 18
months old.) Anyway, 8bits MIME was already possible with RFC 821, the
difference was just editorial (RFC 2821 is easier to read since you do
not need to patch it with many following RFCs.)

> "SHOULD" does definitely not mean the same thing as "MUST".
> An SMTP server does not have to support 8-bit MIME mail.

You're playing with words. In real life, all SMTP servers support
8-bits mail because all SMTP servers authors are aware of the issue
(true, it was long and difficult to convince them all but it
worked). Any counter-example?
 
> I have seen many messages, originally in ISO-8859-1-encoded French,
> that got the high-bit of every accented character chopped off, thus
> replacing "é" with "i", "î" with "n", and so forth. 

Last time I saw such problems was something like ten years ago. It was
almost never the fault of the SMTP server, but of some programs on the
destination machine (or sometimes the faults of funny gateways like
X400 servers, something you cannot blame on the Internet).

> Of course, more and more SMTP servers support 8-bit MIME, 

All implementations already supports 8-bits MIME. Some servers have
not been upgraded yet but it is uncommon. (Remember we are talking
about a move which occurred many years ago: even if many system
administrators do not upgrade their software, in the long term,
machines are replaced and new software catches on.)

> take the pains to transform 8-bit MIME to some transfer-encoding
> supported by the receiving server. 

Very bad idea, BTW, since it mangles the mail, which can be a problem
with applications like cryptographic signatures. I always turn it off
and it was never a problem. In practice (do note I refer to the real
world), all SMTP servers accept 8-bits EVEN IF THEY DO NOT ADVERTISE
IT PROPERLY with the 8BITMIME option.

Back to Unicode: why does nobody use UTF-7? Precisely because it is no
longer necessary.






8-bit MIME (was: Documenting in Tamil Computing)

2002-12-17 Thread Otto Stolz
Dear all,

Barry Caplan had written:

SMTP [...] is not 8 bit clean. It is very
clear in the RFCs that only 7bit data is allowed "over the wire".


Stephane Bortzmeyer wrote:

All these extensions are referenced in the same RFC, 2821, which is
the authoritative one about SMTP.



As of November 2002, RFC 2821 is still a Proposed Standard, and RFC 821
is the Standard Protocol (cf. ).


The most important for us is 8BITMIME:



Section 2.3.1 of RFC 2821, the proposed standard, says:
| The content is textual in nature, expressed using the US-ASCII
| repertoire [1]. Although SMTP extensions (such as "8BITMIME" [20])
| may relax this restriction for the content body,

Stephane Bortzmeyer quoted section 2.4 of RFC 2821:
> Eight-bit message content transmission MAY be requested of the server
> by a client using extended SMTP facilities, notably the "8BITMIME"
> extension [20].  8BITMIME SHOULD be supported by SMTP servers.

"SHOULD" does definitely not mean the same thing as "MUST".
An SMTP server does not have to support 8-bit MIME mail.

And the remainder of the quoted paragraph requests proper MIME
headers for 8-bit text:
| However, it MUST not be construed as authorization to transmit
| unrestricted eight bit material.  8BITMIME MUST NOT be requested
| by senders for material with the high bit on that is not in MIME
| format with an appropriate content-transfer encoding; servers
| MAY reject such messages.

Barry Caplan had written:

But for arbitrary email from one address to another, you can't rely on it.


Stephane Bortzmeyer wrote:

I send Latin-1 (ISO 8859-1) emails for more than ten years (and
without using quoted-printable or other similar hacks) to
French-speaking people in various parts of the world and I'm still
waiting for an actual problem.


Mere luck, I'd say, but no proof at all.

I have seen many messages, originally in ISO-8859-1-encoded French,
that got the high-bit of every accented character chopped off, thus
replacing "é" with "i", "î" with "n", and so forth. And even more mail
in German, distorted in a similar way. This has provoked an entry in
my E-Mail FAQ: .

Of course, more and more SMTP servers support 8-bit MIME, and many
take the pains to transform 8-bit MIME to some transfer-encoding
supported by the receiving server. If you are located behind a server
that recodes your 8-bit mail, you cannot claim that 8-bit mail is
supported everywhere; you can only claim that your server compensates
for the incompatibility of your MUA and the world at large.

Best wishes,
  Otto Stolz





Re: Precomposed Tibetan

2002-12-17 Thread Michael Everson
At 07:47 -0800 2002-12-13, Andrew C. West wrote:

I have just noticed that the Chinese government have presented a proposal to
encode 956 "BrdaRten" characters in the BMP. See
http://std.dkuug.dk/jtc1/sc2/WG2/docs/n2558.pdf

Would I be correct in believing that there is no chance of these precomposed
forms being accepted, even when pushed by a country with the clout of China ?


We pray not. It would introduce the kind of chaos we have for Korean 
for Tibetan. I know a number of national bodies will vote vigorously 
against damaging Tibetan in this way.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



Re: h in Greek epigraphy

2002-12-17 Thread Michael Everson
At 18:43 -0800 2002-12-15, Doug Ewell wrote:


One classic case of letters being unified across scripts is Kurdish,
which uses Latin Q and W in an otherwise all-Cyrillic alphabet.


Which is not so smart, as has been pointed out by many. Consider that 
even CYRILLIC SOFT SIGN has a Latin clone: U+0184 and U+1085
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



RE: converting devanagari to mangal unicode

2002-12-17 Thread Marco Cimarosti
John Hudson wrote:
> At 03:09 PM 12/16/2002, Eric Muller wrote:
> 
> >>In order to convert any Devanagari font to be rendered in 
> the same way,
> >
> >May be Sunil is just asking for a conversion of data, 
> presumably from 
> >ISCII to Unicode.
> 
> Ah, yes, this is possible. I'm so used to people asking the 
> other question 
> that I assumed from the slightly mixed up references in the 
> question that this was what Sunil intended.

OK, this is my interpretation of Sunil's question: He has text data encoded
in a so-called "font encoding" (e.g. "Shusha"), and he needs to convert it
to Unicode.

The Linux Technology Development for Indian Languages
(http://www.cse.iitk.ac.in/users/isciig/) has two ongoing projects for
similar conversions:

- iconverter
(http://www.cse.iitk.ac.in/users/isciig/iconverter/main.html)
- ISSCIIlib
(http://www.cse.iitk.ac.in/users/isciig/isciilib/main.html)

_ Marco




Re: Documenting in Tamil Computing

2002-12-17 Thread Eric Muller


I don't understand what you meant by Unicode not being
mature enough to support multilingual emails. 

Maybe the argument is simply that there are not enough email agents that 
can render Tamil properly from Unicode-encoded text, and that email 
rarely has a useful life that justifies pain today.

Eric.





Re: Documenting in Tamil Computing

2002-12-17 Thread Stephane Bortzmeyer
On Mon, Dec 16, 2002 at 10:29:14AM -0800,
 Barry Caplan <[EMAIL PROTECTED]> wrote 
 a message of 23 lines which said:

> Actually, it is not Unicode which is nt mature enough. It is SMTP,
> the core mail transport protocol. It is not 8 bit clean. It is very
> clear in the RFCs that only 7bit data is allowed "over the wire".

I have to correct this because it may seriously cast doubts about the
ability of Internet email to send Unicode files.
 
> There are various extensions and kluges described in various RFCs
> (ESMTP, MIME, etc. )

All these extensions are referenced in the same RFC, 2821, which is
the authoritative one about SMTP. I do not know any mainstream SMTP
server which does not implement them.

The most important for us is 8BITMIME:

   Eight-bit message content transmission MAY be requested of the server
   by a client using extended SMTP facilities, notably the "8BITMIME"
   extension [20].  8BITMIME SHOULD be supported by SMTP servers.

> but they are not universally implemented at the server transport
> layer,

This is absolutely wrong. sendmail, Postfix and qmail allow 8-bits
transport for a *very* long time.

> But for arbitrary email from one address to another, you can't rely on it.

I send Latin-1 (ISO 8859-1) emails for more than ten years (and
without using quoted-printable or other similar hacks) to
French-speaking people in various parts of the world and I'm still
waiting for an actual problem.