Re: Controls, gliphs, flies, lemonade

2011-09-20 Thread QSJN 4 UKR
Yes, i had written 'egyptian hieroglyphs' but how about banal CJK? We
still have no way to insert nonstandard ideogramme into text. Isn't it
a simple task? There are just 20 basic strokes :)  ok, 500 basic
symbols. Or 20? However  we can't combine it together :( !
Unicode is to complex standard. I even don't know how many properties
have one character (did you know about unicode-coloured characters? -
there was somewhere that my theme in this list), how can i know how my
application has to render 'plain' text with bidi, noncanonicordered
diacritics, and korean script. Right, i don't know that. And my
application render it in my way, some else in another (a_a / aa_ -
double comb. char., sure you seen that), so we have no standard at
all.
Off course, i can learn this complex standard, but what for? Most of
them i never use.
There must be a simpler system, not so many aprior data for it work.

2011/9/13, John H. Jenkins jenk...@apple.com:

 QSJN 4 UKR 於 2011年9月12日 下午9:06 寫道:

 I know it is sacred cow, but let me just ask, how do you people think.
 Is it good or bad that the codepoint means all about character: what,
 where, how... (see theme)? Maybe have we separate graph  control
 codes - wellnt have many problems, from banal ltr (( rtl instead ltr
 (rtl) to placing one tilde above 3, 4, anymore letters, or egyptian
 hierogliphs in rows'n'cols. Conceptually, I mean! Each letter in text
 is at least two codepoints (what and where) in file. Is it stupid?
 Trying to render the text we anyway must generate this data.



 It's not really a sacred cow per se, but it is a fundamental architectural
 decision which would be pretty much impossible to revisit now.

 Almost all writing is done using a small set of script-specific rules which
 are pretty straightforward.  English, for example, is laid out in horizontal
 lines running left-to-right and arranged top-to-bottom of the writing
 surface.  East Asian languages were traditionally laid out in vertical lines
 running from top-to-bottom and arranged right-to-left on the writing
 surface.

 Because some scripts are right-to-left and ltr and rtl text can be freely
 intermingled on a single line, Unicode provides plain-text directionality
 controls.  The preference, however, is to use higher-level protocols where
 possible.

 As for the scripts which are inherently two-dimensional (using
 hieroglyphics, mathematics, and music), it's almost impossible to provide
 plain text support for them.  There is too much dependence on additional
 information such as the specifics of font and point size.  Because of this,
 the UTC decided long ago that layout for such scripts absolutely must be
 done using a higher-level protocol to handle all the details.

 There are occasionally suggestions that positioning controls be added to
 plain text in Unicode, but so far the UTC has felt that the benefits are too
 marginal to overcome its reasons for having left them out in the first
 place.

 =
 Hoani H. Tinikini
 John H. Jenkins
 jenk...@apple.com









Attn: Unicode Inc worker Kent Karlsson

2011-09-20 Thread Tulasi
Attn: Unicode Inc worker Kent Karlsson
C/o   Magda Danish
  Sr Administrative Director
  Unicode Inc
  kent.karlsso...@telia.com,
  v-mag...@microsoft.com,
  arch...@mail-archive.com,


Neither Assam Government nor Assam Literary Society has asked Unicode
Inc to encode Assamese stuff.

Why did it encode Assamese stuff?

Can you reply back with detailed information on what prompt Unicode
Inc to encode Assamese stuff as Bengali?

Thank you in advance for providing this information,

Tulasi
PS: Your email thread appended herewith as reference


From: Kent Karlsson kent.karlsso...@telia.com
Date: Fri, Sep 9, 2011 at 5:44 PM
Subject: Re: Continue:Glaring mistake in the code list for South Asian
Script
To: delex r del...@indiatimes.com, unicode@unicode.org


Den 2011-09-10 00:53, skrev delex r del...@indiatimes.com:

 I figure out that Unicode has not addressed the sovereignty issues of a
 language

Which, I daresay, is irrelevant from a *character* encoding
perspective.

 while trying to devise an ASCII like encoding system for almost all
 the characters and symbols used on earth. I am continuing with my observation
 of the glaring mistake done by Unicode by naming a South Asian Script as
 łBengali˛. Here I would like to give certain information that I think will be
 of some help for Unicode in its endeavour to faithfully represent a Universal
 Character encoding standard truer to even micro-facts.

 India is believed to have at least 1652 mother tongues out of which only 22

One list of languages in India is given in
http://www.ethnologue.com/show_country.asp?name=IN
(I did not count the number of entries)

 are recognized by the Indian Constitution as official languages for
 administrative communication among local governments and to the citizens. And
 the constitution has not explicitly recognized any official script. As Unicode
 has listed the languages and scripts, the Indian Constitution has also listed

Unicode does not list any languages at all. Ok, the CLDR subproject
copies a
list of language codes from the IANA language subtag registry, which
(in a
complex manner) takes its language codes from (among others) the ISO
639-3
registry, which largely is in sync with Ethnologue (as in the list
above);
but I guess that is not what you referred to.

 the official languages ( In its 8th schedule). The first entry in that list is
 the Assamese language.  Assamese is a sovereign language with its own grammar

Which I don't think is in dispute at all.

 and łscript˛ that contains some unique characters that you will not find in
 any of the scripts so far discovered by Unicode. At least 30 million people

Unicode (at this stage) does not do any discovery. Unicode and ISO/
IEC
10646 is driven by applications (proposals) to encode characters (and
define
properties of characters).

 call it the łAssamese Script˛ and if provided with computers and internet

If you want to disunify the Bengali script (and characters) from
Assamese,
you need to show, in a proposal document, that they really are
different
scripts, and should not be unified as just different uses of the same
script.

 connection can bomb the Unicode e-mail address with confirmations. These

Hmm, an email bombing threat... I'm sure Sarasvati can find a way to
block
those (or we may all simply file them away as spam).

 characters are, I repeat, the one that is given a Hexcode 09F0  and the other
 with 09F1 by this universal character encoding system but unfortunat!
  ely has described both as łBengali˛ Ra etc. etc. I donąt know who has advised
 Unicode to use the tag łBengali˛ to name the block that includes these two
 characters.

 If you are not an Indian then just google an image of an Indian Currency note.
 There on one side of the note you will find a box inside which the value of
 the currency note is written in words in at least 15 scripts of official
 Indian languages.( I donąt know why it is not 22). At the top , the script is
 Assamese as Assamese is the first officially recognized language (script?) .
 Next below it you will find almost similar shapes. That is in Bengali. India
 officially recognises the distinction between these two scripts which although
 shaped similar but sounds very different at many points. And the standard

Minor font differences is not a reason for disunification. Different
pronunciations of the same letters is not a reason for disunification
either. Just think of how many different ways Latin letters (and
letter
combinations) are pronounced in different languages (x, j, h, v, w,
f, ...;
even a gets different pronunciation in British English vs. US
English,
and that is within the same language...; and most orthographies aren't
very accurately phonetic anyway, with quite a bit of varying
(contextual
and dialectal) pronunciation for the letters).

 assamese alphabet set has extra characters which are never bengali just like
 London is never in Germany.

There are 8 London in the USA, two in Canada, 

Re: Controls, gliphs, flies, lemonade

2011-09-20 Thread John H. Jenkins
In re CJK, that's already a FAQ: http://www.unicode.org/faq/han_cjk.html#16.  
The short version is: if all you want to do is to draw something, then yes, 
making up new hanzi on the fly is a solvable problem.  If you want to do 
anything that deals with the *content* (lexical analysis, sorting, 
text-to-speech), it's an incredibly difficult problem.  

And, actually, there's already a way to insert nonstandard hanzi into text 
(well, two, if you count the Ideographic Variation Indicator), namely 
Ideographic Description Sequences.  They're clumsy and awkward, but they do 
make it possible to exchange text with unencoded hanzi in a vaguely standard 
fashion.  

And yes, Unicode is very complicated, but that's because of the problem it's 
intended to solve.  If all you're interested in is drawing text in a couple of 
common scripts, such as Latin and Japanese, then you really don't need Unicode 
with all of its complexity.  Unicode is trying to provide a basis for handling 
all aspects of plain text processing for all the languages of the world in a 
single application.  

Just go to Wikipedia and look down the long list of different languages that a 
popular subject has articles in.  *That* is what Unicode is trying to provide.  
It's very tough to implement, but fortunately on all the major platforms, there 
are libraries that make it unnecessary for you to do all the work yourself.

QSJN 4 UKR 於 2011年9月20日 下午9:01 寫道:

 Yes, i had written 'egyptian hieroglyphs' but how about banal CJK? We
 still have no way to insert nonstandard ideogramme into text. Isn't it
 a simple task? There are just 20 basic strokes :)  ok, 500 basic
 symbols. Or 20? However  we can't combine it together :( !
 Unicode is to complex standard. I even don't know how many properties
 have one character (did you know about unicode-coloured characters? -
 there was somewhere that my theme in this list), how can i know how my
 application has to render 'plain' text with bidi, noncanonicordered
 diacritics, and korean script. Right, i don't know that. And my
 application render it in my way, some else in another (a_a / aa_ -
 double comb. char., sure you seen that), so we have no standard at
 all.
 Off course, i can learn this complex standard, but what for? Most of
 them i never use.
 There must be a simpler system, not so many aprior data for it work.
 
 2011/9/13, John H. Jenkins jenk...@apple.com:
 
 QSJN 4 UKR 於 2011年9月12日 下午9:06 寫道:
 
 I know it is sacred cow, but let me just ask, how do you people think.
 Is it good or bad that the codepoint means all about character: what,
 where, how... (see theme)? Maybe have we separate graph  control
 codes - wellnt have many problems, from banal ltr (( rtl instead ltr
 (rtl) to placing one tilde above 3, 4, anymore letters, or egyptian
 hierogliphs in rows'n'cols. Conceptually, I mean! Each letter in text
 is at least two codepoints (what and where) in file. Is it stupid?
 Trying to render the text we anyway must generate this data.
 
 
 
 It's not really a sacred cow per se, but it is a fundamental architectural
 decision which would be pretty much impossible to revisit now.
 
 Almost all writing is done using a small set of script-specific rules which
 are pretty straightforward.  English, for example, is laid out in horizontal
 lines running left-to-right and arranged top-to-bottom of the writing
 surface.  East Asian languages were traditionally laid out in vertical lines
 running from top-to-bottom and arranged right-to-left on the writing
 surface.
 
 Because some scripts are right-to-left and ltr and rtl text can be freely
 intermingled on a single line, Unicode provides plain-text directionality
 controls.  The preference, however, is to use higher-level protocols where
 possible.
 
 As for the scripts which are inherently two-dimensional (using
 hieroglyphics, mathematics, and music), it's almost impossible to provide
 plain text support for them.  There is too much dependence on additional
 information such as the specifics of font and point size.  Because of this,
 the UTC decided long ago that layout for such scripts absolutely must be
 done using a higher-level protocol to handle all the details.
 
 There are occasionally suggestions that positioning controls be added to
 plain text in Unicode, but so far the UTC has felt that the benefits are too
 marginal to overcome its reasons for having left them out in the first
 place.
 
 =
 Hoani H. Tinikini
 John H. Jenkins
 jenk...@apple.com
 
 
 
 
 
 
 

=
John H. Jenkins
jenk...@apple.com






Attn: Unicode Inc worker Peter Zilahy Ingerman, PhD

2011-09-20 Thread Tulasi
Attn: Unicode Inc worker Peter Zilahy Ingerman, PhD
C/o   Magda Danish
  Sr Administrative Director
  Unicode Inc
  pzi @ ingerman.org,
  v-magdad @ microsoft.com,


Neither Assam Government nor Assam Literary Society has asked Unicode
Inc to encode Assamese stuff.

Why did Unicode Inc encode Assamese stuff?

Can you reply back with detailed information on what prompt Unicode
Inc to encode Assamese stuff as Bengali?

Thank you in advance for providing this information,

Tulasi
PS: Your email thread appended herewith as reference


From: Peter Zilahy Ingerman, PhD p...@ingerman.org
Date: Mon, Sep 12, 2011 at 5:27 AM
Subject: Re: Continue: Glaring Mistake in the Code List of South Asian
Script, Reply to Daug Ewell and Others
To: Mark E. Shoulson m...@kli.org
Cc: unicode@unicode.org


Truly, a fanatic redoubles his efforts when he loses sight of his
goal.

Peter Ingerman


On 2011-09-12 07:21, Mark E. Shoulson wrote:

On 09/12/2011 06:01 AM, delex r wrote:

Anyone who is not aware of fact and want to find out in
unicode about Assamese Raw (09F1) or Assamese Wa(09F1) will find it
absurd and difficult as if he is being asked to find out London in the
map of Germany.


See above.  You're absolutely, 100% right, and you obviously have
seen something we've all missed.  (Actually, I don't know whether you
are or not, but let's assume you are).  Thank you for pointing out
this glaring mistake in Unicode's naming.  This glaring mistake will
remain a glaring mistake, just like the spelling of BRAKCET instead
of bracket will remain in U+FE18.

You're totally right in everything you have said (we'll assume).
No need to try to convince us anymore, we believe you.  No names will
be changed, anyway.

~mark



Attn: Unicode Inc worker Christoph Päper

2011-09-20 Thread Tulasi
Attn: Unicode Inc worker Christoph Päper
C/o   Magda Danish
  Sr Administrative Director
  Unicode Inc
  christoph.paeper @ crissov.de,
  v-mag...@microsoft.com,


Neither Assam Government nor Assam Literary Society has asked Unicode
Inc to encode Assamese stuff.

Why did Unicode Inc encode Assamese stuff?

Can you reply back with detailed information on what prompt Unicode
Inc to encode Assamese stuff as Bengali?

Thank you in advance for providing this information,

Tulasi
PS: Your email thread appended herewith as reference


From: Christoph Päper christoph.pae...@crissov.de
Date: Mon, Sep 12, 2011 at 5:52 AM
Subject: Re: Continue: Glaring Mistake in the Code List of South Asian
Script, Reply to Daug Ewell and Others
To: Unicode Discussion unicode@unicode.org


Delex,

you are obviously confusing character sets, scripts, writing systems,
orthographies, languages, peoples and names thereof (which may vary
across languages and applications).

NB: Some might argue that Unicode already distinguishes Indic scripts
on a finer level than necessary, since elsewhere many would be seen as
hands or typefaces of a single script, hence they would unify encoding
and leave the looks to fonts completely.

 difficult as if he is being asked to find out London in the map of Germany.

There’s a London in my (German) home county. I think it has like 20
citizens. Proves nothing.


Attn: Unicode Inc worker Ken Whistler

2011-09-20 Thread Tulasi
Attn: Unicode Inc worker Ken Whistler
C/o   Magda Danish
  Sr Administrative Director
  Unicode Inc
  kenw @ sybase.com,
  v-mag...@microsoft.com,


Neither Assam Government nor Assam Literary Society has asked Unicode
Inc to encode Assamese stuff.

Why did Unicode Inc encode Assamese stuff?

Can you reply back with detailed information on what prompt Unicode
Inc to encode Assamese stuff as Bengali?

Thank you in advance for providing this information,

Tulasi
PS: Your email thread appended herewith as reference


From: Ken Whistler k...@sybase.com
Date: Mon, Sep 12, 2011 at 1:53 PM
Subject: Re: Continue: Glaring Mistake in the Code List of South Asian
Script, Reply to Daug Ewell and Others
To: verd...@wanadoo.fr
Cc: unicode@unicode.org


On 9/12/2011 9:13 AM, Philippe Verdy wrote:

Well, wasn't the ISCII standard naming the script Bengali? It
also gave the name Assamese, but was it a synonym or did it require
a separate codepage switching code ?

They were separate. Annex A of ISCII 1991 shows Bengali (BNG) and
Assamese (ASM) in separate columns. *Every* character in those two
columns is completely identical, except the entries (no surprise) in
the r row and the v row. And in Annex D, the listing of Inscript
keyboards, there is one keyboard overlay for Bengali and one for
Assamese. These again are completely
identical, except for the B key (where the v goes) and the J key
(where the r) goes.

Why? Well, I presume the Bureau of Indian Standards ran into the same
linguistic political buzzsaw that you have seen rehearsed on this
thread.


It may be interesting to reread the ISCII standard from which the
UCS encoding of the Indian scripts came from...

Yes. it is interesting reading. I recommend it sometime.

Ultimately, however, it is not pertinent to the question here. The
distinction between
Bengali and Assamese is a matter of linguistic politics. It is not
a matter of
script or character encoding.

--Ken