Re: Controls, gliphs, flies, lemonade
Yes, i had written 'egyptian hieroglyphs' but how about banal CJK? We still have no way to insert nonstandard ideogramme into text. Isn't it a simple task? There are just 20 basic strokes :) ok, 500 basic symbols. Or 20? However we can't combine it together :( ! Unicode is to complex standard. I even don't know how many properties have one character (did you know about unicode-coloured characters? - there was somewhere that my theme in this list), how can i know how my application has to render 'plain' text with bidi, noncanonicordered diacritics, and korean script. Right, i don't know that. And my application render it in my way, some else in another (a_a / aa_ - double comb. char., sure you seen that), so we have no standard at all. Off course, i can learn this complex standard, but what for? Most of them i never use. There must be a simpler system, not so many aprior data for it work. 2011/9/13, John H. Jenkins jenk...@apple.com: QSJN 4 UKR 於 2011年9月12日 下午9:06 寫道: I know it is sacred cow, but let me just ask, how do you people think. Is it good or bad that the codepoint means all about character: what, where, how... (see theme)? Maybe have we separate graph control codes - wellnt have many problems, from banal ltr (( rtl instead ltr (rtl) to placing one tilde above 3, 4, anymore letters, or egyptian hierogliphs in rows'n'cols. Conceptually, I mean! Each letter in text is at least two codepoints (what and where) in file. Is it stupid? Trying to render the text we anyway must generate this data. It's not really a sacred cow per se, but it is a fundamental architectural decision which would be pretty much impossible to revisit now. Almost all writing is done using a small set of script-specific rules which are pretty straightforward. English, for example, is laid out in horizontal lines running left-to-right and arranged top-to-bottom of the writing surface. East Asian languages were traditionally laid out in vertical lines running from top-to-bottom and arranged right-to-left on the writing surface. Because some scripts are right-to-left and ltr and rtl text can be freely intermingled on a single line, Unicode provides plain-text directionality controls. The preference, however, is to use higher-level protocols where possible. As for the scripts which are inherently two-dimensional (using hieroglyphics, mathematics, and music), it's almost impossible to provide plain text support for them. There is too much dependence on additional information such as the specifics of font and point size. Because of this, the UTC decided long ago that layout for such scripts absolutely must be done using a higher-level protocol to handle all the details. There are occasionally suggestions that positioning controls be added to plain text in Unicode, but so far the UTC has felt that the benefits are too marginal to overcome its reasons for having left them out in the first place. = Hoani H. Tinikini John H. Jenkins jenk...@apple.com
Attn: Unicode Inc worker Kent Karlsson
Attn: Unicode Inc worker Kent Karlsson C/o Magda Danish Sr Administrative Director Unicode Inc kent.karlsso...@telia.com, v-mag...@microsoft.com, arch...@mail-archive.com, Neither Assam Government nor Assam Literary Society has asked Unicode Inc to encode Assamese stuff. Why did it encode Assamese stuff? Can you reply back with detailed information on what prompt Unicode Inc to encode Assamese stuff as Bengali? Thank you in advance for providing this information, Tulasi PS: Your email thread appended herewith as reference From: Kent Karlsson kent.karlsso...@telia.com Date: Fri, Sep 9, 2011 at 5:44 PM Subject: Re: Continue:Glaring mistake in the code list for South Asian Script To: delex r del...@indiatimes.com, unicode@unicode.org Den 2011-09-10 00:53, skrev delex r del...@indiatimes.com: I figure out that Unicode has not addressed the sovereignty issues of a language Which, I daresay, is irrelevant from a *character* encoding perspective. while trying to devise an ASCII like encoding system for almost all the characters and symbols used on earth. I am continuing with my observation of the glaring mistake done by Unicode by naming a South Asian Script as łBengali˛. Here I would like to give certain information that I think will be of some help for Unicode in its endeavour to faithfully represent a Universal Character encoding standard truer to even micro-facts. India is believed to have at least 1652 mother tongues out of which only 22 One list of languages in India is given in http://www.ethnologue.com/show_country.asp?name=IN (I did not count the number of entries) are recognized by the Indian Constitution as official languages for administrative communication among local governments and to the citizens. And the constitution has not explicitly recognized any official script. As Unicode has listed the languages and scripts, the Indian Constitution has also listed Unicode does not list any languages at all. Ok, the CLDR subproject copies a list of language codes from the IANA language subtag registry, which (in a complex manner) takes its language codes from (among others) the ISO 639-3 registry, which largely is in sync with Ethnologue (as in the list above); but I guess that is not what you referred to. the official languages ( In its 8th schedule). The first entry in that list is the Assamese language. Assamese is a sovereign language with its own grammar Which I don't think is in dispute at all. and łscript˛ that contains some unique characters that you will not find in any of the scripts so far discovered by Unicode. At least 30 million people Unicode (at this stage) does not do any discovery. Unicode and ISO/ IEC 10646 is driven by applications (proposals) to encode characters (and define properties of characters). call it the łAssamese Script˛ and if provided with computers and internet If you want to disunify the Bengali script (and characters) from Assamese, you need to show, in a proposal document, that they really are different scripts, and should not be unified as just different uses of the same script. connection can bomb the Unicode e-mail address with confirmations. These Hmm, an email bombing threat... I'm sure Sarasvati can find a way to block those (or we may all simply file them away as spam). characters are, I repeat, the one that is given a Hexcode 09F0 and the other with 09F1 by this universal character encoding system but unfortunat! ely has described both as łBengali˛ Ra etc. etc. I donąt know who has advised Unicode to use the tag łBengali˛ to name the block that includes these two characters. If you are not an Indian then just google an image of an Indian Currency note. There on one side of the note you will find a box inside which the value of the currency note is written in words in at least 15 scripts of official Indian languages.( I donąt know why it is not 22). At the top , the script is Assamese as Assamese is the first officially recognized language (script?) . Next below it you will find almost similar shapes. That is in Bengali. India officially recognises the distinction between these two scripts which although shaped similar but sounds very different at many points. And the standard Minor font differences is not a reason for disunification. Different pronunciations of the same letters is not a reason for disunification either. Just think of how many different ways Latin letters (and letter combinations) are pronounced in different languages (x, j, h, v, w, f, ...; even a gets different pronunciation in British English vs. US English, and that is within the same language...; and most orthographies aren't very accurately phonetic anyway, with quite a bit of varying (contextual and dialectal) pronunciation for the letters). assamese alphabet set has extra characters which are never bengali just like London is never in Germany. There are 8 London in the USA, two in Canada,
Re: Controls, gliphs, flies, lemonade
In re CJK, that's already a FAQ: http://www.unicode.org/faq/han_cjk.html#16. The short version is: if all you want to do is to draw something, then yes, making up new hanzi on the fly is a solvable problem. If you want to do anything that deals with the *content* (lexical analysis, sorting, text-to-speech), it's an incredibly difficult problem. And, actually, there's already a way to insert nonstandard hanzi into text (well, two, if you count the Ideographic Variation Indicator), namely Ideographic Description Sequences. They're clumsy and awkward, but they do make it possible to exchange text with unencoded hanzi in a vaguely standard fashion. And yes, Unicode is very complicated, but that's because of the problem it's intended to solve. If all you're interested in is drawing text in a couple of common scripts, such as Latin and Japanese, then you really don't need Unicode with all of its complexity. Unicode is trying to provide a basis for handling all aspects of plain text processing for all the languages of the world in a single application. Just go to Wikipedia and look down the long list of different languages that a popular subject has articles in. *That* is what Unicode is trying to provide. It's very tough to implement, but fortunately on all the major platforms, there are libraries that make it unnecessary for you to do all the work yourself. QSJN 4 UKR 於 2011年9月20日 下午9:01 寫道: Yes, i had written 'egyptian hieroglyphs' but how about banal CJK? We still have no way to insert nonstandard ideogramme into text. Isn't it a simple task? There are just 20 basic strokes :) ok, 500 basic symbols. Or 20? However we can't combine it together :( ! Unicode is to complex standard. I even don't know how many properties have one character (did you know about unicode-coloured characters? - there was somewhere that my theme in this list), how can i know how my application has to render 'plain' text with bidi, noncanonicordered diacritics, and korean script. Right, i don't know that. And my application render it in my way, some else in another (a_a / aa_ - double comb. char., sure you seen that), so we have no standard at all. Off course, i can learn this complex standard, but what for? Most of them i never use. There must be a simpler system, not so many aprior data for it work. 2011/9/13, John H. Jenkins jenk...@apple.com: QSJN 4 UKR 於 2011年9月12日 下午9:06 寫道: I know it is sacred cow, but let me just ask, how do you people think. Is it good or bad that the codepoint means all about character: what, where, how... (see theme)? Maybe have we separate graph control codes - wellnt have many problems, from banal ltr (( rtl instead ltr (rtl) to placing one tilde above 3, 4, anymore letters, or egyptian hierogliphs in rows'n'cols. Conceptually, I mean! Each letter in text is at least two codepoints (what and where) in file. Is it stupid? Trying to render the text we anyway must generate this data. It's not really a sacred cow per se, but it is a fundamental architectural decision which would be pretty much impossible to revisit now. Almost all writing is done using a small set of script-specific rules which are pretty straightforward. English, for example, is laid out in horizontal lines running left-to-right and arranged top-to-bottom of the writing surface. East Asian languages were traditionally laid out in vertical lines running from top-to-bottom and arranged right-to-left on the writing surface. Because some scripts are right-to-left and ltr and rtl text can be freely intermingled on a single line, Unicode provides plain-text directionality controls. The preference, however, is to use higher-level protocols where possible. As for the scripts which are inherently two-dimensional (using hieroglyphics, mathematics, and music), it's almost impossible to provide plain text support for them. There is too much dependence on additional information such as the specifics of font and point size. Because of this, the UTC decided long ago that layout for such scripts absolutely must be done using a higher-level protocol to handle all the details. There are occasionally suggestions that positioning controls be added to plain text in Unicode, but so far the UTC has felt that the benefits are too marginal to overcome its reasons for having left them out in the first place. = Hoani H. Tinikini John H. Jenkins jenk...@apple.com = John H. Jenkins jenk...@apple.com
Attn: Unicode Inc worker Peter Zilahy Ingerman, PhD
Attn: Unicode Inc worker Peter Zilahy Ingerman, PhD C/o Magda Danish Sr Administrative Director Unicode Inc pzi @ ingerman.org, v-magdad @ microsoft.com, Neither Assam Government nor Assam Literary Society has asked Unicode Inc to encode Assamese stuff. Why did Unicode Inc encode Assamese stuff? Can you reply back with detailed information on what prompt Unicode Inc to encode Assamese stuff as Bengali? Thank you in advance for providing this information, Tulasi PS: Your email thread appended herewith as reference From: Peter Zilahy Ingerman, PhD p...@ingerman.org Date: Mon, Sep 12, 2011 at 5:27 AM Subject: Re: Continue: Glaring Mistake in the Code List of South Asian Script, Reply to Daug Ewell and Others To: Mark E. Shoulson m...@kli.org Cc: unicode@unicode.org Truly, a fanatic redoubles his efforts when he loses sight of his goal. Peter Ingerman On 2011-09-12 07:21, Mark E. Shoulson wrote: On 09/12/2011 06:01 AM, delex r wrote: Anyone who is not aware of fact and want to find out in unicode about Assamese Raw (09F1) or Assamese Wa(09F1) will find it absurd and difficult as if he is being asked to find out London in the map of Germany. See above. You're absolutely, 100% right, and you obviously have seen something we've all missed. (Actually, I don't know whether you are or not, but let's assume you are). Thank you for pointing out this glaring mistake in Unicode's naming. This glaring mistake will remain a glaring mistake, just like the spelling of BRAKCET instead of bracket will remain in U+FE18. You're totally right in everything you have said (we'll assume). No need to try to convince us anymore, we believe you. No names will be changed, anyway. ~mark
Attn: Unicode Inc worker Christoph Päper
Attn: Unicode Inc worker Christoph Päper C/o Magda Danish Sr Administrative Director Unicode Inc christoph.paeper @ crissov.de, v-mag...@microsoft.com, Neither Assam Government nor Assam Literary Society has asked Unicode Inc to encode Assamese stuff. Why did Unicode Inc encode Assamese stuff? Can you reply back with detailed information on what prompt Unicode Inc to encode Assamese stuff as Bengali? Thank you in advance for providing this information, Tulasi PS: Your email thread appended herewith as reference From: Christoph Päper christoph.pae...@crissov.de Date: Mon, Sep 12, 2011 at 5:52 AM Subject: Re: Continue: Glaring Mistake in the Code List of South Asian Script, Reply to Daug Ewell and Others To: Unicode Discussion unicode@unicode.org Delex, you are obviously confusing character sets, scripts, writing systems, orthographies, languages, peoples and names thereof (which may vary across languages and applications). NB: Some might argue that Unicode already distinguishes Indic scripts on a finer level than necessary, since elsewhere many would be seen as hands or typefaces of a single script, hence they would unify encoding and leave the looks to fonts completely. difficult as if he is being asked to find out London in the map of Germany. There’s a London in my (German) home county. I think it has like 20 citizens. Proves nothing.
Attn: Unicode Inc worker Ken Whistler
Attn: Unicode Inc worker Ken Whistler C/o Magda Danish Sr Administrative Director Unicode Inc kenw @ sybase.com, v-mag...@microsoft.com, Neither Assam Government nor Assam Literary Society has asked Unicode Inc to encode Assamese stuff. Why did Unicode Inc encode Assamese stuff? Can you reply back with detailed information on what prompt Unicode Inc to encode Assamese stuff as Bengali? Thank you in advance for providing this information, Tulasi PS: Your email thread appended herewith as reference From: Ken Whistler k...@sybase.com Date: Mon, Sep 12, 2011 at 1:53 PM Subject: Re: Continue: Glaring Mistake in the Code List of South Asian Script, Reply to Daug Ewell and Others To: verd...@wanadoo.fr Cc: unicode@unicode.org On 9/12/2011 9:13 AM, Philippe Verdy wrote: Well, wasn't the ISCII standard naming the script Bengali? It also gave the name Assamese, but was it a synonym or did it require a separate codepage switching code ? They were separate. Annex A of ISCII 1991 shows Bengali (BNG) and Assamese (ASM) in separate columns. *Every* character in those two columns is completely identical, except the entries (no surprise) in the r row and the v row. And in Annex D, the listing of Inscript keyboards, there is one keyboard overlay for Bengali and one for Assamese. These again are completely identical, except for the B key (where the v goes) and the J key (where the r) goes. Why? Well, I presume the Bureau of Indian Standards ran into the same linguistic political buzzsaw that you have seen rehearsed on this thread. It may be interesting to reread the ISCII standard from which the UCS encoding of the Indian scripts came from... Yes. it is interesting reading. I recommend it sometime. Ultimately, however, it is not pertinent to the question here. The distinction between Bengali and Assamese is a matter of linguistic politics. It is not a matter of script or character encoding. --Ken