Re: Abstract character?
Mark Davis wrote: > The UTC in has decided to make scalar value mean unambiguously the > code points ..D7FF, E000..10, i.e., everything but surrogate > code points. While surrogate code points cannot be represented in > UTF-8 (as of Unicode 3.2), the UTC has not decided that the surrogate > code points are illegal in all UTFs; notably, they are legal in > UTF-16. They are not legal in UTF-16 unless you believe that the two code points (0xD800, 0xDC00) are fundamentally equivalent to the single code point 0x1 -- that is, unless you believe Unicode *is* UTF-16. UTF-16 does not allow the representation of an unpaired surrogate 0xD800 followed by another, coincidental unpaired surrogate 0xDC00. (It maps the two to U+1.) Among the standard UTFs, only UTF-32 allows the two to be treated as unpaired surrogates. In fact, before UTF-8 was "tightened up" in 3.2, the only UTF that DID NOT permit these two coincidental unpaired surrogates was UTF-16. UTF-8: D800 DC00 <==> ED A0 80 ED B0 80 (no longer legal) UTF-32: D800 DC00 <==> D800 DC00 - but - UTF-16: D800 DC00 ==> D800 DC00 ==> 1 > Ken is pushing for this change; I believe it would be a very bad idea. > (I think the reasons have already appeared on this list, so I am not > trying to reopen the discussion; just state the current situation.) I don't recall seeing the reasons conclusively discussed on this list; I'd be happy to hear them again. I've been complaining about the paragraph after D29 for two years now. -Doug Ewell Fullerton, California
Dublin Conference: Re: ISO/IEC 10646 versus Unicode
Dear Marion, After checking the mail lists upon returning from vacation/holiday, I found the following comment on the most recent Unicode conference in Dublin rather surprising: When, after all the years of receiving Irish support, I saw Unicode's 2002 conference in Dublin being advertised as more of a showcase for German than native interests, I decided not to attend, but that does not mean any withdrawal of EGT's initial and longstanding support of Unicode, in principal (although it seems to have produced only one thing to date, viz., a book called 'The Unicode Standard' (where I expected to read 'Implementation'). As a matter of fact, we specifically designed the Dublin Unicode Conference to tie in with the substantial Dublin localization industry. I am quite sorry if this purpose was misunderstood. Our keynote speaker you refer to was from the Localization Research Institute of the University of Limerick. It is too bad that you were not able to attend, particularly since you have a great interest in Unicode implementations (as do I). We were able to showcase implementations ranging from top US IT businesses, to many interesting worldwide case studies, localization, etc. I think you would have enjoyed it (in addition to the local pub:-). Implementation is truly where "the rubber meets the road", to use an American idiom. In this regard, the conferences have a goal to champion leading edge Unicode implementations. I particularly enjoyed hearing from a British mobile phone company at the Dublin conference - Unicode is popping up everywhere, it seems. Best regards, Lisa Moore Co-Chair, IUC
Re: Abstract character?
A small correction to Ken's message: >The Unicode scalar value >definitionally excludes D800..DFFF, which are only code unit >values used in UTF-16, and which are not code points associated >with any well-formed UTF code unit sequences. The UTC in has decided to make scalar value mean unambiguously the code points ..D7FF, E000..10, i.e., everything but surrogate code points. While surrogate code points cannot be represented in UTF-8 (as of Unicode 3.2), the UTC has not decided that the surrogate code points are illegal in all UTFs; notably, they are legal in UTF-16. Ken is pushing for this change; I believe it would be a very bad idea. (I think the reasons have already appeared on this list, so I am not trying to reopen the discussion; just state the current situation.) Mark __ http://www.macchiato.com ◄ “Eppur si muove” ► - Original Message - From: "Kenneth Whistler" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Monday, July 22, 2002 13:38 Subject: Re: Abstract character? > Lars Marius Garshol asked: > > > I'm trying to find out what an abstract character is. I've been > > looking at chapter 3 of Unicode 3.0, without really achieving > > enlightenment. > > > > The term Unicode scalar value (apparently synonymous with code point) > > seems clear. It is the identifying number assigned to assigned > > Unicode characters. > > Here is one of my attempts at a more rigorous term rectification: > > Abstract character > >that which is encoded; an element of the repertoire (existing >independent of the character encoding standard, and often >identifiable in other character encoding standards, as well >as the Unicode Standard); the implicit basis of transcodings. > >Note that while in some sense abstract characters exist a >priori by virtue of the nature of the units of various writing >systems, their exact nature is only pinned down at the point >that an actual encoding is done. They are not always obvious, >and many new abstract characters may arise as the result of >particular textual processing needs that can be addressed by >characters. (E.g. WORD JOINER, OBJECT REPLACEMENT CHARACTER, >etc., etc.) > > Code point > >A number from 0..10; a "point" in the codespace 0..10. > > Encoded character > >An *association* of an abstract character with a code point. > > Unicode scalar value > >A number from 0..D7FF, E000..10; the domain of the >functions which define UTF's. The Unicode scalar value >definitionally excludes D800..DFFF, which are only code unit >values used in UTF-16, and which are not code points associated >with any well-formed UTF code unit sequences. > > Assignment (of code points) > >Refers to the process of associating abstract character with >code points. Mathematically a code point is >"assigned to" an abstract character and an abstract >character is "mapped to" a code point. > >This is distinguished from the vaguer sense of "assigned" >in general parlance as meaning "a code point given some >designated function by the standard", which would include >noncharacters and surrogates. > > > > > So far, so good. Some questions: > > > > - are all assigned Unicode characters also abstract characters? > > Yes. Or rather: all encoded characters are assigned to abstract > characters. > > (See above for my distinction between "assigned" and > "designated", which would apply to noncharacters and surrogate > code points -- neither of which classes of code points get > assigned to abstract characters.) > > > > > - it seems that not all abstract characters have code points (since > >abstract characters can be formed using combining characters). Is > >that correct? > > Yes. (Note above -- abstract characters are also a concept which > applies to other character encodings besides the Unicode Standard, > and not all encoded characters in other character encodings automatically > make it into the Unicode Standard, for various architectural reasons.) > > > > > - do () and (A followed by combining ring > >above) represent the same abstract character? > > Yes. That is the implicit claim behind a specification of canonical > equivalence. > > --Ken > > > > > Would be good if someone could clear this up. > > > > -- > > Lars Marius Garshol, Ontopian http://www.ontopia.net > > > ISO SC34/WG3, OASIS GeoLang TChttp://www.garshol.priv.no > > > > > > > > > >
Re: Tamil Text Messaging in Mobile Phones
Sinnathurai Srivas wrote: > http://www.gbizg.com/tamil/Unicode/Tamil_Text_Messaging.htm > > see the above for a sample of typical modern Tamil designed for mobile > texting without rendering support. "Rendering" is the process of mapping character codes to displayable glyphs on a screen or printer. Rendering support is always required, even for English. You probably mean "without complex rendering support," e.g. contextual glyph forms and glyph reordering. > Text messaging in Tamil on Mobile phones. Would they only work with my > proposed reformed Tamil characters? Unicode is not the place to propose reforms in scripts or orthography. Your proposed characters must first achieve popular usage before they will be encoded. -Doug Ewell Fullerton, California
Re: Tamil Text Messaging in Mobile Phones
For those who are interested in what is behind this message, a little background... Sinnathurai Srivas is a member of INFITT's WG02 (Working Group 02, Unicode Tamil) who has been long advocating changes to Unicode Tamil that would be done in a "linear" manner that would remove the requirement of complex rendering. It would of course require many changes to rendering rules and character properties. At this point you might wonder how it would be possible to do this without breaking compatibility -- well, no need to wondeer, it would not be possible. Compatibility would have to be sacrificed. Several members of the committee pointed out that these reforms would not be possible without invalidating existing implementations. After some discussion, the chairman noted that he saw no way that such a proposal could actually be accomplished. The committee let the matter drop after this, I know am I not the only one thought that was the end of it -- until this very post was sent out today to at least a half dozen lists. MichKa Michael Kaplan Trigeminal Software, Inc. -- http://www.trigeminal.com/ - Original Message - From: "Sinnathurai Srivas" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Monday, July 22, 2002 6:22 PM Subject: Tamil Text Messaging in Mobile Phones > > http://www.gbizg.com/tamil/Unicode/Tamil_Text_Messaging.htm > > see the above for a sample of typical modern Tamil designed for mobile > texting without rendering support. > > > A typical Product; > > http://sms.gt.com.ua/ > > Text messaging in Tamil on Mobile phones. Would they only work with my > proposed reformed Tamil characters? > > see http://www.geocities.com/avarangal > for using ancient Tamil writing logic and reforming current alphabets > > _ > Join the world's largest e-mail service with MSN Hotmail. > http://www.hotmail.com > > >
Tamil Text Messaging in Mobile Phones
http://www.gbizg.com/tamil/Unicode/Tamil_Text_Messaging.htm see the above for a sample of typical modern Tamil designed for mobile texting without rendering support. A typical Product; http://sms.gt.com.ua/ Text messaging in Tamil on Mobile phones. Would they only work with my proposed reformed Tamil characters? see http://www.geocities.com/avarangal for using ancient Tamil writing logic and reforming current alphabets _ Join the worlds largest e-mail service with MSN Hotmail. http://www.hotmail.com
Re: Abstract character?
I usually define an abstract character in talks I give as "an element of a writing system that you care about, independent of glyphs, and certainly independent of endings or specific code points". If it could be described more precisely than that, it wouldn't be "abstract", would it? :) This is usually brought up in a series of definitions leading from "character" (what we are referring to here as "abstract" character, and then: - "character list" - a list of "characters" one is interested in - "character set" - a list of "character lists", which may or may not be ordered, but still has no codepoints - "encoding scheme" - an algorithm for assigning code points to a "character set" - "code point" the representation of an "abstract character" in an "encoding scheme" - "font" - a series of glyphs that are used to display a characters represented by code points, in their immediate context All of this is filled with examples - building to an explanation of Unicode. For example, wrt "abstract character, I ask the audience to ponder if "upper case A" and "lower case a", are the same "abstract character". Also, I ask them to ponder if "lower case a" displayed in "Helvetica" is the same "character as "lower case a" in " Times Roman". Finally, how about "lower case a in 9 point Helvetica" and "lower case a in 18 point Helvetica"? And apropos a thread from last week, Unicode introduces new concepts such as "character properties" which means the anticipation and intrigue I spend time building in the audience that there is a neat solution to the historical morass I just spent 40 minutes describing, gets thoroughly dashed! Joy! Implicit in this set of definitions is of course that a "character" may or may not be of interest to all "character lists", and therefore may or may not end of represented in more than one encoding. Also note that even when it does end up in more than one, this model in no way implies a round trip capability. This leads nicely into a discussion about some very important aspects of internationalizing code and working with 3rd party components.. Barry Caplan www.i18n.com At 01:38 PM 7/22/2002 -0700, Kenneth Whistler wrote: >Lars Marius Garshol asked: > >> I'm trying to find out what an abstract character is. I've been >> looking at chapter 3 of Unicode 3.0, without really achieving >> enlightenment. >> >> The term Unicode scalar value (apparently synonymous with code point) >> seems clear. It is the identifying number assigned to assigned >> Unicode characters. > >Here is one of my attempts at a more rigorous term rectification: > >Abstract character > > that which is encoded; an element of the repertoire (existing > independent of the character encoding standard, and often > identifiable in other character encoding standards, as well > as the Unicode Standard); the implicit basis of transcodings. > > Note that while in some sense abstract characters exist a > priori by virtue of the nature of the units of various writing > systems, their exact nature is only pinned down at the point > that an actual encoding is done. They are not always obvious, > and many new abstract characters may arise as the result of > particular textual processing needs that can be addressed by > characters. (E.g. WORD JOINER, OBJECT REPLACEMENT CHARACTER, > etc., etc.) > >Code point > > A number from 0..10; a "point" in the codespace 0..10. > >Encoded character > > An *association* of an abstract character with a code point. > >Unicode scalar value > > A number from 0..D7FF, E000..10; the domain of the > functions which define UTF's. The Unicode scalar value > definitionally excludes D800..DFFF, which are only code unit > values used in UTF-16, and which are not code points associated > with any well-formed UTF code unit sequences. > >Assignment (of code points) > > Refers to the process of associating abstract character with > code points. Mathematically a code point is > "assigned to" an abstract character and an abstract > character is "mapped to" a code point. > > This is distinguished from the vaguer sense of "assigned" > in general parlance as meaning "a code point given some > designated function by the standard", which would include > noncharacters and surrogates. > >> >> So far, so good. Some questions: >> >> - are all assigned Unicode characters also abstract characters? > >Yes. Or rather: all encoded characters are assigned to abstract >characters. > >(See above for my distinction between "assigned" and >"designated", which would apply to noncharacters and surrogate >code points -- neither of which classes of code points get >assigned to abstract characters.) > >> >> - it seems that not all abstract characters have code points (since >>abstract characters can be formed using combining characters). Is >>that correct? > >Yes. (Note above -- abstract characters are also a concept which >applies to other c
Re: Abstract character?
Lars Marius Garshol asked: > I'm trying to find out what an abstract character is. I've been > looking at chapter 3 of Unicode 3.0, without really achieving > enlightenment. > > The term Unicode scalar value (apparently synonymous with code point) > seems clear. It is the identifying number assigned to assigned > Unicode characters. Here is one of my attempts at a more rigorous term rectification: Abstract character that which is encoded; an element of the repertoire (existing independent of the character encoding standard, and often identifiable in other character encoding standards, as well as the Unicode Standard); the implicit basis of transcodings. Note that while in some sense abstract characters exist a priori by virtue of the nature of the units of various writing systems, their exact nature is only pinned down at the point that an actual encoding is done. They are not always obvious, and many new abstract characters may arise as the result of particular textual processing needs that can be addressed by characters. (E.g. WORD JOINER, OBJECT REPLACEMENT CHARACTER, etc., etc.) Code point A number from 0..10; a "point" in the codespace 0..10. Encoded character An *association* of an abstract character with a code point. Unicode scalar value A number from 0..D7FF, E000..10; the domain of the functions which define UTF's. The Unicode scalar value definitionally excludes D800..DFFF, which are only code unit values used in UTF-16, and which are not code points associated with any well-formed UTF code unit sequences. Assignment (of code points) Refers to the process of associating abstract character with code points. Mathematically a code point is "assigned to" an abstract character and an abstract character is "mapped to" a code point. This is distinguished from the vaguer sense of "assigned" in general parlance as meaning "a code point given some designated function by the standard", which would include noncharacters and surrogates. > > So far, so good. Some questions: > > - are all assigned Unicode characters also abstract characters? Yes. Or rather: all encoded characters are assigned to abstract characters. (See above for my distinction between "assigned" and "designated", which would apply to noncharacters and surrogate code points -- neither of which classes of code points get assigned to abstract characters.) > > - it seems that not all abstract characters have code points (since >abstract characters can be formed using combining characters). Is >that correct? Yes. (Note above -- abstract characters are also a concept which applies to other character encodings besides the Unicode Standard, and not all encoded characters in other character encodings automatically make it into the Unicode Standard, for various architectural reasons.) > > - do (Å) and (A followed by combining ring >above) represent the same abstract character? Yes. That is the implicit claim behind a specification of canonical equivalence. --Ken > > Would be good if someone could clear this up. > > -- > Lars Marius Garshol, Ontopian http://www.ontopia.net > > ISO SC34/WG3, OASIS GeoLang TChttp://www.garshol.priv.no > > > >
Re: Abstract character?
Lars Marius Garshol wrote: > I'm trying to find out what an abstract character is. http://www.unicode.org/reports/tr17/ http://oss.software.ibm.com/icu/docs/papers/forms_of_unicode/ markus
Abstract character?
I'm trying to find out what an abstract character is. I've been looking at chapter 3 of Unicode 3.0, without really achieving enlightenment. The term Unicode scalar value (apparently synonymous with code point) seems clear. It is the identifying number assigned to assigned Unicode characters. So far, so good. Some questions: - are all assigned Unicode characters also abstract characters? - it seems that not all abstract characters have code points (since abstract characters can be formed using combining characters). Is that correct? - do (Å) and (A followed by combining ring above) represent the same abstract character? Would be good if someone could clear this up. -- Lars Marius Garshol, Ontopian http://www.ontopia.net > ISO SC34/WG3, OASIS GeoLang TChttp://www.garshol.priv.no >
Re: ISO/IEC 10646 versus Unicode
At 16:15 +0100 2002-07-22, Marion Gunn wrote: >Kenneth Whistler wrote: > > > > Marion Gunn wrote: > > > > > > How many years does it take to get ISO/IEC work item accepted, then > > > develop the corresponding Standard to publication stage, Ken? >> > > In the case of 10646, approximately 10 years, Marion. > >10 years? And Unicode, after eleven long years, has yet to produce the >promised Universal Character Set/Implementations of 10646. It is absolutely astonishing to me that after all these years that you don't know what it is that Unicode is meant to produce. Unicode produces a character set standard, equivalent to the character set of ISO/IEC 10646. Unicode also produces some other standards which guide implementation. The implementation is done by Apple, HP, IBM, JustSystem, Microsoft, Oracle, SAP, Sun, Sybase, Unisys and many other companies. If you wish to learn more, start at http://www.unicode.org/unicode/standard/WhatIsUnicode.html. You may also be interested to see how many products are Unicode-enabled. See http://www.unicode.org/unicode/onlinedat/products.html. >Any fool can chuck missiles from the discarded rockheaps of history, Truer words were never spoken. >but I do know what my company understood itself to be investing in >through many expensive years of supporting Unicode. I don't know what you were understanding during the period at which "EGT" was "investing" in travel to ISO and CEN meetings, but I understood it perfectly well. I would prefer it very much if you would not speak for me with regard to that time period on this or other lists. >It was in the Universal Character Set and 10646 Implemenations, >which I still hope to see Unicode produce, or at least a reasonable >timetable offered. Does Unicode have a reasonable timetable to offer? I wonder how many characters I actually helped encode so far? It must be approaching four and a half thousand. -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: ISO/IEC 10646 versus Unicode
On 07/22/2002 10:15:37 AM Marion Gunn wrote: >I do know >what my company understood itself to be investing in through many >expensive years of supporting Unicode. It was in the Universal Character >Set and 10646 Implemenations, which I still hope to see Unicode produce, >or at least a reasonable timetable offered. Does Unicode have a >reasonable timetable to offer? I'm not sure I get this: 10646 implementations to be produced by [The] Unicode [Consortium]? My understanding is that TUC does not produce implementations; they only produce a standard known as The Unicode Standard. As for producing the Universal Character Set, they are in the process of doing that in concert with WG2, and are working with exactly the same timetable as WG2. The point of this thread escapes me. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: <[EMAIL PROTECTED]>
Re: ISO/IEC 10646 versus Unicode
Arsa Kenneth Whistler: > > Marion Gunn wrote: > > > > How many years does it take to get ISO/IEC work item accepted, then > > develop the corresponding Standard to publication stage, Ken? > > In the case of 10646, approximately 10 years, Marion. > ... 10 years? And Unicode, after eleven long years, has yet to produce the promised Universal Character Set/Implementations of 10646. Any fool can chuck missiles from the discarded rockheaps of history, but I do know what my company understood itself to be investing in through many expensive years of supporting Unicode. It was in the Universal Character Set and 10646 Implemenations, which I still hope to see Unicode produce, or at least a reasonable timetable offered. Does Unicode have a reasonable timetable to offer? mg -- Marion Gunn * E G T (Estab.1991) vox: +353-1-2839396 * [EMAIL PROTECTED] 27 Páirc an Fhéithlinn; Baile an Bhóthair; Contae Átha Cliath; Éire
Re: Corporate influence on Unicode development (long)
On 07/21/2002 07:30:33 PM "Doug Ewell" wrote: >First of all, the figure that William (or any other individual) really >should be looking at is not $12,000 for a full membership, but $600 for >a "specialist" membership or $120 for an "individual" membership. (BTW, >I would be interested in hearing -- perhaps off-line -- from individuals >who hold or have held such memberships, to find out how they felt their >memberships benefited them and Unicode.) Only the $12000 membership makes you a candidate to vote on UTC. Associate and Specialist memberships give you access to the "insider's" mailing list (where real proposals get discussed) and documnts, though, which has been very useful. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: <[EMAIL PROTECTED]>
Re: ISO/IEC 10646 versus Unicode
Dear colleagues, I was biting my tongue there for a bit, but as this list is both public and archived, I am afraid that I have little choice but to respond to Marion Gunn's revisionist history, as it reflects on my own activities working for the Universal Character Set. I will begin by reminding readers of this list that I have had no interest in "EGT" since September 2001, when I ceased to be an owner-director of that limited company. However, as Marion refers to the period of time when I *was* involved, it seems to me proper that I set the record straight. At 11:36 +0100 2002-07-18, Marion Gunn wrote: >EGT was one of the first companies to give (almost) unqualified >support to the setting up of Unicode. This could not possibly be considered to be true. As told in http://www.unicode.org/unicode/history/ the UTC meetings are counted from February 1989. I didn't come to Ireland until September 1989. The Unicode Consortium was officially incorporated in January 1991. "EGT" wasn't incorporated until February 1991. Further, although "EGT" did become aware of the 10646 ballot in time to influence Ireland's vote on the DIS in June 1991, it was afterward that "EGT" made contact with the Unicode Consortium, when I wrote a number of responses to UTR #1 and UTR #2: Burmese (April 1993), Ethiopic (May 1993), Sinhala and Tibetan (September 1993). The formal involvement of "EGT" in standards development began with my attendance of a CEN/TC304 meeting in Paris in 1994. In October 1994 I attended my first meeting of ISO/IEC JTC1/SC2/WG2 in San Francisco, and it was there that I first met members of the Unicode Technical Committee. (Asmus Freytag and I hit it off rather badly, in the spirit of cautious distrust which was, it has to be admitted, present in the 10646-vs-Unicode spirit of those times. Now, of course, we work closely together as co-editors and are fast friends; I have the honour of being godfather to his daughter Brianna.) >When it became clear that 10646 was getting unwieldy, EGT took a >2-pronged approach, consisting of establishing new Irish National >Standards and adding to the 8859-series, which proved a lot more >productive than trusting to 10646 alone (both of which aims EGT >successfully achieved). "EGT" did not propose the development of I.S. 434 (8-bit code for Ogham) and ISO/IEC 8859-14 (Latin 8, Celtic) because "10646 was getting unwieldy". I had developed Ogham and Gaelic fonts for use on 8-bit operating systems, and it seemed that support for Celtic text written with those character sets would be likelier if there were formal standards available. That is the reason those standards were developed. I was the editor of both of those standards on behalf of NSAI/AGITS/WG6 (now NSAI/ICTSCC/SC4) and ISO/IEC JTC1/SC2/WG3. >I, for one, am still a believer in the vision of Unicode, and still >monitor/support its mailing list/other activities, and hope to live >long enough to see it succeed, although I have to admit to getting >so very many things wrong about Unicode in the past: [...] I >thought, for example, that involvement in it would cost EGT very >little, in terms of working hours (wrong) and in terms of money >(wrong) [] Marion writes about "EGT" as though it were more than the sum of its parts. From February 1991 to September 2001, in any case, it was certainly not so; during that period, "EGT" consisted of two people, myself and Marion, and no more. What money was spent on standardization was chiefly for JTC1/SC2/WG2 and CEN/TC304 activities, in point of fact, and it was spent with the agreement of the two co-directors who both signed the cheques. It ought not to be made to look otherwise. For my part, I regret not one penny of the money we chose to spend on standardization travel, nor one minute of the time I invested in drawing up script and character proposals. Consider, for instance, the living scripts which have been encoded to date with at least some of my input (Buhid, Cherokee, Canadian Syllabics, Ethiopic, Hanunóo, Khmer, Limbu, Myanmar, Sinhala, Tagbanwa, Tai Le, Thaana, Tibetan, and Yi). These are used to write languages spoken by some 63 million people on our planet. The investment has, to be sure, enabled me to come into the fullness of my ability to do what has become my own life's work. If I may be so bold to say so, the Unicode Standard and ISO/IEC 10646 -- and computer users worldwide -- are better off for the investment which "EGT" made between 1994 and 2001 than they would have been otherwise. >When, after all the years of receiving Irish support, I saw >Unicode's 2002 conference in Dublin being advertised as more of a >showcase for German than native interests, I decided not to attend, >but that does not mean any withdrawal of EGT's initial and >longstanding support of Unicode, in principal (although it seems to >have produced only one thing to date, viz., a book called "The >Unicode Standard"
Normalization
Hi, Before getting to the question, let me explain the scenarios first: Scenario 1: Need to compare strings containing Japanese/French characters entered from the command line against string stored in a SQL Server database (stored through an ASP application) as a nvarchar datatype. The application accepting the command line aguments and doing the comparison is a C++ console application. Scenario 2: A C++ application is posting data (string containing French/Japanese characters) to a Java Servlet. Now, the C++ application exists on both Windows and Mac. My question for both the scenarios is same, do I need to do anything special w.r.t Normalization while comparing the strings or the C++ (as in first scnario) and Java (second scnario) string comparison functions are capble enough to work properly. Thanks in advance. Regards, Debmalya Biswas
Re: UTS #10: Unicode Collation Algorithm (UCA)
At 06:15 -0700 2002-07-21, Michael \(michka\) Kaplan wrote: >The UCA provides a very nice framework. But if you already have a house, who >needs a new frame? Because your already nice house isn't very friendly. It isn't tailorable by anyone but you, which means, in effect, unless you're an invited guest, you won't be able to enjoy the house. Of course it's understandable that you did something else while things were under ballot and so on. Perhaps you should migrate your current ordering support to a UCA-based one. That would leave you more flexible in future, particularly for the support of smaller communities. -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: TIM - A Table-base Input Method Module
> From: Arthit Suriyawongkul <[EMAIL PROTECTED]> > anybody here interesting in this Table-based Input Method ? > http://sourceforge.net/projects/wenju/ > i've got this site from gtk-i18n-list. I have not looked at this one yet, but you may also want to take a look at IIIMF(http://www.li18nux.org/subgroup/im/IIIMF) which has something similar, called ude(user defined engine) as a table based IM. Also recently, XML based IM, EIMIL(Extensible IM interface Language) is ntroduced to IIIMF, which you can combine the table based IM, the portable XML based logics, and backend dictionay lookup server. You can retrive the source of this as follows; cvs -d -d:pserver:[EMAIL PROTECTED]:/cvsroot co -r exp-EIMIL-1 im-sdk The following is the sample XML based IM definition, which you can find in im-sdk/server/programs/language_engines/canna. This sample shows how you can combine those table/logic and backend dictionary lookup server(in this case, Japanese Canna dictionary lookup server). --- "a""$B$"(B" "i""$B$$(B" "u""$B$&(B" "e""$B$((B" "o""$B$*(B" "xa" "$B$!(B" "xi" "$B$#(B" "xu" "$B$%(B" "xe" "$B$'(B" "xo" "$B$)(B" "ka" "$B$+(B" "ki" "$B$-(B" "ku" "$B$/(B" "ke" "$B$1(B" "ko" "$B$3(B" "kya" "$B$-$c(B" "kyi" "$B$-$#(B" "kyu" "$B$-$e(B" "kye" "$B$-$'(B" "kyo" "$B$-$g(B" "ga" "$B$,(B" "gi" "$B$.(B" "gu" "$B$0(B" "ge" "$B$2(B" "go" "$B$4(B" "gya" "$B$.$c(B" "gyi" "$B$.$#(B" "gyu" "$B$.$e(B" "gye" "$B$.$'(B" "gyo" "$B$.$g(B" "sa" "$B$5(B" "si" "$B$7(B" "su" "$B$9(B" "se" "$B$;(B" "so" "$B$=(B" "sha" "$B$7$c(B" "shi" "$B$7(B" "shu" "$B$7$e(B" "she" "$B$7$'(B" "sho" "$B$7$g(B" "sya" "$B$7$c(B" "syi" "$B$7$#(B" "syu" "$B$7$e(B" "sye" "$B$7$'(B" "syo" "$B$7$g(B" "za" "$B$6(B" "zi" "$B$8(B" "zu" "$B$:(B" "ze" "$B$<(B" "zo" "$B$>(B" "ja" "$B$8$c(B" "ji" "$B$8(B" "ju" "$B$8$e(B" "je" "$B$8$'(B" "jo" "$B$8$g(B" "zya" "$B$8$c(B" "zyi" "$B$8$#(B" "zyu" "$B$8$e(B" "zye" "$B$8$'(B" "zyo" "$B$8$g(B" "ta" "$B$?(B" "ti" "$B$A(B" "tu" "$B$D(B" "te" "$B$F(B" "to" "$B$H(B" "cha" "$B$A$c(B" "chi" "$B$A(B" "chu" "$B$A$e(B" "che" "$B$A$'(B" "cho" "$B$A$g(B" "tya" "$B$A$c(B" "tyi" "$B$A$#(B" "tyu" "$B$A$e(B" "tye" "$B$A$'(B" "tyo" "$B$A$g(B" "da" "$B$@(B" "di" "$B$B(B" "du" "$B$E(B" "de" "$B$G(B" "do" "$B$I(B" "dha" "$B$G$c(B" "dhi" "$B$G$#(B" "dhu" "$B$G$e(B" "dhe" "$B$G$'(B" "dho" "$B$G$g(B" "dya" "$B$B$c(B" "dyi" "$B$B$#(B" "dyu" "$B$B$e(B" "dye" "$B$B$'(B" "dyo" "$B$B$g(B" "na" "$B$J(B" "ni" "$B$K(B" "nu" "$B$L(B" "ne" "$B$M(B" "no" "$B$N(B" "nya" "$B$K$c(B" "nyi" "$B$K$#(B" "nyu" "$B$K$e(B" "nye" "$B$K$'(B" "nyo" "$B$K$g(B" "ha" "$B$O(B" "hi" "$B$R(B" "hu" "$B$U(B" "he" "$B$X(B" "ho" "$B$[(B" "fa" "$B$U$!(B" "fi" "$B$U$#(B" "fu" "$B$U(B" "fe" "$B$U$'(B" "fo" "$B$U$)(B" "hya" "$B$R$c(B" "hyi" "$B$R$#(B" "hyu" "$B$R$e(B" "hye" "$B$R$'(B" "hyo" "$B$R$g(B" "ba" "$B$P(B" "bi" "$B$S(B" "bu" "$B$V(B" "be" "$B$Y(B" "bo" "$B$\(B" "bya" "$B$S$c(B" "byi" "$B$S$#(B" "byu" "$B$S$e(B" "by
TIM - A Table-base Input Method Module
anybody here interesting in this Table-based Input Method ? http://sourceforge.net/projects/wenju/ i've got this site from gtk-i18n-list. :) regards, Art Original Message Subject: Re: TIM - A Table-base Input Method Module Date: Sun, 21 Jul 2002 09:03:06 -0400 From: Daniel Yacob <[EMAIL PROTECTED]> To: [EMAIL PROTECTED], [EMAIL PROTECTED] many months later... > Now I just finished such a IM module which you can find it at > http://sourceforge.net/projects/wenju/ > I call it TIM (Table-based Input Method). I haven't released a package > yet, but it is in the CVS. I do like this idea, if I were to give a wish list of features I'd like to see in an IM description file I'd no doubt end up describing what Keyman uses. Perhaps because it is what I'm most familiar with but it also some nice expressive syntax. It has occured to me before that it would be nice to be able to import keyman .kmn files directly. Has an XML definition for IMs ever been developed? It would *really* be nice to have some kind of universal vendor independent, IM definition, like unicode is to charsets. Could TIM be taken in this direction? Towards a XIM? I'd be happy to participate in defining an XML schema for it. Anyone interested? cheers, /Daniel ___ gtk-i18n-list mailing list [EMAIL PROTECTED] http://mail.gnome.org/mailman/listinfo/gtk-i18n-list
Re: Unicode mention (Urdu)
At 02:07 22/07/02 +0100, Alistair Vining wrote: >The cross-platform message somewhat dulled by the font [Urdu Naskh Asiatype] download >being a Windows .exe file with (judging by a message that popped up) a copy of the >uniscribe .dll... If this is true, then are the BBC pirating Microsoft's software? And if that is the case, could Microsoft please either sue or not sue? I'm not saying this out of perversity. The non-redistributability of Uniscribe is an enormous inconvenience to software developers like us, because either we have to do without Uniscribe or we have to force our users to install a large and irrelevant software package (such as a web browser) in order to make sure that they have it on their systems. So if the BBC has found a way to redistribute Uniscribe legally, we want to hear about it; or if Microsoft have decided to take no notice if people do distribute Uniscribe, we want to hear about that too!
Re: How to type sporadic Unicode (was: User interface for keyboard input)
Martin Kochanski wrote: > Microsoft's Alt+X method: unfortunately, there is no such thing. I > have seen at least two different Alt+X methods in Microsoft software: I should have said "one of Microsoft's Alt+X methods." > Methods specified by ISO 14755: unfortunately, there are no such > things. As you and others have said, the ISO 14755 specification > merely specifies properties that conforming methods should have, it > does not specify the methods themselves [you can imagine a predecessor > standard specifying that characters should be typed by hitting keys > but not specifying the keyboard layout itself]. I should have said "methods conforming to ISO 14755." My language may not have been precise, but my point should have been clear: until the real world settles on a standard for entry of arbitrary Unicode characters, as universal as Ctrl+C for copy and Ctrl+V for paste, there is no need for an application to support only one method. > Is there any sign of an emerging consensus as to what the beginning > and ending sequences might be? Addison Phillips mentioned \uX, but > that was in a programming context and he says himself it wouldn't be > suitable for running text. It would be nice if an innocent user faced > with a new software package did not have to look up manuals or > experiment to see what the beginning sequence was. I wouldn't mind seeing the ISO 14755 suggestions, "press and hold Ctrl+Shift" and "release Ctrl+Shift," take hold. But Ctrl+Shift sequences could already be assigned by users, as you note. I use Ctrl+Shift+C myself to launch Character Map.. -Doug Ewell Fullerton, California