RE: Summary: xml:lang validity and RFC 1766 refs to outdated
Mike Brown <[EMAIL PROTECTED]> wrote: > Correct, but RFC 1766 doesn't, in turn, allow for successors to ISO > 639 and ISO 3166, at least not by a strict interpretation of its > formal language. And to date, there still is no successor to RFC > 1766. The successors to ISO 639 and ISO 3166 are newer versions of 639 and 3166. ISO standards don't get new numbers when they are revised, as RFCs do. I don't see anything in RFC 1766 that hardcodes it to the 1988 versions of either 639 or 3166. The 1988 versions are cited in the "References" section, but that is just to provide a bibliographically complete citation based on the versions available at the time (March 1995). It would make no sense in any event for an application using RFC 1766 (including, but not limited to, XML) to be artificially limited to the language or country codes set at a fixed point in the past. -Doug Ewell Fullerton, California
Why not to move characters (was: is there any way to change already defined character codes?)
You don't want to move characters because then you could change the meaning of a sentence that way. I don't want to price something at 1000 cows when I mean 1000 yen. Or worse, 100 yen. ___ Get your own FREE Bolt Onebox - FREE voicemail, email, and fax, all in one place - sign up at http://www.bolt.com
Re: is there any way to change already defined character codes?
From: <[EMAIL PROTECTED]> > [EMAIL PROTECTED] wrote: > > E.g., if you look at the Latin part, you see that > > the 26 letters used in > > modern English are all contiguously ordered in > > two areas: U0041 to U005A > > (uppercase) and U0061 to U007A (lowercase). > > Yeah, but so what? All you gotta do is turn the 6th > bit off and there you go! > > > > But that's the end of the story! All the other > > 100's Latin letters are > > scattered all over, using no consistent order. > > > Too bad unicode values can't be fractions!! Lets take this one offline, Robert. michka
Re: GEORGIAN DIGITs
Well, if the language does not have them, you will not find them. Funny how that works, huh? michka Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/ - Original Message - From: <[EMAIL PROTECTED]> To: "Unicode List" <[EMAIL PROTECTED]> Sent: Tuesday, August 08, 2000 4:24 PM Subject: GEORGIAN DIGITs > Where are the Georgian digits? I want a set of Georgian > digits so I can use them as counter digits. > > -- > Robert Lozyniak > Accusplit pedometer manufactures can go suck eggs > My page: http://walk.to/11 > [EMAIL PROTECTED] - email > (917) 421-3909 x1133 - voicemail/fax > > > > ___ > Get your own FREE Bolt Onebox - FREE voicemail, email, and > fax, all in one place - sign up at http://www.bolt.com > >
RE: is there any way to change already defined character codes?
-- Robert Lozyniak Accusplit pedometer manufactures can go suck eggs My page: http://walk.to/11 [EMAIL PROTECTED] - email (917) 421-3909 x1133 - voicemail/fax [EMAIL PROTECTED] wrote: > Sandro Karumidze wrote: > > The issue is that in Unicode there is a sequence > of Georgian > > caracters different > > from what this people think should be. > > [...] In beginning of this century 5 characters > were dropped > > [...] > > In Unicode this 5 characters follow 33. There > is a different > > point of view that those 5 should be included > among the > > ohters. > > (You definitely need an official reply, but let's > go on with some more > informal chatting.) > > I foresee that this would not be considered a good > reason to change > anything. > > The order of characters in Unicode (or in any other > character encoding) is > not important. The scope of a character set is > to assign a unique number to > each character, not to define an "alphabetical > order". > Yeah. Just look at the kanji digits! > If you notice, the situation that you describe > is true for *all* the > alphabets in Unicode. > > E.g., if you look at the Latin part, you see that > the 26 letters used in > modern English are all contiguously ordered in > two areas: U0041 to U005A > (uppercase) and U0061 to U007A (lowercase). Yeah, but so what? All you gotta do is turn the 6th bit off and there you go! > > But that's the end of the story! All the other > 100's Latin letters are > scattered all over, using no consistent order. > Too bad unicode values can't be fractions!! > The same is true for Cyrillic, Greek, Hebrew, Arabic, > and so on. Have a look > at those blocks: the basic letters for post-czar > Russian, modern Greek, > Israeli Hebrew, modern Arabic etc. are consistently > ordered, but the letters > for other languages that use the same alphabets > (or ancient letters for the > same languages) are scattered all over with no > specific order. > > The reason why no one cares about the order of > characters is that it is > *impossible* to determine a "correct" order. > > In alphabet used by more than one language (e.g. > Latin, Cyrillic, Arabic, > Devanagari, etc.), the alphabetic order is normally > different for each > language. > > Moreover, many languages have more than one alphabetic > order, all equally > valid and in current usage. > > For this reason the problem of "alphabetic order" > has been pulled apart from > character sets, and addressed separately. > > In Unicode, the issue of "collation" is handled > by ad-hoc optional > algorithm, that is part of the standard but is > separated from the encoding > issue itself. > > The algorithm is titled "Unicode Technical Report > #10: Unicode Collation > Algorithm", and you can find it here: > http://www.unicode.org/unicode/reports/tr10/ . > > *That* is the place to check whether Georgian Letters > are in the correct > order or not. And if they are not, you have two > options: > > 1) Ask Unicode to change it: here you *do* have > some chances to be listened, > if you have valid arguments. > > 2) Change it yourself: unlike the character values, > the collation algorithm > is designed to be flexible and customizable. > > Regards, > _ Marco > ___ Get your own FREE Bolt Onebox - FREE voicemail, email, and fax, all in one place - sign up at http://www.bolt.com
GEORGIAN DIGITs
Where are the Georgian digits? I want a set of Georgian digits so I can use them as counter digits. -- Robert Lozyniak Accusplit pedometer manufactures can go suck eggs My page: http://walk.to/11 [EMAIL PROTECTED] - email (917) 421-3909 x1133 - voicemail/fax ___ Get your own FREE Bolt Onebox - FREE voicemail, email, and fax, all in one place - sign up at http://www.bolt.com
Re: is there any way to change already defined character codes?
The question is: > Is there any way to change already defined character codes? And the definitive answer is "No". Marco Cimarosti wrote: > (You definitely need an official reply, but let's go on with some more > informal chatting.) OK, here is another semi-official reply from me, as a UTC member, since everyone else seems to be at the UTC meeting this week... As far as I know, neither WG2 nor UTC would vote to re-order the Georgian alphabet because that would invalidate all existing data. Neither WG2 nor UTC would remove or move any existing characters for the same reason. Use a tailored sorting table if you need a different ordering. Jianping Yang wrote: > Not really for Unicode in which we have relocated some codepoints > for Hangul between Unicode 1.1 and 2.0 :) The fact that there was a re-ordering in Hangul some years ago was a travesty and an embarrassment that nobody wants to repeat. Rick
RE: Unicode String literals on various
Hi, Antoine. > I can continue to dissert on this subject (all of this should > finally be > cooked in a FAQ anyway), but I do not want to flood the list > with a marginaly interesting subject. Merci beaucoup. It was very informative! Ciao. Marco P.S. You should not be so shy: up to date information about how Unicode may be used in the world's most important programming language does not sound so "off topic" or "marginally interesting" to me. Ciao++ M.
Re: Summary: xml:lang validity and RFC 1766 refs to outdated codes
Mike Brown wrote: > Correct, but RFC 1766 doesn't, in turn, allow for successors to ISO 639 and > ISO 3166, at least not by a strict interpretation of its formal language. > And to date, there still is no successor to RFC 1766. Right. So Yn nediwn seint yn llinghedig, yn nediwn seint yn cor is not proper XML, although it is well-formed, because the language tag "roa" (Romance, Other) is not legal by RFC 1766. But when RFC 1766 is officially revised to include such language tags, it *will* be good XML. > The use of "are" in that statement sounds as definitive as "must" to me. No, because a violation of a "must" rule is a violation of well-formedness, requiring the report of a fatal error and draconian error recovery. > As > an XML document author, or the programmer of an XML document authoring tool, > tell me, do I or do I not use RFC 1766 language tags/identifiers as xml:lang > values? You do. > It seems that XML says I must use them, but it would not a violation > of validity if I didn't use them. It is a violation of the intent of the xml:lang attribute not to use them. > ...so the removal of productions 33-38 from XML really just seem to be > intended to allow RFC 1766 and its successors determine the proper > construction of a language tag, which makes more sense than trying to > reiterate the RFC's technical contents in XML's specification. Just so. > It doesn't > necessarily follow that xml:lang values can avoid conforming to RFC 1766. They cannot avoid it. -- Schlingt dreifach einen Kreis um dies! || John Cowan <[EMAIL PROTECTED]> Schliesst euer Aug vor heiliger Schau, || http://www.reutershealth.com Denn er genoss vom Honig-Tau, || http://www.ccil.org/~cowan Und trank die Milch vom Paradies.-- Coleridge (tr. Politzer)
Re: Unicode String literals on various
[EMAIL PROTECTED] wrote: > > Antoine Leca wrote: > > char C_thai[] = > > "\u0E40\u0E02\u0E17\u0E32\u0E49\u0E1B\u0E07\u0E1C\u0E33"; > > Would the Unicode values be converted to the local SBCS/MBCS character set? In this case, yes (assuming a normal C compiler). With wchar_t / L"...", they are converted to the local "wide character set", which happens to be Unicode on most boxes, with the following main exceptions: - some (cheap) C compilers does not have any special support for wchar_t, so it defaults to the same as cahr, and are usually 8 bit; - with East Asian C compilers, wchar_t are either Unicode or either a flat character coding, that is every character whether coded as SBCS or DBCS stands, with its nominal, legacy, code, in a 16-bit or 32-bit cell (that is different from MBCS in that the ASCII character are stored in cells the same width as DBCS characters) - EBCDIC implementations have their own rules (for obvious reasons), that I do not know exactly (I am not sure they are consistent) C99 also specifies that if __STDC_ISO_10646__ is defined, then the wchar_t values are the Unicode codepoints (then to learn if it is UTF-16 or UTF-32, one should look at WCHAR_MAX to learn if wchar_t are 16-bit or 32-bit). > If yes: > > Is the definition of this locale info part of the C99 standard itself, or is > it operating system's locale? It is "implementation-defined". Which means: - it is not required in any way by the C99 Standard itself (except if __STDC_ISO_10646__ is defined); - it is required to be stated in full words in the documentation for the compiler; - it can vary as per compilation options; often the OS's current locale is the default value, that can be overriden. > And what happens to Unicode values that cannot be converted in that > character set? The compiler is required to fall back to something (it cannot refuse to compile, nor it can simply drop the character); it is allowed to "fall back" to different character depending on the typed character, though; so for example, #include int main() { printf("%ls\n", L"\u00C0 table!"); return 0; } Can produce (among others, this is UTF-8 encoded): À table! A table! à table! table! I can continue to dissert on this subject (all of this should finally be cooked in a FAQ anyway), but I do not want to flood the list with a marginaly interesting subject. Antoine
Re: is there any way to change already defined character codes?
At 11:01 PM -0800 8/7/00, Jianping Yang wrote: >Not really for Unicode in which we have relocated some codepoints for Hangul >between Unicode 1.1 and 2.0 :) > And have regretted it ever since. Moving the Hangul and renaming æ have caused no end of problems. It was the fact that it was so disastrous when done once that makes everyone determined not to do it again. -- = John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://www.blueneptune.com/~tseng
FW:Unicode Font with Special Effects
-Original Message- From: Greg Olsen [mailto:[EMAIL PROTECTED]] Sent: Friday, August 04, 2000 3:14 AM To: [EMAIL PROTECTED] Subject: Font question Dear Sirs, My name is Greg Olsen. I am an Industrial Designer in Irvine California and I need information. I am developing a user Interface that I will hand off to be programmed in C. The interfaces design has the Arial font, the catch is that there is a beveled effect on each letter. I was wondering if there is a UNICode font that is capable of these type of effects. The interface is for a medical product that is to be released in numerous countries and languages. Any information would be helpful. Thank you for your time and response, Greg Olsen Patton Design 8 Pasteur #170 Irvine, CA 92618 [EMAIL PROTECTED]
RE: Unicode String literals on various
Antoine Leca wrote: > char C_thai[] = > "\u0E40\u0E02\u0E17\u0E32\u0E49\u0E1B\u0E07\u0E1C\u0E33"; Would the Unicode values be converted to the local SBCS/MBCS character set? If yes: Is the definition of this locale info part of the C99 standard itself, or is it operating system's locale? And what happens to Unicode values that cannot be converted in that character set? Thanks. _ Marco
Re: Zero-width ligator
Peter Constable <[EMAIL PROTECTED]> wrote: > I inquired about that recently on the unicoRe list, and was told that > the semantics of ZWJ/ZWNJ will be extended in 3.0.1 (or maybe it was > 3.1). Well, that's a good thing. It sounds like the benefits described by Everson will be made available in Unicode after all. > You mentioned that this decision was made at the meeting in February. > Interestingly, I was at that meeting, and my recollection was that > extending the semantics of ZWJ/ZWNJ was going to be given further > consideration, after some people investigated the implications of > extending the semantics of ZWJ, particularly for Indic scripts. But I > left before the meeting was over, and the minutes reflect that a > decision was in fact made (although the weasle word "provisionally" > is used). Thanks for the insight on this process. Somehow I needed more information than the word "rejected" in the Pipeline table could offer. \u263a -Doug Ewell Fullerton, California
RE: Summary: xml:lang validity and RFC 1766 refs to outdated code
> > XML 1.0 says that xml:lang attributes must match production 33 > > In fact, not so. Productions 33-38 have no normative value > whatsoever, as there is neither a production nor normative > language connecting them with the rest of XML 1.0. > [...] > In recognition of this fact, official erratum E73 (at > http://www.w3.org/XML/xml-19980210-errata#E73) removes these > productions from XML 1.0 altogether. It also allows for a > successor to RFC 1766 when and if such a thing exists. Correct, but RFC 1766 doesn't, in turn, allow for successors to ISO 639 and ISO 3166, at least not by a strict interpretation of its formal language. And to date, there still is no successor to RFC 1766. E73 says in its rationale "The XML processor does not deal with the value of xml:lang", but it also says, more formally, "The values of the attribute are language identifiers as defined by [IETF RFC 1766]". The use of "are" in that statement sounds as definitive as "must" to me. As an XML document author, or the programmer of an XML document authoring tool, tell me, do I or do I not use RFC 1766 language tags/identifiers as xml:lang values? It seems that XML says I must use them, but it would not a violation of validity if I didn't use them. I also don't see how one could read RFC 1766 in such a way as to ignore its prescription of a finite range of possible values for what it calls a language tag: Language-Tag = Primary-tag *( "-" Subtag ) Primary-tag = 1*8ALPHA Subtag = 1*8ALPHA In the primary language tag: -All 2-letter tags are interpreted according to ISO standard 639, "Code for the representation of names of languages" [ISO 639]. [...mention of "i-" and "x-"...] -Other values cannot be assigned except by updating this standard. ...so the removal of productions 33-38 from XML really just seem to be intended to allow RFC 1766 and its successors determine the proper construction of a language tag, which makes more sense than trying to reiterate the RFC's technical contents in XML's specification. It doesn't necessarily follow that xml:lang values can avoid conforming to RFC 1766. [We're on the same side, here. I'm just playing devil's advocate, because after I heard about this issue and reviewed the specs myself, I found that there were indeed points of contention.] -Mike
RE: Summary: xml:lang validity and RFC 1766 refs to outdated code
Jonathan Borden wrote: > > the 2-letter language code portion of xml:lang values must > > not only be 2 ASCII characters, but... > > Actually production [34] states that the LangCode is one of: > > ISO639Code | IanaCode | UserCode I know that. I also knew that productions 33 through 38 had been made obsolete by an erratum and that the only normative reference for xml:lang values was RFC 1766. That's why I said the 2-letter language code *portion* of xml:lang values. This is in reference to those RFC 1766 conforming language identifiers that include 2-letter language codes. The text of RFC 1766 describes exactly when and where those codes are to be used.
RE: is there any way to change already defined character codes?
On 08/08/2000 06:40:17 AM Marco.Cimarosti wrote: >(You definitely need an official reply, but let's go on with some more >informal chatting.) All the "officials" are busy meeting this week, but the statement, "Can't be done" is just as true whether it comes from the lips (or... fingertips) of a Ken Whistler or Mark Davis as from a Marco Cimarosti or a Chris Fynn. There are enough of us on this list that have a solid understanding of the standard and its development that a question like this can be answered without waiting for an "official" answer (though this question really ought to be answered somewhere on the Unicode web site); if somebody were to give wrong information, there would be several that wouldn't hesitate to correct. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: <[EMAIL PROTECTED]>
Re: Zero-width ligator
I inquired about that recently on the unicoRe list, and was told that the semantics of ZWJ/ZWNJ will be extended in 3.0.1 (or maybe it was 3.1). You mentioned that this decision was made at the meeting in February. Interestingly, I was at that meeting, and my recollection was that extending the semantics of ZWJ/ZWNJ was going to be given further consideration, after some people investigated the implications of extending the semantics of ZWJ, particularly for Indic scripts. But I left before the meeting was over, and the minutes reflect that a decision was in fact made (although the weasle word "provisionally" is used). - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: <[EMAIL PROTECTED]>
Re: thanks is there any way to change already defined character codes?
Does the Traditional sort order I mentioned meet the needs of typical usage? Or are there sorting rules that are missing? I am slowly learning Georgian and have a localizer who I work with as well, but I have much to learn and he makes many allowances for my ignorance (so he may not be as quick to correct me when I am missing something!). michka - Original Message - From: "Sandro Karumidze" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Cc: "Unicode List" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Tuesday, August 08, 2000 7:10 AM Subject: thanks is there any way to change already defined character codes? > thank you for information. > > I completely agree with you that codes should not be changes. I just wanted to > know more about rules. > > best regards, > > Sandro Karumidze > > > > > "Michael (michka) Kaplan" wrote: > > > Sandro, > > > > Are you basically wanting the ordering to be different? > > > > Unicode does not have any expressed or implied warranty that the ordering of > > characters will be anything like what a user would expect (how can it, when > > even so many languages that use the same scripts have entirely different, > > occasionally conflicting, collation rules? > > > > It is up to the software to make the necessary collation rules happen. > > > > For example, in Windows 2000 there are two different sorts supported for > > Georgian: "modern" and "traditional." The difference is that modern has four > > letters (He, Hie, We, and Har, both Capital and Small) sort at the end of > > the alphabet (which I presume corresponds to the sort that you do not > > like?), while the traditional sort has: > > > > * He appearing between Zen and Tan > > * Hie appearing between Nar and On > > * We appearing between Un and Phar > > * Har appearing between Xan and Jhan > > > > I presume the above "exceptions" more closely match the sort you would > > expect? And if there are more, this would be very valuable information (as > > the rules behind all new "sorts" like this are that a valid need to sort > > text differently was identified. > > > > As a rule, Unicode order is not intended to be nor does it explicitly decide > > to follow any kind of collation rules for code point order. > > > > FWIW, the LCIDs behind these two sorts under Windows 2000 (used in the C > > CompareString and the VB StrComp) are: > > > > Traditional: 1079 (0x0437) > > Modern: 66615 (0x10437) > > > > michka > > > > - Original Message - > > From: "Sandro Karumidze" <[EMAIL PROTECTED]> > > To: "Unicode List" <[EMAIL PROTECTED]> > > Cc: "Unicode List" <[EMAIL PROTECTED]> > > Sent: Tuesday, August 08, 2000 3:26 AM > > Subject: Re: is there any way to change already defined character codes? > > > > > Dear Chris, > > > > > > Thank you for your answer. > > > > > > > May I ask what is the reason these people from the government of Georgia > > want > > > > to change the codepoints of some Georgian characters? There is probably > > another > > > > good solution (or solutions) for whatever problem they think would be > > solved by > > > > changing encoding points. > > > > > > The issue is that in Unicode there is a sequence of Georgian caracters > > different > > > from what this people think should be. > > > > > > In modern Georgian there are 33 widely used characters. However before > > there were > > > 38 characters. In beginning of this century 5 characters were dropped, > > though still > > > used in old texts and by language specialists. > > > > > > In Unicode this 5 characters follow 33. There is a different point of view > > that > > > those 5 should be included among the ohters. > > > > > > This is all the issue - there are no specific implementation difficulties > > or > > > problems. The only point is that 5 among the rest 33 is more "correct". > > > > > > Best regards, > > > > > > Sandro Karumidze > > > > > > > > > > > > > > > > > > > > > > > Regards > > > > > > > > - Chris > > > > > > > > "Sandro Karumidze" <[EMAIL PROTECTED]> wrote: > > > > > > > > > There are people from the government of Georgia interested in > > possibility in > > > > > altering Unicode standard it terms of changing codes for some of > > Georgian > > > > > characters. > > > > > > > > > Does this type of things happen in Consortium and if yes under what > > > > circumstances. > > > > > > > > > If not can you specify in which rules is it defined that this types of > > > > changes are > > > > > not allowed.. > > > > > > > > > Thanks in advance for your support, > > > > > > > > > Best regards, > > > > > > > > > Sandro Karumidze > > > > > > > >
thanks is there any way to change already defined character codes?
thank you for information. I completely agree with you that codes should not be changes. I just wanted to know more about rules. best regards, Sandro Karumidze "Michael (michka) Kaplan" wrote: > Sandro, > > Are you basically wanting the ordering to be different? > > Unicode does not have any expressed or implied warranty that the ordering of > characters will be anything like what a user would expect (how can it, when > even so many languages that use the same scripts have entirely different, > occasionally conflicting, collation rules? > > It is up to the software to make the necessary collation rules happen. > > For example, in Windows 2000 there are two different sorts supported for > Georgian: "modern" and "traditional." The difference is that modern has four > letters (He, Hie, We, and Har, both Capital and Small) sort at the end of > the alphabet (which I presume corresponds to the sort that you do not > like?), while the traditional sort has: > > * He appearing between Zen and Tan > * Hie appearing between Nar and On > * We appearing between Un and Phar > * Har appearing between Xan and Jhan > > I presume the above "exceptions" more closely match the sort you would > expect? And if there are more, this would be very valuable information (as > the rules behind all new "sorts" like this are that a valid need to sort > text differently was identified. > > As a rule, Unicode order is not intended to be nor does it explicitly decide > to follow any kind of collation rules for code point order. > > FWIW, the LCIDs behind these two sorts under Windows 2000 (used in the C > CompareString and the VB StrComp) are: > > Traditional: 1079 (0x0437) > Modern: 66615 (0x10437) > > michka > > - Original Message - > From: "Sandro Karumidze" <[EMAIL PROTECTED]> > To: "Unicode List" <[EMAIL PROTECTED]> > Cc: "Unicode List" <[EMAIL PROTECTED]> > Sent: Tuesday, August 08, 2000 3:26 AM > Subject: Re: is there any way to change already defined character codes? > > > Dear Chris, > > > > Thank you for your answer. > > > > > May I ask what is the reason these people from the government of Georgia > want > > > to change the codepoints of some Georgian characters? There is probably > another > > > good solution (or solutions) for whatever problem they think would be > solved by > > > changing encoding points. > > > > The issue is that in Unicode there is a sequence of Georgian caracters > different > > from what this people think should be. > > > > In modern Georgian there are 33 widely used characters. However before > there were > > 38 characters. In beginning of this century 5 characters were dropped, > though still > > used in old texts and by language specialists. > > > > In Unicode this 5 characters follow 33. There is a different point of view > that > > those 5 should be included among the ohters. > > > > This is all the issue - there are no specific implementation difficulties > or > > problems. The only point is that 5 among the rest 33 is more "correct". > > > > Best regards, > > > > Sandro Karumidze > > > > > > > > > > > > > > > > Regards > > > > > > - Chris > > > > > > "Sandro Karumidze" <[EMAIL PROTECTED]> wrote: > > > > > > > There are people from the government of Georgia interested in > possibility in > > > > altering Unicode standard it terms of changing codes for some of > Georgian > > > > characters. > > > > > > > Does this type of things happen in Consortium and if yes under what > > > circumstances. > > > > > > > If not can you specify in which rules is it defined that this types of > > > changes are > > > > not allowed.. > > > > > > > Thanks in advance for your support, > > > > > > > Best regards, > > > > > > > Sandro Karumidze > > > >
Re: Unicode String literals on various platforms
Bob Jones wrote: > > In a C program, how do you code Unicode string literals on the following > platforms: > NT > Unix (Sun, AIX, HP-UX) > AS/400 We devised a solution for this problem in the C99 Standard. The "solution" is named "UCN", for Universal Character Notation, and is essentially to use the (borrowed from Java) \u notation, like (with Ken's example) char C_thai[] = "\u0E40\u0E02\u0E17\u0E32\u0E49\u0E1B\u0E07\u0E1C\u0E33"; And similarlywchar_t C_thai[] = L"\u0E40... or TCHAR_T C_thai[] = T("\u0E40... depending on your storing option. See below for more. The benefit is that now, your C program is portable to any platform where the C compiler complies to C99. The drawback is that, nowadays, there is very few such compilers. > Everything I have read says not to use wchar_t for cross platform apps > because the size is not uniform, i.e. NT it is an unsigned short (2 bytes) > while on Unix it is an unsigned int (4 bytes). If you create your own TCHAR > or whatever, how do you handle string literals? A similar problem exists with numbers, doesn't it? And the usual solution is to *not* exchange data in internal format, but rather to use textual representations. Agreed? For a C _program_, where the textual representation are string litteral (rather that array of integers), C99 UCN is the way to go. Now, since you are talking of wchar_t vs. other forms of storing characters, I wonder if you are not asking about the problem of the manipulated _datas_, as opposed to the C program. Then, I believe the solution is exactly the same as with numbers: internally use whatever is the most appropriate to the current platform (the TCHAR_T/T() solution of Microsoft is nice because it conveniently alternate to either char or wchar_t depending of compilation options), but when exchanging datas, change to a common, textual representation. Look after the %lc %ls options of [w]printf/[w]scanf, to learn on how output/ input wide characters to/from text files. Another solution is to use "Unicode" files, using some dedicated conversions, pretty much the same as using htons(), ntohl(), etc. functions when dealing with low-level Internet protocols. I agree there is currenly lacking a way in the C Standard to indicate that one would open a text file using a specific encoding protocol (eg. UTF-16LE/BE, or UTF-8). And the discussion on this matter have ending endless so far. > On NT L"foobar" gives each character 2 bytes, Yes > but on Unix L"foobar" uses 4 bytes per character. Depends on the compiler. Some are 4 bytes, some are 8 (64-bit boxes), some are even only 8-bit (and are not Unicode compliant). > Even worse I suspect is the AS/400 where the string literal is probably in > EBCDIC. Perhaps (and even probably, as L'a' is required to be equal to 'a' in C), but what is the problem? You are not going to memcpy()-ing L"foobar", or to fwrite()-ing it, are you? And I am sure your AS/400 implementation have some way to specify on open() that a text file is really an "ASCII", rather that EBCDIC, file. Or if it does not, it should... Regards, Antoine
Re: is there any way to change already defined character codes?
On Tue, 8 Aug 2000, Sandro Karumidze wrote: > The issue is that in Unicode there is a sequence of Georgian caracters different > from what this people think should be. > > In modern Georgian there are 33 widely used characters. However before there were > 38 characters. In beginning of this century 5 characters were dropped, though still > used in old texts and by language specialists. > > In Unicode this 5 characters follow 33. There is a different point of view that > those 5 should be included among the ohters. > > This is all the issue - there are no specific implementation difficulties or > problems. The only point is that 5 among the rest 33 is more "correct". Ah, OK. The order of characters in the Unicode Standard is *not* meant to be the proper sort order for any language (even English) or relied on for that purpose. If any changes are needed, it is to the Unicode default collating sequence (which I have not checked) and not to the codes for the characters themselves. -- John Cowan [EMAIL PROTECTED] C'est la` pourtant que se livre le sens du dire, de ce que, s'y conjuguant le nyania qui bruit des sexes en compagnie, il supplee a ce qu'entre eux, de rapport nyait pas. -- Jacques Lacan, "L'Etourdit"
Re: is there any way to change already defined character codes?
On Mon, 7 Aug 2000, Jianping Yang wrote: > Not really for Unicode in which we have relocated some codepoints for Hangul > between Unicode 1.1 and 2.0 :) Yes, but NEVER AGAIN. -- John Cowan [EMAIL PROTECTED] C'est la` pourtant que se livre le sens du dire, de ce que, s'y conjuguant le nyania qui bruit des sexes en compagnie, il supplee a ce qu'entre eux, de rapport nyait pas. -- Jacques Lacan, "L'Etourdit"
RE: is there any way to change already defined character codes?
Sandro Karumidze wrote: > The issue is that in Unicode there is a sequence of Georgian > caracters different > from what this people think should be. > [...] In beginning of this century 5 characters were dropped > [...] > In Unicode this 5 characters follow 33. There is a different > point of view that those 5 should be included among the > ohters. (You definitely need an official reply, but let's go on with some more informal chatting.) I foresee that this would not be considered a good reason to change anything. The order of characters in Unicode (or in any other character encoding) is not important. The scope of a character set is to assign a unique number to each character, not to define an "alphabetical order". If you notice, the situation that you describe is true for *all* the alphabets in Unicode. E.g., if you look at the Latin part, you see that the 26 letters used in modern English are all contiguously ordered in two areas: U0041 to U005A (uppercase) and U0061 to U007A (lowercase). But that's the end of the story! All the other 100's Latin letters are scattered all over, using no consistent order. The same is true for Cyrillic, Greek, Hebrew, Arabic, and so on. Have a look at those blocks: the basic letters for post-czar Russian, modern Greek, Israeli Hebrew, modern Arabic etc. are consistently ordered, but the letters for other languages that use the same alphabets (or ancient letters for the same languages) are scattered all over with no specific order. The reason why no one cares about the order of characters is that it is *impossible* to determine a "correct" order. In alphabet used by more than one language (e.g. Latin, Cyrillic, Arabic, Devanagari, etc.), the alphabetic order is normally different for each language. Moreover, many languages have more than one alphabetic order, all equally valid and in current usage. For this reason the problem of "alphabetic order" has been pulled apart from character sets, and addressed separately. In Unicode, the issue of "collation" is handled by ad-hoc optional algorithm, that is part of the standard but is separated from the encoding issue itself. The algorithm is titled "Unicode Technical Report #10: Unicode Collation Algorithm", and you can find it here: http://www.unicode.org/unicode/reports/tr10/ . *That* is the place to check whether Georgian Letters are in the correct order or not. And if they are not, you have two options: 1) Ask Unicode to change it: here you *do* have some chances to be listened, if you have valid arguments. 2) Change it yourself: unlike the character values, the collation algorithm is designed to be flexible and customizable. Regards, _ Marco
Re: is there any way to change already defined character codes?
Sandro, Are you basically wanting the ordering to be different? Unicode does not have any expressed or implied warranty that the ordering of characters will be anything like what a user would expect (how can it, when even so many languages that use the same scripts have entirely different, occasionally conflicting, collation rules? It is up to the software to make the necessary collation rules happen. For example, in Windows 2000 there are two different sorts supported for Georgian: "modern" and "traditional." The difference is that modern has four letters (He, Hie, We, and Har, both Capital and Small) sort at the end of the alphabet (which I presume corresponds to the sort that you do not like?), while the traditional sort has: * He appearing between Zen and Tan * Hie appearing between Nar and On * We appearing between Un and Phar * Har appearing between Xan and Jhan I presume the above "exceptions" more closely match the sort you would expect? And if there are more, this would be very valuable information (as the rules behind all new "sorts" like this are that a valid need to sort text differently was identified. As a rule, Unicode order is not intended to be nor does it explicitly decide to follow any kind of collation rules for code point order. FWIW, the LCIDs behind these two sorts under Windows 2000 (used in the C CompareString and the VB StrComp) are: Traditional: 1079 (0x0437) Modern: 66615 (0x10437) michka - Original Message - From: "Sandro Karumidze" <[EMAIL PROTECTED]> To: "Unicode List" <[EMAIL PROTECTED]> Cc: "Unicode List" <[EMAIL PROTECTED]> Sent: Tuesday, August 08, 2000 3:26 AM Subject: Re: is there any way to change already defined character codes? > Dear Chris, > > Thank you for your answer. > > > May I ask what is the reason these people from the government of Georgia want > > to change the codepoints of some Georgian characters? There is probably another > > good solution (or solutions) for whatever problem they think would be solved by > > changing encoding points. > > The issue is that in Unicode there is a sequence of Georgian caracters different > from what this people think should be. > > In modern Georgian there are 33 widely used characters. However before there were > 38 characters. In beginning of this century 5 characters were dropped, though still > used in old texts and by language specialists. > > In Unicode this 5 characters follow 33. There is a different point of view that > those 5 should be included among the ohters. > > This is all the issue - there are no specific implementation difficulties or > problems. The only point is that 5 among the rest 33 is more "correct". > > Best regards, > > Sandro Karumidze > > > > > > > > > Regards > > > > - Chris > > > > "Sandro Karumidze" <[EMAIL PROTECTED]> wrote: > > > > > There are people from the government of Georgia interested in possibility in > > > altering Unicode standard it terms of changing codes for some of Georgian > > > characters. > > > > > Does this type of things happen in Consortium and if yes under what > > circumstances. > > > > > If not can you specify in which rules is it defined that this types of > > changes are > > > not allowed.. > > > > > Thanks in advance for your support, > > > > > Best regards, > > > > > Sandro Karumidze > >
Re: is there any way to change already defined character codes?
Dear Chris, Thank you for your answer. > May I ask what is the reason these people from the government of Georgia want > to change the codepoints of some Georgian characters? There is probably another > good solution (or solutions) for whatever problem they think would be solved by > changing encoding points. The issue is that in Unicode there is a sequence of Georgian caracters different from what this people think should be. In modern Georgian there are 33 widely used characters. However before there were 38 characters. In beginning of this century 5 characters were dropped, though still used in old texts and by language specialists. In Unicode this 5 characters follow 33. There is a different point of view that those 5 should be included among the ohters. This is all the issue - there are no specific implementation difficulties or problems. The only point is that 5 among the rest 33 is more "correct". Best regards, Sandro Karumidze > > Regards > > - Chris > > "Sandro Karumidze" <[EMAIL PROTECTED]> wrote: > > > There are people from the government of Georgia interested in possibility in > > altering Unicode standard it terms of changing codes for some of Georgian > > characters. > > > Does this type of things happen in Consortium and if yes under what > circumstances. > > > If not can you specify in which rules is it defined that this types of > changes are > > not allowed.. > > > Thanks in advance for your support, > > > Best regards, > > > Sandro Karumidze
(no subject)
I have an application that doesn't include unicode support at all. Considering this, can I use Uniscribe APIs in my application. The system on which I want to run my application is Windows 98. Specifically, is there any relationship between Uniscribe APIs and Unicode, and if yes, then what exactly it is. Thanks C.Janardhana Guptha Quark, Chandigarh
Re: FW: Unicode - Exponent and indication sign
Yes. Try the middle of the "20__" range of characters. -- Robert Lozyniak Accusplit pedometer manufactures can go suck eggs My page: http://walk.to/11 [EMAIL PROTECTED] - email (917) 421-3909 x1133 - voicemail/fax "Magda Danish (Unicode)" <[EMAIL PROTECTED]> wrote: > > -Original Message- > From: Marchand, Gilles [mailto:[EMAIL PROTECTED]] > Sent: Monday, August 07, 2000 6:33 AM > To: '[EMAIL PROTECTED]' > Subject: Unicode - Exponent and indication sign > > > > > > > Hello, > > > we plan to use the ISO LATIN 8859-1 > as our default caracter > set. A question from a user was: does it support >exponentiation N2, or > the indication sign O4 ? If so where can I find > the how to use method? > > > > thank you for listeningn to me. > > > > Gilles Marchand > UQAM - Library system > [EMAIL PROTECTED] > > ___ Get your own FREE Bolt Onebox - FREE voicemail, email, and fax, all in one place - sign up at http://www.bolt.com
Re: is there any way to change already defined character codes?
Not really for Unicode in which we have relocated some codepoints for Hangul between Unicode 1.1 and 2.0 :) Regards, Jianping. "Christopher J. Fynn" wrote: > Sandro > > I'm sure someone official will give you an official answer, but I know the only > answer you are going to get to your question is NO - there is no way to change > the encoding point of a character (or to change a character name) once it is in > the Unicode or ISO 10646 standards. Allowing changes like this would break > existing implementations of these standards - and of course these standards > would be useless as standards if they were subject to that kind of change. > > Proposals to encode new characters in the Unicode and ISO 10646 standards have > to go through a lengthy process of consideration and there is ample opportunity > to submit comments on any proposal during that process. However once characters > are finally assigned code points in the Unicode and ISO 10646 standards that's > it. > > May I ask what is the reason these people from the government of Georgia want > to change the codepoints of some Georgian characters? There is probably another > good solution (or solutions) for whatever problem they think would be solved by > changing encoding points. > > Regards > > - Chris > > "Sandro Karumidze" <[EMAIL PROTECTED]> wrote: > > > There are people from the government of Georgia interested in possibility in > > altering Unicode standard it terms of changing codes for some of Georgian > > characters. > > > Does this type of things happen in Consortium and if yes under what > circumstances. > > > If not can you specify in which rules is it defined that this types of > changes are > > not allowed.. > > > Thanks in advance for your support, > > > Best regards, > > > Sandro Karumidze