Re: is there any way to change already defined character codes?
Not really for Unicode in which we have relocated some codepoints for Hangul between Unicode 1.1 and 2.0 :) Regards, Jianping. "Christopher J. Fynn" wrote: Sandro I'm sure someone official will give you an official answer, but I know the only answer you are going to get to your question is NO - there is no way to change the encoding point of a character (or to change a character name) once it is in the Unicode or ISO 10646 standards. Allowing changes like this would break existing implementations of these standards - and of course these standards would be useless as standards if they were subject to that kind of change. Proposals to encode new characters in the Unicode and ISO 10646 standards have to go through a lengthy process of consideration and there is ample opportunity to submit comments on any proposal during that process. However once characters are finally assigned code points in the Unicode and ISO 10646 standards that's it. May I ask what is the reason these people from the government of Georgia want to change the codepoints of some Georgian characters? There is probably another good solution (or solutions) for whatever problem they think would be solved by changing encoding points. Regards - Chris "Sandro Karumidze" [EMAIL PROTECTED] wrote: There are people from the government of Georgia interested in possibility in altering Unicode standard it terms of changing codes for some of Georgian characters. Does this type of things happen in Consortium and if yes under what circumstances. If not can you specify in which rules is it defined that this types of changes are not allowed.. Thanks in advance for your support, Best regards, Sandro Karumidze
(no subject)
I have an application that doesn't include unicode support at all. Considering this, can I use Uniscribe APIs in my application. The system on which I want to run my application is Windows 98. Specifically, is there any relationship between Uniscribe APIs and Unicode, and if yes, then what exactly it is. Thanks C.Janardhana Guptha Quark, Chandigarh
Re: is there any way to change already defined character codes?
Sandro, Are you basically wanting the ordering to be different? Unicode does not have any expressed or implied warranty that the ordering of characters will be anything like what a user would expect (how can it, when even so many languages that use the same scripts have entirely different, occasionally conflicting, collation rules? It is up to the software to make the necessary collation rules happen. For example, in Windows 2000 there are two different sorts supported for Georgian: "modern" and "traditional." The difference is that modern has four letters (He, Hie, We, and Har, both Capital and Small) sort at the end of the alphabet (which I presume corresponds to the sort that you do not like?), while the traditional sort has: * He appearing between Zen and Tan * Hie appearing between Nar and On * We appearing between Un and Phar * Har appearing between Xan and Jhan I presume the above "exceptions" more closely match the sort you would expect? And if there are more, this would be very valuable information (as the rules behind all new "sorts" like this are that a valid need to sort text differently was identified. As a rule, Unicode order is not intended to be nor does it explicitly decide to follow any kind of collation rules for code point order. FWIW, the LCIDs behind these two sorts under Windows 2000 (used in the C CompareString and the VB StrComp) are: Traditional: 1079 (0x0437) Modern: 66615 (0x10437) michka - Original Message - From: "Sandro Karumidze" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Cc: "Unicode List" [EMAIL PROTECTED] Sent: Tuesday, August 08, 2000 3:26 AM Subject: Re: is there any way to change already defined character codes? Dear Chris, Thank you for your answer. May I ask what is the reason these people from the government of Georgia want to change the codepoints of some Georgian characters? There is probably another good solution (or solutions) for whatever problem they think would be solved by changing encoding points. The issue is that in Unicode there is a sequence of Georgian caracters different from what this people think should be. In modern Georgian there are 33 widely used characters. However before there were 38 characters. In beginning of this century 5 characters were dropped, though still used in old texts and by language specialists. In Unicode this 5 characters follow 33. There is a different point of view that those 5 should be included among the ohters. This is all the issue - there are no specific implementation difficulties or problems. The only point is that 5 among the rest 33 is more "correct". Best regards, Sandro Karumidze Regards - Chris "Sandro Karumidze" [EMAIL PROTECTED] wrote: There are people from the government of Georgia interested in possibility in altering Unicode standard it terms of changing codes for some of Georgian characters. Does this type of things happen in Consortium and if yes under what circumstances. If not can you specify in which rules is it defined that this types of changes are not allowed.. Thanks in advance for your support, Best regards, Sandro Karumidze
Re: is there any way to change already defined character codes?
On Mon, 7 Aug 2000, Jianping Yang wrote: Not really for Unicode in which we have relocated some codepoints for Hangul between Unicode 1.1 and 2.0 :) Yes, but NEVER AGAIN. -- John Cowan [EMAIL PROTECTED] C'est la` pourtant que se livre le sens du dire, de ce que, s'y conjuguant le nyania qui bruit des sexes en compagnie, il supplee a ce qu'entre eux, de rapport nyait pas. -- Jacques Lacan, "L'Etourdit"
Re: is there any way to change already defined character codes?
On Tue, 8 Aug 2000, Sandro Karumidze wrote: The issue is that in Unicode there is a sequence of Georgian caracters different from what this people think should be. In modern Georgian there are 33 widely used characters. However before there were 38 characters. In beginning of this century 5 characters were dropped, though still used in old texts and by language specialists. In Unicode this 5 characters follow 33. There is a different point of view that those 5 should be included among the ohters. This is all the issue - there are no specific implementation difficulties or problems. The only point is that 5 among the rest 33 is more "correct". Ah, OK. The order of characters in the Unicode Standard is *not* meant to be the proper sort order for any language (even English) or relied on for that purpose. If any changes are needed, it is to the Unicode default collating sequence (which I have not checked) and not to the codes for the characters themselves. -- John Cowan [EMAIL PROTECTED] C'est la` pourtant que se livre le sens du dire, de ce que, s'y conjuguant le nyania qui bruit des sexes en compagnie, il supplee a ce qu'entre eux, de rapport nyait pas. -- Jacques Lacan, "L'Etourdit"
Re: Unicode String literals on various platforms
Bob Jones wrote: In a C program, how do you code Unicode string literals on the following platforms: NT Unix (Sun, AIX, HP-UX) AS/400 We devised a solution for this problem in the C99 Standard. The "solution" is named "UCN", for Universal Character Notation, and is essentially to use the (borrowed from Java) \u notation, like (with Ken's example) char C_thai[] = "\u0E40\u0E02\u0E17\u0E32\u0E49\u0E1B\u0E07\u0E1C\u0E33"; And similarlywchar_t C_thai[] = L"\u0E40... or TCHAR_T C_thai[] = T("\u0E40... depending on your storing option. See below for more. The benefit is that now, your C program is portable to any platform where the C compiler complies to C99. The drawback is that, nowadays, there is very few such compilers. Everything I have read says not to use wchar_t for cross platform apps because the size is not uniform, i.e. NT it is an unsigned short (2 bytes) while on Unix it is an unsigned int (4 bytes). If you create your own TCHAR or whatever, how do you handle string literals? A similar problem exists with numbers, doesn't it? And the usual solution is to *not* exchange data in internal format, but rather to use textual representations. Agreed? For a C _program_, where the textual representation are string litteral (rather that array of integers), C99 UCN is the way to go. Now, since you are talking of wchar_t vs. other forms of storing characters, I wonder if you are not asking about the problem of the manipulated _datas_, as opposed to the C program. Then, I believe the solution is exactly the same as with numbers: internally use whatever is the most appropriate to the current platform (the TCHAR_T/T() solution of Microsoft is nice because it conveniently alternate to either char or wchar_t depending of compilation options), but when exchanging datas, change to a common, textual representation. Look after the %lc %ls options of [w]printf/[w]scanf, to learn on how output/ input wide characters to/from text files. Another solution is to use "Unicode" files, using some dedicated conversions, pretty much the same as using htons(), ntohl(), etc. functions when dealing with low-level Internet protocols. I agree there is currenly lacking a way in the C Standard to indicate that one would open a text file using a specific encoding protocol (eg. UTF-16LE/BE, or UTF-8). And the discussion on this matter have ending endless so far. On NT L"foobar" gives each character 2 bytes, Yes but on Unix L"foobar" uses 4 bytes per character. Depends on the compiler. Some are 4 bytes, some are 8 (64-bit boxes), some are even only 8-bit (and are not Unicode compliant). Even worse I suspect is the AS/400 where the string literal is probably in EBCDIC. Perhaps (and even probably, as L'a' is required to be equal to 'a' in C), but what is the problem? You are not going to memcpy()-ing L"foobar", or to fwrite()-ing it, are you? And I am sure your AS/400 implementation have some way to specify on open() that a text file is really an "ASCII", rather that EBCDIC, file. Or if it does not, it should... Regards, Antoine
RE: is there any way to change already defined character codes?
On 08/08/2000 06:40:17 AM Marco.Cimarosti wrote: (You definitely need an official reply, but let's go on with some more informal chatting.) All the "officials" are busy meeting this week, but the statement, "Can't be done" is just as true whether it comes from the lips (or... fingertips) of a Ken Whistler or Mark Davis as from a Marco Cimarosti or a Chris Fynn. There are enough of us on this list that have a solid understanding of the standard and its development that a question like this can be answered without waiting for an "official" answer (though this question really ought to be answered somewhere on the Unicode web site); if somebody were to give wrong information, there would be several that wouldn't hesitate to correct. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
RE: Summary: xml:lang validity and RFC 1766 refs to outdated code
XML 1.0 says that xml:lang attributes must match production 33 In fact, not so. Productions 33-38 have no normative value whatsoever, as there is neither a production nor normative language connecting them with the rest of XML 1.0. [...] In recognition of this fact, official erratum E73 (at http://www.w3.org/XML/xml-19980210-errata#E73) removes these productions from XML 1.0 altogether. It also allows for a successor to RFC 1766 when and if such a thing exists. Correct, but RFC 1766 doesn't, in turn, allow for successors to ISO 639 and ISO 3166, at least not by a strict interpretation of its formal language. And to date, there still is no successor to RFC 1766. E73 says in its rationale "The XML processor does not deal with the value of xml:lang", but it also says, more formally, "The values of the attribute are language identifiers as defined by [IETF RFC 1766]". The use of "are" in that statement sounds as definitive as "must" to me. As an XML document author, or the programmer of an XML document authoring tool, tell me, do I or do I not use RFC 1766 language tags/identifiers as xml:lang values? It seems that XML says I must use them, but it would not a violation of validity if I didn't use them. I also don't see how one could read RFC 1766 in such a way as to ignore its prescription of a finite range of possible values for what it calls a language tag: Language-Tag = Primary-tag *( "-" Subtag ) Primary-tag = 1*8ALPHA Subtag = 1*8ALPHA In the primary language tag: -All 2-letter tags are interpreted according to ISO standard 639, "Code for the representation of names of languages" [ISO 639]. [...mention of "i-" and "x-"...] -Other values cannot be assigned except by updating this standard. ...so the removal of productions 33-38 from XML really just seem to be intended to allow RFC 1766 and its successors determine the proper construction of a language tag, which makes more sense than trying to reiterate the RFC's technical contents in XML's specification. It doesn't necessarily follow that xml:lang values can avoid conforming to RFC 1766. [We're on the same side, here. I'm just playing devil's advocate, because after I heard about this issue and reviewed the specs myself, I found that there were indeed points of contention.] -Mike
Re: Zero-width ligator
Peter Constable [EMAIL PROTECTED] wrote: I inquired about that recently on the unicoRe list, and was told that the semantics of ZWJ/ZWNJ will be extended in 3.0.1 (or maybe it was 3.1). Well, that's a good thing. It sounds like the benefits described by Everson will be made available in Unicode after all. You mentioned that this decision was made at the meeting in February. Interestingly, I was at that meeting, and my recollection was that extending the semantics of ZWJ/ZWNJ was going to be given further consideration, after some people investigated the implications of extending the semantics of ZWJ, particularly for Indic scripts. But I left before the meeting was over, and the minutes reflect that a decision was in fact made (although the weasle word "provisionally" is used). Thanks for the insight on this process. Somehow I needed more information than the word "rejected" in the Pipeline table could offer. \u263a -Doug Ewell Fullerton, California
RE: Unicode String literals on various
Antoine Leca wrote: char C_thai[] = "\u0E40\u0E02\u0E17\u0E32\u0E49\u0E1B\u0E07\u0E1C\u0E33"; Would the Unicode values be converted to the local SBCS/MBCS character set? If yes: Is the definition of this locale info part of the C99 standard itself, or is it operating system's locale? And what happens to Unicode values that cannot be converted in that character set? Thanks. _ Marco
FW:Unicode Font with Special Effects
-Original Message- From: Greg Olsen [mailto:[EMAIL PROTECTED]] Sent: Friday, August 04, 2000 3:14 AM To: [EMAIL PROTECTED] Subject: Font question Dear Sirs, My name is Greg Olsen. I am an Industrial Designer in Irvine California and I need information. I am developing a user Interface that I will hand off to be programmed in C. The interfaces design has the Arial font, the catch is that there is a beveled effect on each letter. I was wondering if there is a UNICode font that is capable of these type of effects. The interface is for a medical product that is to be released in numerous countries and languages. Any information would be helpful. Thank you for your time and response, Greg Olsen Patton Design 8 Pasteur #170 Irvine, CA 92618 [EMAIL PROTECTED]
Re: is there any way to change already defined character codes?
At 11:01 PM -0800 8/7/00, Jianping Yang wrote: Not really for Unicode in which we have relocated some codepoints for Hangul between Unicode 1.1 and 2.0 :) And have regretted it ever since. Moving the Hangul and renaming æ have caused no end of problems. It was the fact that it was so disastrous when done once that makes everyone determined not to do it again. -- = John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://www.blueneptune.com/~tseng
Re: Unicode String literals on various
[EMAIL PROTECTED] wrote: Antoine Leca wrote: char C_thai[] = "\u0E40\u0E02\u0E17\u0E32\u0E49\u0E1B\u0E07\u0E1C\u0E33"; Would the Unicode values be converted to the local SBCS/MBCS character set? In this case, yes (assuming a normal C compiler). With wchar_t / L"...", they are converted to the local "wide character set", which happens to be Unicode on most boxes, with the following main exceptions: - some (cheap) C compilers does not have any special support for wchar_t, so it defaults to the same as cahr, and are usually 8 bit; - with East Asian C compilers, wchar_t are either Unicode or either a flat character coding, that is every character whether coded as SBCS or DBCS stands, with its nominal, legacy, code, in a 16-bit or 32-bit cell (that is different from MBCS in that the ASCII character are stored in cells the same width as DBCS characters) - EBCDIC implementations have their own rules (for obvious reasons), that I do not know exactly (I am not sure they are consistent) C99 also specifies that if __STDC_ISO_10646__ is defined, then the wchar_t values are the Unicode codepoints (then to learn if it is UTF-16 or UTF-32, one should look at WCHAR_MAX to learn if wchar_t are 16-bit or 32-bit). If yes: Is the definition of this locale info part of the C99 standard itself, or is it operating system's locale? It is "implementation-defined". Which means: - it is not required in any way by the C99 Standard itself (except if __STDC_ISO_10646__ is defined); - it is required to be stated in full words in the documentation for the compiler; - it can vary as per compilation options; often the OS's current locale is the default value, that can be overriden. And what happens to Unicode values that cannot be converted in that character set? The compiler is required to fall back to something (it cannot refuse to compile, nor it can simply drop the character); it is allowed to "fall back" to different character depending on the typed character, though; so for example, #include stdio.h int main() { printf("%ls\n", L"\u00C0 table!"); return 0; } Can produce (among others, this is UTF-8 encoded): À table! A table! à table! table! I can continue to dissert on this subject (all of this should finally be cooked in a FAQ anyway), but I do not want to flood the list with a marginaly interesting subject. Antoine
Re: Summary: xml:lang validity and RFC 1766 refs to outdated codes
Mike Brown wrote: Correct, but RFC 1766 doesn't, in turn, allow for successors to ISO 639 and ISO 3166, at least not by a strict interpretation of its formal language. And to date, there still is no successor to RFC 1766. Right. So span xml:lang="roa"Yn nediwn seint yn llinghedig, yn nediwn seint yn cor/span is not proper XML, although it is well-formed, because the language tag "roa" (Romance, Other) is not legal by RFC 1766. But when RFC 1766 is officially revised to include such language tags, it *will* be good XML. The use of "are" in that statement sounds as definitive as "must" to me. No, because a violation of a "must" rule is a violation of well-formedness, requiring the report of a fatal error and draconian error recovery. As an XML document author, or the programmer of an XML document authoring tool, tell me, do I or do I not use RFC 1766 language tags/identifiers as xml:lang values? You do. It seems that XML says I must use them, but it would not a violation of validity if I didn't use them. It is a violation of the intent of the xml:lang attribute not to use them. ...so the removal of productions 33-38 from XML really just seem to be intended to allow RFC 1766 and its successors determine the proper construction of a language tag, which makes more sense than trying to reiterate the RFC's technical contents in XML's specification. Just so. It doesn't necessarily follow that xml:lang values can avoid conforming to RFC 1766. They cannot avoid it. -- Schlingt dreifach einen Kreis um dies! || John Cowan [EMAIL PROTECTED] Schliesst euer Aug vor heiliger Schau, || http://www.reutershealth.com Denn er genoss vom Honig-Tau, || http://www.ccil.org/~cowan Und trank die Milch vom Paradies.-- Coleridge (tr. Politzer)
RE: Unicode String literals on various
Hi, Antoine. I can continue to dissert on this subject (all of this should finally be cooked in a FAQ anyway), but I do not want to flood the list with a marginaly interesting subject. Merci beaucoup. It was very informative! Ciao. Marco P.S. You should not be so shy: up to date information about how Unicode may be used in the world's most important programming language does not sound so "off topic" or "marginally interesting" to me. Ciao++ M.
GEORGIAN DIGITs
Where are the Georgian digits? I want a set of Georgian digits so I can use them as counter digits. -- Robert Lozyniak Accusplit pedometer manufactures can go suck eggs My page: http://walk.to/11 [EMAIL PROTECTED] - email (917) 421-3909 x1133 - voicemail/fax ___ Get your own FREE Bolt Onebox - FREE voicemail, email, and fax, all in one place - sign up at http://www.bolt.com
Re: GEORGIAN DIGITs
Well, if the language does not have them, you will not find them. Funny how that works, huh? michka Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/ - Original Message - From: [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Sent: Tuesday, August 08, 2000 4:24 PM Subject: GEORGIAN DIGITs Where are the Georgian digits? I want a set of Georgian digits so I can use them as counter digits. -- Robert Lozyniak Accusplit pedometer manufactures can go suck eggs My page: http://walk.to/11 [EMAIL PROTECTED] - email (917) 421-3909 x1133 - voicemail/fax ___ Get your own FREE Bolt Onebox - FREE voicemail, email, and fax, all in one place - sign up at http://www.bolt.com
Re: is there any way to change already defined character codes?
From: [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: E.g., if you look at the Latin part, you see that the 26 letters used in modern English are all contiguously ordered in two areas: U0041 to U005A (uppercase) and U0061 to U007A (lowercase). Yeah, but so what? All you gotta do is turn the 6th bit off and there you go! But that's the end of the story! All the other 100's Latin letters are scattered all over, using no consistent order. Too bad unicode values can't be fractions!! Lets take this one offline, Robert. michka
Why not to move characters (was: is there any way to change already defined character codes?)
You don't want to move characters because then you could change the meaning of a sentence that way. I don't want to price something at 1000 cows when I mean 1000 yen. Or worse, 100 yen. ___ Get your own FREE Bolt Onebox - FREE voicemail, email, and fax, all in one place - sign up at http://www.bolt.com