RE: UNICODE BOMBER STRIKES AGAIN
You can determine that that particular text is not legal UTF-32*, since there be illegal code points in any of the three forms. IF you exclude null code points, again heuristically, that also excludes UTF-8, and almost all non-Unicode encodings. That leaves UTF-16, 16BE, 16LE as the only remaining possibilities. So look at those: 1. In UTF-16LE, the text is perfectly legal Ken. 2. In UTF-16BE or UTF-16, the text is the perfectly legal 䬀攀渀. Thus there are two legal interpretations of the text, if the only thing you know is that it is untagged. IF you have some additional information, such as that it could not be UTF-16LE, then you can limit it further. Actually, I also think that without any external information about the encoding except that it is some UTF-16, it *has to* be interpreted as being most significant byte first. I agree that it could be either UTF-16LE or UTF-16BE/UTF-16, but in the absence of any other information, at this point in time, it is ruled by the text of 3.1 C3 of TUS 3.0 and the reader has no choice but to declare it UTF-16. Now what about auto-detection in relation to this conformance clause? Readers that first try to be smart by auto-detecting encodings could of course pick any of these as the 'auto-detected' one. Does that violate 3.1 C3's interpretation of bytes? I would say that as long as the auto-detector is seen as a separate process/step, one can get away with it, since by the time you look at the bytes to process the data, their encoding has been set by the auto-detector. YA
RE: browsers and unicode surrogates
| I am surprised by the must only be used. It seems I am not | conforming by including a meta statement in the utf-16 HTML page. I | should either remove the statement or encode the HTML up to and | including that statement as ascii. I'll check on this. It doesn't make much sense to have the meta statement there, as I would expect most browser to assume ASCII compatibility, but I agree that must only be used sounds too harsh. [...] it struck us: if we can see that the page claims to be UTF-16, it can't be, because our meta declaration scanning assumes ASCII compatibility. I think you just answered why the spec says must only be used :) it is so that the parsing of the meta tag can happen with predictability. YA
RE: SCSU compression (WAS: RE: Thai word list)
This looks like a nice endorsement of SCSU: :D It saves 59% just as a charset, and it saves almost 20% in a system with a real compression. I am all for SCSU as a charset (after my tools can view it properly), but that was not the use there. OTOH there is gzip encoding in HTTP 1.1 :) Seriously, SCSU is fine for some uses, but in this example, was definitely not the best way to appreciate a reduction in file size. By the 20% you mean an additional 20% by doing SCSU+gzip versus just gzip, right? YA
RE: Japanese and Chinese and ... word lists (WAS RE:Thai word list)
Since we're on this topic, what about sources for other languages where a dictionary is needed to do word breaking? I'd be interested in Chinese and Japanese myself for instance, YA
RE: Thai word list
If you can process SCSU, and would appreciate a 59% reduction in file size, try: http://home.adelphia.net/~dewell/th18057-scsu.txt(135,731 bytes) Not to knock down SCSU, but if it had been gzipped instead, the resulting file would be about half that size: 70,912 bytes. (The gzipped SCSU-encoded file is 57,987 itself). YA
RE: Default endianness of Unicode, or not
The last time I read the Unicode standard UTF-16 was big endian unless a BOM was present, and that's what I expected from a UTF-16 converter. Conformance requirement C2 (TUS 3.0, p. 37) says: [And other many good references where TUS does *not* say that :)] OK, maybe in 2.0, or I made an assumption about network byte order. Or maybe I read this too: I do remember reading once, somewhere, that big-endian was a preferred default in the absence of *any* other information (including platform of origin). But I can't find anything in the Unicode Standard to back this up, so I'll assume for now that both byte orientations are considered equally legitimate. Thanks for getting the references and checking Doug. YA
RE: MS/Unix BOM FAQ again (small fix)
The reason for ICU's UTF-16 converter not trying to auto-detect the BOM is that this seems to be something that the _application_ has to decide, not the _converter_ that the application instantiates. This converter name is (currently) only a convenience alias for use the UTF-16 byte serialization that is normally used on this machine. I agree that the application may know better. It is just unfortunate that the name is not UTF-16PE to remind people that it is about platform endianness (sp?). Also, when used in a script using say uconv, the script does not have access to ucnv_detectUnicodeSignature(), so you end up in a situation where you get a file identified as being in UTF-16 but when you use the UTF-16 converter it may not be readable. If instead you had UTF-16PE as the convenience name for the platform endian UTF-16, and UTF-16 handle the BOM and default byte order expectation (conformance clause C3 of TUS) then it'd be much easier on newcomers. YA
RE: Default endianness of Unicode, or not
D43 italUTF-16 character encoding scheme:/ital the Unicode CES that serializes a UTF-16 code unit sequence as a byte sequence in either big-endian or little-endian format. * In UTF-16 (the CES), the UTF-16 code unit sequence 004D 0430 4E8C D800 DF02 is serialized as FE FF 00 4D 04 30 4E 8C D8 00 DF 02 or FF FE 4D 00 30 04 8C 4E 00 D8 02 DF or 00 4D 04 30 4E 8C D8 00 DF 02. etc., etc. So same semantics as before. In the absence of any indication of what byte order is used, assume big endian. YA
RE: Default endianness of Unicode, or not
And of course, I have been complaining about ICU's UTF-16 converter behavior, but glibc's one does the same assumption that UTF-16 is in the local endianness: gabier% echo hello | uconv -t utf-16be | iconv -f utf-16 -t ascii iconv: illegal input sequence at position 0 gabier% So fixing one but not the other may introduce different compatibility problems, this time on the local platform. Ugh. YA
RE: Default endianness of Unicode, or not
So same semantics as before. Yep. The editorial committee would't be doing its job right if it were changing the semantics of the standard. Agreed! Is there any mention that the non-BOM byte sequence is most significant byte first anywhere else? You know, for the newbies? Joshua 1.8 This book of the law shall not depart out of thy mouth; but thou shalt meditate therein day and night, that thou mayest observe to do according to all that is written therein: for then thou shalt make thy way prosperous, and then thou shalt have good success. (King James) -- Keep this book of the law on your lips. Recite it by day and by night, that you may observe carefully all that is written in it; then you will successfully attain your goal. (New American Bible) I think in this case, the semantics change from meditate (which implies reflection and intelligence) to recite (as I've done blindly as a student) is either unfortunate or telling. Pick one. (Not that you can't meditate on something you know by heart; I just think meditate is better.) YA (From Merriam-Webster, http://www.m-w.com/:) Main Entry: med*i*tate Pronunciation: 'me-d-tAt Function: verb Inflected Form(s): -tat*ed; -tat*ing Etymology: Latin meditatus, past participle of meditari, frequentative of medEri to remedy -- more at MEDICAL Date: 1560 intransitive senses : to engage in contemplation or reflection transitive senses 1 : to focus one's thoughts on : reflect on or ponder over 2 : to plan or project in the mind : INTEND, PURPOSE synonym see PONDER - med*i*ta*tor /-tA-tr/ noun Main Entry: re*cite Pronunciation: ri-'sIt Function: verb Inflected Form(s): re*cit*ed; re*cit*ing Etymology: Middle English, to state formally, from Middle French or Latin; Middle French reciter to recite, from Latin recitare, from re- + citare to summon -- more at CITE Date: 15th century transitive senses 1 : to repeat from memory or read aloud publicly 2 a : to relate in full recites dull anecdotes b : to give a recital of : DETAIL recited a catalog of offenses 3 : to repeat or answer questions about (a lesson) intransitive senses 1 : to repeat or read aloud something memorized or prepared 2 : to reply to a teacher's question on a lesson - re*cit*er noun
RE: MS/Unix BOM FAQ again (small fix)
This is incorrect. Here is a summary of the meaning of those bytes at the start of text files with different Unicode encoding forms. beginning with bytes FE FF: - UTF-16 = big endian, omitted from contents beginning with bytes FF FE: - UTF-16 = little endian, omitted from contents Unfortunately this breaks with popular Unicode libraries like ICU (I am Cc:ing them here, since I have the opportunity to raise this again), where UTF-16 is mapped to the platform endian form: (From ICU's convrtrs.txt file:) # The ICU UTF-16 converter uses the current platform's endianness. # It does not autodetect endianness from a BOM. UTF-16 { MIME } UTF16_PlatformEndian ISO-10646-UCS-2 { IANA } csUnicode ibm-17584 ibm-13488 ibm-1200 cp1200 ucs-2 (End of excerpt.) This is typically *very* confusing to new users of Unicode. I wish such libraries used only a UTF-16PE denomination for such a converter, and handled UTF-16 as a converter per the expectations that Mark described well in his explanation of how to interpret a FF FE / FE FF sequence of bytes. Otherwise you end up having people properly label UTF-16 some UTF-16 with a BOM, and naive code using the library's UTF-16 converter (sounds appropriate, right?) fail to decode data properly. In the context of ICU, it's one of my favorite pet peeves, especially since the ICU is usually so a*al about being very strict as far as the interpretation of a given charset name goes. The last time I read the Unicode standard UTF-16 was big endian unless a BOM was present, and that's what I expected from a UTF-16 converter. YA
RE: Collation - last character?
TUS does not prevent anyone to put noncharacter code points in Unicode strings. As a matter of fact, p. 23 of TUS 3.0 reads U+ is reserved for private program use as a sentinel or other signal. I would expect this to hold true for the noncharacters that were introduced later too. It may not fit your needs if you're looking for a character, but it is available for use by applications. But it is *not* available to *users* to put into lists to make certain elements sort at the end. When dealing with user-specified lists, I would if possible introduce some markup so that my application can deal with those two special cases (lowest/highest) as it wishes internally without burdening the user with the need to enter an improbable (in her everyday's context) codepoint. YA
RE: Collation - last character?
Markus Scherer wrote: How about U+10? It is a non-character, which gives it a high (unassigned character) weight in the UCA. It is the highest code point = the last character. That is definitely not what I was looking for. It is an illegal codepoint, while I was looking for a legal codepoint, and one that would not 'happen to be' the last, but would be 'defined as' last. TUS does not prevent anyone to put noncharacter code points in Unicode strings. As a matter of fact, p. 23 of TUS 3.0 reads U+ is reserved for private program use as a sentinel or other signal. I would expect this to hold true for the noncharacters that were introduced later too. It may not fit your needs if you're looking for a character, but it is available for use by applications. YA
RE: Standard Conventions and euro
The old currencies on the continent (German Mark, Dutch guilder, French frank) however use a period to devide the groups and a comma as a decimal sign Some use a full stop as the thousands separator and some use a numeric (nonbreaking) space Switzerland uses an apostrophe for the thousands separator, I believe Yes, Switzerland uses an apostrophe France does use a comma for the decimal separator, but uses a non-breaking, non expansible (constant size) space to group digits 3 by 3, and not a dot: 1 799 237,59 Check your Palm if you have one Last time I looked, their number formats were okay YA
RE: Standard Conventions and euro
listing the way I wanted it. *nix systems that start with fr_FR and then allow you to define fr_FR-EURO or something really aren't much better; what if I want to deviate from the pre-defined locale in four or five ways instead of just one? They do not let you deviate from a pre-defined locale in one way. They have two pre-defined locale whose names are fr_FR and fr_FR-EURO (fr_FR@EURO), and you can simply select one or the other. Anybody's free to write an fr_FR@MYTASTE locale that customizes fr_FR, and use that. YA
RE: Standard Conventions and euro
On Fri, 1 Mar 2002 11:26:42 +0100 , Marco Cimarosti wrote: French francs amounts were often written with a single decimal (because the smallest coin was 10 cents) No, the 5 centime coin remained in use (until the recent demise of the Franc, of course) and in any case it was very rare to see amounts written (or displayed) with anything other than 2 decimals And we even had some 399 FF prices, even though we couldn't pay them in cash What happened is that the store would sum up everything you buy and then round down to whatever could be paid in hard currency A good deal: two 399 FF items would set you back 795 FF, versus 790 FF if they had been priced at 395 FF to start with Multiply that by millions of sales monthly YA
RE: Unicode page Web ring?
My page is in Unicode, but does not mention Unicode except in the headers, and the headers are invisible unless you choose view source in your browser My company service has been in UTF-8 since I joined in 1998 See http://wwwrealnamescom/; Another good example, but it's much more recent: http://wwwmsncom/; YA
RE: ISO 3166 (country codes) Maintenance Agency Web pages move
I'm confused. Do you mean meaningless identifiers? They look meaningless to me. House numbers in North America (and in France also, it seems) have a few bits of meaning: the least-significant (numeric) bit tells you which side of the street the house is on, and it's often the case that you can deduce the cross street from the house number. Similarly with the others. Until, that is, when some smart beep decides to renumber everybody by counting the distance in meters from the start of the street to the house. We've got a house that went from an odd to an even number this way. Not to mention people wondering why the neighbor of 650 SomeStreet was at 615... YA
RE: Standard Conventions and euro
Perhaps not as physical currency, but they sure do still exist in data, and will continue to exist in data until the Apocalypse. When is that scheduled to occur? [Alain] Very simple: « la semaine des quatre jeudis » (the week of the 4 Thursdays, as we say in French). And the exact day would be that of St Glinglin. (Still a French reference.) YA
RE: Unicode and end users
If foo is a US-ASCII string, grep foo file will work fine with any US-ASCII-superset charset for which non-ASCII characters do not use bytes 0x80, including the hypothetical one I described, with no possibility of a false match. However grep fóó file will work only if the current shell charset (i.e. of argv[1]) matches the encoding of file. Not necessarily. It will work as long as the sequence of 3 bytes fóó is the representation of the string you are looking for in the file, in that file's encoding. grep does not validate anything, nor should it IMHO. If you want to guarantee the encoding, use a converter like ICU's uconv(1) or iconv(1). YA
RE: This spoofing and security thread
The very fact that most of them can be reduced to ASCII and people still find the resulting text useful and accurate to the original is a sign that the important characters in English are in ASCII. And all the standard transliterations - em-dashes - --, c-cedilia - c, e-acute, e-grave - e, o-umlaut - o, shaped quotes - and ' - are from characters in Windows-1252. Well, wouldn't you expect an American standard to properly encode the important characters for English? I would. Only ISO has the luxury of encoding Western Europe languages without catering properly to French and some Nordic language (sorry, forgot which; as for French, I am referring to the lack of oe ligature in iso-8859-1). YA
RE: Unicode and end users
UTF-8 should *never* contain the BOM. But has been pointed out, it is common practice for Microsoft, and also for ICU's genrb tool, for example, which uses the BOM to autodetect the encoding. The more example you'll see of that, the more people will use the BOM (now, can't we all use -*- coding: utf-8 -*- ;-)?). YA
RE: This spoofing and security thread
What do you mean? I've done works for Project Gutenberg, and looked at a number of books with thoughts of reducing them to ASCII. In my opinion, Windows-1252 has every character that most English books will need, Especially those books that you want to reduce to ASCII :-) YA
RE: UTF-16 is not Unicode
A ideal interface should probably automatically and silently select Unicode (and its default UTF) whenever one or more of the characters in a document are not representable in the local encoding. I beg to differ. Silently doing such an unexpected change is guaranteed to confuse the user, especially as she starts exchanging the files or loading in other programs. The interface should warn the user and offer a couple sensible choices, one of them (and maybe the default) being to save using one of the UTFs. YA
RE: Unicode and Security: Domain Names
Moreover, the IDN WG documents are in final call, so if you have comments to make on them, now is the time. Visit http://www.i-d-n.net/ and sub-scribe (with a hyphen here so that listar does not interpret my post as a command!) to their mailing list (and read their archives) before doing so. The documents in last call are: 1. Internationalizing Domain Names in Applications (IDNA) http://www.ietf.org/internet-drafts/draft-ietf-idn-idna-06.txt 2. Stringprep Profile for Internationalized Host Names http://www.ietf.org/internet-drafts/draft-ietf-idn-nameprep-07.txt 3. Punycode version 0.3.3 http://www.ietf.org/internet-drafts/draft-ietf-idn-punycode-00.txt 4. Preparation of Internationalized Strings (stringprep) http://www.ietf.org/internet-drafts/draft-hoffman-stringprep-00.txt and the last call will end on Feb 11th 2002, 23h59m GMT-5. There is little time left. YA
RE: Unicode and Security: Domain Names
Moreover, the IDN WG documents are in final call, so if you have comments to make on them, now is the time. Visit http://www.i-d-n.net/ and subscribe to their mailing list (and read their archives) before doing so. The documents in last call are: 1. Internationalizing Domain Names in Applications (IDNA) http://www.ietf.org/internet-drafts/draft-ietf-idn-idna-06.txt 2. Stringprep Profile for Internationalized Host Names http://www.ietf.org/internet-drafts/draft-ietf-idn-nameprep-07.txt 3. Punycode version 0.3.3 http://www.ietf.org/internet-drafts/draft-ietf-idn-punycode-00.txt 4. Preparation of Internationalized Strings (stringprep) http://www.ietf.org/internet-drafts/draft-hoffman-stringprep-00.txt and the last call will end on Feb 11th 2002, 23h59m GMT-5. There is little time left. YA
RE: Unicode and Security: Domain Names
Are the actual domain names as stored in the DB going to be canonical normalized Unicode strings? It seems this would go a long way towards preventing spoofing ... Names will be stored according to a normalization called Nameprep. Read the Stringprep (general framework) and Nameprep (IDN application, or Stringprep profile) for details. This normalization includes a step of normalizing using NFKC, but it does more than that. no one would be allowed to register a non- canonical normalized domain name. Then, a resolver would be required to normalize any request string before the actual resolve. To keep the resolver's loads the same as today, client applications will do the normalization of their requests. If they don't normalize properly, the lookup will just fail. Read the IDNA document for more info on this. All normalized strings are encoded in a so-called ASCII Compatible Encoding which uses the restricted set of characters used in the DNS today (letters, digits, hyphen except at the extremities) for host names (which are different than STD13 names, cf. SRV RRs for example). Read IDNA, again, and Punycode, the chosen encoding. YA
RE: Unicode and Security
Well, nothing wrong with Unicode of course. Just means that there will need to be an option in your browser to reject any site without a digital certificate, and perhaps it will need to be turned on by default. So, Nothing prevents sites running frauds to get a certificate matching their name. If the price of certificates drop, or if the fraud has good margins enough, it will not even be a big inconvenience. YA
RE: ICU's uconv vs Linux iconv and UTF-8
As part of the mystery of CJK encodings I notice that IBM's ICU's uconv and SuSE6.4 linux iconv differ as to the UTF-8 representation if table.euc Both converters will round-trip with themselves and give byte exact copy of table.euc Weirdly they differ in how they map '\' and '~' in ASCII space as well as some spots in higher characters. That is understandable if they use different tables. The question is which one is the right EUC-JP, and which one do users want? ICU, as well as iconv, could have two tables with the different mappings. The question then is how to label them, and whether the labeling should be compatible between the two. Linux iconv will not take ICU's UTF-8. ICU's uconv will read the iconv output but does produce same as original table.euc. I find the same statement confusing. Are you saying that uconv's UTF-8 is ill-formed? Nick, Would you mind email me (and just me, not the list) your table.euc sample file? Thanks, YA
RE: ICU's uconv vs Linux iconv and UTF-8
It is definitely a problem to try to interpret what any given label is supposed to be. The problem is that MIME labels and others are ambiguous, and are interpreted different ways on different systems. Still, in the meantime it does make sense to have EUC-JP associated to the most common interpretation of it, doesn't it? Just for the sake of user satisfaction? I am curious: is there a better name for the EUC-JP that ICU is using, that would make everybody understand which one it is? If so, we could have EUC-JP for the one that the rest of the world wants. YA
RE: Introducing the idea of a ROMAN VARIANT SELECTOR (was: Re: Proposing Fraktur)
quite a lot of space. However, Fraktur is already encoded in the Mathematical whatever-it's-called block. This variant selector would mean that lots of characters can be displayed in two *different* ways. I'd prefer that Fraktur diacritics were added instead, and that the mathematical letters were to be used for Fraktur texts. I hope not. These were encoded there because they convey a specific meaning when used for mathematics. If you use them to spell out names, then you're abusing them and potentially confusing software that would rely on their mathematical semantics. I think it's time to have another proposal for French, FRENCH VARIANT SELECTOR, where we do not use Fraktur but some other font variation. And we may need a QUEBEC VARIANT SELECTOR if they have different rules... Or should it be a QUEBEC FRENCH VARIANT SELECTOR to show the relationship? YA
RE: POSITIVELY MUST READ! Bytext is here!
Well, I've seen cases where chat engines have converted ASCII into emoticon pictures at the wrong places... And sometimes you can't turn them off. Grumble. I couldn't give out sample code in MSIM using foo(c) for a function call w/o getting a cup of coffee after foo! YA
RE: [Very-OT] Re: ü
Obviously (I advocate in French changing the spelling of common foreign words so that there would be more consistency). Le ouiquende? That would be pronounced wikãd... To respect the English pronunciation you would have to write it ouiquennde, which would still be a very odd spelling in French... The end sound is really not French in itself... France's Académie française is good at that: they recently invented cédérom (CD-ROM; gets used because it's quite okay), and mèl (mail, for e-mail; nobody uses it except to make fun of it). YA
RE: RE: [Very-OT] Re: ü
http://www.culture.fr/culture/dglf/dispositif-enrichissement.htm http://www.culture.fr/culture/dglf/dispositif-enrichissement.htm Thanks for the pointer. Though I can't fine the exact sentence re: the substantive use I found mél referred to as a symbol for messagerie électronique. I like courriel a lot. Nice. YA
RE: Funky characters, Japanese, and Unicode
1. I have a Geocities page now. I do not know what encoding Geocities uses, but I think it's unicode. What I did for the Japanese text on it was not think about encodings and just type it in with Microsoft's IME (and do some swearing at the IME at the process). And it comes out fine, for the most part. Why does this work? What encoding does it use? Your browser (which one?) just does a good job of detecting the encoding used for your page http://www.geocities.com/elevendigitboy/. For instance, if I view itr with IE after unselecting the Autoselect item of the ViewEncoding menu, I get garbage as expected. Otherwise, IE does recognize Shift_JIS. YA -- Sailing is harder than flying. It's amazing that man learned how to sail first. -- Burt Rutan..
RE: Off topic: Whut in tarnation is Unicode?
Re: elite-speak generator, I meant the one Edward Cherlin posted: L33t-5p34k, d00d! 1t'5 3v3rywh3r3. Try the L33t-5p34K Generator!!!### at http://www.geocities.com/mnstr_2000/translate.html but the link to the trusty mail archives was enough :) Thanks. YA -- Sailing is harder than flying. It's amazing that man learned how to sail first. -- Burt Rutan.
RE: Off topic: Whut in tarnation is Unicode?
Now if someone could resend this elite-speak converter link, it was great. Please... Thanks! YA -- Sailing is harder than flying. It's amazing that man learned how to sail first. -- Burt Rutan.
RE: C with bar for with
It may even be a glyph variant of the w with forward slash... YA -Original Message- From: Stefan Persson [mailto:[EMAIL PROTECTED]] Sent: Sunday, December 02, 2001 3:19 AM To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: C with bar for with - Original Message - From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: den 2 december 2001 02:16 Subject: C with bar for with Someone said that in English, c-with-underbar means with. My mom writes this as c-with-overline. Well, then I suppose this is a glyph variant of the c with underbar... Stefan _ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com
RE: Character encoding at the prompt
But: setenv LC_ALL en_US.UTF-8 env LC_ALL=it echo giovedì, 25 ottobre 2001, 11:45:24 EDT I could not understand why I get the display of the letter ì in the en_US.UTF-8 Locale. My understanding was that the date command was generating the message in the Italian locale (default encoding iso-8859-1) and as a result ì would be encoded as xEC. The display should be done in the en_US.UTF-8 Locale and be an invalid byte sequence. I think you're making an improper assumption about the fact that your *terminal* is in UTF-8 and would then complain. Unless your terminal has explicit support for UTF-8 I do not think it will validate things. And it apparently has not been started from a process that was already using UTF-8, since you're issuing your setenv LC_ALL en_US.UTF-8 at the prompt. This is only affecting subsequent commands (unless overridden, of course, as in your next call), not the running process. YA PS: not to mention zsh(1) would be a better shell ;-) just teasing
RE: normalize before map?
[People were discussing whether one should do some case mappings before doing normalization, or the other way, and whether the case mapping can be naive or must account for what normalization will do/has done in order not to break assumptions that the resulting string is both case-folded and normalized. The normalization form used can be anything I believe, though in the IETF context NFKC and NFC are the common ones.] My guess is that case folding by no means guarantee that the output is still normalized. Right, if you fold and then normalize, your string might not be properly folded anymore (which is why nameprep had to adjust the mapping table). Similarly, if you normalize and then fold, your string might not be properly normalized anymore. Either way, if you want a string to be both normalized and folded, you cannot naively apply normalization and case-folding (in either order), you need to tweak the mapping table to compensate for the interactions. The sentence quoted from UTR#21 above glosses over this problem. The problem exists (and has a solution) no matter which order you use. Does Mark Davis (the author of UTR#21) subscribe to this list? It would probably be helpful to get his thoughts on the matter. You can always Cc: [EMAIL PROTECTED] for such questions. Which I am doing now. Of course, we don't want to Cc: them all the time... YA
RE: Currency symbols (was RE: Shape of the US Dollar Sign)
About £ (L with two bars = Italian lira or Egypt/Cyprus pound) and £ (L with one bar = Pound Sterling or Irish punt), I think that the Unicode distinction is not valid because: [...] For these reason, I suggest that font designers ignore the distinction between U+00A3 (POUND SIGN) and U+20A4 (LIRA SIGN) and use the same glyph for both. The glyphs should have one or two bars depending on the font style and on the choice made for other currency symbols. Interesting comment. Isn't the Unicode distinction simply one of characters, and the difference in glyphs shown in the standard simply a reflection of the preferences of the designer of the fonts used to print the character tables? I'd think so. YA
RE: DerivedAge.txt
At the request of someone working with ICU, I regenerated a derived file that shows the age of Unicode characters -- when they came into Unicode. Does anyone think this might be useful to have in the UCD? It is definitely useful information that could go into UNIDATA. Here is a good use for it (and my reason for asking Mark to regenerate it for me): when one uses a library such as ICU that manipulates 3.1 data but want to store some data in a database that won't like anything after 2.x. Using this, one can validate data before sending them to the database as needed. It doesn't necessarily have to get into the UCD, except if it helps me make a smaller change to ICU to support the version as a character property ;-) YA
RE: 3rd-party cross-platform UTF-8 support
UTF-16 - wchar_t* Wait be careful here. wchar_t is not an encoding. So.. in theory, you cannot convert between UTF-16 and wchar_t. You, however, can convert between UTF-16 and wchar_t* ON win32 since microsoft declare UTF-16 as the encoding for wchar_t. And he can also do some between UTF-16 and UTF-32 for glibc-based programs since UTF-32 is the encoding for wchar_t for such platforms. The way I read that was UTF-16 - UTF-(8*sizeof(wchar_t)). (Please don't ask what happens when sizeof(wchar_t) is 3 or larger than 4, you know what I mean :)). I guess the responsibility of this being a meaningful conversion would be with the caller. YA PS: I don't know a way of knowing the encoding of wchar_t programmatically. Is there one? That'd offer some interesting possibilities..
RE: UTF-8 on NT
I'm also thinking of 3rd party UTF-8 support such as libutf8, IBM ICU. They seem no good supports on NT, what do you think ?We are usingICU for all our Unicode needs,on NT, Windows 2000, and Unix, and itworks perfectlywell on all of these. YA
How are the UNIDATA derived files generated?
Hi, I would like to know how the derived files that one can find in the UNIDATA folder are generated? I am trying to have IBM's ICU library support older versions of Unicode than the one it currently supports (3.0.something), specifically Unicode 2.1.x. ICU needs the following files: UnicodeData.txt SpecialCasing.txt DerivedNormalizationProperties.txt NormalizationTest.txt UCARules.txt FractionalUCA.txt CaseFolding.txt Mirror.txt If I look in Public/2.1-Update4 I can find the first two files for Unicode 2.1.9. A number of the other files either say they have been algorithmically generated (e.g. DerivedNormalizationProperties.txt) or look like they have. I am interested in knowing what tools have been used to generated these and if I could get these tools and use them to generated the same files for another version of Unicode. I am sure I could write some tools myself (following the instructions in DerivedProperties.html for DerivedNormalizationProperties.txt for example) but I am looking for a quicker way to generate these. Thanks for any help on this, YA PS: Also I hope that all the derived files will be stored in the non-UNIDATA folders as Unicode is revised. They'll be helpful for people that need to build a Unicode library for a very specific version of Unicode. -- My opinions do not necessarily reflect my company's. The opposite is also true..
RE: Locale codes (WAS: RE: RTF language codes)
On Thu, Jul 26, 2001 at 01:04:29AM -0700, Yves Arrouye wrote: If you have a cross platform system you should use RFC 1766 style locales between systems and convert them to LCIDs on Windows. RFC 3066 was published in January. Check it out. http://www.ietf.org/rfc/rfc3066.txt Note that neither RFC 1766 nor RFC 3066 refer to locales; they just define language identification tags. Yes, I should have made that correction in my reply. These tags, or some variations of these tags (e.g. replacing the hyphen by an underscore) can be found as locale identifiers in many systems, I think that's what Carl was referring to (e.g. use en_US). I am not sure, and don't think, that the use of en_US on Unix/POSIX is actually related to these RFCs. Does anybody know for sure? YA
Locale codes (WAS: RE: RTF language codes)
If you have a cross platform system you should use RFC 1766 style locales between systems and convert them to LCIDs on Windows. RFC 3066 was published in January. Check it out. http://www.ietf.org/rfc/rfc3066.txt YA
RE: Ethnologue 14 online
After considerable and unfortunate delay, the new Ethnologue site, including the online version of the 14th Edition, is at last available to the public: http://www.ethnologue.com/home.asp. There are still refinements being made, but all the basics are there and working. Very nice! Something to get lost into for hours... YA
RE: More about SCSU (was: Re: A UTF-8 based News Service)
SCSU doesn't look very nice for me. The idea is OK but it's just too complicated. Various proposals of encodings differences or xors between consecutive characters are IMHO technically better: much simpler to implement and work as well. These differential schemes seem to be the way IDN (internationalized domain names) are headed. They are intended for the limited scope of domain names that have already passed through nameprep, which performs normalization and further limits the range of allowable characters. I'm not sure how well the ACEs would perform with arbitrary Unicode text. I suppose only testing would answer that question. Also don't forget they're likely to add some code point reordering. Do we want that too in an alternate scheme? Then is it really that much simpler than SCSU? (Probably; tables for code point reordering are not complex to build. But they may take some effort to optimize, so my guess is the implementation effort may be roughly the same.) YA
RE: More about SCSU (was: Re: A UTF-8 based News Service)
SCSU is also registered as an IANA charset, although you are unlikely to find raw SCSU text on the Internet, due to its use of control characters (bytes below 0x20). And what browser supports SCSU, and what it that browser's reach in term of population? Because that's usually what matters to people that publish on the Internet. YA
RE: Playing with Unicode (was: Re: UTF-17)
A proposal needs a definition, though: UTF would mean Unicode Transformation Format utf would mean Unicode Terrible Farce untenable total figment? unable to focus? utf twisted form? YA
RE: UTF-17
From: [EMAIL PROTECTED] Oh yeah, well, I can be more tongue-in-cheek than all of you. I've already implemented it. Quick, quick. Patent it and then open-source it. It will be unstoppable. YA
RE: UTF-17
Isn't UTF-17 just a sarcastic comment on all of this UTF- discussion? YA
RE: converting ISO 8859-1 character set text to ASCII (128)charactet set
We have a specific requirment of converting Latin -1 character set ( iso 8859-1 ) text to ASCII charactet set ( a set of only 128 characters). Is there any special set of utilities available or service providers who can do that type of job. [I am assuming that your ascii table is the ASCII everybody use, not some variation of it.] If you do not care about the loss of information at all, just truncate the data to 7 bits. You can write a trivially simple program for that, or use your platform's conversion tools or routines (cf. iconv(1) and iconv(3) on UNIX 98 platforms, uconv from ICU's contributed applications at http://oss.software.ibm.com/icu/, or the WIN32 conversion APIs whose name I forgot). If you want to minimize the loss, you may want to use fallbacks so that for example you will lose diacritics on letters but will retain the base letter. Giving you things like mon bebe a tete tout l'ete for French. I am sure the WIN32 APIs will let you do that, iconv doesn't support it, and I am not sure about whether the ICU ASCII converter has fallbacks (some of their converters do, some don't; but thus may be outdated info). Hope this helps, YA
RE: UTFs, ACEs, and English horns
Also check out the sites of the IETF IDN WG (http://www.ietf.org/html.charters/idn- charter.html, and http://www.i-d-n.net/) for more information that you may have wished for. Oops. Sorry, I only saw James's answer. You obviously read these. Well, I hope my English horns pages were new reading at least... YA
RE: UTFs, ACEs, and English horns
Also check out the sites of the IETF IDN WG (http://www.ietf.org/html.charters/idn-charter.html, and http://www.i-d-n.net/) for more information that you may have wished for. Except on English horns, that is; but then you may want to visit http://www.users.globalnet.co.uk/~gbrowne/geoff9.htm and http://www.mathcs.duq.edu/~iben/oboeng.htm :). Good luck, YA
RE: Missing characters for Italian
So my question is: is the superscript attribute essential in French to understand these abbreviations (as it is in Italian), or is it desirable but optional (as it is in English)? Not to understand them. While understanding is subjective, it is usually evident from the context that these are abbreviations, and which ones they are. I wouldn't wish them to be encoded as characters myself. Displaying them properly is what typography is for. YA
RE: Term Asian is not used properly on Computers and NET
There are also terms like the West or Western (world, languages, civilization, etc) which have referents that are not completely west of the Greenwich Meridian, whose usage cannot be simply explained or justified by it. Every point can be found west (or east) of the Greenwhich Meridian. Not all of them have west or east longitudes, though. YA
RE: Metafont [was Re: Single Unicode Font]
BTW, it seems that Metafont is a trademark of Addison Wesley publishing company ... Interesting. Maybe because they published the Metafont book (and its friend Metafont: the program) along with the rest of Knuth's Computers and Typesetting books? This is the bell that Metafont (as you capitalized it) rings for me. See http://www.math.utah.edu/~beebe/fonts/metafont.html and http://cgm.cs.mcgill.ca/~luc/metafont.html. YA
RE: search ignoring diacritics
Peter - normalise both data and search string - delete / ignore all Peter characters with general category Mn It worked well for us too. Someone mentionned to me once though that U+3099 and U+309A should be preserved in order not to change the meaning of words, and we do so. But maybe this is not necessary? YA
RE: About Kana folding
Kenneth, Thanks for the explanations. So I'd suggest you be very careful when trying to do this kind of a folding. If it is just for surface text matching, the number of false positive matches would likely swamp the number of false negatives you'd be correcting. On the other hand, if you are doing a phonetic matching, then of course you have to fold the Hiragana and Katakana forms together. I am trying to work around a situation where people cannot register a database key in Katakana and the same one in Hiragana (because the DB's collation does some Kana folding), yet they need to be able to find it using either of these (after this key has been migrated to some other system that doesn't do Kana folding). I don't know if that's what you call surface text matching. The matching will be done on the whole key, not using N-grams. The more serious problem of equivalencing for matching in Japanese would be kanji versus Hiragana, in particular. [...] Getting this kind of thing right is far more important for matching in Japanese than just brute matching of Hiragana to Katakana. And if one wanted to do that automatically (which is not my intent, Kanji work fine), one would need a dictionary to go from words in Kanji to one Kana, is that true? YA
About Kana folding
Hi, If one were to need to pick Katakana versus Hiragana and fold one into the other (say to let people match a word or sentence in any of them), is there one that is preferrable to the other? I think that some Katakana have no Hiragana equivalents, does that mean that it's always easier to go from Hiragana to Katakana? Also, what are the caveats of doing such foldings (and is it possible to change meanings?) Thanks! YA -- My opinions do not necessarily reflect my company's. The opposite is also true..
RE: Help in a HURRY !!!!!!!!!!!!!!!!!!!!!!!
To go with Lukas's Perl code, I'll provide a C version, not really tested either, with ICU, to give him a choice. No error checking etc., just to give the idea. If you want UTF-16 you'll need to use the macros in unicode/utf16.h to generate surrogate pairs properly. #include stdio.h #include unicode/utf8.h #define LINE_MAX 80 /* Whatever. */ int main() { char buf[LINE_MAX]; while (fgets(buf, sizeof(buf), stdin)) { int i; size_t len = strlen(buf); if (buf[len - 1] == '\n') { buf[--len] = 0; /* We don't want that one in the output. */ } for (i= 0; i len;) { int32_t c; UTF8_NEXT_CHAR_UNSAFE(buf, i, c); printf(c 0x80U ? %c : #%ld;, c); /* As Lukas's code, use entities only above ASCII. */ } putchar('\n');/* Separate lines; will produce white space in HTML. */ } } Hope this helps, YA
RE: UCD in XML
I then tried my usual remedy: Bow in precisely the correct direction (359° 16' 32 N*) Adjust the bearing for declination (15° 26' E according to my chart of the bay), and try again compass in hand, maybe? ;-) YA
RE: Using hex numbers considered a geek attitude
BTW, anybody knows how to input characters on Windows using the hex codepoint? I know it's good for my brain to do the exercise of going from hexadecimal to decimal, but it is still a pain to have to type ALT-DECIMAL when all I have in my book is hex. That would be a reason for providing the decimal value (not in the tables, but in the properties pages that follow maybe), actually. But I am sure there must be a way to input the hex directly. Please? YA
RE: Byte Order Marks
Then why is ICU mapping UTF-16 to UTF16_PlatformEndian and not UTF16_BigEndian? ICU does not do Unicode-signature or other encoding detection as part of a converter. When you get text from some protocol, you need to instantiate a converter according to what you know about the encoding. So I can't pass it some text with a BOM and say "utf-16" and let it run through that. I guess that explains why I also didn't find converters that write a BOM at the start of the conversion. Is that something that would added to ICU in the future? It would be very nice to have a converter that would pick the BOM (and write it back). And yes, most of the time, when you stay on a given platform, it is very convenient to use the platform's endianness. My question was rather "why isn't UTF-16 the one that detects the BOM and defaults to an externalized form, BE, and then people on a given platform would just use UTF-16PE (of which UTF-16 is an alias in ICU)?". That would facilitate interchange of information. YA
RE: Byte Order Marks
On Thu, Apr 19, 2001 at 06:24:47PM -0700, Markus Scherer wrote: On the other hand, if you get a file from your platform and it is in 16-bit Unicode, then you would appreciate the convenience of the auto-endian alias. But nothing should be spitting out platform-endian UTF-16! In the case that there's a lot of unmarked big-endian UTF-16 around (as I understand the ISO-10646 standard recommends), then that assumption that everything emits unmaked platform-dependent UTF-16 will be wrong. And for reference, on Windows, Unicode files are recognized because they have a BOM. Write plain UTF-16LE w/o a BOM, and your file won't be recognized properly. Manipulation of these files w/ ICU today is a bit painful, since one needs to strip the BOM on input (if I understand Markus correctly) and write a BOM at output. So these cannot be manipulated using applications like uconv which blindly uses the raw converters. YA
RE: Byte Order Marks
If you don't have any clue about the byte order, but you know it is UTF-16, then assume BE. Then why is ICU mapping UTF-16 to UTF16_PlatformEndian and not UTF16_BigEndian? I know that was a difference between ICU and my library, and when I asked this question a while ago I was told that despite what some litterature suggests, w/o any clue, platform endianness should be used. That's contradictory. YA
RE: How will software source code represent 21 bit unicode charac ters?
Has this matter already been addressed anywhere? I think the C standard is in the process of making a decision about this. If memory helps, we will have escapes like '\u' and '\U'. I think they made the decision already. It is in the latest editions of the standards. The only ambiguity (for me) is whether one can write: uint32_t codepoint = '\U001'; and have it work, or if there's some implicit assumption that '\U001' is of type wchar_t, in which case the construction is not portable because of the fact that the size of wchar_t is implementation-specific, and can be as small as 8 bits. I am sure we have a C/C++ expert (or many!) here that can clear that up though. YA
RE: Identifiers
On Sun, Apr 15, 2001 at 08:10:55PM +0200, Florian Weimer wrote: Is it sufficient to mandate that all such identifiers MUST be KC- or KD-normalized? Does this guarantee print-and-enter round-trip compatibility? In general, the problem is unsolvable. There are several look-alikes among the Cyrillic, Greek, Latin and Cherokee blocks, among others. And those are not equivalent under normalization? That's a pity. But that is not the goal of the Unicode normalization! (Reas UAX #15, http://www.unicode.org/unicode/reports/tr15/). Which is to be expected, from a standard about characters, anf not glyphs. The normalization you are talking about seems to me to be one that is glyph-centric: you're looking at shapes and are wanting to avoid confusions by making similar-looking things the same. We have normalization similar to the one you're talking about in our Internet Keywords system. It is built on top of NFKC. It is good for users, but then it is also very specific. For example, we didn't consider the look-alikes aming Cyrillic, Greek, and Latin to be a problem for our users, but your comment about that being a pity seems to imply that you would. I think such normalizations depend a lot about who is going to need the names and in what context. It'll be very hard to make a general recommendation that isn't too restrictive for many. YA
RE: Identifiers
(I don't know if email addresses will be internationalized anytime soon. This is just an example. ;-) http://www.-i-d-n.net/ They have a normalization process that may be used for e-mail someday. It explictely does not do anything about similar looking glyphs. Read their list archive, I'm sure the reason why has been discussed there. That may give you ideas for what you're trying to achieve. YA
RE: Identifiers
There should be a method to overcome the source sepearation rule which might have saved certain identical characters from unification. - U+0048 LATIN CAPITAL LETTER H - U+0397 GREEK CAPITAL LETTER ETA - U+041D CYRILLIC CAPITAL LETTER EN - U+13BB CHEROKEE LETTER MI If this were Han glyphs, they would have been unified, wouldn't they? ;-) Florian, I respectfully suggest that you look up the various technical reports that accompany the Unicode standard. It looks like ther may be certain confusion about characters and glyphs in respect with the Unicode standard (which tackles characters, not glyphs; Han *characters* were unified, and they were in a single *script*). UTR #17 (http://www.unicode.org/unicode/reports/tr17/) should definitely be useful. See section 2.1 for instance. Hope this helps, YA
RE: Identifiers
Florian, I respectfully suggest that you look up the various technical reports that accompany the Unicode standard. It looks like ther may be certain confusion about characters and glyphs Oops, got tripped by my native French language. I didn't mean "certain" but "some". Do not conclude that I jump to conclusions that easily :). YA
RE: Identifiers
We have normalization similar to the one you're talking about in our Internet Keywords system. It is built on top of NFKC. It is good for users, but then it is also very specific. Details, details! (Or do you consider that stuff a proprietary advantage?) I don't really. That would be too fragile of an advantage to build on. But as my signature shows, I may be mistaken :) For a year-old explanation of the use of Unicode in our system, from the 16th IUC, see http://www.internetkeywords.org/iuc/realnames-iuc16-paper.htm. Basically, we have two normalization forms. The first one is only for presentation, and that is a very lightweight cleanup (remove invisible characters, compress whitespace runs, map half-width characters to full-width ones...). The second one is used to define uniqueness and that is more restrictive; it builds on the cleaned up form. We do the following: - Put the string in NFKC. - Put the string in lowercase of its uppercase. - Map some characters to take into account alternate spelling (German, for example; when there is a conflicting between languages, oops). - Undo some ligatures that KC didn't undo (as in French "qui vole un oeuf vole un boeuf"). - Map some characters that are visually very similar to their lowest common denominator (ASCII) counterpart. For example, the prime and fancy apostrophes (sorry, don't feel like fetching my Unicode book to get their proper names) are considered the same as a vanilla apostrophe. That's about it. We're considering doing new things regularly, and are/will be also doing specific things to overcome limitations of our distributions channels (for example, Kana mapping). As I've said, it's specific to the user experience we want to present to users of Keywords (fancy display, simpler input). There are obvious limitations, and each time we start getting a fair number of names in a given language, I look at these again, and try to do the "right thing" (fortunately, this is a subjective and very adaptable notion ;-)). Any pointers to problems that we may encounter, smart things to do, etc... are of great interest to me, please send them! YA -- My opinions do not necessarily reflect my company's. The opposite is also true.
RE: Identifiers
Is it sufficient to mandate that all such identifiers MUST be KC- or KD-normalized? Does this guarantee print-and-enter round-trip compatibility? It depends on the accuracy of both the printer or the reader. So I'd say no. People won't necessarily mae the difference between a middle dot and some bullets, for example. But do we always want every identifier to resist the "napkin test"? Not necessarily. And IDN is an example where this was not chosen, so internationalized e-mail addresses, as per today's IDN I.D.s, won't have this guarantee. And remember that for most people or organizations, the problem will be much simpler: they won't understand the identifier, let alone make such fine distinctions. For example, even if you print a high-resolution version of a Japanese e-mail address, chance is that I won't be able to type it in anyway in any software (though I may be able to recognize the glyphs and copy/paste them from a Japanese site. Ugh)... YA
RE: Sun's Java encodings vs IANA's character set registry
I should not be surprised by your statement, but I am. It is distressing to think that something that by definition should not be rocket science -- repertoires of abstract characters mapped directly to specific bit patterns -- would be subject to such haphazard definition and even more haphazard implementation. Backwards compatibility stroke. As vendors changed the mappings, they kept the same names so that they would not have to update software to use the new names. Typically the changes are thought to enhance the encoding, and people want everybody to benefit (isn't that ironic?). Shift_JIS is my favorite incompatible charset. And just think of things like putting the Euro sign in a bunch of encodings w/o changing their names, or of when Windows-1252 was advertised as iso-8859-1 for interoperability purposes... It's a dangerous world ;) YA
RE: Digits in Unicode Names
What would really be nice, is for glibc-2.2 or any other unicode enabled library to display unicode characters,etc by juts using the "escape" sequence \u, where X represents a hexadecimal value.. Make that up to 6 Xs. One of the problems of such escapes when used in code, a la ISO C++ (like the \ooo for octal digits) is that they're variable-length, and stop as soon as an invalid char for the radix is encountered. That makes them error-prone (but fun). Does anybody know if the C++ standard specified how many hex digits max this escape can have? And doesn't the standard say something like \u is for wchar_t, which may not be Unicode (I hope I'm wrong here)? YA
RE: locale files....
sorry. Intel platform running Redhat Linux 7.0.. Oops, and regarding your questions about locale files on Linux. They follow the POSIX format and can easily be modified once you get them in source form along with the localdef util. YA
Re: UTF8 vs. Unicode (UTF16) in code
Since the U in UTF stands for Unicode, UTF-32 cannot represent more than what Unicode encodes, which is is 1+ million code points. Otherwise, you're talking about UCS-4. But I thought that one of the latest revs of ISO 10646 explicitely specified that UCS-4 will never encode more than what Unicode can encode, and thus definitely these 4 billion characters you're alluding to. As far as I know the U in UTF stands for Universal - not unicode. ISO 10646 can encode characters beyond UTF-16, and should retain this capability. There is a proposal to restrict UTF-8 to only encompas the same values as UTF-16, but UCS-4 still encodes the 31-bit code space. Page 12 of the Unicode Standard 3.0 says: "UTF-8 (Unicode Transformation Format-8) [...]" which is what I used to build my knowledge of what the U stands for. But I may be wrong. Thanks for clarifying my confusion between the proposal for restricting UTF-8, not UCS-4. So if the ISO never said that they will not encode things beyond what Unicode can encode, and if UTF-8 is restricted, they may someday need a UCSTF-8 (or whatever) to encode UCS-4, right? And the only difference between UTF-8 and this UCSTF-8 may be the semantics of what can be encoded and what is legal after decoding. YA
RE: New Name Registry Using Unicode
The people doing this are www.xns.org and www.onename.com. One needs to visit their sites and read their "white papers" to get a full picture of what the purpose is and how they are using the standards. Note that there are other naming initiatives, including the one driven by my company, RealNames, which was presented at the 16th Unicode Conference. See http://www.internetkeywords.org/iuc/realnames-iuc16-paper.htm which contains both a paper and my presentation slides, linked from the Unicode Web site (http://www.unicode.org/iuc/iuc16/papers.html). Lastly, people interested in naming may want to check out CNRP, an IETF protocol for the resolution of common names, at http://www.ietf.org/html.charters/cnrp-charter.html. YA
RE: Unicode in VFAT file system
Recently I've had the dubious pleasure of delving into the details of the VFAT file system. For long file names, I thought it used UCS-2, but in looking at the data with a disk editor, it appears to be byte-swapping (little endian). I thought that UCS-2 was by definition big endian, thus I've got the following questions: 1. Could it be using UTF-16LE? I tried creating an entry with a surrogate pair, but the name was displayed with two black boxes on a Windows 2000-based computer, so I assumed that surrogates were not supported. It is UTF-16 (LE, because of Intel architecture), and AFAIK there are is no surrogate support yet. Not that there would be anything to display, except one box instead of two :) YA