Quiz for Unicode guru
OK, just for fun Quiz for Unicode Guru Here is the quiz for the Unicoder. It is not a hard quiz. Everyone will get it right eventually. So, use stop watch to measure how long it will take for you figure out the right answer. Note: You can find the information of Unicode and UTF-8 from www.unicode.org In the two pictures in the link below: 1. How many bytes you need to encode the text in the picture into UTF-8 encoding? 2. What is the script name for the text on the picture? 3. Can you guess where (provice, state, country, etc) did I take these two images? [Hint: somewhere very close to where you can find famous mouses.] The two pictures and the quiz can be found at http://journals.aol.com/ytang0648/FrankTangsDiary/entries/753 Do NOT give your answer to the mail list to spoil the fun once you find out the right one, ok?
problems in Public Review 33 UTF Conversion Code Update
Looking at http://www.unicode.org/review/ 33 UTF Conversion Code Update 2004.06.08 The C language source code example for UTF conversions (ConverUTF.c) has been updated to version 1.2 and is being released for public review and comment. This update includes fixes for several minor bugs. The code can be found at the above link. and look at the code under http://www.unicode.org/Public/BETA/CVTUTF-1-2/ In http://www.unicode.org/Public/BETA/CVTUTF-1-2/ConvertUTF.c /* * Index into the table below with the first byte of a UTF-8 sequence to * get the number of trailing bytes that are supposed to follow it. */static const char trailingBytesForUTF8[256] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5};although there are code prevent 5-6 bytes UTF-8 sequence. The array above mislead people to think there are 5 and 6 bytes UTF-8. Also, F5-F7 should not map to 3. C0 and C1 ! should not map to 1It should be change to static const char trailingBytesForUTF8[256] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,0,0,0,0,0,0,0,0,0,0,0};/* * Once the bits are split out into bytes of UTF-8, this is a mask OR-ed * into the first byte, depending on how many bytes follow. There are * as many entries in this table as there are UTF-8 sequence types. * (I.e., one byte sequence, two byte... six byte sequence.! ) */static const UTF8 firstByteMark[7] = { 0x00, 0x00, 0xC0, 0 xE0, 0xF0, 0xF8, 0xFC }; This comment is also misleading "six byte sequence" and "0xF8, 0xFC" /* Figure out how many bytes the result will require */ if (ch (UTF32)0x80) { bytesToWrite = 1; } else if (ch (UTF32)0x800) { bytesToWrite = 2; } else if (ch (UTF32)0x1) { bytesToWrite = 3; } else if (ch (UTF32)0x20) { bytesToWrite = 4;Shouldn't the last line be } else if (ch (UTF32)0x11) { bytesToWrite = 4;? where does the 0x20 come from ? switch (extraBytesToRead) { case 5: ch += *source++; ch = 6; case 4: ch += *source++; ch = 6;This code also mislead people to think there are 5 and 6 bytes UTF-8 sequenceAlso the following routinestatic Boolean isLegalUTF8(const UTF8 *source, int length) {UTF8 a;const UTF8 *srcptr = source+length;switch (length) {default: return false; /* Everything else falls through when "true"... */case 4: if ((a = (*--srcptr)) 0x80 || a 0! xBF) return false;case 3: if ((a = (*--srcptr)) 0x80 || a 0xBF) return false;case 2: if ((a = (*--srcptr)) 0xBF) return false; switch (*source) { /* no fall-through in this inner switch */ case 0xE0: if (a 0xA0) return false; break; case 0xF0: if (a 0x90) return false; break; case 0xF4: if (a 0x8F) return false; break; default: if (a 0x80) return false; } case 1: if (*source = 0x80 *source 0xC2) return false;if (*source 0xF4) return false;}return true;}Does NOT match the table 3.1B as defined in Unicode 3.2see http://www.unicode.org/reports/tr28/#3_1_conformanceor Table 3-6 Well-Formed UTF-8 Byte Sequences in page 78 of Unciode 4.0in particular the function treat the following range legal! whileit should NOTU+D800..U+DFFF ED A0-BF 80-BFAl so http://www.unicode.org/Public/BETA/CVTUTF-1-2/harness.cThe following comment is misleading/* - test01 - Spot check a few legal illegal UTF-8 values only.This is not an exhaustive test, just a brief one that was used to develop the "isLegalUTF8" routine. Legal UTF-8 sequences are: 1st 2nd 3rd 4th Codepoints--- 00-7F - 007F C2-DF 80-BF 0080- 07FF E0 A0-BF 80-BF 0800- 0FFF E1-EF 80-BF 80-BF 1000- F0 90-BF 80-BF 80-BF1- 3 F1-F3 80-BF 80-BF 80-BF4- F F4 80-8F 80-BF 80-BF 10-10 - */It should beLeg! al UTF-8
Yet another reason some software treat your UTF-8 xml as US-ASCII
For sure no one in this mailling list want to see your xml got treated as US-ASCII when the data is really in UTF-8. If I have an xml file like the following ?xml version="1.0"? and send over the HTTP protocol with the following content type header: Content-Type: text/xml; (without the charset=UTF-8) Guess what charset should the receiver use as the charset of the xml? UTF-8? ISO-8859-1? or US-ASCII? If you only read the XML 1.0 specification, I guess you will conclude it should be treated as "UTF-8". However, if you also read RFC 3023, then ... the answer is "US-ASCII" see http://www.faqs.org/rfcs/rfc3023.html [...] 3.1 Text/xml Registration[] Conformant with [RFC2046], if a text/xml entity is received with the charset parameter omitted, MIME processors and XML processors MUST use the default charset value of "us-ascii"[ASCII]. In cases where the XML MIME entity is transmitted via HTTP, the default charset value is still "us-ascii". []:( Notice if the type is application/xml, the rule changed!!! 3.2 Application/xml Registration[...] If an application/xml entity is received where the charset parameter is omitted, no information is being provided about the charset by the MIME Content-Type header. Conforming XML processors MUST follow the requirements in section 4.3.3 of [XML] that directly address this contingency. However, MIME processors that are not XML processors SHOULD NOT assume a default charset if the charset parameter is omitted from an application/xml entity. [...]:( :( :(
OT: Standardize TimeZone ID
Is there any standard effort try to standardize Time Zone ID? I am not talking about the Time Zone which refer to a particular time (that could be done by GMT offset or addressed by ISO 8601) itself, but rather talking about an id refer to a particular time zone/ day light saving time rule. I know the de factor standard around is the one in ftp://elsie.nci.nih.gov/pub/tz; . Probably people also use the timezone value get back from Java a lot. I think a standard (maybe just adopt the one ftp://elsie.nci.nih.gov/pub/tz; and cleary specify it in RFC) for Timezone ID is important for the future common locale data repository as well as web services i18n. I know this is a little bit off-topic for Unicode, just like the one about locale. Maybe I should move this to w3c i18n mailling list.
unicode site problem
any one know who can fix http://www.unicode.org/reports/index.html ? all the links are broken
Re: GB18030 and super font
Raymond Mercier wrote on 4/22/2004, 7:35 AM: I enquired about the 'super font' created by a Beijing foundry, http://font.founder.com.cn/english/web/index.htm, and am fairly astonished at the prices, as you see from the attached. The cost of produce these fonts are much higher than produce a font which only have the glyph from WGL4.
Unicode 4.0 and ISO10646-2003
I saw the announcment of publishing " ISO/IEC 10646: 2003, Information technology -- Universal Multiple-Octet Coded Character Set (UCS)" >From http://anubis.dkuug.dk/jtc1/sc2/open/02n3729.htm I expect there are no difference from Unicode 4.0, am I right?
Re: GB18030 and super font
In case you want to test your GB18030 font, you can use Netscape 7 (or lateset Mozilla) and then visit my GB18030 test pages at http://people.netscape.com/ftang/testscript/gb18030/gb18030.cgi?page=10 It should be page to page compatable to the paper copy of GB18030-2000 standard. I also create "pseudo page" after page 284 for surrogate mapping. Page after 284 does not exist in the origional GB18030 standard. Have fun with http://people.netscape.com/ftang/testscript/gb18030/gb18030.cgi?page=597 :) Raymond Mercier wrote on 4/22/2004, 1:04 PM: Eric, Amazin' Amazon!! Now why didn't I think of that ? In fact the uk Amazon.co.uk say it is discontinued, so I would have to get it from Amazon in the US. It is not the first time that the two Amazon's fail to connect. Many thanks for the tip, Raymond - Original Message - From: Eric Muller To: [EMAIL PROTECTED] Sent: Thursday, April 22, 2004 5:40 PM Subject: Re: GB18030 and super font Raymond Mercier wrote: But that link to proofing tools leads nowhere. Maybe it's not be so easy to get the CHS version. http://www.amazon.com/exec/obidos/tg/detail/-/BBZ54P/qid=1082651762/sr=8-1/ref=pd_ka_1/103-8333725-5907026?v=glances=softwaren=507846 Includes ~140 fonts, mostly for CJK, Arabic, Hebrew but other scripts as well. Includes "Simsun (Founder Extended)" aka "-", with 65,531 glyphs! Eric.
Re: Unicode 4.0 and ISO10646-2003
Kenneth Whistler wrote on 4/22/2004, 3:26 PM: Frank asked: I expect there are no difference from Unicode 4.0, am I right? Correct. Please see Appendix C of Unicode 4.0, p. 1348 and p. 1350, which already explicitly makes this statement. --Ken I don't see ISO10646-2003 in the page you mentioned. Is that equal to the so-called the thrid version? There are no easy to tell ISO10646-2003 is equal to the so-called the third version :) Although I guess that is the case.
Re: help finding radical/stroke index at unicode.org
are you talking about http://www.unicode.org/charts/unihangridindex.html and http://www.unicode.org/charts/unihanrsindex.html ? Gary P. Grosso wrote on 4/14/2004, 1:18 PM: Hi, I am looking for an up-to-date, online version of the sort of thing I see in the back of the printed Unicode 2.0 book. All I can find is a search engine thing, and that's real cool (I suppose) but I need a tableau, a complete picture, of the whole shebang. Can someone please help me find my way? Thanks, Gary --- Gary Grosso Arbortext, Inc. Ann Arbor, MI, USA
Re: Novice question
Be careful here, for Unicode support in the browser (at least Netscape/Mozilla) there are some code fork between 2000/XP and Win98/ME. Philippe Verdy wrote on 3/23/2004, 5:39 AM: From: Edward H. Trager [EMAIL PROTECTED] Also, I would not bother testing Windows OSes prior to Windows 2000/XP.
Re: in the NEW YORK TIMES today, report of a USA patent for a met hod to make the Arabic language easier to read/write/typeset
Chris Jacobs wrote on 3/15/2004, 10:08 PM: - Original Message - From: Kenneth Whistler [EMAIL PROTECTED] To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Tuesday, March 16, 2004 2:28 AM Subject: Re: in the NEW YORK TIMES today, report of a USA patent for a met hod to make the Arabic language easier to read/write/typeset Mark Shoulson said: (Me, I think it's a cool idea, but I'm notorious for being fascinated by shiny new things.) a gnieb rof suoiroton m'I tub ,aedi bmud a s'ti kniht I ,eM .tehctorc evitcaer neK -- If you really typed that in backwards someone should teach you about unicode. Not really, CSS2 is good enough. You don't need unicode to do that. Try the following HTML (and CSS2) source in your browser: div style=direction:rtl; unicode-bidi:bidi-override; Try the following HTML (and CSS2) source in your browser: /div
Re: in the NEW YORK TIMES today, report of a USA patent for a met hod to make the Arabic language easier to read/write/typeset
May be I should file an US patent application to write Arabic from left to right to make it more simplified :) I guess that will have more adoption rate compare to this font design patent since most software which does not support Bidi already implement them. :) Mark E. Shoulson wrote on 3/15/2004, 7:54 PM: And see http://www.arabetics.com/ for the official site. (Me, I think it's a cool idea, but I'm notorious for being fascinated by shiny new things.) ~mark
Re: in the NEW YORK TIMES today, report of a USA patent for a method to make the Arabic language easier to read/write/typeset
Wow. It seems not a very new idea. Similar idea have been used in Chinese 40 years ago and create the differences between Simplifed Chinese And Traditional Chinese. Michael Everson wrote on 3/15/2004, 12:40 PM: In the NEW YORK TIMES today comes a report of a USA patent for a new version of written Arabic letters, designed to make them easier to read/write/typeset without making them too different from traditional Arabic script: http://www.nytimes.com/2004/03/15/technology/15patent.html -
Re: multibyte char display
many different reason you will see ? there. read my paper http://people.netscape.com/ftang/paper/unicode25/a302.htm to see a list. Manga wrote on 3/15/2004, 10:07 AM: I use UTF-8 encoding in java code to store multi byte characters in the db . When i retreive the multi byte characters from db , i see ? inplace of the actual multi byte characters. I use solaris os. Is there any environment variable which i can set to see the actual characters on my terminal window. Thanks
RE: in the NEW YORK TIMES today, report of a USA patent for a met hod to make the Arabic language easier to read/write/typeset
Mike Ayers wrote on 3/15/2004, 2:50 PM: From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Frank Yung-Fong Tang Sent: Monday, March 15, 2004 11:16 AM It seems not a very new idea. Similar idea have been used in Chinese 40 years ago and create the differences between Simplifed Chinese And Traditional Chinese. Really? That conflicts with my understanding, which is: When writing Chinese, there are certain stroke elements which, when written in the more flowing script of everyday usage (grass script et al.), closely resemble other stroke elements which use less strokes to write. These stroke reduced elements are substituted for the original elements. Also, there are certain "paired" character elements, such that one may be substituted for the other, and the quicker-to-write stroke reduced element gets substituted. I do not really understand these substitutions, but it is my understanding that they are intuitive to literate Chinese. These two "simplification" methods were formalized and standardized to become Simplified Chinese. Am I getting this wrong? I don't see the connection between organic change in a script and singular revolutionary change. Oh... believe me, as a Chinese educate in the Traditional Chinese world the Simplified Chinese looks like "revolutionary change" :) Don't get me wrong. I mention Chinese not to prove it "could be done". I only want to show- if it does happen, you will have one more alphabetic set to deal with (now Chinese in USA need to know BOTH Traditional Chinese AND Simplified Chinese instead of JUST the "hard-to-learn" Traditional Chinese) /|/|ike
Re: Version(s) of Unicode supported by various versions of Microsoft Windows
Not sure how to find the information paper. But one way to check the degree of the support is to do a GetStringTypeEx agasinst some characters defined in 2.0, 2.1, 3.0, 3.1, 3.2, 4.0 to see does those return result reflect what it should be. Antoine Leca wrote on 3/5/2004, 8:35 AM: Hi folks, I discovered, to much of my surprise (but after reflexion it does hold much sense, taken in account the dates when it were developped), that Windows 2000 only support The Unicode Standard, version 2.0 URL:http://support.microsoft.com/default.aspx?scid=kb;EN-US;227483 The question, I was unable to find a similar information refering to Windows NT version 5.1 and 5.2. Certainly people here may direct me to the correct place to find it. Thanks in advance. (Please, do not tell me it supports 4.0 since you can view 4.0 provided you use the correct browser and the correct fonts; that is NOT what I want to know. I am interested for example in sorting strings with surrogates; seeing that in a typical WinXP distribution, %SYSTEM32%/SORTKEYS.NLS is still 256k like it was with NT3.x, shows me that this one would not support Unicode 3.1, for instance). A similar query has been directed to Dr. International URL:http://www.microsoft.com/globaldev/drintl/askdrintl.aspx Antoine
Re: commandline converter for gb18030 - utf8 in *nix
you can also use 'nsconv' which come with mozilla source code with GB18030. see http://www.mozilla.org/projects/l10n/mlp_tools.html for details Zhang Weiwu wrote on 3/5/2004, 6:43 AM: Hello. I believe this must be a frequent question, but I googled around and I didn't find a satisfying tool. It seems most converters do GB2312 but not GB18030. I have 100+ files to convert, normal graphical /web based converters won't do the work well. On my FreeBSD there is a ported tool i18ntools (http://www.whizkidtech.redprince.net/i18n/), it seems lack the GB18030 codepage (and the GB_1988-80 page produced messed file). Last month I reported w3c's amaya's lack of GB18030 support, they say on the mailling list they cannot implement the charset unless they can get a code conversion page file. Is it so hard to get one? And what command-line charset converter do you often use? Many thanks.
Re: Font Technology Standards
BDF is also widly used, although the quality and features is not that powerful these day. Also, there are other "standard" about the font: 1. Glyph set "standard"- how to make sure one font contains all the glyph for a particular group of users- for example- WGL4 is a glyph set standard from MS for pan european users. 2. Glyph naming standard- how to name a particular glyph. I remember Adobe have a "standard" glyph naming scheme for at least Cyrillic Glyph. This is needed to put the common glyph name into a PostScript font /TTF font. And I am sure the following DOES NOT exist although I hope there we can have one day- Glyph Encoding Standard. Map a glyph to a fixed glyph ID. (The Arabic presentation block A and B sort of like this one) For example, it will be much easier for people to understand the Indic font if there a INFOS glyph mapping standard for all their indic fonts. [EMAIL PROTECTED] wrote on 3/3/2004, 3:52 AM: Not sure exactly what you are looking for because "Font Technology" covers a broad spectrum, but a *simplified* picture might be something like the following: First, we should distinguish bitmap font technologies from scalable font technologies ... I assume you are more interested in the latter. For scalable fonts, there are a number of fundamentally different ways to describe the curves: Postscript outlines are based on bezier curves, TrueType outlines on quadratic curves, and I can't remember what Metafont is. The next level is how you package the individual glyphs into a font: Postscript type1 fonts -- bundle up Postscript outlines TrueType fonts -- bundle up TrueType outlines OpenType fonts -- bundle up either TrueType or Postscript outlines (and bitmaps) and there are others. Next level is how you encode into the font the smarts for complex rendering. At least three technologies utilize extensions of the TrueType font: OpenType from Microsoft Adobe GX and AAT from Apple SIL Graphite from SIL (Note that the TrueType file structure is inherently extensible, and OpenType, GX/AAT and Graphite fonts are TrueType fonts with extra tables. Because of this people often interchange and blur the terms "TrueType" and "OpenType".) As is common in this world, at each level the various options each have pros and cons. Bob
Re: What's in a wchar_t string on unix?
oh. This is the first time I hear about this. Thanks about your information. Does it also mean wchar_t is 4 bytes if __STDC_ISO_10646__ is defined? or does it only mean wchar_t hold the character in ISO_10646 (which mean it could be 2 bytes, 4 bytes or more than that?) Noah Levitt wrote on 3/2/2004, 1:33 PM: As specified in C99 (and maybe earlier), if the macro __STDC_ISO_10646__ is defined, then wchar_t values are ucs4. Otherwise, wchar_t is an opaque type and you can't be sure what it is. Noah
Re: What's in a wchar_t string on unix?
Clark Cox wrote on 3/3/2004, 1:28 PM: From the C standard: __STDC_ISO_10646_ _An integer constant of the formmmL(for example, 199712L), intended to indicate that values of type wchar_t are the coded representations of the characters defined by ISO/IEC10646, along with all amendments and technical corrigenda as of the specified year and month. This, to me suggests that wchar_t would indeed be a 32-bit type (well, at least a 20-bit type) when this macro is defined. However, to be sure, I'd suggest posting to news:comp.std.c The language in the standard does not prevent someone to make it 16 bits or 64 bits when that macro is defined, right? And what does the year and month mean? On Mar 03, 2004, at 12:38, Frank Yung-Fong Tang wrote: oh. This is the first time I hear about this. Thanks about your information. Does it also mean wchar_t is 4 bytes if __STDC_ISO_10646__ is defined? or does it only mean wchar_t hold the character in ISO_10646 (which mean it could be 2 bytes, 4 bytes or more than that?) Noah Levitt wrote on 3/2/2004, 1:33 PM: As specified in C99 (and maybe earlier), if the macro __STDC_ISO_10646__ is defined, then wchar_t values are ucs4. Otherwise, wchar_t is an opaque type and you can't be sure what it is. Noah -- Clark S. Cox III [EMAIL PROTECTED] http://homepage.mac.com/clarkcox3/ http://homepage.mac.com/clarkcox3/blog/B1196589870/index.html
Re: What's in a wchar_t string on unix?
Clark Cox wrote on 3/3/2004, 4:33 PM: [I swap the reply order to make my new question clearer] And what does the year and month mean? It indicates which version of ISO10646 is used by the implementation. In the above example, it indicates whatever version was in effect in December of 1997. It indicates which version of ISO10646 is used by the implementation. hum... what text in the standard make you believe that is the case? (I am not against it, just have not seen any standard text clearly show that yet.) The language in the standard does not prevent someone to make it 16 bits or 64 bits when that macro is defined, right? Not explicitly, but as I read it, when that macro is defined, wchar_t would have to be at least 20-bits, or else it couldn't be true that values of type wchar_t are the coded representations of the characters defined by ISO/IEC10646. That is, I would think that wchar_t would have to be able to represent values in the range [0, 0x10]. But my interpretation could be off, which is why I recommended asking on comp.std.c. hum... if it is defined as 199712, then what does it mean ? Unicode 2.0 ? (Unicode 2.0 frist print July 1996)? Unicode 2.1 ? (1998 by UTR#8) Unicode 3.0? (2000). None of these define any coded representations of characters defined = U+1, right? Therefore, there no reason for a implementation which defined it as 199712 have to make the size of wchar_t 16 bits, right?
Re: What's in a wchar_t string on unix?
I Rick Cameron wrote on 3/1/2004, 2:13 PM: Hi, all This may be an FAQ, but I couldn't find the answer on unicode.org. The reason is there are "NO answer" to the question you ask. It seems that most flavours of unix define wchar_t to be 4 bytes. Depend on which UNIX and which version. Depend on how you define "most flavours" If the locale is set to be Unicode, what's in a wchar_t string? No answer for that because 1) ANSI C standard does not define it. (neither it's size nor it's content) 2) Several organization try to establish standard for Unix. One of that is "The Open Group"'s "Base Specifications" IEEE Std 1003.1, 2003. But neither that define what should wchar_t hold. Is it UTF-32, or UTF-16 with the code units zero-extended to 4 bytes? Cheers - rick cameron The more interesting question is, why do you need to know the answer of your question. And the ANSI/C wchar_t model basically suggest, if you ask that question, you are moving to a wrong direction
Re: unicode format
John Cowan wrote: steve scripsit: Could someone please clarify the difference between UTF8 and UFT16 please? If it is possible to encode everything in UTF8 and it is more efficient what is the need for UTF16? It is more efficient to PROCESS in UTF16.
RE: Mother Language Day
joe wrote: (Hmm, in Russian mother language (maternij jazik) means something *verry* different. Watch your language! ;-) He write this in English not Russian, right? How can I watch Chinese (my language) ? Joe
Re: Codes for Individual Chinese Brushstrokes
As a native Chinese person. I believe 1. The so called eight basic stroke is very standard in concept. But that is only 8. 2. They list 8 different varients for each of the 8 basic stroke. But if you read that page carefully, it does not mean that there are only 8 variants for each stroke, neither mean people can distinguish those variants from each others. For example, most Chinese will think the first Dot from the left is the same as the fourth Dot from the left. And the differents between them are really style. Therefore, it is not a good idea to encode those variants 3. There are more composit strokes if you really want to encode strokes. For example: http://people.netscape.com/ftang/chineselearning/strokes/refglyph_003.gif http://people.netscape.com/ftang/chineselearning/strokes/refglyph_004.gif Andrew C. West wrote: On Thu, 19 Feb 2004 18:27:09 -0800 (PST), Kenneth Whistler wrote: Of the 64 entities listed on the page: http://www.chinavoc.com/arts/calligraphy/eightstroke.asp *none* of them are encoded, and *none* of them are standard enough to merit consideration -- if by consideration you mean separate encoding as characters. I'm not sure about *none* of them are encoded. As far as I can tell, pretty much most of the basic ideographic stroke forms are either already encoded in CJK and CJK-B or are proposed in CJK-C (where encoded here means encoded in their own right or can be represented by same-shaped ideographs). See for example the IRG document http://www.cse.cuhk.edu.hk/~irg/irg/irg19/N927_Add%202%20Strokes%20to%20C1.doc which states : quote Although most ideographic strokes have been encoded in CJK (including Ext.A and B) or submitted to CJK_C1 by IRG members, there are two ideographic strokes are found missing. Ideographic strokes are important for ideograph decomposition, analysis and for making ideographic strokes subset. Chinese linguists suggest to add these two ideographic strokes to CJK_C1. /quote I also remember reading one WG2 document that explicitly raised the question of how to deal with all the ideographic strokes proposed in CJK-C that are not distinct ideographs in their own right, although I can't seem to locate that document any more. All except one of the eight basic strokes mentioned at http://www.chinavoc.com/arts/calligraphy/eightstroke.asp are *representable* using existing characters in the CJK and/or Kangxi Radicals blocks : dot = U+4E36 or U+2F02 [KANGXI RADICAL DOT] dash = U+4E00 or U+2F00 [KANGXI RADICAL ONE] perpendicular downstroke = U+4E28 or U+2F01 [KANGXI RADICAL LINE] downstroke to the left or left-falling stroke = U+4E3F or U+2F03 [KANGXI RADICAL SLASH] wavelike stroke or right-falling stroke = U+4E40 hook = U+4E85 or U+2F05 [KANGXI RADICAL HOOK], as well as U+4E5A and U+2010C upstroke to the right = bend or twist = U+4E5B and U+200CC I concur with Ken that the 8x8 stroke categorization given at this web site is largely artificial. Whilst it may be useful to encode general ideographic stroke forms to help in the analysis and decomposition of ideographs, in my opinion the minute distinctions in the way that dots and dashes are written in various individual ideographs are beyond the scope of a character encoding system as the exact shape of a dot or length of a dash is irrelevant to any analysis of the compositional structure of an ideograph. Andrew
Re: UTF-8 to UTF-16 conversion
Yes, TEC. look at developer.apple.com and look at Text Encoding Converter Paramdeep Ahuja wrote: Hi Can anyone tell if there is any API available on MAC to convert from UTF-8 to UTF-16 thnx -P
Re: Detecting encoding in Plain text
Consider CR and LF too. Mark Davis wrote on 1/14/2004, 9:25 AM: I'm not sure which one suggested heuristic method you are referring to, but you are bounding to conclusions. For example, one of the heuristics is to judge what are more common characters when bytes are interpreted as if they were in different encoding schemes. When picking between UTF16-BE and LE, U+0020 is *still* much more common than U+2000, even in Thai. Mark __ http://www.macchiato.com - Original Message - From: Peter Kirk [EMAIL PROTECTED] To: John Burger [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Wed, 2004 Jan 14 08:12 Subject: Re: Detecting encoding in Plain text On 14/01/2004 07:16, John Burger wrote: ... By the way, I still don't quite understand what's special about Thai. Could someone elaborate? I mentioned Thai because it is the only language I know of which does not used SPACE, U+0020. It also has at least some of its own punctuation. So a Thai text need not include any characters U+00xx - which rules out one suggested heuristic method. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Detecting encoding in Plain text
Does Thai use CR and LF? Peter Kirk wrote on 1/14/2004, 8:12 AM: On 14/01/2004 07:16, John Burger wrote: ... By the way, I still don't quite understand what's special about Thai. Could someone elaborate? I mentioned Thai because it is the only language I know of which does not used SPACE, U+0020. It also has at least some of its own punctuation. So a Thai text need not include any characters U+00xx - which rules out one suggested heuristic method. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Detecting encoding in Plain text
John Burger wrote on 1/14/2004, 7:16 AM: Mark E. Shoulson wrote: If it's a heuristic we're after, then why split hairs and try to make all the rules ourselves? Get a big ol' mess of training data in as many languages as you can and hand it over to a class full of CS graduate students studying Machine Learning. Absolutely my reaction. All of these suggested heuristics are great, but would almost certainly simply fall out of a more rigorous approach using a generative probabilistic model, or some other classification technique. Useful features would include n-graphs frequencies, as Mark suggests, as well as lots of other things. For particular applications, you could use a cache model, e.g., using statistics from other documents from the same web site, or other messages from the same email address, or even generalizing across country-of-origin. Additionally, I'm pretty sure that you could get some mileage out of unsupervised data, that is, all of the documents in the training set needn't be labeled with language/encoding. And one thing we have a lot of on the web is unsupervised data. I would be extremely surprised if such an approach couldn't achieve 99% accuracy - and I really do mean 99%, or better. By the way, I still don't quite understand what's special about Thai. Could someone elaborate? For language other than Thai, Chinese and Japanese, you usually will see space between words. Therefore, you should see a high count of SPACE in your document. The SPACE for text in language other than Thai, Chinese and Japanese should occupy probably 10%-15% of the code point (just a guess, if the average lenght of word is 9 characters, you will get 10% SPACE, if it shorter, if the average is shoter, than the percentage of SPACE increase). But for Thai, Chinese and Japanese, space is not put in between words, and therefore the percentage of SPACE code point will be quite different. For Korean, it is hard to say, depend they are using IDEOGRAPH SPACE or SINGLE BYTE SPACE. Also, for Korea, it will depend on which normalization form they are using. The % of space will be different too because in one normalization form you will count one Korean characters as one unicode code point, but in the decomposed form, it may be count as 3. Shanjian Lee and Kat Momoi implement a charset detector based on my early work and direction. They summarise it into a paper and present in Sept 11, 2001. see http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html for details. It talk about a different issue and problem. - John Burger MITRE
Re: Programmatic description of ideographic characters
looks like an old idea people in Taiwan gave up long time ago because of the issue of the quality of glyph will never be good enough. Tom Emerson wrote on 1/2/2004, 6:06 PM: The following paper, Chinese Character Synthesis using METAPOST, was recently mentioned in a thread on the teTeX mailing list. It's an interesting read. http://www.tug.org/tug2003/preprints/Yiu/yiu.pdf
Re: MS Windows and Unicode 4.0 ?
come on, take my joke. but that is a perfect example of language specific variant glyph, right? Michael Everson wrote: At 17:13 -0800 2003-12-02, Frank Yung-Fong Tang wrote: come on, use language specific glyph substution on the last resort font to show Irish last resort glyph if the language is Irish. I know OpenType have it. Does AAT support language specific features? You are welcome to lobby Apple to commission such an enormous font. You have, I think, no idea how much work that would be. -- Michael Everson * * Everson Typography * * http://www.evertype.com -- -- Frank Yung-Fong Tang tm rhtt, Itrntinl Dvlpmet, AOL Intrtv Srvies AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913 Yahoo! Msg: frankyungfongtan
Re: MS Windows and Unicode 4.0 ?
Peter Kirk wrote: On 02/12/2003 16:25, Frank Yung-Fong Tang wrote: ... a barrier to proper internationalisation ? My opinion is reverse, I think it is a strategy to proper internationalization. Remember, people can always choose to stay with ISO-8859-1 only or go to UTF-8 with MES-1 support for European market. UTF-8 with MES-1 support does not mean other characters won't work in their product, but instead, it mean other charactrers are not Quality Assuranced in their products. Well, Frank, I am surprised that you favour encouraging developers to design their systems with only the European market in mind. Surely it would help with internationalisation for Thailand if the system is designed with support for Thai and other scripts in mind, even if not fully implemented and quality assured in the first release. No. that is not what I said. See, you still thinking about developers and design and system. I am talking about QA, product, service, marketing PLUS the development. I am encouraging QA to test MES-1 with UTF-8 instead of only ISO-8859-1. I am encouraging product ship with MES-1 support out of the box instead of ISO-8859-1. And if QA wrote their test plan by using UTF-8 and MES-1 and product claim to supprt MES-1, how far it could be away from even if not fully implemented and quality assured in the first release. You are talking about a developer driven mindset, I am talking about a product driven, marketing driven, Quality driven mindset. ... You only look at the issue from the developer point of view. But how about QA? How are you going to QA the whole Unicode? You also need to look at the issue from an end-user point of view, or the working out of box point of view. How could the end user know what kind of function they are going to get WITHOUT extra efforts. True, I hadn't looked at the QA issue. I suppose there are two ways to go here: one would be to aim at support for the whole of Unicode but only assure support for certain ranges; in my book a supporting feature without QA is not a supported feature at all. See, you still have this developer oriented mind set. No product should claim they support something without QA. the other is for the QA people to work with third party fonts. QA-ing the whole of Unicode shouldn't be a big problem anyway as most work needs to be done on new features rather than new characters e.g. For QA engineer to test a software product with a particular script, they have to have at least some minimun knowledge about that script. And I won't say that is easy. For example, ask yourself, how many scripts you feel confortable by youself to Quality Assuranced? (not just test, but ASSURANCE) if one script using special feature X is assured to work, a rather quick test should be sufficient to show that every script using feature X works. hum that sound below the QA standard normal QA engineers is targeting. A good Test Plan need to include Make sure right input cause right output Make sure wrong input cause error but not rigth output Make sure all the possible cdoe path got executed. and more. If you are a QA engineer who is working on a working out of box product, how are you going to prepare your test cases? If you are a product marketing person who is going to write a product specification about a cell phone which do not allow user to download fonts, how are you going to spec it out? Well, I was thinking of computers rather than brain dead mobile phones. Mobile phones have long allowed downloading of ring tones, so why not downloading of fonts? And there is probably already a significant demand for mobile phones using every script which is in current everyday use, and so mobile phone manufacturers who restrict users to more restrictive subsets are being shortsighted - although I would expect that full BMP support would be adequate for a basic product in this scenario. Name me a cell phone which can download and accept CJK Han Extension B (Unicode Plan 2) today. If you are building a theory, you can support any unicode code point. If you are building a technology, you may support any unicode code point. If you are building a product, you won't be able to support any unicode code point with limited time cost in good enough quality. In that case, I rather cut features (how many scripts in Unicode) in exchange of quality. You are assuming a product which is does not need to work out of box. If that is the case, you can ALSO think Windows 2000 work for surrogate since you can install or tweak the register to make it work with Surrogate. You can ALSO think Windows 95 can support Complex Script since you can INSTALL Uniscribe on it, right? Right. My Windows 2000 supports surrogates, probably because either one of the service packs or Office XP installed this support for me. When I was using Windows 95 I
Re: MS Windows and Unicode 4.0 ?
As long as a product support UTF-8 and pass the test with MES-1, I can pretty sure that no code in between strip off any non ISO-8859-1 characters, regardless they support MES-2 or MES-3. Of course, that does not guarantee surrogate characters won't get damanaged, but just as someone believe, it will be 1% of efforts for me to fix it later, right? :) Michael Everson wrote: At 15:38 -0800 2003-12-03, Frank Yung-Fong Tang wrote: I am encouraging QA to test MES-1 with UTF-8 instead of only ISO-8859-1. I am encouraging product ship with MES-1 support out of the box instead of ISO-8859-1. And if QA wrote their test plan by using UTF-8 and MES-1 and product claim to supprt MES-1, how far it could be away from even if not fully implemented and quality assured in the first release. MES-1 is hopelessly archaic. It's ISO 6937. MES-2 would be the only miminum I could recommend for Europe. And it's not good enough either, which is why MES-3 is block based. -- Michael Everson * * Everson Typography * * http://www.evertype.com -- -- Frank Yung-Fong Tang tm rhtt, Itrntinl Dvlpmet, AOL Intrtv Srvies AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913 Yahoo! Msg: frankyungfongtan
RE: MS Windows and Unicode 4.0 ?
Michael Everson wrote: It's better than not knowing what range the thing is in. It helps the user know he has received, say, Telugu data or whatever. Only if the user know what Telugu may look like. How many users other than those sign up the Unicode malling list know the shape of more than 10 scripts ? I think the value is it show poeple it is not a ? ASCII question mark itself. -- -- Frank Yung-Fong Tang tm rhtt, Itrntinl Dvlpmet, AOL Intrtv Srvies AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913 Yahoo! Msg: frankyungfongtan
RE: MS Windows and Unicode 4.0 ?
A better approach than asking "Does product X support Unicode 4.0" which in some way you can always get a NO answer is to 1. Define a smaller set of functionality (Such as MES-1, MES-2, MES-3A) 2. Ask 'Does Product X Support MES-1? Does Product X Support MES-2? I think that kind of question is more meaningful. Unicode is "Big, Powerful but Complex", compare to US-ASCII or ISO-8859-1 which are "Small, Weak but Simple". While the answer of "Does Product X support Y" is meaningful while Y is a "Small but Simple" stuff, the answer have less meaning while Y is a Big and Complex beast like Unicode. Surrogate itself could be a very small enough subset for the question, assuming if you don't consider the Plane 14 behabior is part of it. Please do not push too hard on commercial company to implement Unicode. Because there are TWO approaches, not ONE, some commercial company took to implement the Unicode Standard in the history: 1. Implement the next version of software according to today's Unicode Standard 2. Change the next version of Unicode standard according to today's implementation The famous Korean Mess and the introduction of 15.10 Tag Characters should teach all of us a lesson- If you push a company too hard to implement the Unicode, it may push them to take the 2nd approach... Arcane Jill wrote: Damn right. I would like to know this too. In particular, I want all the math characters working, and all the musical symbols working. Note that many of these are not in the BMP. I want to be able to put these characters on web pages, and know that they will be displayed correctly on my own choice of browser (which is not MSIE).. And since I already have a Windows OS (XP Pro), I don't see why I should have to buy another one just to get these extras. I'm hoping it would suffice to make just the FONTS available to the world. Jill -Original Message- From: Patrick Andries [mailto:[EMAIL PROTECTED]] Sent: Tuesday, December 02, 2003 3:54 AM To: Michael (michka) Kaplan; Unicode List Subject: Re: MS Windows and Unicode 4.0 ? I'm interested in knowing whether the following features would soon be found in Windows : fonts for scripts covered by Unicode 4.0, corresponding rendering engine to display all Unicode 4.0 scripts -- -- Frank Yung-Fong Tang tm rhtt, Itrntinl Dvlpmet, AOL Intrtv Srvies AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913 Yahoo! Msg: frankyungfongtan
Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)
Mark Davis wrote: UTF-166,634,430 bytes UTF-87,637,601 bytes SCSU6,414,319 bytes BOCU-15,897,258 bytes Legacy encoding (*)5,477,432 bytes (*) KS C 5601, KS X 1001, or EUC-KR) What is the size of gzip these? Just wonder gzip of UTF-16 gzip of UTF-8 gzip of SCSU gzip of BOCU-1 gzip of Legacy encoding -- -- Frank Yung-Fong Tang tm rhtt, Itrntinl Dvlpmet, AOL Intrtv Srvies AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913 Yahoo! Msg: frankyungfongtan
Re: How can I have OTF for MacOS
John Jenkins wrote: On Dec 1, 2003, at 4:24 PM, Frank Yung-Fong Tang wrote: John What 'cmap' format Apple use in the MacOS X Devanagari and Bangla fonts? The formats are irrelevant; the Mac supports all the 'cmap' subtable formats for all subtables. For rendering complex scripts, however, the font can only be rendered through ATSUI (or Cocoa), because the old way to support complex scripts via an 'itl5' resource in the suitcase with the 'FOND' and 'sfnt' resources is not supported on X. It may or may not relevant. The reason I ask this question is because the eariler version of mozilla code (I forgot when do we change it) used to work around a freezing issue on ATSUI on the eariler version of MacOS X. What happen is in the old version of MacOS X, if a page of Unicode characters are not supported by any installed font, the performance is extermley unacceptable (freeze for 3 minutes by open and close font file) in Mozilla because either the old MacOSX does not cache the information about which characters have no glyph on the system at all or such caching is based on the layout instead of a global. To work around that, Mozilla read the cmap table (only some format) to decide which chracters 'could be render by ATSUI before it pass to ATSUI. I know the performance is much better now (wel... if you are still not sure, try to visit http://people.netscape.com/ftang/testscript/gb18030/gbtext.cgi?page=1220 a page of GB18030 characters which encode Unicode plane 4 (which have no characters assign yet) from a browser which support GB18030. or http://people.netscape.com/ftang/testscript/gb18030/gb18030.cgi?page=1220 for the html which include NCR, GB18030 and GIF (which does not exist) I think we remove that checking code last year, after the newer version of MacOS X improve from this almost freeze with wrong data situration. Apple really, really wants everybody to move to using Unicode in their applications for all their text, and Apple really, really, *really* wants people to do it for complex scripts. And my question have no conflict with Apple's recommendation at all. Think about this, while Microsoft support Unicode cmap and really encourage people to use Unicode, they ALSO publish the WGL4 and the OpenType font spec for different script. They also say which format a font SHOULD support in TTF cmap. That does not conflict with the goal of Using Unicode at all. There are times people have to know those details. And it is better those details are capture in a Apple's public tech note, rather than people dig into the code by revers eng and guess what it is. For example, if I want to customize my last resort behavior in MacOS X with ATSUI (by drawing a Frank Tang picture with a Unicode Decimal value below it- a way nobody want to implement- everyone live Unicode Hex, right. Just want to make an extreme case that for sure John won't crazy enough to add them into ATSUI.) in my application, how can I do it now? -- -- Frank Yung-Fong Tang tm rhtt, Itrntinl Dvlpmet, AOL Intrtv Srvies AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913 Yahoo! Msg: frankyungfongtan
RE: MS Windows and Unicode 4.0 ?
Michael Everson wrote: At 14:23 -0800 2003-12-02, Frank Yung-Fong Tang wrote: It's better than not knowing what range the thing is in. It helps the user know he has received, say, Telugu data or whatever. Only if the user know what Telugu may look like. How many users other than those sign up the Unicode malling list know the shape of more than 10 scripts ? Actually, if you look at the Last Resort Glyphs (at a large enough size) you will see that the block name and range numbers are part of the image. See http://developer.apple.com/fonts/LastResortFont/ ok, you are right, I should say Only if you have good vision instead. -- -- Frank Yung-Fong Tang tm rhtt, Itrntinl Dvlpmet, AOL Intrtv Srvies AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913 Yahoo! Msg: frankyungfongtan
Re: UTF-16 inside UTF-8
Doug Ewell wrote: Frank Yung-Fong Tang ytang0648 at aol dot com wrote: Then, Frank, the Tcl implementation is *not valid UTF-8* and needs to be fixed. Plain and simple. If a system like Tcl only supports the BMP, that is its choice, but it *must not* accept non-shortest UTF-8 forms or output CESU-8 disguised as UTF-8. Agree with you. Just want to make a point that the implementation is not 1% of the work. If you still think adding 4 bytes UTF-8 support is 1% of the task, then please join the Tcl project and help me fix that. I appreciate your efforts there and I beleive a lot of people will thank for your contribution. I'll be happy to supply UTF-8 code that handles 4-byte sequences. That is not the same thing as converting an entire system from 16-bit to 32-bit integers, or adding proper UTF-16 surrogate support to a UCS-2-only system. Of course that is more work. You view is based on the assumption the internal code is UCS4 instead of UTF-16. Remember, AGAIN, that this thread was originally about taking an application like MySQL that did not support Unicode at all, and adding Unicode support to it, **BUT ONLY FOR THE 16-BIT BMP.** That is what I can't imagine -- making BMP-only assumptions *today*, in 2003, knowing that you'll have to go back and fix them some day. That is certainly more work than adding support for the full Unicode range at once. I think you thought I said the opposite, that such retrofitting is easy, and are now trying hard to disprove it. Nothing wrong if people choose to use UTF-16 instead of UCS4 in the API, even as 2003. Do you agree? If people do use UTF-16 in the API, it is nature for people who do care about BMP but not care about Plan 1-16 to only work on BMP, right? I am not saying they do the right thing. I am saying they do the nature thing. Remember, the text describe about 'Surrogate' in the Unocde 4.0 standard is probably only 5-10 pages total in that 1462 pages standard. For developer who won't going to implement the rest 1000 pages right, it is nature for them to think why do I need to make this 10 pages right? double your memory cost and size from UTF-8. x4 of the size for your ASCII data. To change implementation of a ASCII compatable / support application to UTF-16 is already hard since people only care about ASCII will upset the data size x 2 for all their data. It is already a hard battle most of the time for someone like me. If we tell them to change to UCS-4 that mean they need not only x2 the memory but x4 of the memory. I can't fight this battle with people who would rather stay with ASCII and 7/8 bits per character. They are not living in a Unicode world. But how about the UTF-16 vs UCS4 battle? 1024 768 screen resolution takes 150% more display memory than 640 480, too. For web services or application which spend multi millions on those memory and database, it mean adding millions of dollars to their cost. They may have to adding some millions of cost to support international customer by using UTF-16. They probably are willing to add multi millions dollars of cost to change it to use UCS4. In fact, there are people proposed to stored UTF-8 in a hackky way into the database instead of using UTF-16 or UCS4 to save cost. They have to add restriction of using the api and build upper level api to do conversion and hacky operation. That mean it will introduce some fixed (not depend on the size of data) developement cost to the project but it will save millions of dollars of memory cost which depend on the size of the data. I don't like that approach but usually my word and what is right is less important than multiple million of dollars for a commercial company. I would truly be surprised if full 17-plane Unicode support in a single app could be demonstrated to be a matter of multiple millions of dollars. It is not the full 17-plane Unicode support which will contribut to it. It is the (Number of ASCII only records X sizeof (records in UCS4)) - ( Number of ASCII only records X sizeof(record in ASCII)) contribute to that. compare to (Number of ASCII only records X sizeof (records in UTF-8)) - ( Number of ASCII only records X sizeof(record in ASCII)) or (Number of ASCII only records X sizeof (records in UTF-16)) - ( Number of ASCII only records X sizeof(record in ASCII)) The other comparision is (Number of BMP only records X sizeof (records in UCS4)) - ( Number of BMP only records X sizeof(record in UTF-8)) (Number of BMP only records X sizeof (records in UCS4)) - ( Number of BMP only records X sizeof(record in UTF-16)) of course, the sizeof() is really the average size of record with those data -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/ -- -- Frank Yung-Fong Tang tm rhtt, Itrntinl Dvlpmet, AOL Intrtv Srvies AIM:yungfongta mailto:[EMAIL
Re: MS Windows and Unicode 4.0 ?
Peter Kirk wrote: On 02/12/2003 14:19, Frank Yung-Fong Tang wrote: A better approach than asking Does product X support Unicode 4.0 which in some way you can always get a NO answer is to 1. Define a smaller set of functionality (Such as MES-1, MES-2, MES-3A) 2. Ask 'Does Product X Support MES-1? Does Product X Support MES-2?... I disagree - if we are talking about a system rather than a font. Supporting subsets is a dead end, and a barrier to proper internationalisation. a barrier to proper internationalisation ? My opinion is reverse, I think it is a strategy to proper internationalization. Remember, people can always choose to stay with ISO-8859-1 only or go to UTF-8 with MES-1 support for European market. UTF-8 with MES-1 support does not mean other characters won't work in their product, but instead, it mean other charactrers are not Quality Assuranced in their products. This is not a new approach. For example, while MS add Unicode support, they ALSO define WGL4. That basically tell people all the characters in WGL4 will be able to render in all the Windows system after Win98 (not sure about 95). It does not mean other characters will not be able to render in Win98 or later. It only mean those characters could be render out of the box. It would be much better for developers to realise that from the start they need to build in support for the whole Unicode character set. Once Arabic, one Indic script and Plane 1 are supported, the rest is relatively easy; all the data required are in the UCD, and the shaping details can be left to the font. The alternative of bolting on ad hoc support for extra scripts later, when they become necessary, just causes extra work. You only look at the issue from the developer point of view. But how about QA? How are you going to QA the whole Unicode? You also need to look at the issue from an end-user point of view, or the working out of box point of view. How could the end user know what kind of function they are going to get WITHOUT extra efforts. If you are a QA engineer who is working on a working out of box product, how are you going to prepare your test cases? If you are a product marketing person who is going to write a product specification about a cell phone which do not allow user to download fonts, how are you going to spec it out? A product can thus claim to support Unicode 4.0 rather easily, if it makes the caveat that its font and perhaps keyboard support is limited to certain scripts. Users interested in more unusual scripts can then supply their own specialised font, or a general (but inexpensive) one like Code2000. You are assuming a product which is does not need to work out of box. If that is the case, you can ALSO think Windows 2000 work for surrogate since you can install or tweak the register to make it work with Surrogate. You can ALSO think Windows 95 can support Complex Script since you can INSTALL Uniscribe on it, right? And I would think that MS Windows 2000/XP is quite close to being able to make this claim, as long as you ignore the outdated Character Map (a prime example of needless subsetting!) and use an alternative like BabelMap. For one big advantage of the approach I suggest is that an OS can even anticipate future versions of the standard, as long as no major new properties are added. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/ -- -- Frank Yung-Fong Tang tm rhtt, Itrntinl Dvlpmet, AOL Intrtv Srvies AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913 Yahoo! Msg: frankyungfongtan
Re: MS Windows and Unicode 4.0 ?
come on, use language specific glyph substution on the last resort font to show Irish last resort glyph if the language is Irish. I know OpenType have it. Does AAT support language specific features? John Jenkins wrote: On Dec 2, 2003, at 4:34 PM, Michael Everson wrote: At 15:14 -0800 2003-12-02, Patrick Andries wrote: Actually, if you look at the Last Resort Glyphs (at a large enough size) you will see that the block name and range numbers are part of the image. See http://developer.apple.com/fonts/LastResortFont/ I believe the name is in English. That's correct. I tried to get Apple to put all the block names in Irish, of course ;-) Well, Irish was just silly. I was pushing internally to put them all in Deseret, but nobody went for it. :-( John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://homepage..mac.com/jhjenkins/ -- -- Frank Yung-Fong Tang tm rhtt, Itrntinl Dvlpmet, AOL Intrtv Srvies AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913 Yahoo! Msg: frankyungfongtan
RE: UTF-16 inside UTF-8
Philippe Verdy wrote: Frank Yung-Fong Tang writes: But how about the UTF-16 vs UCS4 battle? Forget it: nearly nobody uses UCS-4 except very internally for string processing at the character level. For whole strings, nearly everybody uses UTF-16 as it performs better with less memory costs, and because UCS-4 is not needed. I don't think that is a correct statement. I would like to use UTF-16. But it is clear that is not all the case. 1. Some people in this list preferred UCS4. (Raise your hand if you do) 2. wchar_t in Linux's glib is UCS4. (and that is nearly nobody) 3. because of 2, therefore, gconv on linux is using UCS4 4. FontConfig use UCS4 for API provide for Xft, (see FcFreeTypeCharIndex in fcfreetype.h ) 5. Xft internally use UCS4 (look at xftdraw.c, xftrender.c). Some of the Xft's api use UCS4 (not all)- XftTextExtents32, XftDrawString32, XftTextRender32, XftTextRender32BE, XftTexdtRender32LE, XftDrawCharSpec, XftCharSpecRender, XftDrawCharFotnSpec, XftCharFontSpecRender, 6. gunichar in linux is ucs4 7. Because of 6, pango use UCS4 in the unicode api Handling surrogates found in surrogates is quite simple and in fact it is even simpler to detect and manage than handling MBCS-encoded strings for Asian 8-bit applications, and today MBCS 8-bit processing is performed by transforming it first into equivalent internal 16-bit code positions, or sometimes by transcoding it to Unicode with UTF-16. So I do think that applications that could handle East-Asian DBCS 8-bit text (EUC-*, ISO2022-*, JIS) can very easily be modified to work internally with UTF-16 (notably because interoperability of Unicode code points with these DBCS charsets is excellent as the transcoding is not ambiguous, bijective, does not need code reordering, and just consists in a simple mapping table implemented now in all OSes localized for Asian markets). East-Asian developers have learned since long how to cope with DBCS-encoded strings. Now with UTF-16, handling surrogates found in string is even simpler, as UTF-16 allows bidirectional and random access to any positions in strings, which means additional performance and less tricky algorithms for text processing... Agree. It is simpler to address surrogate compare to handle multibyte. now the question is, if it is simple to address surrogate, then why don't we address that later? and put higher priority on other i18n issue which is harder to address and are more critical if not implement (such as handling non shortest form which may lead to security problem?) -- -- Frank Yung-Fong Tang tm rhtt, Itrntinl Dvlpmet, AOL Intrtv Srvies AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913 Yahoo! Msg: frankyungfongtan
Re: creating a test font w/ CJKV Extension B characters.
as my last memory, IE even could render the GB18030, still treat multi byte characters accorss TCP block poorly. For example, if you have a 4 bytes GB18030 across a TCP block (4k? 8k?), it will be trashed. Andrew C. West wrote: On Mon, 24 Nov 2003 10:12:52 +, [EMAIL PROTECTED] wrote: Even with the registery changes that allow Uniscript to work with such characters? Oops, my mistake. I had forgotten that I had deliberately deleted the registry settings that control how IE deals with surrogate pairs sometime ago in order to prove a point (that IE won't display surrogate pairs without them ?). Anyway, restore the registry to its original state and Frank's page displays OK without any tweaking whatsoever - both NCR and GB18030 encoded CJK-B characters render correctly with my preferred CJK-B font. To install the registry keys necessary for IE to display surrogate pairs simply copy the code below to a file named something.reg and double-click on it. Replace Code2001 with the name of your preferred Supra-BMP font if necessary. code Windows Registry Editor Version 5.00 [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\LanguagePack] SURROGATE=dword:0002 [HKEY_CURRENT_USER\Software\Microsoft\Internet Explorer\International\Scripts\42] IEFixedFontName=Code2001 IEPropFontName=Code2001 /code Andrew -- -- Frank Yung-Fong Tang tm rhtt, Itrntinl Dvlpmet, AOL Intrtv Srvies AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913 Yahoo! Msg: frankyungfongtan John 3:16 For God so loved the world that he gave his one and only Son, that whoever believes in him shall not perish but have eternal life. Does your software display Thai language text correctly for Thailand users? - Basic Conceptof Thai Language linked from Frank Tang's Itrntinliztin Secrets Want to translate your English text to something Thailand users can understand ? - Try English-to-Thai machine translation at http://c3po.links.nectec.or.th/parsit/
Re: How can I have OTF for MacOS
John Jenkins wrote: On Nov 26, 2003, at 7:26 AM, [EMAIL PROTECTED] wrote: But what about devnagri or Bangla. Devanagari and Bangla cannot be supported on Mac OS X through QuickDraw text rendering. Since Office on the Mac is currently restricted to QuickDraw text rendering, it cannot support them. John H. Jenkins John What 'cmap' format Apple use in the MacOS X Devanagari and Bangla fonts? -- -- Frank Yung-Fong Tang tm rhtt, Itrntinl Dvlpmet, AOL Intrtv Srvies AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913 Yahoo! Msg: frankyungfongtan John 3:16 For God so loved the world that he gave his one and only Son, that whoever believes in him shall not perish but have eternal life. Does your software display Thai language text correctly for Thailand users? - Basic Conceptof Thai Language linked from Frank Tang's Itrntinliztin Secrets Want to translate your English text to something Thailand users can understand ? - Try English-to-Thai machine translation at http://c3po.links.nectec.or.th/parsit/
RE: MS Windows and Unicode 4.0 ?
Carl W. Brown wrote: Jill, I know that Unicode does have some locale-sensitive case mappings (Turkish uppercase I to dotless lowercase I for example), I was under the impression that ss to was not one of them. You are correct that SS and are the same in case insensitive compares regardless of locale. But does MS file system is case insensitive in the Unicode way is a different question then does MS file system support Unicode, right? So... Necessary will be the same as Neceary in case insensitive comparison? I also think that ?stanbul and Istanbul should also compare the same for things like keyword searches and file systems even though it is technically incorrect. Carl -- -- Frank Yung-Fong Tang tm rhtt, Itrntinl Dvlpmet, AOL Intrtv Srvies AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913 Yahoo! Msg: frankyungfongtan
Re: MS Windows and Unicode 4.0 ?
Michael (michka) Kaplan wrote: To answer the original question, support of Unicode in *any* version of Windows (or indeed any operating system) is between 1.1 and 4.0, depending on what feature you are looking at. To answer such a question, the specific feature about which the questioning party is thinking must be given as a part of said question. oh... really, what kind of Unicode support in Windows 2.0? (since you said- *any*)... No... I don't really care. Don't try to answer me. -- -- Frank Yung-Fong Tang tm rhtt, Itrntinl Dvlpmet, AOL Intrtv Srvies AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913 Yahoo! Msg: frankyungfongtan John 3:16 For God so loved the world that he gave his one and only Son, that whoever believes in him shall not perish but have eternal life. Does your software display Thai language text correctly for Thailand users? - Basic Conceptof Thai Language linked from Frank Tang's Itrntinliztin Secrets Want to translate your English text to something Thailand users can understand ? - Try English-to-Thai machine translation at http://c3po.links.nectec.or.th/parsit/
Re: Request
Markus Scherer wrote: Ritu Malhotra wrote: I would like to know that I am currently working with a hindi software. In this scenario the complete software is working on the basis of ISCII code. Now in my software itself I want to give support for a unicode font for devnagari Script(mangal). How do I go about doing this. ... You may need to convert from ISCII to Unicode and then use the Unicode text for display. ICU has an ISCII converter: http://oss.software.ibm.com/icu/ markus Does the ICU ISCII convertesr take ATTRIBUTE code in ISCII (as defined in ANNEX-E of ISCII 13194:1991, page 20 to swtich between script?) ATR = 0xEF in ISCII 0xEF 0x42 to switch to Devanagari script 0xEF 0x43 to switch to Bengali script etc... Not saying I like that part of ISCII or those support is needed. Just want to know how complete is the ICU converter while it deal with this weired specification - ISCII. (if you don't think it is weired, look at the E-1 Display Attributes session in Annex-E of ISCII which is worst than the E-2 Font Attributes I mentioned here.) -- -- Frank Yung-Fong Tang tm rhtt, Itrntinl Dvlpmet, AOL Intrtv Srvies AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913 Yahoo! Msg: frankyungfongtan John 3:16 For God so loved the world that he gave his one and only Son, that whoever believes in him shall not perish but have eternal life. Does your software display Thai language text correctly for Thailand users? - Basic Conceptof Thai Language linked from Frank Tang's Itrntinliztin Secrets Want to translate your English text to something Thailand users can understand ? - Try English-to-Thai machine translation at http://c3po.links.nectec.or.th/parsit/
Re: creating a test font w/ CJKV Extension B characters.
so.. in summary, how is your concusion about the quality of GB18030 support on IE6/Win2K ? If you run the same test on Mozilla / Netscape 7.0, what is your conclusion about that quality of support? Andrew C. West wrote: On Thu, 20 Nov 2003 01:32:16 +, [EMAIL PROTECTED] wrote: Frank Yung-Fong Tang wrote, If you visit http://people.netscape.com/ftang/testscript/gb18030/gb18030.cgi?page=596 and your machine have surrogate support install correctly and surrogate font install correctly then you should see surrogate characters show up match the gif. It isn't working, but I have surrogate support and a font correctly installed. Using W2K and IE6, if you have a CJK-B font configured for User Defined scripts under the Options : Fonts settings, and manually select the encoding for the page as User Defined, then the second CJK-B character in each box (just above the gif image) displays just fine. The top character in each box appears to be encoded as GB-18030 (e.g. GB-18030 0x95328236 = U+2), and the second character is encoded as hex NCR values (e.g. #x2; for U+2). If GB-18030 is selected as the encoding for the page (as explicitly given in the file), then IE won't display the CJK-B characters correctly (even if you configure a CJK-B font as your default font for displaying Chinese), but you can copy and paste them to a Unicode editor, where both the GB-18030 and NCR encoded forms of CJK-B characters will display correctly with an appropriate CJK-B font. If User Defined is selected as the encoding for the page (either manually or by changing the meta tag in the file to charset=x-user-defined), then the GB-18030 encoded characters turn to gunk, but the NCR representations are displayed using whatever font you have configured for user defined scripts, and if that is a CJK-B font then hey presto ! Andrew -- -- Frank Yung-Fong Tang tm rhtt, Itrntinl Dvlpmet, AOL Intrtv Srvies AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913 Yahoo! Msg: frankyungfongtan John 3:16 For God so loved the world that he gave his one and only Son, that whoever believes in him shall not perish but have eternal life. Does your software display Thai language text correctly for Thailand users? - Basic Conceptof Thai Language linked from Frank Tang's Itrntinliztin Secrets Want to translate your English text to something Thailand users can understand ? - Try English-to-Thai machine translation at http://c3po.links.nectec.or.th/parsit/
Re: creating a test font w/ CJKV Extension B characters.
James: I think the first thing you need to make sure is you did properly install all of the following: a. you are running on W2K or WinXP b. install the surrogate support from microsoft if you are running win2k c. configure your font in the IE font pref Try to do the following 1. open notepad 2. select the surrogate font 3. open Netscape 7 or mozilla 4. view thc url I gave you in mozilla If you did a and b (even without c) you should see the chinese text there. 5. click the [text link] 6. copy and paste the text into your Word XP 7. Do you see it correctly in Word XP. 8. Do the same thing and put into Notepad If that don't show you, then the problem is really you don't install the thing right. [EMAIL PROTECTED] wrote: . Andrew C. West wrote, Using W2K and IE6, if you have a CJK-B font configured for User Defined scripts under the Options : Fonts settings, and manually select the encoding for the page as User Defined, then the second CJK-B character in each box (just above the gif image) displays just fine. Yes. The page was downloaded and heavily tweaked off line. First I substituted a decimal numeric character reference for one of the hexadecimal entries. No dice. I did a couple of other tricks to no avail. I removed the GB character set declaration and tried to manually set the [View] to user defined. The page loaded again, but didn't display, checking the [View] showed that the page was still being loaded as UTF-8! Tried it again and again. At this point, *I* was heavily tweaked, so I didn't even try to insert 'x-user-defined' character set into the HTML header. I just went back on line and opened the page successfully with a different browser. Best regards, James Kass . -- -- Frank Yung-Fong Tang tm rhtt, Itrntinl Dvlpmet, AOL Intrtv Srvies AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913 Yahoo! Msg: frankyungfongtan John 3:16 For God so loved the world that he gave his one and only Son, that whoever believes in him shall not perish but have eternal life. Does your software display Thai language text correctly for Thailand users? - Basic Conceptof Thai Language linked from Frank Tang's Itrntinliztin Secrets Want to translate your English text to something Thailand users can understand ? - Try English-to-Thai machine translation at http://c3po.links.nectec.or.th/parsit/
Re: creating a test font w/ CJKV Extension B characters.
Michael (michka) Kaplan wrote: From: Frank Yung-Fong Tang [EMAIL PROTECTED] so.. in summary, how is your concusion about the quality of GB18030 support on IE6/Win2K ? If you run the same test on Mozilla / Netscape 7.0, what is your conclusion about that quality of support? In Summary? Well, in summry, I fail to see how testing for NCRs has anything to do with suport of *any* encoding in a browser. It seems like an inadequate test of functionality of gb18030 support. If you want to test gb18030 support, then please encode a web page in gb18030 and test *that* in the browser of your choice. Have you ever look at that page before you said this? or the html source of that page? Those page display 5 information [for BMP characters, less information is there] for each characters 1. the GB18030 encoded value in hex. That hex value of the first two bytes are display on the top of the page for the plane 2 characters. The hex value of the thrid byte is display on the left of each row. The 4th byte is display on top of each column 2. The 4 bytes of the characters encoded in GB18030 3. The same characters encoded by using hex escape in html as #xhhh; 4. a IMG point to the image on www.unicode.org 5 The equvilant Unicode hex value is display as U+ in the bottom Ideally, if the browser do thing right and the font is install, the one who want to test can compare 2, 3, and 4 to see what happen. Therefore, it could be used to test BOTH the NCR and GB18030. If 2 display different from 4 (assume the server is up and running and you do see the glyph in gif), then it mean the conveter have problem. If 3 display different from 4 (assume the gif can be view), then it mean your html parser have problem If 2 and 3 display different from 4, then it could be both have problem , the rendering engine itself have problem, or all of them have problem. Of course, you don't really need to img part, you can compare with the Unicode 4.0 standard by yourself. But my tool is written 1 year before I got my hardcopy of Unicode 4.0 standard so that image help us to QA. If you SAVE the page locally then look at the result, notice the save operation could already damange you page. And YES, I DO encode that page in GB18030 and use byte to encode. I did have ADDITIONAL information encoded in NCR and img there to help to verfication. You may missed the real GB18030 encoded characters there if you do not pay close attention. Now if you want to discuss NCR support then that may also be interesting. But it would be nice to have tests that actually cover what they claim to cover I do have actual claim about what it cover. And more than that. The problem is you look at the addition part which is beyond what I claim in the last email. MichKa [MS] NLS Collation/Locale/Keyboard Development Globalization Infrastructure and Font Technologies -- -- Frank Yung-Fong Tang tm rhtt, Itrntinl Dvlpmet, AOL Intrtv Srvies AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913 Yahoo! Msg: frankyungfongtan John 3:16 For God so loved the world that he gave his one and only Son, that whoever believes in him shall not perish but have eternal life. Does your software display Thai language text correctly for Thailand users? - Basic Conceptof Thai Language linked from Frank Tang's Itrntinliztin Secrets Want to translate your English text to something Thailand users can understand ? - Try English-to-Thai machine translation at http://c3po.links.nectec.or.th/parsit/
Re: UTF-16 inside UTF-8
Dear Doug: Thank you for your reply. What you said about how to do it is exactly what it should be done. The point of asking those question is not to seek for an answer. Instead, just want to show from the answer that adding the surrogate support is not You wrote earlier: For UTF-8 in particular, I can't imagine why one would choose to implement the 1-, 2-, and 3-byte forms in one stage and add the 4-byte forms in a later stage. Can you imaging now? The task list you listed below are additional task that people need to perform before they add 4 bytes UTF-8. They don't need that part if they support 2 bytes or 3 bytes UTF-8. It does not imply they should not add 4 bytes support. It only mean for people want to add the support, they need plan for extra task and time on it. All the following task cause it to come later. The later could be 1 day late, it could be 1 week later. It could be one milestone (from alpha 1 to alpha 2) late. But the nature that the developer do need to spend efforts on those task cause it late. One real example I found recently is Tcl. Tcl have the so-called UTF-8 support since 8.1. But if you look at the implementation of Tcl 8.4.4 (from http://www.tcl.tk ) you will find the UTF-8 implementation: a. do not align with Unicode 3.2/4.0 or RFC 3629 definitation and accept non-shortest form b. by default it does not accept 4 bytes UTF-8. c. 4, 5, 6 byets UTF-8 support is accept if certain compiler flag got turn on. TCL_MAX_UTF (default 3, could be turn to 4, 5, 6) d. no documentation mention about surrogate. e. use unsigned int for Tcl_UniChar if the TCL_MAX_UTF is 4 to 6. use unsigned short if TCL_MAX_UTF is 3. (look like a very very very bad decision). f. there are no way to use UTF-16 internally to accept 4 bytes UTF-8. You can either use up to 3 bytes in UTF-8 and use UTF-16 internally, or support up to 6 (which is wrong, it should stop at 4) bytes with UTF-32 (not really) support internally. g. they really output CESU-8 but not UTF-8 now if the UTF-16 (TCL_MAX_UTF = 3 or undefined as default) have surrogate pair. If you still think adding 4 bytes UTF-8 support is 1% of the task, then please join the Tcl project and help me fix that. I appreciate your efforts there and I beleive a lot of people will thank for your contribution. Doug Ewell wrote: Frank Yung-Fong Tang YTang0648 at aol dot com wrote: What you do is, you go through the exact same process that API vendors have had to go through since the beginning of multibyte character sets. That is, you decide whether your API returns code units or characters, you publicize that decision, and you stick to it. If the decision means you have a function that isn't terribly useful, you have to define a new function that does the right thing, and leave the old function on the mountain to die. To cite a non-Unicode example, in ECMAScript (ne JavaScript) there is a function Date.GetYear() that was intended to return the last two digits of the year but actually returned the year minus 1900. Of course, starting in 2000 the function returned a value which was useful to practically nobody. Did Sun or ECMA change the definition of Date.GetYear()? No, they introduced a new function, Date.GetFullYear(), which does what users really want. Same thing here: you can't change the 16-bit UniChar, so you'll have to declare that your old functions that return a UniChar are defined as returning UTF-16 code points, and you'll probably want to define a new UniChar32 type and functions like: UniChar32 ToLower(UniChar32 aChar) that do the obvious right thing. And I'm sorry, I know some people will cringe when I say this, but if you're like me and get to define your own UniChar data type, you've been making it 32 bits wide since about 1997. double your memory cost and size from UTF-8. x4 of the size for your ASCII data. To change implementation of a ASCII compatable / support application to UTF-16 is already hard since people only care about ASCII will upset the data size x 2 for all their data. It is already a hard battle most of the time for someone like me. If we tell them to change to UCS-4 that mean they need not only x2 the memory but x4 of the memory. For web services or application which spend multi millions on those memory and database, it mean adding millions of dollars to their cost. They may have to adding some millions of cost to support international customer by using UTF-16. They probably are willing to add multi millions dollars of cost to change it to use UCS4. In fact, there are people proposed to stored UTF-8 in a hackky way into the database instead of using UTF-16 or UCS4 to save cost. They have to add restriction of using the api and build upper level api to do conversion and hacky operation. That mean it will introduce some fixed (not depend on the size of data) developement cost to the project but it will save millions of dollars
Re: Problems encoding the spanish o
One thing may help you to think about this kind of issue is my 'under constrution paper - Frank Tang's List of Common Bugs that Break Text Integrity http://people.netscape.com/ftang/paper/textintegrity.html I am going to present a newer revsion in the coming IUC25 if they accept my proposal. it look like n M 4 bytes got changed to two bytes U+DB7A and U+DC0D which is a surrogate pair in UTF-16. Here is what I think what happened. 1. the text ...izacin Map.. is output from process A and pass to a process B which the byte is encoded in ISO-8859-1. so the 4 bytes n M are encoded as 0xf3, 0x6e, 0x20, 0x4d. 2. somehow process B think the incoming data is in UTF-8 instead of ISO-8859-1. You can find some possible cause as hint from my paper (url above). 3. Process B try to convert the data stream to UTF-16 by using UTF-8 to UTF-16 conversion rule. However the UTF-8 scanner in the converter is not will written. It implement the conversion in the following way: 3.a. it hit the byte 0xf3, and it look at a look up table and notice 0xf3 in a legal UTF-8 sequence is the first bytes of a 4 bytes UTF-8 sequence. 3.b. it decode that 4 bytes UTF-8 sequence without checking the value of the next 3 bytes 0x6e, 0x20, 0x4d. It blindly think these bytes are the 2nd, 3rd and 4th bytes of this UTF-8 sequence. Of course, it need to first get the UCS4 value, what it does is m1 = byte1 0x07 m2 = byte2 0x3F m3 = byte3 0x3F m4 = byte4 0x3F in your case, what it got is m1 = 0xf3 0x07 = 0x03 m2 = 0x6e 0x3F = 0x2e m3 = 0x20 0x3f = 0x20 m4 = 0x4d 0x3f = 0x0d [Notice the problem is such algorighm does not check to make sure byte2, byte3 and byte4 is in the range of 0x80 - 0xBF at all. One possibility is it does not check in the code. The other possibility is the code do value checking but massed up by using (char) value to compare with (unsigne char) by using and . What I mean is the following: main() { char a=0x23; printf(a is %x ,a); if( a (char)0x80) printf(and a is greater than 0x80\n); else printf(and a is less or equal than 0x80\n); } sh% ./b a is 23 and a is greater than 0x80 ] then it caculate the ucs4 by using ucs4 = (m1 18) | (m2 12) | (m3 6) | (m4 0); in your case, what it got is ucs4 = (0x03 18) | (0x2e 12) | (0x20 6) | (0x0d 0) = 0xc | 0x2e000 | 0x800 | 0x0d = U+ee80d; 3.c. now it turn that ucs4 into UTF-16 by surrogate high = ((ucs4-0x1 ) 10) | 0xd800 = ((0xee80d - 0x1) 10) | 0xd800 = ( 0xde80d 10 ) | 0xd800 = 0x037a | 0xd800 = 0xdb7a surrogte low = ((ucs4 - 0x1) 0x03FF) | 0xdc00 = ((0xee80d - 0x1) 0x03FF) | 0xdc00 = (0xde80d 0x3FF) | 0xdc00 = 0x0d | 0xdc00 = 0xdc0d so you got a UTF-16 DB7A DC0D with you now 4. now process b (or some other code) try to convert the UTF-16 into HTML NCR, unfortunatelly, that process do not handle the UTF-16 to NCR conversion correctly. So... instead of doing the right way as below: 4.a take DB7A DC0D convert to UCS4 as 0xEE80D 4.b convert EE80D to decimal as 976909 and generate as #976909; it convert DB7A as decimal 56186 and generate as #56186; and then it convert DC0D as decimal 56333 and generate as #56333; So... in summary, there are 3 but not only 1 problem in your system Problem 1: Process A convert data to ISO-8859-1 while process B is expecting UTF-8. You should either fix the Process A to let it generate UTF-8 or fix the Process B to treat the input as ISO-8859-1. The preferred approach is the ealier one. Problem 2: The UTF-8 converter in Process B does not strictly implement the requirement in RFC 3629 which say it MUST protect against decode invalid sequence. If you put the non ASCII into the end of a line it probably will cause your software to fold line if you put it in the end of the record it may even crash your software for converter in this kind of quality. You need to fix the convert scanning part. Problem 3: The UTF-16 to NCR conversion is incorrect according to the HTML. Hope the above analysis help. pepe pepe wrote: Hello: We have the following sequence of characters ...izacin Map.. that is the same than ...izaci#243;n Map... that after suffering some transformations becomes to ...izaci#56186;56333;ap AS you can see the two characters 56186 and 56333 seem to represent this sequences n M. Any idea?. Regards, Mario. _ Charla con tus amigos en lnea mediante MSN Messenger. http://messenger.microsoft.com/es -- -- Frank Yung-Fong Tang tm rhtt, Itrntinl Dvlpmet, AOL Intrtv Srvies AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913 Yahoo! Msg: frankyungfongtan John 3:16 For God so loved the world that he gave his one and only Son, that whoever believes in him shall not perish but have eternal life. Does your software display Thai language text correctly for Thailand users? - Basic Conceptof
Re: What does i18n mean?
of course, I was just joking the answer is, read http://www.i18nguy.com/origini18n.html and notice the spelling of "internationalization" (US spelling) and "internationalisation" (UK spelling) in that doc. The reason is to abbreviate is not because it is too long for Asian engineers like me to memorize it, but also because UK and US people often spell it differently and cofused Asian engineers like me. [EMAIL PROTECTED] wrote: In a message dated 11/14/2003 2:34:26 PM Pacific Standard Time, [EMAIL PROTECTED] writes: what does i18n mean? I see it bandied about a lot. It is a short hand for "Irn " because it is too hard for most of the people to type the "r" part. :) [and if your software can save that string retrive it correct later, 50% of the i18n problem is addressed] -- Frank Yung-Fong Tang System Architect, Itrntinl Dvlpmet, AOL Intrtv Srvies AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913 Yahoo! Msg: frankyungfongtan John 3:16 "For God so loved the world that he gave his one and only Son, that whoever believes in him shall not perish but have eternal life. Does your software display Thai language text correctly for Thailand users? - Basic Conceptof Thai Language linked from Frank Tang's Itrntinliztin Secrets Want to translate your English text to something Thailand users can understand ? - Try English-to-Thai machine translation at http://c3po.links.nectec.or.th/parsit/ -- -- Frank Yung-Fong Tang tm rhtt, Itrntinl Dvlpmet, AOL Intrtv Srvies AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913 Yahoo! Msg: frankyungfongtan John 3:16 "For God so loved the world that he gave his one and only Son, that whoever believes in him shall not perish but have eternal life. Does your software display Thai language text correctly for Thailand users? - Basic Conceptof Thai Language linked from Frank Tang's Itrntinliztin Secrets Want to translate your English text to something Thailand users can understand ? - Try English-to-Thai machine translation at http://c3po.links.nectec.or.th/parsit/
Re: creating a test font w/ CJKV Extension B characters.
why don't you find a font which already support it ? you can find some info here - http://www.microsoft.com/globaldev/DrIntl/columns/015/default.mspx It is not that easy for you from don't know beans about fonts to creat a test font that contains ... \u20050. If you are lucky, it will take you several month if not year. There are commercial base font tool. But I am not sure they support 32 bits cmap or not (probably not). You can start from http://www.microsoft.com/typography/users.htm , but I think it will take you a while you need the 32 bits cmap support in OpenType to add u20050 . I don't know which commercial tool current support that. Ostermueller, Erik wrote: Hello all, I'd like to create a test font that contains a a standard US Latin alphabet and the following characters: \u5000 \u20050 We need this for testing a software app that supports GB18030. if you want to get GB18030 test data, one thing you can do is to visit my GB18030 test page at http://people.netscape.com/ftang/testscript/gb18030/gb18030.cgi?page=10 That is what I design to test Mozilla's GB18030 support. The page number and the layout match exactly what the paper copy of GB18030 so you can do a screen to paper copy comparision. I do add addition pages in the web (from page 284 and later which is not in the hardcopy of GB18030). If you visit http://people.netscape.com/ftang/testscript/gb18030/gb18030.cgi?page=596 and your machine have surrogate support install correctly and surrogate font install correctly then you should see surrogate characters show up match the gif. IF you click the [Text] in the left upper corner, it will open a new window and put those GB18030 text in plain text format. Good luck. My main problem is that I don't know beans about fonts. Could someone recommend a good tutorial or 'font creator' application that addresses surrogate pairs? Thanks, Erik Ostermueller -- -- Frank Yung-Fong Tang tm rhtt, Itrntinl Dvlpmet, AOL Intrtv Srvies AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913 Yahoo! Msg: frankyungfongtan John 3:16 For God so loved the world that he gave his one and only Son, that whoever believes in him shall not perish but have eternal life. Does your software display Thai language text correctly for Thailand users? - Basic Conceptof Thai Language linked from Frank Tang's Itrntinliztin Secrets Want to translate your English text to something Thailand users can understand ? - Try English-to-Thai machine translation at http://c3po.links.nectec.or.th/parsit/
Re: creating a test font w/ CJKV Extension B characters.
are you using Netscape7 / Mozilla or IE? If you use IE, then IE may have a bug about that. I think Mozilla should not have the problem since I develope and test it by myself. [EMAIL PROTECTED] wrote: . Frank Yung-Fong Tang wrote, If you visit http://people.netscape.com/ftang/testscript/gb18030/gb18030.cgi?page=596 and your machine have surrogate support install correctly and surrogate font install correctly then you should see surrogate characters show up match the gif. It isn't working, but I have surrogate support and a font correctly installed. Are you running on XP or 2K? od you install all the necessary surrogate support? Do you teak your font pref to use the surrogate font for Chinese pages? The page looks like it is calling for Unicode characters to display, example #x2;, but the HTML header says GB-18030 for the characters set. Could this be the problem, or are Unicode and GB18030 matched for plane two and for HTML numeric characters references? It should not matter . But again, it could be a bug in the IE. Each single Plane Two character is displaying as two missing glyphs, if that is an extra clue. Best regards, James Kass . -- -- Frank Yung-Fong Tang tm rhtt, Itrntinl Dvlpmet, AOL Intrtv Srvies AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913 Yahoo! Msg: frankyungfongtan John 3:16 For God so loved the world that he gave his one and only Son, that whoever believes in him shall not perish but have eternal life. Does your software display Thai language text correctly for Thailand users? - Basic Conceptof Thai Language linked from Frank Tang's Itrntinliztin Secrets Want to translate your English text to something Thailand users can understand ? - Try English-to-Thai machine translation at http://c3po.links.nectec.or.th/parsit/
Re: creating a test font w/ CJKV Extension B characters.
Philippe Verdy wrote: From: Frank Yung-Fong Tang [EMAIL PROTECTED] It is not that easy for you from don't know beans about fonts to creat a test font that contains ... \u20050. If you are lucky, it will take you several month if not year. There are commercial base font tool. But I am not sure they support 32 bits cmap or not (probably not). According to: http://www.microsoft.com/typography/otspec/cmap.htm The so-called Microsoft Unicode cmap format 4 (platfom id=3, encoding id=1) is the one recommanded for all fonts, except those than need to encode supplementary planes. Format 0 is deprecated (was used to map 8-bit encodings to glyph ids), as well as now Format 2 (was used to map DBCS encodings with leadbyte/trail bytes in East Asia, as a mix of 8 and 16 bit codes) For supplementary planes, like a font built to support GB18030, the cmap format 12 must be used instead with the same platform id, but the encoding id 10 (UCS-4). Format 8 is used to create a mix of 16-bit and 32-bit maps (with the assumption that no 16bit Unicode character will have the same code point as the highest 16-bit of a character out of the BMP, meaning that it works as long as there's no glyph to assign for Unicode codepoints X between U+ and U+0010 simultaneously with codepoints between X16 and (X+1)16 - 1).This compresses a bit the size of the cmap. Format 10 is not portable unlike format 12 which must be provided in addition to the recommanded format 4 for characters present in the BMP. In practice, this format is used mostly for GB18030 support, and supported by Windows 2000 and later. So you won't have to wait for years to create a GB18030 font, using UCS-4 mappings... Which font tool currently support generating TTF with format 12? While it is true the font format and application software (such as mozilla I wrote, WinXP, Office XP, etc) is ready to deal with it, not many font tools which I know can create TTF with format 12 that are design for someone claimed himself as don't know beans about fonts to creat a test font that contains ... \u20050 now. -- -- Frank Yung-Fong Tang tm rhtt, Itrntinl Dvlpmet, AOL Intrtv Srvies AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913 Yahoo! Msg: frankyungfongtan John 3:16 For God so loved the world that he gave his one and only Son, that whoever believes in him shall not perish but have eternal life. Does your software display Thai language text correctly for Thailand users? - Basic Conceptof Thai Language linked from Frank Tang's Itrntinliztin Secrets Want to translate your English text to something Thailand users can understand ? - Try English-to-Thai machine translation at http://c3po.links.nectec.or.th/parsit/
Re: creating a test font w/ CJKV Extension B characters.
John Jenkins wrote: Nov 19, 2003 10:30 PM Ostermueller, Erik Could someone recommend a good tutorial or 'font creator' application that addresses surrogate pairs? FontLab is probably the best cross-platform font creation software out there, although it's not cheap. Cheaper solutions are to be found IIRC on Windows, and there's . Does FontLab support generating TTF in format12 (32 bits)? Which cheaper solutions could generating TTF in format12 (32 bits)? If you're on a Mac, Apple's font tool suite (http://developer.apple.com/fonts/) is free and lets you add non-BMP support to fonts. Can you point out which document and chapter in those doc talk in those document talk about what we need to do to add non-BMP charactrers? which of the following MacOSX font tool should be used for that purpose? # ftxanalyzer # ftxdiff # ftxdumperfuser # ftxenhancer # ftxinstalledfonts # ftxruler # ftxvalidator John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://homepage..mac.com/jhjenkins/ -- -- Frank Yung-Fong Tang tm rhtt, Itrntinl Dvlpmet, AOL Intrtv Srvies AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913 Yahoo! Msg: frankyungfongtan John 3:16 For God so loved the world that he gave his one and only Son, that whoever believes in him shall not perish but have eternal life. Does your software display Thai language text correctly for Thailand users? - Basic Conceptof Thai Language linked from Frank Tang's Itrntinliztin Secrets Want to translate your English text to something Thailand users can understand ? - Try English-to-Thai machine translation at http://c3po.links.nectec.or.th/parsit/
Re: How can I input any Unicode character if I know its hexadecimal code?
hum a very stupid (but work) way. 1. use vi 2. type #x + the Unicode text + ; for each characters 3. save it as .html 4. open the file by using browser 5. copy the text 6. paste into your software. -- Frank Yung-Fong Tang tm rhtt, Itrntinl Dvlpmet, AOL Intrtv Srvies AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913 Yahoo! Msg: frankyungfongtan John 3:16 For God so loved the world that he gave his one and only Son, that whoever believes in him shall not perish but have eternal life. Does your software display Thai language text correctly for Thailand users? - Basic Conceptof Thai Language linked from Frank Tang's Itrntinliztin Secrets Want to translate your English text to something Thailand users can understand ? - Try English-to-Thai machine translation at http://c3po.links.nectec.or.th/parsit/
Re: newbie 18030 font question
We add GB18030 support into Mozilla and also add 32 bit cmap support on windows into Mozilla about a year ago. The Linux and Mac 32-bit cmap support is a little bit behind I think we first have GB18030 encoding support in Netscape in Netscape 6.2 You should be able to see whatever the characters in Netscape 7 if your system have a font which contains the glyph Try the following test page http://people.netscape.com/ftang/testscript/gb18030/gb18030.cgi?page=10 It is coded according to the hard copy of GB18030 spec. (and I also add more pages which beyond the GB18030 spec to test the BMP part) [EMAIL PROTECTED] wrote: Hello, all. I'm new to 18030 and was hoping that someone could verify this. We're implementing a browser-delivered database application and would like to support 18030. One fairly straightforward way of implementing this seems to be to accept 18030 at the browser and then transcode to Unicode when the data first reaches the server. When sending data back to the browser, we'd transcode back to 18030. OK so far, right? Unicode fonts don't support all characters in 18030, correct? Let's assume our client makes use of 18030 characters not in unicode fonts. What font could we use for a 3rd party reporting tool that read data straight from the unicode db, bypassing our transcoding layer? Thanks you for your time; I've learned a lot reading through the archives of this maillist. --Erik Ostermueller [EMAIL PROTECTED]
Re: Copy/paste in xterm/XEmacs
I think that is depending on the application support the newly defined UTF8_STRING for selection or not. The Linux verion of mozilla implement it so it can copy/paste with the recent version of xterm w/o problem Notice that UTF8_STRING is defined AFTER X11 R6 ICCCM. See the spec in http://www.pps.jussieu.fr/~jch/software/UTF8_STRING/ for details see http://lxr.mozilla.org/seamonkey/source/widget/src/gtk/nsClipboard.cpp about mozilla's implementation Phillip Farber wrote: After searching far and wide and reading all the HOWTOs etc. I'm still at a loss as to how to make a simple copy/paste work within xterm and between xterm and XEmacs. If I cat a utf-8 encoded XML file containing Russian and it displays just fine. If I select a single word by left double clicking and paste with a middle click I see a mixture of '@', '^' and a few Cyrillic characters. If I paste into XEmacs 21.4 in a buffer with the buffer file coding system set to utf-8 I see a string of '?'. Interestingly I can paste the selection into Windows Notepad and Word and it displays just fine too. Am I missing something very basic in my configuration/setup or is this a known problem? I'm wondering whether my X server is not up to the task or perhaps are there some x resources I should be setting? I am running the xterm XFree86 4.2.0(165) that comes with Linux Redhat 8 with display support from Hummingbird eXceed X Server 7.1 on Windows NT 4.0 and XEmacs 21.4 (patch 8) "Honest Recruiter" [Lucid] (i386-redhat-linux, Mule) of Mon Aug 26 2002 on astest. I'm invoking xterm as: xterm -u8 -fn '-misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646-1' % locale LANG=en_US.UTF-8 LC_CTYPE=en_US LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL= Phil. --- Phillip Farber, Information Retrieval Specialist Email: [EMAIL PROTECTED] Digital Library Production Service (http://www.umdl.umich.edu/) Hatcher Graduate Library, University of Michigan 308 Hatcher North, Ann Arbor, MI 48104-1205
Re: Problem in unix server with the encoding of pound(#163)
Jain, Pankaj (MED, TCS) wrote: Hi, I am generating pound sign in html preview using XML XSLT transformation and its working fine in windows using #163; in XML but same thing is not working in unix server. What do you mean in unix server ? display the text on the Unix Xterm ? or you are talking about some UNIX browser? which browser? which version ? I am using utf-8 encoding for this. And the strange thing is that it works fine for PDF in both windows and unix which I am generating using FOP XSLFO. So I am not able to figure out where exactly the problem is. Please help me in the above area if there is any dependency of encoding in unix. Thanks -Pankaj
Re: Unicode Public Review Issues update
url please Rick McGowan wrote: The Unicode Public Review Issues page has been updated today. Highlights: Closed issue #1 (Language tag deprecation) without any change. Updated some deadlines on other issues to June 1, 2003. Added a document for issue #7 (tailored normalizations). Added an issue #8 regarding properties of math digits. Regards, Rick McGowan Unicode, Inc.
Re: Characters that rotate in vertical text
I think that is a hard problem First of all. Take a look at http://www.unicode.org/Public/4.0-Update/UCD-4.0.0d5b.html and find the vertical one Second, anything which need to be Symmetric Swap in Bidi probably need to be change in the vertical form. (If they need to be change in horizontal direction, they probably will need to be change in the vertical position) However, this is not that easy. First, there are some characters could be rotate as optionl. For example, if you have English string "Book" in your vertical text, should software rotate it? or not? It could rotate the whole text 90 as "Book", or it could displayed as B o o k Both are "right". It depend on the application domain to decide how to display it. Which mean it need "a higher level protocol" look at the example in the session of 3.3 of http://www.w3.org/TR/2003/WD-css3-text-20030226/ Second, it also depend on the people who design the glyph. For example, U+FF0C in a Traditional font have the comma in the central position- which mean it don't need to be change in the vertical layout. However, Japanese users think that position is funny for horizontal text and won't accept that. So the U+FF0C glyph in a Japanese font will be put in the left lower corner. and in that case, it need a different Glyph (note, not different unicode, but a different glyph id) to represent it in the Vertical layout. That is way you see on the Window system most of the font have a "@ variant" version there. That font is used for Veritcal layout and the same unicode map to different glyph id (so the , show up in the left upper position, center position, or right upper posiont [I am not a typographer so I am not sure which one they choose, but one of them}) More info about Vertical text could be found at the following places 1. page 342-365, Chapter 7, Typography, CJKV Information Processing, Ken Lunde, O'Reilly, ISBN 1-56592-224-7, http://www.oreilly.com/catalog/cjkvinfo/ 2. page 192-193, Developing International Software- 2nd Edition, Dr. International, Microsoft Press, ISBN 0-7356-1583-7 http://www.microsoft.com/mspress/books/5717.asp They may have an online copy on the msdn Rick Cameron wrote: Characters that rotate in vertical text Hi, all When Japanese (and, I imagine, other East Asian languages) is written vertically, certain characters are rotated by 90 degrees. Examples: the parenthesis-like characters in the block at U+3000, and U+30FC. U+3000 is SPACE characters, I don't think it will need to be rotate, it should show BLANK anywan. Does the Unicode character database include information on which characters are rotated in vertical text? If not, does anyone know of a definitive list? Thanks - rick cameron
Re: New document.
Otto Stolz wrote: The two scans under http://www.rz.uni-konstanz.de/Antivirus/tests/li.png http://www.rz.uni-konstanz.de/Antivirus/tests/re.png are from the authoritative (until July 1996) book on German orthography: Duden Rechtschreibung der deutschen Sprache und der Fremdwörter / hrsg. von d. Dudenred. auf d. Grundlage d. amtl. Rechtschreibregeln. [Red.Bearb.: Werner Scholze- Stubenrecht unter Mitw. von Dieter Berger ...]. - 19., neu bearb. u. erw. Aufl. ISBN: 3-411-20900-3. Best wishes, Otto Stolz could you point out which symbol in that two images need to be proposed? either by using red ciricle on the image or tell use the surrounding text. Thanks
Re: pinyin syllable `rua'
Which pinyin system the rua is in? I use simpchinese win XP and if I switch to Full Spell (??)Simplified Chinese IME and type rua', then I got (read this email in UTF-8) which is U+633C I am not sure that is correct. At least, as a native Mardarin speaker, that sound is not nature for me at all. It could be a table mistake in the software. It sound like Japanese :) Werner LEMBERG wrote: Some lists of pinyin syllables contain `rua', but I actually can't find any Chinese character with this name. Does it exist at all? Or is it just there for completeness of pinyin? Werner
Re: sorting order between win98/xp
Dominikus Scherkl wrote: Anyone know why the sort order is different under that two systems? As I mentioned: a new feature, keeping numbers ordered numerical. I won't mind if they ALSO give me a flag to control that behavior. Number could be used for many different thing in a string. It does not make sense to sort differently on win98 and winxp if I have the following subject in my IMAP mailbox. "7870789 is my phone number" "1947 is the year Mary graduage from highschool" "95129 is my zip code" "23.95 only to get a DSL line" "234458 - bugzilla bug- System crash when using large size font"
Re: sorting order between win98/xp
Anyone know is there a way to make them sort in the same order? Why should anybody want that? Because user expect a cross platforms (or I should said cross windows version) product display the same sorting order in Win98 and on WinXP. For example, the Netscape7 mailer could run on both Win98 and WinXP and use it to access IMAP mailbox. When the user sort the mail by the subject, they expect to see the same sequence on the same mailbox content from that two system through the same mail client. I am not saying that is an IMPORTANT issue, but it is A REASONABLE issue. Why should any OS user want to see different sorting order for the same locale? I look at the Win32 API but I cannot find any flag to define it.
Re: sorting order between win98/xp
Michael (michka) Kaplan wrote: From: "Yung-Fong Tang" [EMAIL PROTECTED] One of my colleague ask me this question. Not much to do with Unicode, though. Is it? It will be an Unicode issue if the cause is the new software try to implement http://unicode.org/reports/tr10/ Unicode Technical Standard #10 Unciode Collation Algorith, right?
Re: sorting order between win98/xp
We cannot use that. The function you mention is to compare two Unicode strings. We need the function to "generate sort key" from unicode strings instead of compare two string. Michael (michka) Kaplan wrote: From: "Yung-Fong Tang" [EMAIL PROTECTED] One of my colleague ask me this question. In the interests of completeness The function that does the type of sorting your colleague noted is StrCmpLogicalW in shlwapi.dll, version 5.5 and later. See the following link for more information (all on one line in the browser): http://msdn.microsoft.com/library/en-us/shellcc/platform/shell/reference/shlwapi/string/strcmplogicalw.asp MichKa
Re: sorting order between win98/xp
Doug got my point. What I care is the "difference" instead of which one is better. Doug Ewell wrote: Dominikus Scherkl Dominikus dot Scherkl at glueckkanja dot com wrote: It is not deterministic string ordering ?!? What's non-deterministic in numeric ordering? Ok, mix of (letter-)strings and numbers maybe not so straight-forward to sort than simply sorting digits by their encoding-value (this is the cause it was not implemented before), but I prefer it always very much. The question really isn't whether one sort order is "better" than another. It's easy to come up with examples where each method has an advantage. The real question is whether the same sort option should generate different results on different versions of Windows, all other things being equal. It would be nice to have an explicit option to sort strings by numeric value instead of character-set collating order, but not so good if the developer has no control over which method is used, and worse if Microsoft did not publicize this change; I don't know if they did or not). Note that I'm speaking in terms of programmable sorting. I really don't care how filenames in Windows Explorer are sorted. Me neither. That is just to be used to show the problem is in the OS level instead of programming error in our code. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: Unicode character transformation through XSLT
I have not touch Java for years (probably 5 years) ... so, I could be wrong. Jain, Pankaj (MED, TCS) wrote: Hi ftang/james.. thanks for the details explanation. and now I the root problem of my error. I have following string is in database as Long in which the special character(?) is equivalent to ndash(-) E8C ? 6 to 10 And i am using following code to write the string from database to property file, and in property file i am getting following string. value= E8C \uFFE2\uFF80\uFF93 6 to 10 And as \uFFE2\uFF80\uFF93 is not equivalent to ndash, I am not able to figure out why it is coming in property file. Do we need to specify in my java program any type of encoding like utf-8. pls let me know where is the problem. Here is my code.. while(rsResult.next()) { /*Get the file contents from the value column*/ ipStream = rsResult.getBinaryStream("VALUE"); what is rsResult? Blob? you probably need to use BufferedInputStream and DataInputStream to pipe the InputStream and use readChar or readUTF in the InputStream interface instad. See http://www.webdeveloper.com/java/java_jj_read_write.html and http://java.sun.com/j2se/1.4/docs/api/java/io/DataInputStream.html#readUTF() for more info. strBuf = new StringBuffer(); while((chunk = ipStream.read())!=-1) { byte byChunk = new Integer(chunk).byteValue(); strBuf.append((char) byChunk); } Here is your problem, you read it in byte to byte. Each byte of the UTF-8 will be read in as a Byte instead of a Char in Java. prop.setProperty(rsResult.getString("KEY"), strBuf.toString()); } /*Write to o/p stream*/ //opFile = new FileOutputStream(strFileName+".properties"); opFile = new FileOutputStream(strFileName); /*Store the Properties files*/ prop.store(opFile, "Resource Bundle created from Database View "+vctView.get(i)); Thnaks -Pankaj -Original Message- From: [EMAIL PROTECTED][mailto:[EMAIL PROTECTED]] Sent: Tuesday, March 11, 2003 6:09PM To: Jain, Pankaj (MED, TCS) Cc: '[EMAIL PROTECTED]';'[EMAIL PROTECTED]' Subject: Re: Unicode character transformationthrough XSLT Because the following code got apply toyour unicode data 1. convert \u to unicode - \uFFE2\uFF80\uFF93 become three unicode characters- U+FFE2, U+FF80, U+FF93 This is ok 2. a "Throw away hihg 8bits got apply to your code" so it became 3 bytes E2 80 93 3. andsome code treat it as UTF-8 and try to convert it to UCS2 again, so E2= 1110 0010 and the right most 4 bits 0010 will be used for UCS2 80 = 1000 and the right most 6 bits 00 will be used for UCS2 93 = 1001 0011and the right most 6 bits 01 0011 will be used for UCS2 [0010] [00] [01 0011] = 0010 0001 0011 = 2013 U+2013 is EN DASH so...in your code there are something very very bad which will corrupt yourdata. Step 2 and 3 are very bad. You probably need to find out where theyare and remove that code. read my paper on http://people.netscape.com/ftang/paper/textintegrity.html Probablyyour Java code have one or two bugs which listed in my paper. Jain,Pankaj (MED, TCS) wrote: James, thanks, its working for me now. But still I have a doubt that why \uFFE2\uFF80\uFF93 is giving ndash in html. if you have any information on this, than pls let me know. Thanks -Pankaj -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] Sent: Monday, March 10, 2003 7:59 PM To: Jain, Pankaj (MED, TCS) Cc: '[EMAIL PROTECTED]' Subject: Re: Unicode character transformation through XSLT . Pankaj Jain wrote, My problem is that, I am getting Unicode character(\uFFE2\uFF80\uFF93) from resource bundle property file which is equivalent to ndash(-) and its U+2013 is the ndash (aEUR"). It is represented in UTF-8 by three hex bytes: E2 80 93. But, \uFFE2 is fullwidth pound sign \uFF80 is half width katakana letter ta and \uff93 is half width katakana letter mo. Perhaps the reason you see three question marks is that the font you are using doesn't support full width and half width characters. What happens if you replace your string \uFFE2\uFF80\uFF93 with \u2013 ? Best regards, James Kass .
Re: farsi calendar components
check http://emr.cs.iit.edu/home/reingold/calendar-book/second-edition/ Paul Hastings wrote: does anybody know of any java farsi calendar components? thanks. Paul Hastings [EMAIL PROTECTED] CTO Sustainable Development Research Institute Member Team Macromedia (ColdFusion)
Re: sorting order between win98/xp
do you use LCMapStringW on WinXP and LCMapStringA on Win98 WITH LCMAP_SORTKEY to genearate the SORT KEY ? Have you try on both platforms ? (Win98 and WinXP)? Michael (michka) Kaplan wrote: LCMapString does not do the reported behavior either. ComparesString and LCMapString are based on the same data and return the same results. Your colleague is mistaken. MichKa - Original Message - From: "Yung-Fong Tang" [EMAIL PROTECTED] To: "Michael (michka) Kaplan" [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Thursday, March 13, 2003 4:31 PM Subject: Re: sorting order between win98/xp We cannot use that. The function you mention is to compare two Unicode strings. We need the function to "generate sort key" from unicode strings instead of compare two string. Michael (michka) Kaplan wrote: From: "Yung-Fong Tang" [EMAIL PROTECTED] One of my colleague ask me this question. In the interests of completeness The function that does the type of sorting your colleague noted is StrCmpLogicalW in shlwapi.dll, version 5.5 and later. See the following link for more information (all on one line in the browser): http://msdn.microsoft.com/library/en-us/shellcc/platform/shell/refere nce/shlwapi/string/strcmplogicalw.asp MichKa
Re: wap and utf-8
Mary McCarter wrote: Hi Friends, My phone (Motorola i550,i30sx,i85,i60c) doesn't show correctly the neither #243; and it shows the instead of . Is that a LATIN CAPITAL A WITH TILD and a SUPERSCRIPT THREE? ISO-8859-1 use 0xc3 to encode LATIN CAPITAL A WITH TILD ISO-8859-1 use 0xb3 to encode UPERSCRIPT THREE UTF-8 use 0xc30xb3 to encode LATIN SMALL LETTER O WITH ACUTE So... it looks like some code treat your UTF-8 as ISO-8859-1 case #4 in my paper http://people.netscape.com/ftang/paper/textintegrity.html Why? ?xml version="1.0" encoding="ISO-8859-1"? said "ISO-8859-1" wml_binary has the \xc3\xb3 What is wml_binary? what encoding are you used to store the the wml? UTF-8 or ISO-8859-1 if you do a od -x on that wml file. do you see \xf3 on that characters or \xc3\xb3 ? One possibility is you create the file in UTF-8 but label it as ISO-8859-1 . Change the first line from ?xml version="1.0" encoding="ISO-8859-1"? to ?xml version="1.0" encoding="UTF-8"? will fix that If you do stored your information in ISO-8859-1 then it could caused by the following reason 1. some code read your xml file and convert it to UTF-8 correctly, however, the encoding="iso-8859-1" is also stored with it 2. that code pass the converted xml to the next module, but it does not remove the 'encoding="utf-8"' or change it from 'encoding="ISO-8859-1"' to "encoding="UTF-8"' so the next module thought the data is stilled stored in UTF-8 How to fix it? 1. fix the data- again, change it to encoding="UTF-8" and use UTF-8 to store the data in your wml file 2. fix the code. Add some code which perform the ISO-8859-1 TO UTF-8 conversion to remove the encoding or change the encoding This is a typical "Dobule Conversion" issue mentioned in my paper http://people.netscape.com/ftang/paper/textintegrity.html as point 6 However, i don't believe that is the case. Because if that IS the case, then all your environemtn should display as garbage, not just your motorola phone and 4.1 similator. The real preblem could bein two places- 1. some code didn't remove/change the xml encoding information even it perform charset conversion AND 2. your nokia phone and your 3.1 simulartor (not your (Motorola i550,i30sx,i85,i60c) or up.sdk 4.1 simulator) may always ASSUME the data as "UTF-8" and may always ignore the mislabel encoding="ISO-8859-1" data The DOUBLE false could cause you to see it well on Nokia phone and 3.1 simulator. The SINGLE faulse in 1 and the CORRECT behavior in your Motorola phone probably will let you see the wrong thing :) The same happen with my up.sdk 4.1 simulator (connected trough my wap-gateway) But a nokia phone shows the correctly! and my nokia toolkit 3.1 simulator show it well, too. I check the wap-gateway code, and I can realize that the wml_binary has the \xc3\xb3 instead of and it is right, I think so.. because it is the UTF-8 code, but why my phone can't show it correctly. Any idea? I will be grateful with any contribution, Thanks a lot and Regards, Mary _ MSN 8 helps eliminate e-mail viruses. Get 2 months FREE*. http://join.msn.com/?page=features/virus
Re: Encoding: Unicode Quarterly Newsletter
Hope they can reduce the weight next time by change the type of the paper. My Bible is about 500 pages (about 1500+ pages) more than the unicode 3.0 standard but only 50% of it's thick. Same as my Chinese/English dictionary. Otto Stolz wrote: Kenneth Whistler wrote: we can calculate the weight as being *approximately* 9.05 pounds (avoirdupois) [or 10.99 troy pounds]. Apparently a weighty publication, that forthcoming Unicode standard... Cheers, Otto Stolz
Re: Encoding: Unicode Quarterly Newsletter
John H. Jenkins wrote: I certainly think it would be good published with a leather cover, onion-skin paper, and gilt edges, yes. First we have to have Ken divide it into verses, though. I thought we already have verses dividied in Chapter 3. Those C1-C13/D1-2 stuff
sorting order between win98/xp
One of my colleague ask me this question. We use LCMapStringW on WinXP and LCMapStringA on Win98 (by using LCMAP_SORTKEY ). And we got different sorting order for the following Example of message list ordering in Win98: TESTING #1 TESTING #10 TESTING #100 TESTING #11 While, the message list ordering in WinXP: TESTING #1 TESTING #10 TESTING #11 TESTING #100 Anyone know is there a way to make them sort in the same order? Anyone know why the sort order is different under that two systems? The are running under the same locale.
Re: Unicode character transformation through XSLT
Because the following code got apply to your unicode data 1. convert \u to unicode - \uFFE2\uFF80\uFF93 become three unicode characters- U+FFE2, U+FF80, U+FF93 This is ok 2. a "Throw away hihg 8 bits got apply to your code" so it became 3 bytes E2 80 93 3. and some code treat it as UTF-8 and try to convert it to UCS2 again, so E2 = 1110 0010 and the right most 4 bits 0010 will be used for UCS2 80 = 1000 and the right most 6 bits 00 will be used for UCS2 93 = 1001 0011 and the right most 6 bits 01 0011 will be used for UCS2 [0010] [00 ] [01 0011] = 0010 0001 0011 = 2013 U+2013 is EN DASH so... in your code there are something very very bad which will corrupt your data. Step 2 and 3 are very bad. You probably need to find out where they are and remove that code. read my paper on http://people.netscape.com/ftang/paper/textintegrity.html Probably your Java code have one or two bugs which listed in my paper. Jain, Pankaj (MED, TCS) wrote: James, thanks, its working for me now. But still I have a doubt that why \uFFE2\uFF80\uFF93 is giving ndash in html. if you have any information on this, than pls let me know. Thanks -Pankaj -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] Sent: Monday, March 10, 2003 7:59 PM To: Jain, Pankaj (MED, TCS) Cc: '[EMAIL PROTECTED]' Subject: Re: Unicode character transformation through XSLT . Pankaj Jain wrote, My problem is that, I am getting Unicode character(\uFFE2\uFF80\uFF93) from resource bundle property file which is equivalent to ndash(-) and its U+2013 is the ndash (aEUR"). It is represented in UTF-8 by three hex bytes: E2 80 93. But, \uFFE2 is fullwidth pound sign \uFF80 is half width katakana letter ta and \uff93 is half width katakana letter mo. Perhaps the reason you see three question marks is that the font you are using doesn't support full width and half width characters. What happens if you replace your string \uFFE2\uFF80\uFF93 with \u2013 ? Best regards, James Kass .
pesonal comments about http://www.w3.org/TR/xml11/
purports not to modify the interpretation of that coded character sequence. If a noncharacter which does not have a specific internal use is unexpectedly encountered in processing, an implementation may signal an error or delete or ignore the noncharacter. If these options are not taken, the noncharacter should be treated as an unassigned code point. For example, an API that returned a character property value for a noncharacter would return the same value as the default value for an unassigned code point. in the http://www.unicode.org/reports/tr27/ Therefore, should the following session changed from [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x1-#x10] /* any Unicode character, excluding the surrogate blocks, FFFE, and . */ to [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFDCF] | [#xFDF0-#xFFFD] | [#x1-#x1FFFD] | [#x2-#x2FFFD] | [#x3-#x3FFFD] | [#x4-#x4FFFD] | [#x5-#x5FFFD] | [#x6-#x6FFFD] | [#x7-#x7FFFD] | [#x8-#x8FFFD] | [#x9-#x9FFFD] | [#xA-#xAFFFD] | [#xB-#xBFFFD] | [#xC-#xCFFFD] | [#xD-#xDFFFD] | [#xE-#xEFFFD] | [#xF-#xD] | [#x1-#x1FFFD] /* any Unicode character, excluding the surrogate blocks, FDD0 to FDEF nFFFE, and n. */ 2. similar thing should apply to [4] NameStartChar #xFDD0-#xFDEF should not be allowed in NameStartChar nFFFE nor n should not be allowed in NameStartChar neither It looks the NameStartChar do not allow private use area [#xE000-#xF8FF]. If we follow that principal, then [#xF-#x10] should neither be in NameStartChar since http://www.unicode.org/Public/3.2-Update/Blocks-3.2.0.txt defined them as Supplementary Private Use Area F..F; Supplementary Private Use Area-A 10..10; Supplementary Private Use Area-B Also, I doubt we should allow E..E007F; Tags to be used as NameStartChar Frank Yung-Fong Tang
Re: length of text by different languages
Ram Viswanadha wrote: There is also some information at http://oss.software.ibm.com/icu/docs/papers/binary_ordered_compression_for_unicode.html#Test_Results Not sure if this is what you are looking for. thanks. not really. I am not look into the ratio caused by encoding. But rather the ratio caused by language itself. For example, in order to communicate the idea "I want to eat chicken for dinner tonight", French, German using the same encoding may use different number of characters to communicate the same "IDEA". Misha's paper help a lot. but unfortunately it lack of japanese and German data.
Re: Need program to convert UTF-8 - Hex sequences
1. open you file with n7 and change the encoding to UTF-8 2. select and copy all the text 3. paste into the first textarea of the attached html file David Oftedal wrote: Hello! Sorry to make this a mass spam, but I need a program to convert UTF-8 to hex sequences. This is useful for embedding text in non-UTF web pages, but also for creating a Yudit keymap file, which I'm doing at the moment. For example, a file with the content would yield the output 0x00E6 0X00F8 0X00E5, and the Japanese expression would yield 0x3042 0x306E 0x4EBA. Can anyone tell me how to do it without making a program for it myself? It would be VERY helpful, and I've already made 2 programs for assembling this file and I'm not starting on another just yet. Best regards David J. Oftedal Title: u.html Text: Unicode Value (\u): Unicode Value (#ddd;): Unicode Value (#;): UTF8 value for C (\xhhh):
Re: length of text by different languages
Francois Yergeau wrote: [EMAIL PROTECTED] wrote: I remember there were some study to show although UTF-8 encode each Japanese/Chinese characters in 3 bytes, Japanese/Chinese usually use LESS characters in writting to communicate information than alphabetic base langauges. Any one can point to me such research? I don't know of exactly what you want, but I vaguely remember a paper given at a Unicode conference long ago that compared various translations of the charter (or some such) of the Voice of America in a couple or three encodings. H, let's see could be this: http://www.unicode.org/iuc/iuc9/Friday2.html#b3 Reuters Compression Scheme for Unicode (RCSU) Misha Wolf yea. That could be it. I got a hard copy and it looks like the Fig 2 is the one I am looking for. No paper online, alas. I remember that Chinese was a clear winner in terms of # of characters. In fact, I kind of remember that Chinese was so much denser that it still won after RCSU (now SCSU) compression, which would mean that a Han character contains more than twice as much info on average as a Latin letter as used in (say) English. This is all on pretty shaky ground, distant memories. Perhaps Misha stil has the figures (if that's in fact the right paper).
Re: length of text by different languages
Francois Yergeau wrote: http://www.unicode.org/iuc/iuc9/Friday2.html#b3 Reuters Compression Scheme for Unicode (RCSU) Misha Wolf Unfortunately, no information about Germany or Japanese. :( It only have Chinese, Frasi, Urdu, Russian, Arabic, Hindi, Korean , Creole, Thai, French, Czech, Turkish, Polish, Armenain, Greek, English, Vietnamese, Albanian, Spanish Anyone have data about that two languages (Germany or Japanese) ?
Re: length of text by different languages
thanks, everyone. But I want to point out the punct and itself should also be consider in your future caculation. Japanese and Chinese, Thai do not use between word, and Latin based (or Greek, Koeran,Cyrillic, Arabic, Armenian Georgian, etc) does use and when used for estimate size, those should also be caculated.
length of text by different languages
I remember there were some study to show although UTF-8 encode each Japanese/Chinese characters in 3 bytes, Japanese/Chinese usually use LESS characters in writting to communicate information than alphabetic base langauges. Any one can point to me such research? Martin, do you have some paper about that ? I would like to find out the average ration between English, Geram, French, Japanese, Chinese, Korean in term of the number of characters, and in term of the bytes needed to encode in UTF-8 If such research information have not been done, maybe one way to figure the result is to take tranlated Bible fo these language from swords project, strip out those xml tag and leave the pure text, and measure the size. Since all the Bible translation communicate the same information and the volumn is huge enough, that could be a good way to find out the result. Of course, those mark up need to be taken out to reduce the noise.
Re: Unicode Arabic Rendering Problem
Guess I am not the right peson to answer that. put it back to unicode.org mailling list. Let me ask you this way. Is this a rendering style issue? or is it a different way to combine characers? How you pronounce the following 3? Is there different pronouncation between 1 and 3? Is there different pronouncation between 2 and 3? The answer of the two questions above may tell us it is a encoding issue or a presentation (glyph variant) issue. This is a unique spelling that is commonly found in the Quran. Is that spelling also found in text OTHER than the Quran? Mete Kural wrote: Hello Yung-Fong, Thank you very much for all the information. It was very helpful. I'm still not clear about something though. As far as I understand, the block of characters U+0644-U+0654-U+0627 would be rendered as such: c \ / \/ /\ \/ U+0644-U+0627-U+0654 would be rendered: c \ / \/ /\ \/ So how would you encode this rendering? c \ / \/ /\ \/ in which the hamza is neither directly above the alef, nor directly above the lam, but it's in between the alef and lam. This is a unique spelling that is commonly found in the Quran. Thank you very much for the help. Mete --- Yung-Fong Tang [EMAIL PROTECTED] wrote:
Re: Unicode 4.0 BETA available for review
Thanks to let me know. I guess I didn't spend enugh time with www.unicode.org these days :) when do you add those PDF there ? It used to have only partial sesssion available... but that is probably story several years ago Roozbeh Pournader wrote: On Thu, 27 Feb 2003, Mark Davis wrote: The Unicode Standard *is* free of charge; the entire text is posted on www.unicode.org. Well, free of charge to *read personally on the screen*, of course. You can't print the major versions yourself, Addison-Wesley must be asked for that ;) And you can't copy and paste portions of its text into emails for reference or discussion, as you can do with RFCs. You should retype it, which I find very annoying. roozbeh
Re: Unicode 4.0 BETA available for review
Doug Ewell wrote: Yung-Fong Tang ftang at netscape dot com wrote: So... in the future, in order to ensure we have a good software environment, we not only need to make the Unicode 4.0 clear, but also need to speed up the revision of those RFCs. But the Unicode Consortium and UTC have no control over that. And as you can see, Franois is doing his best to move the new RFC along. Sure, we all appreciate Franois' efforts on those. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)
Kenneth Whistler wrote: Think of it this way. Does anyone expect the ASCII standard to tell, in detail, what a process should or should not do if it receives data which purports to be ASCII, but which contains an 0x80 byte in it? All the ASCII standard can really do is tell you that 0x80 is not defined in ASCII, and a conformant process shall not interpret 0x80 as an ASCII character. Beyond that, it is up to the software engineers to figure out who goofed up in mislabelling or corrupting the data, and what the process receiving the bad data should do about it. That is not a good comparision. ASCII is a single byte character code standard. And when I got a 0x80 in ASCII string, I know where is the boundary- the boundary is the whole 8-bits of that 0x80 is bad. The scope is not the first 3 bits nor 9 bits- but the 8 bits data. I cannot tell the rest of the data is good or bad, but I know ASCII is only 8-bits and 8 bits only. Same thing for JIS x0208 (a TWO and only TWO bytes character set, not a variable length character set). If I am processing a ISO-2022-JP message and in the JIS x0208 mode and I got a 0x24 0xa8 I know the boundary of that problem is 16 bits, not 8 -bits nor 32 bits. When you deal with encoding which need states (ISO-2022, ISO-2022-JP, etc) or variable length encoding (Shift_JIS, Big5, UTF-8), then the situration is different.
Re: Unicode Arabic Rendering Problem
My test data generator in http://people.netscape.com/ftang/testscript/arabic/arabic.html probably can also help people to look at the Arabic behavior Unfortuatelly, it is currently coded against Windows-1256 instead of the unicode.
Re: Unicode Arabic Rendering Problem
I think you have both problem in 1 and 2 1. I think you use the wrong way to encode, you probably should encode figure 2 by using U+0644-U+0654-U+0627 and figure 3 by using U+0644-U+0627-U+0654 2. I think there are also font problem. From my test, all the font ship with MS windows does not work either way (the way you encode or the way I encode) on IE or Mozilla. But I do see one font which I got from some Arabic font developer show me U+0644-U+0654-U+0627 as figure 2 and U+0644-U+0627-U+0654 as figure 3 I will send you a screenshot in private email. Don't want to send a big jpg or png to the mailling list. I need to find out who design that font I have in my hard drive... and probably will let you know more details later. Mete Kural wrote: Hello Folks, I wanted to ask a question to those of you who have Unicode Arabic knowledge. We have this website http://www.quranreader.org where we are trying to display the text of the Quran with accurately encoded Unicode text rather than the traditional images. Some of the characters in the Quran aren't rendered correctly. We are letting the browser to use its default Unicode font on the website, which is Times New Roman Unicode for the newer versions of Internet Explorer I think. If we used a high-quality Unicode font for Arabic, would this solve the problem? Or is this a bigger problem that has to do with the rendering engine provided by the operating system? I would like to give you an example. In Arabic when you have a Lam And Alef together, it is rendered in a unique way instead of the regular rendering for these letters that kind of looks like this: \ / \/ /\ \/ Figure 1 In the Quran, there is sometimes this combination of characters: Lam-Hamza-Alif In such a case, the Lam and Alif are still rendered the way they would be had there not been a hamza inbetween, and the hamza is simply put above the alef and lam in the middle which looks kind of like this: c \ / \/ /\ \/ Figure 2 Note that this is different than the case as illustrated in Figure 3 where the hamza is directly above the alef and not "in between" lam and alef. c \ / \/ /\ \/ Figure 3 So there is a subtle difference that the hamza is not directly above the alef but rather in between the alef and the lam. I am attaching a small gif file named "Sample.gif" that will demostrate the subtle difference of the positioning of the hamza. Attached are two words from the Quran. Look for the second word where the hamza is in between the alef and the lam instead of directly above the alef. When we encode this case with this combination of Unicode characters: 0644-0627-0621 in Internet Explorer, instead of showing it like Figure 2, it totally seperates all letters and shows it like this: | | | | | C \__/ which is totally wrong. Which one do you think is the problem here? 1) We are not encoding this combination of characters in the correct way. 2) This is a font-related problem. 3) This is a bigger problem for which the rendering engine on the operating system has to be modified. Thank you very very much, Mete Kural image/gif
Re: Unicode 4.0 BETA available for review
Stefan Persson wrote: Kenneth Whistler wrote: Unicode 3.0 defined non-shorted UTF-8 as *irregular* code value sequences. There were two types: a. 0xC0 0x80 for U+ (instead of 0x00) b. 0xED 0xA0 0x80 0xED 0xB0 0x80 for U+1 (instead of 0xF0 0x90 0x80 0x80) Ah, but encoding NULL as a surrogate character and then encoding those two surrogates as three bytes, making totally 6 bytes a character, would also be technically possible (though not legal), right? How ? Surrogate pairs can only be used to represent U+1 - U+10 . It is IMPOSSIBLE to use Surrogate pair to represent any characters in the range of U+ - U+, including U+ which is NULL. Stefan _ Gå före i kön och få din sajt värderad på nolltid med Yahoo! Express Se mer på: http://se.docs.yahoo.com/info/express/help/index.html
Re: Unicode 4.0 BETA available for review
This discussion has been centered around UTF-8. But I hope the corresponding rules apply to UTF-16 and UTF-32 for Unicode 4.0: . for UTF-32: occurrences of 'surrogates' are ill-formed. How about UTF-32 sequence which the 4 bytes represent value U+10 ? Are they considered ill-formed? Should they?
Re: Unicode 4.0 BETA available for review
Kent Karlsson wrote: The Unicode 4.0 text further strengthens Conformance Clause C12, to make this crystal clear: C12 When a process generates a code unit sequence which purports to be in a Unicode character encoding form, it shall not emit ill-formed code unit sequences. C12a When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition, and shall not interpret such sequences as characters. And just in case anyone still has any trouble reading the painfully detailed specification of the UTF-8 encoding form, an explicit note is included there: * Because surrogate code points are not Unicode scalar values, any UTF-8 byte sequence that would otherwise map to code points D800..DFFF is ill-formed. So I don't think there is any hole here. If anyone still thinks that they can use these 3-octet/3-octet encodings of supplementary characters and call it UTF-8, then they are either engaging in wishful thinking or are not reading the standard carefully enough. The problem I need to deal with is not GENERATE those UTF-8, but how to handle these DATA when my code receive it. For example, when I receive a 10K UTF-8 file which have 1000 lines of text, if there are one UTF-8 sequence in the line 990 are ill-formed, should I fire the error for 1. the whole file (10K, 1000 lines), 2. all the line after line 899, 3. the line 990 itslef, 4. the text between the leading byte of that ill-formed UTF-8 till the end of the file, 5. the text between the leading byte of that ill-formed UTF-8 sequenec till the end of the line 990, 6. the text between the leading byte of that ill-formed UTF-8 till the next leading byte in line 990 I there are others way you can scope the ERROR, I probably can continue it on and on and tell you 10-20 other way to scope it if I spend 20 more minutes. I do believe the error handling should be application specific.
Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available forreview)
Likewise, the Unicode Standard tells you what a well-formed UTF-8 byte sequence is. But it is the software designer who has to be smart about determining what his/her software will do when it encounters an error condition and finds itself dealing with a sequence which is ill-formed according to the specification of UTF-8 in the Unicode Standard. or higher level specification, such as XML specification, SOAP specification, CSS2 specification, etc. There are many many layers between Unicode standard and a software application. Not just the code itself --Ken
Re: Unicode 4.0 BETA available for review
I can keep answering these questions, but I can also assure everyone that the UTC worked *very* hard this time around to make the character encoding model much clearer in the Unicode 4.0 text, and to anticipate all these edge cases. --Ken The problem in the past come from two (or more places) 1. the definitation in Unicode itself (3.0, 3.1) 2. the RFC which summarize it. I am sure you can control the point 1. But we have to understand the point 2 is also important. The reasone people refer to point 2 is usually the RFC is much shorter and focus than the Unicode standard itself. And also RFC is FREE of charge but not Unicode standard itself. So... in the future, in order to ensure we have a good software environment, we not only need to make the Unicode 4.0 clear, but also need to speed up the revision of those RFCs.
quoted-string in for MIME Content-Type charset parameter
Not sure this is the right fourm to discuss this issue. I found this "problem" when I debugging a UTF-8 email message. When I look into some email that we have problem with, I just saw some Content-Type header like the following: Content-Type: text/html; charset="UTF-8" As I remember, the MIME specification does not allowed "" with the charset parameter and it should only accept Content-Type: text/html; charset=UTF-8 but not charset="UTF-8" So... I check the MIME spec try to figure out is it allowed or not. What shock me is the original MIME specification RFC 1521 disallowed it http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc1521.html#sec-7.1.1 and http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc1521.html#sec-7.1.2 The formal grammar for the content-type header field for text is asfollows: text-type := "text" "/" text-subtype [";" "charset" "=" charset] text-subtype := "plain" / extension-token charset := "us-ascii"/ "iso-8859-1"/ "iso-8859-2"/ "iso-8859-3" / "iso-8859-4"/ "iso-8859-5"/ "iso-8859-6"/ "iso-8859-7" / "iso-8859-8" / "iso-8859-9" / extension-token but RFC 2045 which obsoleted RFC 1521 allow the " quoted charset name: see http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2045.html#sec-5.1 parameter := attribute "=" value attribute := token ; Matching of attributes ; is ALWAYS case-insensitive. value := token / quoted-string Note that the value of a quoted string parameter does not include thequotes. That is, the quotation marks in a quoted-string are not a part of the value of the parameter, but are merely used to delimit that parameter value. In addition, comments are allowed inaccordance with RFC 822 rules for structured header fields. Thus thefollowing two forms Content-type: text/plain; charset=us-ascii (Plain text) Content-type: text/plain; charset="us-ascii" are completely equivalent. I never aware this differences between RFC 1521 and RFC 2045. Not sure about you folks aware of it or not. I also check HTTP 1.1- RFC 2068. and HTTP 1.0 RFC 1945 . It looks like both specification have conflict language within the same specification about this issue: http://www.w3.org/Protocols/rfc1945/rfc1945 http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2068.html While one place say: charset = "US-ASCII" | "ISO-8859-1" | "ISO-8859-2" | "ISO-8859-3" | "ISO-8859-4" | "ISO-8859-5" | "ISO-8859-6" | "ISO-8859-7" | "ISO-8859-8" | "ISO-8859-9" | "ISO-2022-JP" | "ISO-2022-JP-2" | "ISO-2022-KR" | "UNICODE-1-1" | "UNICODE-1-1-UTF-7" | "UNICODE-1-1-UTF-8" | token and token = 1*any CHAR except CTLs or tspecials tspecials = "(" | ")" | "" | "" | "@" | "," | ";" | ":" | "\" | " | "/" | "[" | "]" | "?" | "=" | "{" | "}" | SP | HT which ruled out the use of quoted-string The other placce it said 3.6 Media Types HTTP uses Internet Media Types [13] in the Content-Type header field (Section 10.5) in order to provide open and extensible data typing. media-type = type "/" subtype *( ";" parameter ) parameter = attribute "=" value value = token | quoted-string :( :( :( :( Therefore we need to make sure 1. all the mailer which receive email not only deal with charset=value but also charset="value". I am not sure about Mozilla can deal with it or not. How about your email program? 2. The browse can deal with Content-Type: text/html; charset="value" in additional to Content-Type: text/html; charset=value 3. because we also use META tag in the HTML to reflect the HTTP header, that mean the browser not only have to deal with the following kind of meta tag meta http-equiv="content-type" content="text/html; charset=value" meta http-equiv="content-type" content='text/html; charset=value' but also meta http-equiv="content-type" content='text/html; charset="value"' :( :( :( :( not sure does mozilla handle 2 or 3. How about IE? However, for email, since RFC 1521 does NOT allow it, to make sure it work with most of the email program, when we try to send out internet email, we should try to use Content-Type: text/html; charset=UTF-8 instead of Content-Type: text/html; charset="UTF-8" Can you check this issue with the product that you are working on ?
Re: Unicode 4.0 BETA available for review
Kenneth Whistler wrote: If you read through those definitions from Unicode 4.0 carefully, you will see that UTF-8 representing a noncharacter is perfectly valid, but UTF-8 representing an unpaired surrogate code point is ill-formed (and therefore disallowed). I see a hole here. How about UTF-8 representing a paired of surrogate code point with two 3 octets sequence instead of an one octets UTF-8 sequence? It should be ill-formed since it is non-shortest form also, right? But we really need to watch out the language used there so we won't create new problem. I DO NOT want people think one 3 otects of UTF-8 surrogate low or high is ill-formed but one 3 octets of UTF-8 surrogate high followed by a one 3 octets of UTF-8 surrogate low is legal.
Re: please review the paper for me
I think that is a very commn mistake people WILL make. Doug Ewell wrote: Thanks to all who pointed out that noncharacters, unlike surrogate code points, are NOT illegal or invalid in UTF-8 or any other CES. I don't know why I said they were. (Bad brain! Bad, bad brain!) -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/