Quiz for Unicode guru

2004-08-19 Thread Frank Yung-Fong Tang
OK, just for fun

Quiz for Unicode Guru

Here is the quiz for the Unicoder. It is not a hard quiz. Everyone will
get it right eventually. So, use stop watch to measure how long it will
take for you figure out the right answer.
Note: You can find the information of Unicode and UTF-8 from www.unicode.org

In the two pictures in the link below:
1. How many bytes you need to encode the text in the picture into UTF-8
encoding?
2. What is the script name for the text on the picture?
3. Can you guess where (provice, state, country, etc) did I take these
two images? [Hint: somewhere very close to where you can find famous
mouses.]


The two pictures and the quiz can be found at
http://journals.aol.com/ytang0648/FrankTangsDiary/entries/753

Do NOT give your answer to the mail list to spoil the fun once you find
out the right one, ok?





problems in Public Review 33 UTF Conversion Code Update

2004-05-19 Thread Frank Yung-Fong Tang




Looking at
http://www.unicode.org/review/


  

  33
   UTF Conversion
Code Update
  2004.06.08


  The C
language source code example for UTF conversions (ConverUTF.c) has been
updated to version 1.2 and is being released for public review and
comment. This update includes fixes for several minor bugs. The code
can be found at the above link.

  


and look at the code
under http://www.unicode.org/Public/BETA/CVTUTF-1-2/

In http://www.unicode.org/Public/BETA/CVTUTF-1-2/ConvertUTF.c

/* * Index into the table below with the first byte of a UTF-8 sequence to * get the number of trailing bytes that are supposed to follow it. */static const char trailingBytesForUTF8[256] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5};although there are code prevent 5-6 bytes UTF-8 sequence. The array above mislead people to think there are 5 and 6 bytes UTF-8. Also, F5-F7 should not map to 3. C0 and C1 !
 should not map to 1It should be change to static const char trailingBytesForUTF8[256] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,0,0,0,0,0,0,0,0,0,0,0};/* * Once the bits are split out into bytes of UTF-8, this is a mask OR-ed * into the first byte, depending on how many bytes follow.  There are * as many entries in this table as there are UTF-8 sequence types. * (I.e., one byte sequence, two byte... six byte sequence.!
 ) */static const UTF8 firstByteMark[7] = { 0x00, 0x00, 0xC0, 0

xE0, 0xF0, 0xF8, 0xFC };
This comment is also
misleading "six byte sequence" and "0xF8, 0xFC"

/* Figure out how many bytes the result will require */ if (ch  (UTF32)0x80) {  bytesToWrite = 1;  } else if (ch  (UTF32)0x800) { bytesToWrite = 2;   } else if (ch  (UTF32)0x1) {   bytesToWrite = 3;   } else if (ch  (UTF32)0x20) {  bytesToWrite = 4;Shouldn't the last line be } else if (ch  (UTF32)0x11) {  bytesToWrite = 4;? where does the 0x20 come from ?  switch (extraBytesToRead) { case 5: ch += *source++; ch = 6;  case 4: ch += *source++; ch = 6;This code also mislead people to think there are 5 and 6 bytes UTF-8 sequenceAlso the following routinestatic Boolean isLegalUTF8(const UTF8 *source, int length) {UTF8 a;const UTF8 *srcptr = source+length;switch (length) {default: return false;  /* Everything else falls through when "true"... */case 4: if ((a = (*--srcptr))  0x80 || a  0!
 xBF) return false;case 3: if ((a = (*--srcptr))  0x80 || a  0xBF) return false;case 2: if ((a = (*--srcptr))  0xBF) return false; switch (*source) {  /* no fall-through in this inner switch */  case 0xE0: if (a  0xA0) return false; break;   case 0xF0: if (a  0x90) return false; break;   case 0xF4: if (a  0x8F) return false; break;   default:  if (a  0x80) return false;   }   case 1: if (*source = 0x80  *source  0xC2) return false;if (*source  0xF4) return false;}return true;}Does NOT match the table 3.1B as defined in Unicode 3.2see  http://www.unicode.org/reports/tr28/#3_1_conformanceor Table 3-6 Well-Formed UTF-8 Byte Sequences in page 78 of Unciode 4.0in particular the function treat the following range legal!
  whileit should NOTU+D800..U+DFFF ED A0-BF 80-BFAl

so http://www.unicode.org/Public/BETA/CVTUTF-1-2/harness.cThe following comment is misleading/* - test01 - Spot check a few legal  illegal UTF-8 values only.This is not an exhaustive test, just a brief one that was   used to develop the "isLegalUTF8" routine.  Legal UTF-8 sequences are:  1st 2nd 3rd 4th Codepoints---   00-7F -  007F   C2-DF   80-BF 0080-  07FF   E0  A0-BF   80-BF 0800-  0FFF   E1-EF   80-BF   80-BF 1000-     F0  90-BF   80-BF   80-BF1- 3   F1-F3   80-BF   80-BF   80-BF4- F   F4  80-8F   80-BF   80-BF   10-10   - */It should beLeg!
 al UTF-8 

Yet another reason some software treat your UTF-8 xml as US-ASCII

2004-05-06 Thread Frank Yung-Fong Tang




For sure no one in this
mailling list want to see your xml got treated as US-ASCII when the
data is really in UTF-8. 

If I have an xml file like the following

?xml version="1.0"?



and send over the HTTP protocol with the following content type header:

Content-Type: text/xml;

(without the charset=UTF-8)

Guess what charset should the receiver use as the charset of the xml? 
UTF-8? ISO-8859-1? or US-ASCII?

If you only read the XML 1.0 specification, I guess you will conclude
it should be treated as "UTF-8". However, if you also read RFC 3023,
then ... the answer is "US-ASCII"

see http://www.faqs.org/rfcs/rfc3023.html

[...]

3.1 Text/xml Registration[]  Conformant with [RFC2046], if a text/xml entity is received with  the charset parameter omitted, MIME processors and XML processors  MUST use the default charset value of "us-ascii"[ASCII].  In cases  where the XML MIME entity is transmitted via HTTP, the default  charset value is still "us-ascii". []:( Notice if the type is application/xml, the rule changed!!!
3.2 Application/xml Registration[...]  If an application/xml entity is received where the charset  parameter is omitted, no information is being provided about the  charset by the MIME Content-Type header.  Conforming XML  processors MUST follow the requirements in section 4.3.3 of [XML]  that directly address this contingency.  However, MIME processors  that are not XML processors SHOULD NOT assume a default charset if  the charset parameter is omitted from an application/xml entity.
[...]:( :( :(









OT: Standardize TimeZone ID

2004-04-23 Thread Frank Yung-Fong Tang

Is there any standard effort try to standardize Time Zone ID? I am not 
talking about the Time Zone which refer to a particular time (that could 
be done by GMT offset or addressed by ISO 8601) itself, but rather 
talking about an id refer to a particular time zone/ day light saving 
time rule.

I know the de factor standard around is the one in 
ftp://elsie.nci.nih.gov/pub/tz; . Probably people also use the timezone 
value get back from Java a lot.

I think a standard (maybe just adopt the one 
ftp://elsie.nci.nih.gov/pub/tz; and cleary specify it in RFC) for 
Timezone ID is important for the future common locale data repository as 
well as web services i18n.

I know this is a little bit off-topic for Unicode, just like the one 
about locale. Maybe I should move this to w3c i18n mailling list.




unicode site problem

2004-04-22 Thread Frank Yung-Fong Tang




any one know who can fix
http://www.unicode.org/reports/index.html ?
all the links are broken







Re: GB18030 and super font

2004-04-22 Thread Frank Yung-Fong Tang


Raymond Mercier wrote on 4/22/2004, 7:35 AM:

  I enquired about the 'super font' created by a Beijing foundry,
  http://font.founder.com.cn/english/web/index.htm, and am fairly
  astonished
  at the prices, as you see from the attached.

The cost of produce these fonts are much higher than produce a font 
which only have the glyph from WGL4.






Unicode 4.0 and ISO10646-2003

2004-04-22 Thread Frank Yung-Fong Tang




 I saw the announcment of publishing 
" ISO/IEC 10646: 2003, Information technology --


Universal Multiple-Octet Coded Character Set (UCS)"

>From http://anubis.dkuug.dk/jtc1/sc2/open/02n3729.htm
I expect there are no difference from Unicode 4.0, am I right?







Re: GB18030 and super font

2004-04-22 Thread Frank Yung-Fong Tang




In case you want to test
your GB18030 font, you can use Netscape 7 (or lateset Mozilla) and then
visit my GB18030 test pages at
http://people.netscape.com/ftang/testscript/gb18030/gb18030.cgi?page=10

It should be page to page compatable to the paper copy of GB18030-2000
standard. I also create "pseudo page" after page 284 for surrogate
mapping. Page after 284 does not exist in the origional GB18030
standard. 

Have fun with
http://people.netscape.com/ftang/testscript/gb18030/gb18030.cgi?page=597
:)

Raymond Mercier wrote on 4/22/2004, 1:04 PM:



  Eric,
  
  Amazin' Amazon!! Now why didn't I
think of that ?
  
  In fact the uk Amazon.co.uk say it
is discontinued, so I would have to get it from Amazon in the US. It is
not the first time that the two Amazon's fail to connect.
  
  Many thanks for the tip,
  
  Raymond
  
   
- Original Message - 
 
From: Eric Muller

 
To: [EMAIL PROTECTED]

 
Sent: Thursday, April 22, 2004
5:40 PM
 
Subject: Re: GB18030 and super
font
 




Raymond Mercier wrote:


  
   
  
  
But that link to proofing tools leads nowhere. Maybe it's not be so
easy to
get the CHS version.
  
  

http://www.amazon.com/exec/obidos/tg/detail/-/BBZ54P/qid=1082651762/sr=8-1/ref=pd_ka_1/103-8333725-5907026?v=glances=softwaren=507846

Includes ~140 fonts, mostly for CJK, Arabic, Hebrew but other scripts
as well. Includes "Simsun (Founder Extended)" aka "-", with
65,531 glyphs!

Eric.









Re: Unicode 4.0 and ISO10646-2003

2004-04-22 Thread Frank Yung-Fong Tang


Kenneth Whistler wrote on 4/22/2004, 3:26 PM:

  Frank asked:
 
   I expect there are no difference from Unicode 4.0, am I right?
 
  Correct. Please see Appendix C of Unicode 4.0, p. 1348 and p. 1350,
  which already explicitly makes this statement.
 
  --Ken

I don't see ISO10646-2003 in the page you mentioned. Is that equal to 
the so-called the thrid version? There are no easy to tell 
ISO10646-2003 is equal to the so-called the third version :) 
Although I guess that is the case.




Re: help finding radical/stroke index at unicode.org

2004-04-14 Thread Frank Yung-Fong Tang
are you talking about
http://www.unicode.org/charts/unihangridindex.html
  and
http://www.unicode.org/charts/unihanrsindex.html
?


Gary P. Grosso wrote on 4/14/2004, 1:18 PM:

  Hi,
 
  I am looking for an up-to-date, online version of the sort of thing
  I see in the back of the printed Unicode 2.0 book.  All I can find
  is a search engine thing, and that's real cool (I suppose) but I
  need a tableau, a complete picture, of the whole shebang.  Can
  someone please help me find my way?
 
  Thanks,
  Gary
 
  ---
  Gary Grosso
  Arbortext, Inc.
  Ann Arbor, MI, USA
 
 





Re: Novice question

2004-03-23 Thread Frank Yung-Fong Tang
Be careful here, for Unicode support in the browser (at least 
Netscape/Mozilla) there are some code fork between 2000/XP and Win98/ME.

Philippe Verdy wrote on 3/23/2004, 5:39 AM:

  From: Edward H. Trager [EMAIL PROTECTED]
   Also, I would not bother testing Windows OSes prior to Windows
  2000/XP.




Re: in the NEW YORK TIMES today, report of a USA patent for a met hod to make the Arabic language easier to read/write/typeset

2004-03-16 Thread Frank Yung-Fong Tang


Chris Jacobs wrote on 3/15/2004, 10:08 PM:

  - Original Message -
  From: Kenneth Whistler [EMAIL PROTECTED]
  To: [EMAIL PROTECTED]
  Cc: [EMAIL PROTECTED]
  Sent: Tuesday, March 16, 2004 2:28 AM
  Subject: Re: in the NEW YORK TIMES today, report of a USA patent for a
  met
  hod to make the Arabic language easier to read/write/typeset
 
 
   Mark Shoulson said:
  
(Me, I think
it's a cool idea, but I'm notorious for being fascinated by shiny new
things.)
  
   a gnieb rof suoiroton m'I tub ,aedi bmud a s'ti kniht I ,eM
.tehctorc evitcaer
neK --
 
  If you really typed that in backwards someone should teach you about
  unicode.

Not really, CSS2 is good enough. You don't need unicode to do that.

Try the following HTML (and CSS2) source in your browser:

div style=direction:rtl; unicode-bidi:bidi-override;
Try the following HTML (and CSS2) source in your browser:
/div







Re: in the NEW YORK TIMES today, report of a USA patent for a met hod to make the Arabic language easier to read/write/typeset

2004-03-16 Thread Frank Yung-Fong Tang
May be I should file an US patent application to write Arabic from left 
to right to make it more simplified :) I guess that will have more 
adoption rate compare to this font design patent since most software 
which does not support Bidi already implement them. :)

Mark E. Shoulson wrote on 3/15/2004, 7:54 PM:

  And see http://www.arabetics.com/ for the official site.  (Me, I think
  it's a cool idea, but I'm notorious for being fascinated by shiny new
  things.)
 
  ~mark




Re: in the NEW YORK TIMES today, report of a USA patent for a method to make the Arabic language easier to read/write/typeset

2004-03-15 Thread Frank Yung-Fong Tang
Wow.
It seems not a very new idea. Similar idea have been used in Chinese 40 
years ago and create the differences between Simplifed Chinese And 
Traditional Chinese.

Michael Everson wrote on 3/15/2004, 12:40 PM:

  In the NEW YORK TIMES today
  comes a report of a USA patent for a new version of written Arabic
  letters, designed to make them easier to read/write/typeset without
  making them too different from traditional Arabic script:
  http://www.nytimes.com/2004/03/15/technology/15patent.html -





Re: multibyte char display

2004-03-15 Thread Frank Yung-Fong Tang
many different reason you will see ? there.
read my paper http://people.netscape.com/ftang/paper/unicode25/a302.htm 
to see a list.

Manga wrote on 3/15/2004, 10:07 AM:

I use UTF-8 encoding in java code to store multi byte characters in the
db . When i retreive the multi byte characters from db , i see
? inplace of the actual multi byte characters. I use solaris os.
  Is there any environment variable which i can set to see the actual
characters on my terminal window.
 
  Thanks
 





RE: in the NEW YORK TIMES today, report of a USA patent for a met hod to make the Arabic language easier to read/write/typeset

2004-03-15 Thread Frank Yung-Fong Tang






Mike Ayers wrote on 3/15/2004, 2:50 PM:



  
   From:
[EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
  
   Behalf Of Frank
Yung-Fong Tang
  
   Sent: Monday,
March 15, 2004 11:16 AM
  
  
   It seems not a very
new idea. Similar idea have been used in 
  
   Chinese 40 
  
   years ago and
create the differences between Simplifed Chinese And 
  
   Traditional
Chinese.
  
  
   Really? That conflicts with my
understanding, which is:
  
  
   When writing Chinese, there are
certain stroke elements which, when written in the more flowing script
of everyday usage (grass script et al.), closely resemble other stroke
elements which use less strokes to write. These stroke reduced
elements are substituted for the original elements. Also, there are
certain "paired" character elements, such that one may be substituted
for the other, and the quicker-to-write stroke reduced element gets
substituted. I do not really understand these substitutions, but it is
my understanding that they are intuitive to literate Chinese. These
two "simplification" methods were formalized and standardized to become
Simplified Chinese.
  
   Am I getting this wrong? I don't see
the connection between organic change in a script and singular
revolutionary change.

Oh... believe me, as a
Chinese educate in the Traditional Chinese world the Simplified Chinese
looks like "revolutionary change" :)

Don't get me wrong. I mention Chinese not to prove it "could be done".
I only want to show- if it does happen, you will have one more
alphabetic set to deal with (now Chinese in USA need to know BOTH
Traditional Chinese AND Simplified Chinese instead of JUST the
"hard-to-learn" Traditional Chinese) 


  
  /|/|ike
  
  






Re: Version(s) of Unicode supported by various versions of Microsoft Windows

2004-03-05 Thread Frank Yung-Fong Tang
Not sure how to find the information paper. But one way to check the 
degree of the support is to do a GetStringTypeEx agasinst some 
characters defined in 2.0, 2.1, 3.0, 3.1, 3.2, 4.0 to see does those 
return result reflect what it should be.

Antoine Leca wrote on 3/5/2004, 8:35 AM:

  Hi folks,
 
  I discovered, to much of my surprise (but after reflexion it does hold
  much
  sense, taken in account the dates when it were developped), that Windows
  2000 only support The Unicode Standard, version 2.0
  URL:http://support.microsoft.com/default.aspx?scid=kb;EN-US;227483
 
  The question, I was unable to find a similar information refering to
  Windows
  NT version 5.1 and 5.2.
 
  Certainly people here may direct me to the correct place to find it.
  Thanks
  in advance.
 
 
  (Please, do not tell me it supports 4.0 since you can view 4.0
  provided you
  use the correct browser and the correct fonts; that is NOT what I
  want to
  know. I am interested for example in sorting strings with surrogates;
  seeing
  that in a typical WinXP distribution, %SYSTEM32%/SORTKEYS.NLS is still
  256k
  like it was with NT3.x, shows me that this one would not support Unicode
  3.1, for instance).
 
  A similar query has been directed to Dr. International
  URL:http://www.microsoft.com/globaldev/drintl/askdrintl.aspx
 
 
  Antoine
 
 





Re: commandline converter for gb18030 - utf8 in *nix

2004-03-05 Thread Frank Yung-Fong Tang

you can also use 'nsconv' which come with mozilla source code with GB18030.
see http://www.mozilla.org/projects/l10n/mlp_tools.html for details

Zhang Weiwu wrote on 3/5/2004, 6:43 AM:

  Hello. I believe this must be a frequent question, but I googled around
  and I didn't find a satisfying tool. It seems most converters do GB2312
  but not GB18030.
 
  I have 100+ files to convert, normal graphical /web based converters
  won't do the work well.
 
  On my FreeBSD there is a ported tool i18ntools
  (http://www.whizkidtech.redprince.net/i18n/), it seems lack the GB18030
  codepage (and the GB_1988-80 page produced messed file). Last month I
  reported w3c's amaya's lack of GB18030 support, they say on the mailling
  list they cannot implement the charset unless they can get a code
  conversion page file. Is it so hard to get one?
 
  And what command-line charset converter do you often use?
 
  Many thanks.





Re: Font Technology Standards

2004-03-03 Thread Frank Yung-Fong Tang




BDF is also widly used,
although the quality and features is not that powerful these day.

Also, there are other "standard" about the font:
1. Glyph set "standard"- how to make sure one font contains all the
glyph for a particular group of users- for example- WGL4 is a glyph set
standard from MS for pan european users.
2. Glyph naming standard- how to name a particular glyph. I remember
Adobe have a "standard" glyph naming scheme for at least Cyrillic
Glyph. This is needed to put the common glyph name into a PostScript
font /TTF font. 

And I am sure the following DOES NOT exist although I hope there we can
have one day- Glyph Encoding Standard. Map a glyph to a fixed glyph ID.
(The Arabic presentation block A and B sort of like this one) For
example, it will be much easier for people to understand the Indic font
if there a INFOS glyph mapping standard for all their indic fonts. 


[EMAIL PROTECTED] wrote on 3/3/2004, 3:52 AM:



  Not sure exactly what you are
looking for because "Font Technology" covers a broad spectrum, but a
*simplified* picture might be something like the following:
  
  
  First, we should distinguish
bitmap font technologies from scalable font technologies ... I assume
you are more interested in the latter.
  
  
  For scalable fonts, there are
a number of fundamentally different ways to describe the curves:
Postscript outlines are based on bezier curves, TrueType outlines on
quadratic curves, and I can't remember what Metafont is.
  
  
  The next level is how you
package the individual glyphs into a font:
  
  Postscript type1
fonts -- bundle up Postscript outlines
  
  TrueType fonts --
bundle up TrueType outlines
  
  OpenType fonts --
bundle up either TrueType or Postscript outlines (and bitmaps)
  
  and there are others.
  
  
  Next level is how you encode
into the font the smarts for complex rendering. At least three
technologies utilize extensions of the TrueType font:
  
  OpenType from
Microsoft  Adobe
  
  GX and AAT from Apple
  
  SIL Graphite from SIL
  
  
  (Note that the TrueType file
structure is inherently extensible, and OpenType, GX/AAT and Graphite
fonts are TrueType fonts with extra tables. Because of this people
often interchange and blur the terms "TrueType" and "OpenType".)
  
  
  As is common in this world,
at each level the various options each have pros and cons. 
  
  
  Bob
  






Re: What's in a wchar_t string on unix?

2004-03-03 Thread Frank Yung-Fong Tang
oh. This is the first time I hear about this. Thanks about your 
information. Does it also mean wchar_t is 4 bytes if __STDC_ISO_10646__ 
is defined? or does it only mean wchar_t hold the character in ISO_10646 
(which mean it could be 2 bytes, 4 bytes or more than that?)

Noah Levitt wrote on 3/2/2004, 1:33 PM:

  As specified in C99 (and maybe earlier), if the macro
  __STDC_ISO_10646__ is defined, then wchar_t values are ucs4.
  Otherwise, wchar_t is an opaque type and you can't be sure
  what it is.
 
  Noah




Re: What's in a wchar_t string on unix?

2004-03-03 Thread Frank Yung-Fong Tang


Clark Cox wrote on 3/3/2004, 1:28 PM:

  From the C standard:
 
  __STDC_ISO_10646_ _An integer constant of the formmmL(for example,
  199712L), intended to indicate that values of type wchar_t are the
  coded representations of the characters defined by ISO/IEC10646, along
  with all amendments and technical corrigenda as of the specified year
  and month.
 
  This, to me suggests that wchar_t would indeed be a 32-bit type (well,
  at least a 20-bit type) when this macro is defined. However, to be
  sure, I'd suggest posting to news:comp.std.c

The language in the standard  does not prevent someone to make it 16 
bits or 64 bits when that macro is defined, right?

And what does the year and month mean?

 
 
  On Mar 03, 2004, at 12:38, Frank Yung-Fong Tang wrote:
 
   oh. This is the first time I hear about this. Thanks about your
   information. Does it also mean wchar_t is 4 bytes if __STDC_ISO_10646__
   is defined? or does it only mean wchar_t hold the character in
   ISO_10646
   (which mean it could be 2 bytes, 4 bytes or more than that?)
  
   Noah Levitt wrote on 3/2/2004, 1:33 PM:
  
   As specified in C99 (and maybe earlier), if the macro
   __STDC_ISO_10646__ is defined, then wchar_t values are ucs4.
   Otherwise, wchar_t is an opaque type and you can't be sure
   what it is.
  
   Noah
 
  --
  Clark S. Cox III
  [EMAIL PROTECTED]
  http://homepage.mac.com/clarkcox3/
  http://homepage.mac.com/clarkcox3/blog/B1196589870/index.html





Re: What's in a wchar_t string on unix?

2004-03-03 Thread Frank Yung-Fong Tang


Clark Cox wrote on 3/3/2004, 4:33 PM:
[I swap the reply order to make my new question clearer]
  
   And what does the year and month mean?
 
  It indicates which version of ISO10646 is used by the implementation.
  In the above example, it indicates whatever version was in effect in
  December of 1997.
It indicates which version of ISO10646 is used by the implementation.
hum... what text in the standard make you believe that is the case?
(I am not against it, just have not seen any standard text clearly show 
that yet.)

   The language in the standard  does not prevent someone to make it 16
   bits or 64 bits when that macro is defined, right?
 
  Not explicitly, but as I read it, when that macro is defined, wchar_t
  would have to be at least 20-bits, or else it couldn't be true that
  values of type wchar_t are the coded representations of the characters
  defined by ISO/IEC10646. That is, I would think that wchar_t would
  have to be able to represent values in the range [0, 0x10]. But my
  interpretation could be off, which is why I recommended asking on
  comp.std.c.

hum...  if it is defined as 199712, then what does it mean ? Unicode 2.0 
? (Unicode 2.0 frist print July 1996)? Unicode 2.1 ? (1998 by UTR#8) 
Unicode 3.0? (2000). None of these define any coded representations of 
characters defined = U+1, right? Therefore, there no reason for a 
implementation which defined it as 199712 have to make the size of 
wchar_t  16 bits, right?









Re: What's in a wchar_t string on unix?

2004-03-01 Thread Frank Yung-Fong Tang




I

Rick Cameron wrote on 3/1/2004, 2:13 PM:



  Hi, all
  
  
  This may be an FAQ,
but I couldn't find the answer on unicode.org.

The reason is there are
"NO answer" to the question you ask. 

  
  
  It seems that most
flavours of
unix define wchar_t to be 4 bytes.

Depend on which UNIX and
which version. Depend on how you define "most flavours"

   If the locale is set
to be Unicode,
what's in a wchar_t string?

No answer for that because
1) ANSI C standard does not define it. (neither it's size nor it's
content)
2) Several organization try to establish standard for Unix. One of that
is "The Open Group"'s "Base Specifications" IEEE Std 1003.1, 2003. But
neither that define what should wchar_t hold. 


   Is it UTF-32, or
UTF-16 with the code units
zero-extended to 4 bytes?
  
  Cheers
  
  
  - rick cameron

The more interesting
question is, why do you need to know the answer of your question. And
the ANSI/C wchar_t model basically suggest, if you ask that question,
you are moving to a wrong direction












Re: unicode format

2004-02-23 Thread Frank Yung-Fong Tang


John Cowan wrote:

  steve scripsit:
 
   Could someone please clarify the difference between UTF8 and UFT16
   please?  If it is possible to encode everything in UTF8 and it is more
   efficient what is the need for UTF16?

It is more efficient to PROCESS in UTF16.




RE: Mother Language Day

2004-02-23 Thread Frank Yung-Fong Tang


joe wrote:

 
  (Hmm, in Russian mother language (maternij jazik) means something
  *verry* different.
 
  Watch your language! ;-)
He write this in English not Russian, right?
How can I watch Chinese (my language) ?
 
  Joe




Re: Codes for Individual Chinese Brushstrokes

2004-02-20 Thread Frank Yung-Fong Tang

As a native Chinese person. I believe
1. The so called eight basic stroke is very standard in concept.
But that is only 8.
2. They list 8 different varients for each of the 8 basic stroke. But 
if you read that page carefully, it does not mean that there are only 8 
variants for each stroke, neither mean people can distinguish those 
variants from each others. For example, most Chinese will think the 
first Dot from the left is the same as the fourth Dot from the left. 
  And the differents between them are really style. Therefore, it is 
not a good idea to encode those variants
3. There are more composit strokes if  you really want to encode 
strokes. For example: 
http://people.netscape.com/ftang/chineselearning/strokes/refglyph_003.gif
http://people.netscape.com/ftang/chineselearning/strokes/refglyph_004.gif


Andrew C. West wrote:

  On Thu, 19 Feb 2004 18:27:09 -0800 (PST), Kenneth Whistler wrote:
  
   Of the 64 entities listed on the page:
  
   http://www.chinavoc.com/arts/calligraphy/eightstroke.asp
  
   *none* of them are encoded, and *none* of them are standard
   enough to merit consideration -- if by consideration you mean
   separate encoding as characters.
  
 
  I'm not sure about *none* of them are encoded. As far as I can tell,
  pretty
  much most of the basic ideographic stroke forms are either already
  encoded in
  CJK and CJK-B or are proposed in CJK-C (where encoded here means
  encoded in
  their own right or can be represented by same-shaped ideographs).
 
  See for example the IRG document
  
http://www.cse.cuhk.edu.hk/~irg/irg/irg19/N927_Add%202%20Strokes%20to%20C1.doc 

 
  which states :
 
  quote
  Although most ideographic strokes have been encoded in CJK (including
  Ext.A and
  B) or submitted to CJK_C1 by IRG members, there are two ideographic
  strokes are
  found missing. Ideographic strokes are important for ideograph
  decomposition,
  analysis and for making ideographic strokes subset. Chinese linguists
  suggest to
  add these two ideographic strokes to CJK_C1.
  /quote
 
  I also remember reading one WG2 document that explicitly raised the
  question of
  how to deal with all the ideographic strokes proposed in CJK-C that
  are not
  distinct ideographs in their own right, although I can't seem to
  locate that
  document any more.
 
  All except one of the eight basic strokes mentioned at
  http://www.chinavoc.com/arts/calligraphy/eightstroke.asp are
  *representable*
  using existing characters in the CJK and/or Kangxi Radicals blocks :
 
  dot = U+4E36 or U+2F02 [KANGXI RADICAL DOT]
  dash = U+4E00 or U+2F00 [KANGXI RADICAL ONE]
  perpendicular downstroke = U+4E28 or U+2F01 [KANGXI RADICAL LINE]
  downstroke to the left or left-falling stroke = U+4E3F or U+2F03
  [KANGXI RADICAL
  SLASH]
  wavelike stroke or right-falling stroke = U+4E40
  hook = U+4E85 or U+2F05 [KANGXI RADICAL HOOK], as well as U+4E5A and
  U+2010C
  upstroke to the right =
  bend or twist = U+4E5B and U+200CC
 
  I concur with Ken that the 8x8 stroke categorization given at this web
  site is
  largely artificial. Whilst it may be useful to encode general
  ideographic stroke
  forms to help in the analysis and decomposition of ideographs, in my
  opinion the
  minute distinctions in the way that dots and dashes are written in
  various
  individual ideographs are beyond the scope of a character encoding
  system as the
  exact shape of a dot or length of a dash is irrelevant to any analysis
  of the
  compositional structure of an ideograph.
 
  Andrew
 





Re: UTF-8 to UTF-16 conversion

2004-02-06 Thread Frank Yung-Fong Tang
Yes, TEC. look at developer.apple.com and look at Text Encoding Converter

Paramdeep Ahuja wrote:

 
  Hi
 
  Can anyone tell if there is any API available on MAC to convert from
  UTF-8
  to UTF-16
 
 
  thnx
  -P
 





Re: Detecting encoding in Plain text

2004-01-14 Thread Frank Yung-Fong Tang
Consider CR and LF too.

Mark Davis wrote on 1/14/2004, 9:25 AM:

  I'm not sure which one suggested heuristic method you are referring
  to, but
  you are bounding to conclusions. For example, one of the heuristics is
  to judge
  what are more common characters when bytes are interpreted as if they
  were in
  different encoding schemes. When picking between UTF16-BE and LE,
  U+0020 is
  *still* much more common than U+2000, even in Thai.
 
  Mark
  __
  http://www.macchiato.com

 
  - Original Message -
  From: Peter Kirk [EMAIL PROTECTED]
  To: John Burger [EMAIL PROTECTED]
  Cc: [EMAIL PROTECTED]
  Sent: Wed, 2004 Jan 14 08:12
  Subject: Re: Detecting encoding in Plain text
 
 
   On 14/01/2004 07:16, John Burger wrote:
  
...
By the way, I still don't quite understand what's special about Thai.
Could someone elaborate?
   
   I mentioned Thai because it is the only language I know of which does
   not used SPACE, U+0020. It also has at least some of its own
   punctuation. So a Thai text need not include any characters U+00xx -
   which rules out one suggested heuristic method.
  
   --
   Peter Kirk
   [EMAIL PROTECTED] (personal)
   [EMAIL PROTECTED] (work)
   http://www.qaya.org/
  
  
  
  
 
 





Re: Detecting encoding in Plain text

2004-01-14 Thread Frank Yung-Fong Tang
Does Thai use CR and LF?

Peter Kirk wrote on 1/14/2004, 8:12 AM:

  On 14/01/2004 07:16, John Burger wrote:
 
   ...
   By the way, I still don't quite understand what's special about Thai.
   Could someone elaborate?
  
  I mentioned Thai because it is the only language I know of which does
  not used SPACE, U+0020. It also has at least some of its own
  punctuation. So a Thai text need not include any characters U+00xx -
  which rules out one suggested heuristic method.
 
  --
  Peter Kirk
  [EMAIL PROTECTED] (personal)
  [EMAIL PROTECTED] (work)
  http://www.qaya.org/
 
 
 





Re: Detecting encoding in Plain text

2004-01-14 Thread Frank Yung-Fong Tang


John Burger wrote on 1/14/2004, 7:16 AM:

  Mark E. Shoulson wrote:
 
   If it's a heuristic we're after, then why split hairs and try to make
   all the rules ourselves?  Get a big ol' mess of training data in as
   many languages as you can and hand it over to a class full of CS
   graduate students studying Machine Learning.
 
  Absolutely my reaction.  All of these suggested heuristics are great,
  but would almost certainly simply fall out of a more rigorous approach
  using a generative probabilistic model, or some other classification
  technique.  Useful features would include n-graphs frequencies, as Mark
  suggests, as well as lots of other things.  For particular
  applications, you could use a cache model, e.g., using statistics from
  other documents from the same web site, or other messages from the same
  email address, or even generalizing across country-of-origin.
  Additionally, I'm pretty sure that you could get some mileage out of
  unsupervised data, that is, all of the documents in the training set
  needn't be labeled with language/encoding.  And one thing we have a lot
  of on the web is unsupervised data.
 
  I would be extremely surprised if such an approach couldn't achieve 99%
  accuracy - and I really do mean 99%, or better.
 
  By the way, I still don't quite understand what's special about Thai.
  Could someone elaborate?


For language other than Thai, Chinese and Japanese, you usually will see 
space between words. Therefore, you should see a high count of SPACE in 
your document. The SPACE for text in language other than Thai, Chinese 
and Japanese should occupy probably 10%-15% of the code point (just a 
guess, if the average lenght of word is 9 characters, you will get 10% 
SPACE, if it shorter, if the average is shoter, than the percentage of 
SPACE increase). But for Thai, Chinese and Japanese, space is not put in 
between words, and therefore the percentage of SPACE code point will be 
quite different. For Korean, it is hard to say, depend they are using 
IDEOGRAPH SPACE or SINGLE BYTE SPACE. Also, for Korea, it will depend on 
which normalization form they are using. The % of space will be 
different too because in one normalization form you will count one 
Korean characters as one unicode code point, but in the decomposed form, 
it may be count as 3.

Shanjian Lee and Kat Momoi implement a charset detector based on my 
early work and direction. They summarise it into a paper and present in 
Sept 11, 2001. see 
http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html for 
details. It talk about a different issue and problem.


 
  - John Burger
 MITRE
 
 
 





Re: Programmatic description of ideographic characters

2004-01-03 Thread Frank Yung-Fong Tang
looks like an old idea people in Taiwan gave up long time
ago because of the issue of the quality of glyph will never be
good enough.

Tom Emerson wrote on 1/2/2004, 6:06 PM:

  The following paper, Chinese Character Synthesis using METAPOST, was
  recently mentioned in a thread on the teTeX mailing list. It's an
  interesting read.
 
  http://www.tug.org/tug2003/preprints/Yiu/yiu.pdf




Re: MS Windows and Unicode 4.0 ?

2003-12-03 Thread Frank Yung-Fong Tang

come on, take my joke. but that is a perfect example of language 
specific variant glyph, right?

Michael Everson wrote:

  At 17:13 -0800 2003-12-02, Frank Yung-Fong Tang wrote:
  come on, use language specific glyph substution on the last resort
  font to show Irish last resort glyph if the language is Irish. I know
  OpenType have it. Does AAT support language specific features?
 
  You are welcome to lobby Apple to commission such an enormous font.
  You have, I think, no idea how much work that would be.
  --
  Michael Everson * * Everson Typography *  * http://www.evertype.com
 

-- 
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv 
Srvies
AIM:yungfongta   mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan





Re: MS Windows and Unicode 4.0 ?

2003-12-03 Thread Frank Yung-Fong Tang


Peter Kirk wrote:

  On 02/12/2003 16:25, Frank Yung-Fong Tang wrote:
 
  ...
  a barrier to proper internationalisation ?
  
  My opinion is reverse, I think it is a strategy to proper
  internationalization. Remember, people can always choose to stay with
  ISO-8859-1 only or go to UTF-8 with MES-1 support for European market.
  UTF-8 with MES-1 support does not mean other characters won't work in
  their product, but instead, it mean other charactrers are not Quality
  Assuranced in their products.
  
  
  Well, Frank, I am surprised that you favour encouraging developers to
  design their systems with only the European market in mind. Surely it
  would help with internationalisation for Thailand if the system is
  designed with support for Thai and other scripts in mind, even if not
  fully implemented and quality assured in the first release.

No. that is not what I said. See, you still thinking about developers 
and design and system. I am talking about QA, product, 
service, marketing PLUS the development.

I am encouraging QA to test MES-1 with UTF-8 instead of only ISO-8859-1.
I am encouraging product ship with MES-1 support out of the box instead 
of ISO-8859-1.
And if QA wrote their test plan by using UTF-8 and MES-1 and product 
claim to supprt MES-1, how far it could be away from even if not
fully implemented and quality assured in the first release.

You are talking about a developer driven mindset, I am talking about a 
product driven, marketing driven, Quality driven mindset.

 
  ...
  You only look at the issue from the developer point of view. But how
  about QA? How are you going to QA the whole Unicode? You also need to
  look at the issue from an end-user point of view, or the working out of
  box point of view. How could the end user know what kind of function
  they are going to get WITHOUT extra efforts.
  
  
  True, I hadn't looked at the QA issue. I suppose there are two ways to
  go here: one would be to aim at support for the whole of Unicode but
  only assure support for certain ranges;

in my book a supporting feature without QA is not a supported feature 
at all. See, you still have this developer oriented mind set. No 
product should claim they support something without QA.

  the other is for the QA people
  to work with third party fonts.
  QA-ing the whole of Unicode shouldn't be
  a big problem anyway as most work needs to be done on new features
  rather than new characters e.g.

For QA engineer to test a software product with a particular script, 
they have to have at least some minimun knowledge about that script. And 
I won't say that is easy. For example, ask yourself, how many scripts 
you feel confortable by youself to Quality Assuranced? (not just test, 
but ASSURANCE)

  if one script using special feature X is
  assured to work, a rather quick test should be sufficient to show that
  every script using feature X works.

hum that sound below the QA standard normal QA engineers is 
targeting. A good Test Plan need to include
Make sure right input cause right output
Make sure wrong input cause error but not rigth output
Make sure all the possible cdoe path got executed.
and more.

 
  If you are a QA engineer who is working on a working out of box product,
  how are you going to prepare your test cases? If you are a product
  marketing person who is going to write a product specification about a
  cell phone which do not allow user to download fonts, how are you going
  to spec it out?
  
  
  Well, I was thinking of computers rather than brain dead mobile phones.
  Mobile phones have long allowed downloading of ring tones, so why not
  downloading of fonts? And there is probably already a significant demand
  for mobile phones using every script which is in current everyday use,
  and so mobile phone manufacturers who restrict users to more restrictive
  subsets are being shortsighted - although I would expect that full BMP
  support would be adequate for a basic product in this scenario.
 

Name me a cell phone which can download and accept CJK Han Extension B 
(Unicode Plan 2) today.

If you are building a theory, you can support any unicode code point.
If you are building a technology, you may support any unicode code point.
If you are building a product, you won't be able to support any unicode 
code point with limited time  cost in good enough quality. In that 
case, I rather cut features (how many scripts in Unicode) in exchange of 
quality.


  You are assuming a product which is does not need to work out of box.
  If that is the case, you can ALSO think Windows 2000 work for surrogate
  since you can install or tweak the register to make it work with
  Surrogate. You can ALSO think Windows 95 can support Complex Script
  since you can INSTALL Uniscribe on it, right?
  
  
  Right. My Windows 2000 supports surrogates, probably because either one
  of the service packs or Office XP installed this support for me. When I
  was using Windows 95 I

Re: MS Windows and Unicode 4.0 ?

2003-12-03 Thread Frank Yung-Fong Tang

As long as a product support UTF-8 and pass the test with MES-1, I can 
pretty sure that no code in between strip off any non ISO-8859-1 
characters, regardless they support MES-2 or MES-3.

Of course, that does not guarantee surrogate characters won't get 
damanaged, but just as someone believe, it will be 1% of efforts for me 
to fix it later, right? :)


Michael Everson wrote:

  At 15:38 -0800 2003-12-03, Frank Yung-Fong Tang wrote:
 
  I am encouraging QA to test MES-1 with UTF-8 instead of only ISO-8859-1.
  I am encouraging product ship with MES-1 support out of the box instead
  of ISO-8859-1.
  And if QA wrote their test plan by using UTF-8 and MES-1 and product
  claim to supprt MES-1, how far it could be away from even if not
  fully implemented and quality assured in the first release.
 
  MES-1 is hopelessly archaic. It's ISO 6937. MES-2 would be the only
  miminum I could recommend for Europe. And it's not good enough
  either, which is why MES-3 is block based.
  --
  Michael Everson * * Everson Typography *  * http://www.evertype.com
 

-- 
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv 
Srvies
AIM:yungfongta   mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan





RE: MS Windows and Unicode 4.0 ?

2003-12-02 Thread Frank Yung-Fong Tang


Michael Everson wrote:

 
  It's better than not knowing what range the thing is in. It helps the
  user know he has received, say, Telugu data or whatever.

Only if the user know what Telugu may look like. How many users other
than those sign up the Unicode malling list know the shape of more
than 10 scripts ?

I think the value is it show poeple it is not a ? ASCII
question mark itself.

-- 
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv 
Srvies
AIM:yungfongta   mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan





RE: MS Windows and Unicode 4.0 ?

2003-12-02 Thread Frank Yung-Fong Tang





A better approach than asking "Does product X support Unicode 4.0"
which in some way you can always get a NO answer is to
1. Define a smaller set of functionality (Such as MES-1, MES-2, MES-3A)
2. Ask 'Does Product X Support MES-1? Does Product X Support MES-2?

I think that kind of question is more meaningful. Unicode is "Big,
Powerful but Complex", compare to US-ASCII or ISO-8859-1 which are
"Small, Weak but Simple". While the answer of "Does Product X support
Y" is meaningful while Y is a "Small but Simple" stuff, the answer have
less meaning while Y is a Big and Complex beast like Unicode. 

Surrogate itself could be a very small enough subset for the question,
assuming if you don't consider the Plane 14 behabior is part of it. 

Please do not push too hard on commercial company to implement Unicode.
Because there are TWO approaches, not ONE, some commercial company
took to implement the Unicode Standard in the history:
1. Implement the next version of software according to today's Unicode
Standard
2. Change the next version of Unicode standard according to today's
implementation

The famous Korean Mess and the introduction of 15.10 Tag Characters
should teach all of us a lesson- If you push a company too hard to
implement the Unicode, it may push them to take the 2nd approach... 

Arcane Jill wrote:



Damn right. I would like to know this too. In particular, I want all
the math characters working, and all the musical symbols working. Note
that many of these are not in the BMP. I want to be able to put these
characters on web pages, and know that they will be displayed correctly
on my own choice of browser (which is not MSIE)..
  
And since I already have a Windows OS (XP Pro), I don't see why I
should have to buy another one just to get these extras. I'm
hoping it would suffice to make just the FONTS available to the world.
  
Jill
  
  
 -Original Message-
 From: Patrick Andries [mailto:[EMAIL PROTECTED]]
 Sent: Tuesday, December 02, 2003 3:54 AM
 To: Michael (michka) Kaplan; Unicode List
 Subject: Re: MS Windows and Unicode 4.0 ?
 
 I'm interested in knowing whether the following features 
 would soon be found
 in Windows : fonts for scripts covered by Unicode 4.0,
corresponding
 rendering engine to display all Unicode 4.0 scripts
  
  

-- 
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv Srvies
AIM:yungfongta mailto:[EMAIL PROTECTED]
Tel:650-937-2913 
Yahoo! Msg: frankyungfongtan







Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

2003-12-02 Thread Frank Yung-Fong Tang


Mark Davis wrote:

UTF-166,634,430 bytes
UTF-87,637,601 bytes
SCSU6,414,319 bytes
BOCU-15,897,258 bytes
Legacy encoding (*)5,477,432 bytes
(*) KS C 5601, KS X 1001, or EUC-KR)

What is the size of gzip these? Just wonder
gzip of UTF-16
gzip of UTF-8
gzip of SCSU
gzip of BOCU-1
gzip of Legacy encoding

-- 
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv 
Srvies
AIM:yungfongta   mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan





Re: How can I have OTF for MacOS

2003-12-02 Thread Frank Yung-Fong Tang


John Jenkins wrote:

  On Dec 1, 2003, at 4:24 PM, Frank Yung-Fong Tang wrote:
 
   John What 'cmap' format Apple use in the MacOS X
   Devanagari and Bangla fonts?
  
 
  The formats are irrelevant; the Mac supports all the 'cmap' subtable
  formats for all subtables.  For rendering complex scripts, however, the
  font can only be rendered through ATSUI (or Cocoa), because the old way
  to support complex scripts  via an 'itl5' resource in the suitcase
  with the 'FOND' and 'sfnt' resources  is not supported on X.

It may or may not relevant.

The reason I ask this question is because the eariler version of mozilla 
code (I forgot when do we change it) used to work around a freezing 
issue on ATSUI on the eariler version of MacOS X. What happen is in the 
old version of MacOS X, if a page of Unicode characters are not 
supported by any installed font, the performance is extermley 
unacceptable (freeze for 3 minutes by open and close font file) in 
Mozilla because either the old MacOSX does not cache the information 
about which characters have no glyph on the system at all or such 
caching is based on the layout instead of a global. To work around that, 
Mozilla read the cmap table (only some format) to decide which chracters 
'could be render by ATSUI before it pass to ATSUI. I know the 
performance is much better now (wel... if you are still not sure, try to 
visit 
http://people.netscape.com/ftang/testscript/gb18030/gbtext.cgi?page=1220 
a page of GB18030 characters which encode Unicode plane 4 (which have no 
characters assign yet) from a browser which support GB18030. or 
http://people.netscape.com/ftang/testscript/gb18030/gb18030.cgi?page=1220 
for the html which include NCR, GB18030 and GIF (which does not exist)

I think we remove that checking code last year, after the newer version 
of MacOS X improve from this almost freeze with wrong data situration.

 
  Apple really, really wants everybody to move to using Unicode in their
  applications for all their text, and Apple really, really, *really*
  wants people to do it for complex scripts.

And my question have no conflict with Apple's recommendation at all.

Think about this, while Microsoft support Unicode cmap and really 
encourage people to use Unicode, they ALSO publish the WGL4 and the 
OpenType font spec for different script. They also say which format a 
font SHOULD support in TTF cmap. That does not conflict with the goal of 
Using Unicode at all.

There are times people have to know those details. And it is better 
those details are capture in a Apple's public tech note, rather than 
people dig into the code by revers eng and guess what it is.

For example, if I want to customize my last resort behavior in MacOS X 
with ATSUI (by drawing a Frank Tang picture with a Unicode Decimal value 
below it- a way nobody want to implement- everyone live Unicode Hex, 
right. Just want to make an extreme case that for sure John won't crazy 
enough to add them into ATSUI.) in my application, how can I do it now?

-- 
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv 
Srvies
AIM:yungfongta   mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan





RE: MS Windows and Unicode 4.0 ?

2003-12-02 Thread Frank Yung-Fong Tang


Michael Everson wrote:

  At 14:23 -0800 2003-12-02, Frank Yung-Fong Tang wrote:
 
 It's better than not knowing what range the thing is in. It helps
  the
 user know he has received, say, Telugu data or whatever.
  
  Only if the user know what Telugu may look like. How many users other
  than those sign up the Unicode malling list know the shape of more
  than 10 scripts ?
 
  Actually, if you look at the Last Resort Glyphs (at a large enough
  size) you will see that the block name and range numbers are part of
  the image. See http://developer.apple.com/fonts/LastResortFont/

ok, you are right, I should say Only if you have good vision instead.

-- 
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv 
Srvies
AIM:yungfongta   mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan





Re: UTF-16 inside UTF-8

2003-12-02 Thread Frank Yung-Fong Tang


Doug Ewell wrote:

  Frank Yung-Fong Tang ytang0648 at aol dot com wrote:
 
  Then, Frank, the Tcl implementation is *not valid UTF-8* and needs to be
  fixed.  Plain and simple.  If a system like Tcl only supports the BMP,
  that is its choice, but it *must not* accept non-shortest UTF-8 forms or
  output CESU-8 disguised as UTF-8.

Agree with you. Just want to make a point that the implementation is not 
 1% of the work.

 
   If you still think adding 4 bytes UTF-8 support is  1% of the task,
   then please join the Tcl project and help me fix that. I appreciate
   your efforts there and I beleive a lot of people will thank for your
   contribution.
 
  I'll be happy to supply UTF-8 code that handles 4-byte sequences.  That
  is not the same thing as converting an entire system from 16-bit to
  32-bit integers, or adding proper UTF-16 surrogate support to a
  UCS-2-only system.  Of course that is more work.

You view is based on the assumption the internal code is UCS4 instead of 
UTF-16.

 
  Remember, AGAIN, that this thread was originally about taking an
  application like MySQL that did not support Unicode at all, and adding
  Unicode support to it, **BUT ONLY FOR THE 16-BIT BMP.**  That is what I
  can't imagine -- making BMP-only assumptions *today*, in 2003, knowing
  that you'll have to go back and fix them some day.  That is certainly
  more work than adding support for the full Unicode range at once.  I
  think you thought I said the opposite, that such retrofitting is easy,
  and are now trying hard to disprove it.

Nothing wrong if people choose to use UTF-16 instead of UCS4 in the API, 
even as 2003. Do you agree?

If people do use UTF-16 in the API, it is nature for people who do care 
about BMP but not care about Plan 1-16 to only work on BMP, right? I am 
not saying they do the right thing. I am saying they do the nature 
thing. Remember, the text describe about 'Surrogate' in the Unocde 4.0 
standard is probably only 5-10 pages total in that 1462 pages standard. 
For developer who won't going to implement the rest 1000 pages right, it 
is nature for them to think why do I need to make this 10 pages right?



 
   double your memory cost and size from UTF-8. x4 of the size for your
   ASCII data. To change implementation of a ASCII compatable / support
   application to UTF-16 is already hard since people only care about
   ASCII will upset the data size x 2 for all their data. It is already
   a hard battle most of the time for someone like me. If we tell them to
   change to UCS-4 that mean they need not only x2 the memory but x4 of
   the memory.
 
  I can't fight this battle with people who would rather stay with ASCII
  and 7/8 bits per character.  They are not living in a Unicode world.

But how about the UTF-16 vs UCS4 battle?

 
  1024  768 screen resolution takes 150% more display memory than 640 
  480, too.
 
   For web services or application which spend multi millions on those
   memory and database, it mean adding millions of dollars to their cost.
   They may have to adding some millions of cost to support international
   customer by using UTF-16. They probably are willing to add multi
   millions dollars of cost to change it to use UCS4. In fact, there are
   people proposed to stored UTF-8 in a hackky way into the database
   instead of using UTF-16 or UCS4 to save cost. They have to add
   restriction of using the api and build upper level api to do
   conversion and hacky operation. That mean it will introduce some fixed
   (not depend on the size of data) developement cost to the project but
   it will save millions of dollars of memory cost which depend on the
   size of the data. I don't like that approach but usually my word and
   what is right is less important than multiple million of dollars for
   a commercial company.
 
  I would truly be surprised if full 17-plane Unicode support in a single
  app could be demonstrated to be a matter of multiple millions of
  dollars.

It is not the full 17-plane Unicode support which will contribut to it.
It is the
(Number of ASCII only records X sizeof (records in UCS4)) - ( Number of 
ASCII only records X sizeof(record in ASCII))

contribute to that.

compare to

(Number of ASCII only records X sizeof (records in UTF-8)) - ( Number of 
ASCII only records X sizeof(record in ASCII))
or

(Number of ASCII only records X sizeof (records in UTF-16)) - ( Number 
of ASCII only records X sizeof(record in ASCII))


The other comparision is
(Number of BMP only records X sizeof (records in UCS4)) - ( Number of 
BMP only records X sizeof(record in UTF-8))

(Number of BMP only records X sizeof (records in UCS4)) - ( Number of 
BMP only records X sizeof(record in UTF-16))

of course, the sizeof() is really the average size of record with those 
data

 
  -Doug Ewell
  Fullerton, California
  http://users.adelphia.net/~dewell/
 

-- 
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv 
Srvies
AIM:yungfongta   mailto:[EMAIL

Re: MS Windows and Unicode 4.0 ?

2003-12-02 Thread Frank Yung-Fong Tang


Peter Kirk wrote:

  On 02/12/2003 14:19, Frank Yung-Fong Tang wrote:
 
  
   A better approach than asking Does product X support Unicode 4.0
   which in some way you can always get a NO answer is to
   1. Define a smaller set of functionality (Such as MES-1, MES-2, MES-3A)
   2. Ask 'Does Product X Support MES-1? Does Product X Support MES-2?...
 
  I disagree - if we are talking about a system rather than a font.
  Supporting subsets is a dead end, and a barrier to proper
  internationalisation.

a barrier to proper internationalisation ?

My opinion is reverse, I think it is a strategy to proper 
internationalization. Remember, people can always choose to stay with 
ISO-8859-1 only or go to UTF-8 with MES-1 support for European market. 
UTF-8 with MES-1 support does not mean other characters won't work in 
their product, but instead, it mean other charactrers are not Quality 
Assuranced in their products.

This is not a new approach. For example, while MS add Unicode support, 
they ALSO define WGL4. That basically tell people all the characters in 
WGL4 will be able to render in all the Windows system after Win98 (not 
sure about 95). It does not mean other characters will not be able to 
render in Win98 or later. It only mean those characters could be render 
out of the box.


  It would be much better for developers to realise
  that from the start they need to build in support for the whole Unicode
  character set. Once Arabic, one Indic script and Plane 1 are supported,
  the rest is relatively easy; all the data required are in the UCD, and
  the shaping details can be left to the font. The alternative of bolting
  on ad hoc support for extra scripts later, when they become necessary,
  just causes extra work.

You only look at the issue from the developer point of view. But how 
about QA? How are you going to QA the whole Unicode? You also need to 
look at the issue from an end-user point of view, or the working out of 
box point of view. How could the end user know what kind of function 
they are going to get WITHOUT extra efforts.

If you are a QA engineer who is working on a working out of box product, 
how are you going to prepare your test cases? If you are a product 
marketing person who is going to write a product specification about a 
cell phone which do not allow user to download fonts, how are you going 
to spec it out?


 
  A product can thus claim to support Unicode 4.0 rather easily, if it
  makes the caveat that its font and perhaps keyboard support is limited
  to certain scripts. Users interested in more unusual scripts can then
  supply their own specialised font, or a general (but inexpensive) one
  like Code2000.

You are assuming a product which is does not need to work out of box. 
If that is the case, you can ALSO think Windows 2000 work for surrogate 
since you can install or tweak the register to make it work with 
Surrogate. You can ALSO think Windows 95 can support Complex Script 
since you can INSTALL Uniscribe on it, right?


 
  And I would think that MS Windows 2000/XP is quite close to being able
  to make this claim, as long as you ignore the outdated Character Map (a
  prime example of needless subsetting!) and use an alternative like
  BabelMap. For one big advantage of the approach I suggest is that an OS
  can even anticipate future versions of the standard, as long as no major
  new properties are added.
 
  --
  Peter Kirk
  [EMAIL PROTECTED] (personal)
  [EMAIL PROTECTED] (work)
  http://www.qaya.org/
 
 

-- 
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv 
Srvies
AIM:yungfongta   mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan





Re: MS Windows and Unicode 4.0 ?

2003-12-02 Thread Frank Yung-Fong Tang


come on, use language specific glyph substution on the last resort 
font to show Irish last resort glyph if the language is Irish. I know 
OpenType have it. Does AAT support language specific features?


John Jenkins wrote:

 
  On Dec 2, 2003, at 4:34 PM, Michael Everson wrote:
 
   At 15:14 -0800 2003-12-02, Patrick Andries wrote:
  
 Actually, if you look at the Last Resort Glyphs (at a large enough
size) you will see that the block name and range numbers are part of
the image. See http://developer.apple.com/fonts/LastResortFont/
  
   I believe the name is in English.
  
   That's correct. I tried to get Apple to put all the block names in
   Irish, of course ;-)
  
 
  Well, Irish was just silly. I was pushing internally to put them all in
  Deseret, but nobody went for it.  :-(
 
  
  John H. Jenkins
  [EMAIL PROTECTED]
  [EMAIL PROTECTED]
  http://homepage..mac.com/jhjenkins/


-- 
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv 
Srvies
AIM:yungfongta   mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan





RE: UTF-16 inside UTF-8

2003-12-02 Thread Frank Yung-Fong Tang


Philippe Verdy wrote:

  Frank Yung-Fong Tang writes:
   But how about the UTF-16 vs UCS4 battle?
 
  Forget it: nearly nobody uses UCS-4 except very internally for string
  processing at the character level. For whole strings, nearly everybody
  uses
  UTF-16 as it performs better with less memory costs, and because UCS-4 is
  not needed.

I don't think that is a correct statement. I would like to use UTF-16. 
But it is clear that is not all the case.

1. Some people in this list preferred UCS4. (Raise your hand if you do)
2. wchar_t in Linux's glib is UCS4. (and that is nearly nobody)
3. because of 2, therefore, gconv on linux is using UCS4
4. FontConfig use UCS4 for API provide for Xft, (see FcFreeTypeCharIndex 
in fcfreetype.h )
5. Xft internally use UCS4 (look at xftdraw.c, xftrender.c). Some of the 
Xft's api use UCS4 (not all)- XftTextExtents32, XftDrawString32, 
XftTextRender32, XftTextRender32BE, XftTexdtRender32LE, XftDrawCharSpec, 
XftCharSpecRender, XftDrawCharFotnSpec, XftCharFontSpecRender,
6. gunichar in linux is ucs4
7. Because of 6, pango use UCS4 in the unicode api


 
  Handling surrogates found in surrogates is quite simple and in fact it is
  even simpler to detect and manage than handling MBCS-encoded strings for
  Asian 8-bit applications, and today MBCS 8-bit processing is performed by
  transforming it first into equivalent internal 16-bit code positions, or
  sometimes by transcoding it to Unicode with UTF-16.
 
  So I do think that applications that could handle East-Asian DBCS
  8-bit text
  (EUC-*, ISO2022-*, JIS) can very easily be modified to work internally
  with
  UTF-16 (notably because interoperability of Unicode code points with
  these
  DBCS charsets is excellent as the transcoding is not ambiguous,
  bijective,
  does not need code reordering, and just consists in a simple mapping
  table
  implemented now in all OSes localized for Asian markets).
 
  East-Asian developers have learned since long how to cope with
  DBCS-encoded
  strings. Now with UTF-16, handling surrogates found in string is even
  simpler, as UTF-16 allows bidirectional and random access to any
  positions
  in strings, which means additional performance and less tricky algorithms
  for text processing...

Agree. It is simpler to address surrogate compare to handle multibyte.

now the question is, if it is simple to address surrogate, then why 
don't we address that later? and put higher priority on other i18n issue 
which is harder to address and are more critical if not implement (such 
as handling non shortest form which may lead to security problem?)


-- 
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv 
Srvies
AIM:yungfongta   mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan





Re: creating a test font w/ CJKV Extension B characters.

2003-12-01 Thread Frank Yung-Fong Tang

as my last memory, IE even could render the GB18030, still treat multi 
byte characters accorss TCP block poorly. For example, if you have a 4 
bytes GB18030 across a TCP block (4k? 8k?), it will be trashed.



Andrew C. West wrote:

  On Mon, 24 Nov 2003 10:12:52 +, [EMAIL PROTECTED] wrote:
  
   Even with the registery changes that allow Uniscript to work with such
   characters?
 
  Oops, my mistake. I had forgotten that I had deliberately deleted the
  registry
  settings that control how IE deals with surrogate pairs sometime ago
  in order to
  prove a point (that IE won't display surrogate pairs without them ?).
  Anyway,
  restore the registry to its original state and Frank's page displays
  OK without
  any tweaking whatsoever - both NCR and GB18030 encoded CJK-B
  characters render
  correctly with my preferred CJK-B font.
 
  To install the registry keys necessary for IE to display surrogate
  pairs simply
  copy the code below to a file named something.reg and double-click
  on it.
  Replace Code2001 with the name of your preferred Supra-BMP font if
  necessary.
 
  code
  Windows Registry Editor Version 5.00
 
  [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows
  NT\CurrentVersion\LanguagePack]
  SURROGATE=dword:0002
 
  [HKEY_CURRENT_USER\Software\Microsoft\Internet
  Explorer\International\Scripts\42]
  IEFixedFontName=Code2001
  IEPropFontName=Code2001
  /code
 
  Andrew
 

-- 
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv 
Srvies
AIM:yungfongta   mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan

John 3:16 For God so loved the world that he gave his one and only Son,
that whoever believes in him shall not perish but have eternal life.

Does your software display Thai language text correctly for Thailand users?
- Basic Conceptof Thai Language linked from Frank Tang's
Itrntinliztin Secrets
Want to translate your English text to something Thailand users can
understand ?
- Try English-to-Thai machine translation at
http://c3po.links.nectec.or.th/parsit/





Re: How can I have OTF for MacOS

2003-12-01 Thread Frank Yung-Fong Tang


John Jenkins wrote:

 
  On Nov 26, 2003, at 7:26 AM, [EMAIL PROTECTED] wrote:
 
  
   But what about devnagri or Bangla.
  
 
  Devanagari and Bangla cannot be supported on Mac OS X through QuickDraw
  text rendering.  Since Office on the Mac is currently restricted to
  QuickDraw text rendering, it cannot support them.
 
  
  John H. Jenkins

John What 'cmap' format Apple use in the MacOS X
Devanagari and Bangla fonts?


-- 
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv 
Srvies
AIM:yungfongta   mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan

John 3:16 For God so loved the world that he gave his one and only Son,
that whoever believes in him shall not perish but have eternal life.

Does your software display Thai language text correctly for Thailand users?
- Basic Conceptof Thai Language linked from Frank Tang's
Itrntinliztin Secrets
Want to translate your English text to something Thailand users can
understand ?
- Try English-to-Thai machine translation at
http://c3po.links.nectec.or.th/parsit/





RE: MS Windows and Unicode 4.0 ?

2003-12-01 Thread Frank Yung-Fong Tang


Carl W. Brown wrote:

  Jill,
 
   I know that Unicode does have some
   locale-sensitive case mappings (Turkish
   uppercase I to dotless lowercase
   I for example), I was under the impression
   that ss to  was not one of them.
 
  You are correct that SS and  are the same in case insensitive
  compares
  regardless of locale.


But does MS file system is case insensitive in the Unicode way is a 
different question then does MS file system support Unicode, right?

So... Necessary will be the same as Neceary in case insensitive 
comparison?


 
  I also think that ?stanbul and Istanbul should also compare the
  same for
  things like keyword searches and file systems even though it is
  technically
  incorrect.
 
  Carl

-- 
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv 
Srvies
AIM:yungfongta   mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan






Re: MS Windows and Unicode 4.0 ?

2003-12-01 Thread Frank Yung-Fong Tang


Michael (michka) Kaplan wrote:

  To answer the original question, support of Unicode in *any* version of
  Windows (or indeed any operating system) is between 1.1 and 4.0,
  depending
  on what feature you are looking at. To answer such a question, the
  specific
  feature about which the questioning party is thinking must be given as a
  part of said question.

oh... really, what kind of Unicode support in Windows 2.0? (since you 
said- *any*)... No... I don't really care. Don't try to answer me.

-- 
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv 
Srvies
AIM:yungfongta   mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan

John 3:16 For God so loved the world that he gave his one and only Son,
that whoever believes in him shall not perish but have eternal life.

Does your software display Thai language text correctly for Thailand users?
- Basic Conceptof Thai Language linked from Frank Tang's
Itrntinliztin Secrets
Want to translate your English text to something Thailand users can
understand ?
- Try English-to-Thai machine translation at
http://c3po.links.nectec.or.th/parsit/





Re: Request

2003-11-21 Thread Frank Yung-Fong Tang


Markus Scherer wrote:

  Ritu Malhotra wrote:
   I would like to know that I am currently working with a hindi
  software. In
   this scenario the complete software is working on the basis of ISCII
  code.
   Now in my software itself I want to give support for a unicode font for
   devnagari Script(mangal). How do I go about doing this. ...
 
  You may need to convert from ISCII to Unicode and then use the Unicode
  text for display. ICU has an
  ISCII converter: http://oss.software.ibm.com/icu/
 
  markus

Does the ICU ISCII convertesr take ATTRIBUTE code in ISCII (as defined 
in ANNEX-E of ISCII 13194:1991, page 20 to swtich between script?)
ATR = 0xEF in ISCII
0xEF 0x42 to switch to Devanagari script
0xEF 0x43 to switch to Bengali script
etc...

Not saying I like that part of ISCII or those support is needed. Just 
want to know how complete is the ICU converter while it deal with this 
weired specification - ISCII. (if you don't think it is weired, look 
at the E-1 Display Attributes session in Annex-E of ISCII which is worst 
than the E-2 Font Attributes I mentioned here.)

-- 
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv 
Srvies
AIM:yungfongta   mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan

John 3:16 For God so loved the world that he gave his one and only Son,
that whoever believes in him shall not perish but have eternal life.

Does your software display Thai language text correctly for Thailand users?
- Basic Conceptof Thai Language linked from Frank Tang's
Itrntinliztin Secrets
Want to translate your English text to something Thailand users can
understand ?
- Try English-to-Thai machine translation at
http://c3po.links.nectec.or.th/parsit/





Re: creating a test font w/ CJKV Extension B characters.

2003-11-20 Thread Frank Yung-Fong Tang

so.. in summary, how is your concusion about the quality of GB18030 
support on IE6/Win2K ? If you run the same test on Mozilla / Netscape 
7.0, what is your conclusion about that quality of support?


Andrew C. West wrote:

  On Thu, 20 Nov 2003 01:32:16 +, [EMAIL PROTECTED] wrote:
  
   Frank Yung-Fong Tang wrote,
If you visit
   
  http://people.netscape.com/ftang/testscript/gb18030/gb18030.cgi?page=596
and your machine have surrogate support install correctly and
  surrogate
font install correctly then you should see surrogate characters
  show up
match the gif.
  
   It isn't working, but I have surrogate support and a font correctly
   installed.
  
 
  Using W2K and IE6, if you have a CJK-B font configured for User Defined
  scripts under the Options : Fonts settings, and manually select the
  encoding
  for the page as User Defined, then the second CJK-B character in
  each box
  (just above the gif image) displays just fine.
 
  The top character in each box appears to be encoded as GB-18030 (e.g.
  GB-18030
  0x95328236 = U+2), and the second character is encoded as hex NCR
  values
  (e.g. #x2; for U+2).
 
  If GB-18030 is selected as the encoding for the page (as explicitly
  given in the
  file), then IE won't display the CJK-B characters correctly (even if you
  configure a CJK-B font as your default font for displaying Chinese),
  but you can
  copy and paste them to a Unicode editor, where both the GB-18030 and
  NCR encoded
  forms of CJK-B characters will display correctly with an appropriate
  CJK-B font.
 
  If User Defined is selected as the encoding for the page (either
  manually or by
  changing the meta tag in the file to charset=x-user-defined), then the
  GB-18030 encoded characters turn to gunk, but the NCR representations are
  displayed using whatever font you have configured for user defined
  scripts, and
  if that is a CJK-B font then hey presto !
 
  Andrew
 

-- 
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv 
Srvies
AIM:yungfongta   mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan

John 3:16 For God so loved the world that he gave his one and only Son,
that whoever believes in him shall not perish but have eternal life.

Does your software display Thai language text correctly for Thailand users?
- Basic Conceptof Thai Language linked from Frank Tang's
Itrntinliztin Secrets
Want to translate your English text to something Thailand users can
understand ?
- Try English-to-Thai machine translation at
http://c3po.links.nectec.or.th/parsit/





Re: creating a test font w/ CJKV Extension B characters.

2003-11-20 Thread Frank Yung-Fong Tang
James:
I think the first thing you need to make sure is you did properly 
install all of the following:
a. you are running on W2K or WinXP
b. install the surrogate support from microsoft if you are running win2k
c. configure your font in the IE font pref

Try to do the following
1. open notepad
2. select the surrogate font
3. open Netscape 7 or mozilla
4. view thc url I gave you in mozilla
If you did a and b (even without c) you should see the chinese text there.
5. click the [text link]
6. copy and paste the text into your Word XP
7. Do you see it correctly in Word XP.
8. Do the same thing and put into Notepad
If that don't show you, then the problem is really you don't install the 
thing right.

[EMAIL PROTECTED] wrote:

  .
  Andrew C. West wrote,
 
   Using W2K and IE6, if you have a CJK-B font configured for User
  Defined
   scripts under the Options : Fonts settings, and manually select the
  encoding
   for the page as User Defined, then the second CJK-B character in
  each box
   (just above the gif image) displays just fine.
 
  Yes.  The page was downloaded and heavily tweaked off line.
 
  First I substituted a decimal numeric character reference for one
  of the hexadecimal entries.  No dice.
 
  I did a couple of other tricks to no avail.
 
  I removed the GB character set declaration and tried to manually
  set the [View] to user defined.  The page loaded again, but didn't
  display, checking the [View] showed that the page was still being
  loaded as UTF-8!  Tried it again and again.
 
  At this point, *I* was heavily tweaked, so I didn't even try to
  insert 'x-user-defined' character set into the HTML header.
  I just went back on line and opened the page successfully with
  a different browser.
 
  Best regards,
 
  James Kass
  .
 

-- 
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv 
Srvies
AIM:yungfongta   mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan

John 3:16 For God so loved the world that he gave his one and only Son,
that whoever believes in him shall not perish but have eternal life.

Does your software display Thai language text correctly for Thailand users?
- Basic Conceptof Thai Language linked from Frank Tang's
Itrntinliztin Secrets
Want to translate your English text to something Thailand users can
understand ?
- Try English-to-Thai machine translation at
http://c3po.links.nectec.or.th/parsit/





Re: creating a test font w/ CJKV Extension B characters.

2003-11-20 Thread Frank Yung-Fong Tang


Michael (michka) Kaplan wrote:

  From: Frank Yung-Fong Tang [EMAIL PROTECTED]
 
   so.. in summary, how is your concusion about the quality of GB18030
   support on IE6/Win2K ? If you run the same test on Mozilla / Netscape
   7.0, what is your conclusion about that quality of support?
 
  In Summary?
 
  Well, in summry, I fail to see how testing for NCRs has anything to do
  with
  suport of *any* encoding in a browser. It seems like an inadequate
  test of
  functionality of gb18030 support.
 
  If you want to test gb18030 support, then please encode a web page in
  gb18030 and test *that* in the browser of your choice.

Have you ever look at that page before you said this? or the html source 
of that page?
Those page display 5 information [for BMP characters, less information 
is there] for each characters
1. the GB18030 encoded value in hex. That hex value of the first two 
bytes are display on the top of the page for the plane 2 characters. The 
hex value of the thrid byte is display on the left of each row. The 4th 
byte is display on top of each column
2. The 4 bytes of the characters encoded in GB18030
3. The same characters encoded by using hex escape in html as #xhhh;
4. a IMG point to the image on www.unicode.org
5 The equvilant Unicode hex value is display as U+ in the bottom

Ideally, if the browser do thing right and the font is install, the one 
who want to test can compare 2, 3, and 4 to see what happen.
Therefore, it could be used to test BOTH the NCR and GB18030.
If 2 display different from 4 (assume the server is up and running and 
you do see the glyph in gif), then it mean the conveter have problem.
If 3 display different from 4 (assume the gif can be view), then it mean 
  your html parser have problem
If 2 and 3 display different from 4, then it could be both have problem 
, the rendering engine itself have problem, or all of them have problem.

Of course, you don't really need to img part, you can compare with the 
Unicode 4.0 standard by yourself. But my tool is written 1 year before I 
got my hardcopy of Unicode 4.0 standard so that image help us to QA.

If you SAVE the page locally then look at the result, notice the save 
operation could already damange you page.

And YES, I DO encode that page in GB18030 and use byte to encode. I did 
have ADDITIONAL information encoded in NCR and img there to help to 
verfication. You may missed the real GB18030 encoded characters there if 
you do not pay close attention.

 
  Now if you want to discuss NCR support then that may also be interesting.
  But it would be nice to have tests that actually cover what they claim to
  cover
I do have actual claim about what it cover. And more than that. The 
problem is you look at the addition part which is beyond what I claim in 
the last email.

 
  MichKa [MS]
  NLS Collation/Locale/Keyboard Development
  Globalization Infrastructure and Font Technologies


-- 
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv 
Srvies
AIM:yungfongta   mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan

John 3:16 For God so loved the world that he gave his one and only Son,
that whoever believes in him shall not perish but have eternal life.

Does your software display Thai language text correctly for Thailand users?
- Basic Conceptof Thai Language linked from Frank Tang's
Itrntinliztin Secrets
Want to translate your English text to something Thailand users can
understand ?
- Try English-to-Thai machine translation at
http://c3po.links.nectec.or.th/parsit/





Re: UTF-16 inside UTF-8

2003-11-19 Thread Frank Yung-Fong Tang

Dear Doug:
Thank you for your reply. What you said about how to do it is exactly 
what it should be done. The point of asking those question is not to 
seek for an answer. Instead, just want to show from the answer that 
adding the surrogate support is not

You wrote earlier:
For UTF-8 in particular, I can't imagine why
one would choose to implement the 1-, 2-, and 3-byte forms in one stage
and add the 4-byte forms in a later stage.

Can you imaging now? The task list you listed below are additional task 
that people need to perform before they add 4 bytes UTF-8. They don't 
need that part if they support 2 bytes or 3 bytes UTF-8. It does not 
imply they should not add 4 bytes support. It only mean for people want 
to add the support, they need plan for extra task and time on it. All 
the following task cause it to come later. The later could be 1 day 
late, it could be 1 week later. It could be one milestone (from alpha 1 
to alpha 2) late. But the nature that the developer do need to spend 
efforts on those task cause it late.

One real example I found recently is Tcl. Tcl have the so-called UTF-8 
support since 8.1. But if you look at the implementation of Tcl 8.4.4 
(from http://www.tcl.tk ) you will find the UTF-8 implementation:
a. do not align with Unicode 3.2/4.0 or RFC 3629 definitation and accept
non-shortest form
b. by default it does not accept 4 bytes UTF-8.
c. 4, 5, 6 byets UTF-8 support is accept if certain compiler flag got 
turn on. TCL_MAX_UTF (default 3, could be turn to 4, 5, 6)
d. no documentation mention about surrogate.
e. use unsigned int for Tcl_UniChar if the TCL_MAX_UTF is 4 to 6. use 
unsigned short if TCL_MAX_UTF is 3. (look like a very very very bad 
decision).
f. there are no way to use UTF-16 internally to accept 4 bytes UTF-8. 
You can either use up to 3 bytes in UTF-8 and use UTF-16 internally, or 
support up to 6 (which is wrong, it should stop at 4) bytes with UTF-32 
(not really) support internally.
g. they really output CESU-8 but not UTF-8 now if the UTF-16 
(TCL_MAX_UTF = 3 or undefined as default) have surrogate pair.

If you still think adding 4 bytes UTF-8 support is  1% of the task, 
then please join the Tcl project and help me fix that. I appreciate your 
efforts there and I beleive a lot of people will thank for your 
contribution.

Doug Ewell wrote:

  Frank Yung-Fong Tang YTang0648 at aol dot com wrote:
 
  What you do is, you go through the exact same process that API vendors
  have had to go through since the beginning of multibyte character sets.
  That is, you decide whether your API returns code units or characters,
  you publicize that decision, and you stick to it.  If the decision means
  you have a function that isn't terribly useful, you have to define a new
  function that does the right thing, and leave the old function on the
  mountain to die.
 
  To cite a non-Unicode example, in ECMAScript (ne JavaScript) there is a
  function Date.GetYear() that was intended to return the last two digits
  of the year but actually returned the year minus 1900.  Of course,
  starting in 2000 the function returned a value which was useful to
  practically nobody.  Did Sun or ECMA change the definition of
  Date.GetYear()?  No, they introduced a new function, Date.GetFullYear(),
  which does what users really want.
 
  Same thing here: you can't change the 16-bit UniChar, so you'll have to
  declare that your old functions that return a UniChar are defined as
  returning UTF-16 code points, and you'll probably want to define a new
  UniChar32 type and functions like:
 
  UniChar32 ToLower(UniChar32 aChar)
 
  that do the obvious right thing.
 
  And I'm sorry, I know some people will cringe when I say this, but if
  you're like me and get to define your own UniChar data type, you've
  been making it 32 bits wide since about 1997.

double your memory cost and size from UTF-8. x4 of the size for your 
ASCII data. To change implementation of a ASCII compatable / support 
application to UTF-16 is already hard since people only care about ASCII 
will upset the data size x 2 for all their data. It is already a hard 
battle most of the time for someone like me. If we tell them to change 
to UCS-4 that mean they need not only x2 the memory but x4 of the 
memory. For web services or application which spend multi millions on 
those memory and database, it mean adding millions of dollars to their 
cost. They may have to adding some millions of cost to support 
international customer by using UTF-16. They probably are willing to add 
multi millions dollars of cost to change it to use UCS4. In fact, there 
are people proposed to stored UTF-8 in a hackky way into the database 
instead of using UTF-16 or UCS4 to save cost. They have to add 
restriction of using the api and build upper level api to do conversion 
and hacky operation. That mean it will introduce some fixed (not depend 
on the size of data) developement cost to the project but it will save 
millions of dollars

Re: Problems encoding the spanish o

2003-11-19 Thread Frank Yung-Fong Tang

One thing may help you to think about this kind of issue is my 'under 
constrution paper - Frank Tang's List of Common Bugs that Break Text 
Integrity  http://people.netscape.com/ftang/paper/textintegrity.html
I am going to present a newer revsion in the coming IUC25 if they accept 
my proposal.

it look like n M 4 bytes got changed to two bytes U+DB7A and U+DC0D 
which is a surrogate pair in UTF-16.

Here is what I think what happened.
1. the text ...izacin Map.. is output from process A and pass to a 
process B which the byte is encoded in ISO-8859-1. so the 4 bytes n M 
are encoded as 0xf3, 0x6e, 0x20, 0x4d.
2. somehow process B think the incoming data is in UTF-8 instead of 
ISO-8859-1. You can find some possible cause as hint from my paper (url 
above).
3. Process B try to convert the data stream to UTF-16 by using UTF-8 to 
UTF-16 conversion rule. However the UTF-8 scanner in the converter is 
not will written. It implement the conversion in the following way:
3.a. it hit the byte 0xf3, and it look at a look up table and notice 
0xf3 in a legal UTF-8 sequence is the first bytes of a 4 bytes UTF-8 
sequence.
3.b. it decode that 4 bytes UTF-8 sequence without checking the value of 
the next 3 bytes 0x6e, 0x20, 0x4d. It blindly think these bytes are the 
2nd, 3rd and 4th bytes of this UTF-8 sequence. Of course, it need to 
first get the UCS4 value, what it does is

m1 = byte1  0x07
m2 = byte2  0x3F
m3 = byte3  0x3F
m4 = byte4  0x3F

in your case, what it got is
m1 = 0xf3  0x07 = 0x03
m2 = 0x6e  0x3F = 0x2e
m3 = 0x20  0x3f = 0x20
m4 = 0x4d  0x3f = 0x0d

[Notice the problem is such algorighm does not check to make sure byte2, 
byte3 and byte4 is in the range of 0x80 - 0xBF at all. One possibility 
is it does not check in the code. The other possibility is the code do 
value checking but massed up by using (char) value to compare with 
(unsigne char) by using  and .

What I mean is the following:
main()
{
 char a=0x23;
 printf(a is %x ,a);
 if( a  (char)0x80)
 printf(and a is greater than 0x80\n);
 else
 printf(and a is less or equal than 0x80\n);
}
sh% ./b
a is 23 and a is greater than 0x80
]

then it caculate the ucs4 by using

ucs4 = (m1  18) | (m2  12) | (m3  6) | (m4  0);
in your case, what it got is
ucs4 = (0x03  18) | (0x2e  12) | (0x20  6) | (0x0d  0) =
 0xc | 0x2e000 | 0x800 | 0x0d = U+ee80d;

3.c. now it turn that ucs4 into UTF-16 by
surrogate high = ((ucs4-0x1 )  10) | 0xd800
  = ((0xee80d - 0x1)  10) | 0xd800
  = ( 0xde80d  10 ) | 0xd800
  = 0x037a | 0xd800
  = 0xdb7a
surrogte low = ((ucs4 - 0x1)  0x03FF) | 0xdc00
= ((0xee80d - 0x1)  0x03FF) | 0xdc00
= (0xde80d  0x3FF) | 0xdc00
= 0x0d | 0xdc00
= 0xdc0d
so you got a UTF-16 DB7A DC0D with you now

4. now process b (or some other code) try to convert the UTF-16 into 
HTML NCR, unfortunatelly, that process do not handle the UTF-16 to NCR 
conversion correctly. So... instead of doing the right way as below:
4.a take DB7A DC0D convert to UCS4 as 0xEE80D
4.b convert EE80D to decimal as 976909 and generate as #976909;

it convert DB7A as decimal 56186 and generate as #56186; and then it 
convert DC0D as decimal 56333 and generate as #56333;

So... in summary, there are 3 but not only 1 problem in your system
Problem 1: Process A convert data to ISO-8859-1 while process B is 
expecting UTF-8. You should either fix the Process A to let it generate 
UTF-8 or fix the Process B to treat the input as ISO-8859-1. The 
preferred approach is the ealier one.
Problem 2: The UTF-8 converter in Process B does not strictly implement 
the requirement in RFC 3629 which say it MUST protect against decode 
invalid sequence. If you put the non ASCII into the end of a line it 
probably will cause your software to fold line if you put it in the end 
of the record it may even crash your software for converter in this kind 
of quality. You need to fix the convert scanning part.
Problem 3: The UTF-16 to NCR conversion is incorrect according to the HTML.

Hope the above analysis help.


pepe pepe wrote:

  Hello:
 
We have the following sequence of characters ...izacin Map.. that is
  the same than ...izaci#243;n Map... that after suffering some
  transformations becomes to ...izaci#56186;56333;ap
  AS you can see the two characters 56186 and 56333 seem to represent this
  sequences n M. Any idea?.
 
  Regards,
  Mario.
 
  _
  Charla con tus amigos en lnea mediante MSN Messenger.
  http://messenger.microsoft.com/es
 
 

-- 
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv 
Srvies
AIM:yungfongta   mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan

John 3:16 For God so loved the world that he gave his one and only Son,
that whoever believes in him shall not perish but have eternal life.

Does your software display Thai language text correctly for Thailand users?
- Basic Conceptof

Re: What does i18n mean?

2003-11-19 Thread Frank Yung-Fong Tang




of course, I was just
joking 

the answer is, read http://www.i18nguy.com/origini18n.html
and notice the spelling of "internationalization" (US spelling) and
"internationalisation" (UK spelling) in that doc. The reason is to
abbreviate is not because it is too long for Asian engineers like me to
memorize it, but also because UK and US people often spell it
differently and cofused Asian engineers like me. 


[EMAIL PROTECTED] wrote:



  
  
  In a message dated
11/14/2003 2:34:26 PM Pacific Standard Time, [EMAIL PROTECTED]
writes:
  
   what does i18n mean? I
see it bandied about a lot.
  
  
  It is a short hand for "Irn " because it is too hard for most of the people to type the "r" part. :) [and if your software can
save that string retrive it correct later, 50% of the i18n problem is
addressed]
  
  
  
  --
Frank Yung-Fong Tang
  System Architect, Itrntinl Dvlpmet, AOL Intrtv
Srvies
  AIM:yungfongta mailto:[EMAIL PROTECTED]
  Tel:650-937-2913 
Yahoo! Msg: frankyungfongtan
  
  John 3:16 "For God so loved the world that he gave his
one and only Son, that whoever believes in him shall not perish but
have eternal life.
  
  Does your
software display Thai language text correctly for Thailand users? 
  - Basic Conceptof
Thai Language linked from Frank
Tang's Itrntinliztin Secrets
  Want to translate your English text to something Thailand users
can understand ? 
- Try English-to-Thai machine translation at http://c3po.links.nectec.or.th/parsit/
  
  


-- 
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv Srvies
AIM:yungfongta mailto:[EMAIL PROTECTED]
Tel:650-937-2913 
Yahoo! Msg: frankyungfongtan

John 3:16 "For God so loved the world that he gave his one and only
Son, that whoever believes in him shall not perish but have eternal
life.

Does your software display Thai language text correctly for Thailand
users? 
- Basic Conceptof Thai Language linked from Frank Tang's
Itrntinliztin Secrets
Want to translate your English text to something Thailand users can
understand ? 
- Try English-to-Thai machine translation at http://c3po.links.nectec.or.th/parsit/







Re: creating a test font w/ CJKV Extension B characters.

2003-11-19 Thread Frank Yung-Fong Tang
why don't you find a font which already support it ?
you can find some info here -
http://www.microsoft.com/globaldev/DrIntl/columns/015/default.mspx

It is not that easy for you from don't know beans about fonts to 
creat a test font that contains ... \u20050. If you are lucky, it will 
take you several month if not year. There are commercial base font tool. 
But I am not sure they support 32 bits cmap or not (probably not). You 
can start from http://www.microsoft.com/typography/users.htm , but I 
think it will take you a while  you need the 32 bits cmap support in 
OpenType to add u20050 . I don't know which commercial tool current 
support that.


Ostermueller, Erik wrote:

  Hello all,
 
  I'd like to create a test font that contains a
  a standard US Latin alphabet and the following characters:
 
  \u5000
  \u20050
 
  We need this for testing a software app that supports GB18030.

if you want to get GB18030 test data, one thing you can do is to visit 
my GB18030 test page at
http://people.netscape.com/ftang/testscript/gb18030/gb18030.cgi?page=10

That is what I design to test Mozilla's GB18030 support. The page number 
  and the layout match exactly what the paper copy of GB18030 so you can 
do a screen to paper copy comparision. I do add addition pages in the 
web (from page 284 and later which is not in the hardcopy of GB18030). 
If you visit 
http://people.netscape.com/ftang/testscript/gb18030/gb18030.cgi?page=596 
and your machine have surrogate support install correctly and surrogate 
font install correctly then you should see surrogate characters show up 
match the gif. IF you click the [Text] in the left upper corner, it will 
open a new window and put those GB18030 text in plain text format.
Good luck.

  My main problem is that I don't know beans about fonts.
  Could someone recommend a good tutorial or 'font creator' application
  that addresses surrogate pairs?
 
  Thanks,
 
  Erik Ostermueller
 

-- 
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv 
Srvies
AIM:yungfongta   mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan

John 3:16 For God so loved the world that he gave his one and only Son,
that whoever believes in him shall not perish but have eternal life.

Does your software display Thai language text correctly for Thailand users?
- Basic Conceptof Thai Language linked from Frank Tang's
Itrntinliztin Secrets
Want to translate your English text to something Thailand users can
understand ?
- Try English-to-Thai machine translation at
http://c3po.links.nectec.or.th/parsit/





Re: creating a test font w/ CJKV Extension B characters.

2003-11-19 Thread Frank Yung-Fong Tang
are you using Netscape7 / Mozilla or IE?
If you use IE, then IE may have a bug about that.
I think Mozilla should not have the problem since I develope and test it 
by myself.

[EMAIL PROTECTED] wrote:

  .
  Frank Yung-Fong Tang wrote,
 
   If you visit
  
  http://people.netscape.com/ftang/testscript/gb18030/gb18030.cgi?page=596
   and your machine have surrogate support install correctly and surrogate
   font install correctly then you should see surrogate characters show up
   match the gif.
 
  It isn't working, but I have surrogate support and a font correctly
  installed.

Are you running on XP or 2K? od you install all the necessary surrogate 
support? Do you teak your font pref to use the surrogate font for 
Chinese pages?


 
  The page looks like it is calling for Unicode characters to display,
  example #x2;, but the HTML header says GB-18030 for the characters
  set.  Could this be the problem, or are Unicode and GB18030 matched
  for plane two and for HTML numeric characters references?
It should not matter . But again, it could be a bug in the IE.

 
  Each single Plane Two character is displaying as two missing glyphs,
  if that is an extra clue.
 
  Best regards,
 
  James Kass
  .


-- 
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv 
Srvies
AIM:yungfongta   mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan

John 3:16 For God so loved the world that he gave his one and only Son,
that whoever believes in him shall not perish but have eternal life.

Does your software display Thai language text correctly for Thailand users?
- Basic Conceptof Thai Language linked from Frank Tang's
Itrntinliztin Secrets
Want to translate your English text to something Thailand users can
understand ?
- Try English-to-Thai machine translation at
http://c3po.links.nectec.or.th/parsit/





Re: creating a test font w/ CJKV Extension B characters.

2003-11-19 Thread Frank Yung-Fong Tang


Philippe Verdy wrote:

  From: Frank Yung-Fong Tang [EMAIL PROTECTED]
   It is not that easy for you from don't know beans about fonts to
   creat a test font that contains ... \u20050. If you are lucky, it
  will
   take you several month if not year. There are commercial base font
  tool.
   But I am not sure they support 32 bits cmap or not (probably not).
 
  According to:
  http://www.microsoft.com/typography/otspec/cmap.htm
 
  The so-called Microsoft Unicode cmap format 4 (platfom id=3, encoding
  id=1) is the one recommanded for all fonts, except those than need to
  encode
  supplementary planes.
 
  Format 0 is deprecated (was used to map 8-bit encodings to glyph ids), as
  well as now Format 2 (was used to map DBCS encodings with leadbyte/trail
  bytes in East Asia, as a mix of 8 and 16 bit codes)
 
  For supplementary planes, like a font built to support GB18030, the cmap
  format 12 must be used instead with the same platform id, but the
  encoding
  id 10 (UCS-4).
 
  Format 8 is used to create a mix of 16-bit and 32-bit maps (with the
  assumption that no 16bit Unicode character will have the same code
  point as
  the highest 16-bit of a character out of the BMP, meaning that it
  works as
  long as there's no glyph to assign for Unicode codepoints X between
  U+
  and U+0010 simultaneously with codepoints between X16 and (X+1)16 -
  1).This compresses a bit the size of the cmap.
 
  Format 10 is not portable unlike format 12 which must be provided in
  addition to the recommanded format 4 for characters present in the
  BMP. In
  practice, this format is used mostly for GB18030 support, and
  supported by
  Windows 2000 and later. So you won't have to wait for years to create a
  GB18030 font, using UCS-4 mappings...

Which font tool currently support generating TTF with format 12? While 
it is true the font format and application software (such as mozilla I 
wrote, WinXP, Office XP, etc) is ready to deal with it, not many font 
tools which I know can create TTF with format 12 that are design for 
someone claimed himself as don't know beans about fonts to creat a 
test font that contains ... \u20050 now.


-- 
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv 
Srvies
AIM:yungfongta   mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan

John 3:16 For God so loved the world that he gave his one and only Son,
that whoever believes in him shall not perish but have eternal life.

Does your software display Thai language text correctly for Thailand users?
- Basic Conceptof Thai Language linked from Frank Tang's
Itrntinliztin Secrets
Want to translate your English text to something Thailand users can
understand ?
- Try English-to-Thai machine translation at
http://c3po.links.nectec.or.th/parsit/





Re: creating a test font w/ CJKV Extension B characters.

2003-11-19 Thread Frank Yung-Fong Tang


John Jenkins wrote:

 
   Nov 19, 2003 10:30 PM Ostermueller, Erik 
 
   Could someone recommend a good tutorial or 'font creator' application
   that addresses surrogate pairs?
  
 
  FontLab is probably the best cross-platform font creation software out
  there, although it's not cheap.  Cheaper solutions are to be found IIRC
  on Windows, and there's .
Does FontLab support generating TTF in format12 (32 bits)?
Which cheaper solutions could  generating TTF in format12 (32 bits)?

  If you're on a Mac, Apple's font tool suite
  (http://developer.apple.com/fonts/) is free and lets you add non-BMP
  support to fonts.
Can you point out which document and chapter in those doc talk in those 
document talk about what we need to do to add non-BMP charactrers?

which of the following MacOSX font tool should be used for that purpose?
# ftxanalyzer
# ftxdiff
# ftxdumperfuser
# ftxenhancer
# ftxinstalledfonts
# ftxruler
# ftxvalidator

 
  
  John H. Jenkins
  [EMAIL PROTECTED]
  [EMAIL PROTECTED]
  http://homepage..mac.com/jhjenkins/
 
 

-- 
--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv 
Srvies
AIM:yungfongta   mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan

John 3:16 For God so loved the world that he gave his one and only Son,
that whoever believes in him shall not perish but have eternal life.

Does your software display Thai language text correctly for Thailand users?
- Basic Conceptof Thai Language linked from Frank Tang's
Itrntinliztin Secrets
Want to translate your English text to something Thailand users can
understand ?
- Try English-to-Thai machine translation at
http://c3po.links.nectec.or.th/parsit/





Re: How can I input any Unicode character if I know its hexadecimal code?

2003-11-17 Thread Frank Yung-Fong Tang

hum a very stupid (but work) way.
1. use vi
2. type #x + the Unicode text + ; for each characters
3. save it as .html
4. open the file by using browser
5. copy the text
6. paste into your software.


--
Frank Yung-Fong Tang
tm rhtt, Itrntinl Dvlpmet, AOL Intrtv 
Srvies
AIM:yungfongta   mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan

John 3:16 For God so loved the world that he gave his one and only Son,
that whoever believes in him shall not perish but have eternal life.

Does your software display Thai language text correctly for Thailand users?
- Basic Conceptof Thai Language linked from Frank Tang's
Itrntinliztin Secrets
Want to translate your English text to something Thailand users can
understand ?
- Try English-to-Thai machine translation at
http://c3po.links.nectec.or.th/parsit/





Re: newbie 18030 font question

2003-04-03 Thread Yung-Fong Tang
We add GB18030 support into Mozilla and also add 32 bit cmap support on 
windows into Mozilla about a year ago. The Linux and Mac 32-bit cmap 
support is a little bit behind

I think we first have GB18030 encoding support in Netscape in Netscape 6.2
You should be able to see whatever the characters in Netscape 7 if your 
system have a font which contains the glyph

Try the following test page

http://people.netscape.com/ftang/testscript/gb18030/gb18030.cgi?page=10

It is coded according to the hard copy of GB18030 spec. (and I also add 
more pages which beyond the GB18030 spec to test the  BMP part)



[EMAIL PROTECTED] wrote:

Hello, all.

I'm new to 18030 and was hoping that someone could verify this.
We're implementing a browser-delivered database application and would
like to support 18030.
One fairly straightforward way of implementing this
seems to be to accept 18030 at the browser
and then transcode to Unicode when the
data first reaches the server.
When sending data back to the browser,
we'd transcode back to 18030.
OK so far, right?

Unicode fonts don't support all characters in 18030, correct?
Let's assume our client makes use of 18030 characters not in unicode fonts.
What font could we use for a 3rd party reporting tool 
that read data straight from the unicode db, bypassing our transcoding layer?

Thanks you for your time; I've learned a lot reading through 
the archives of this maillist.

--Erik Ostermueller
[EMAIL PROTECTED]
 






Re: Copy/paste in xterm/XEmacs

2003-04-03 Thread Yung-Fong Tang




I think that is depending on the application support the newly defined UTF8_STRING
for selection or not. 
The Linux verion of mozilla implement it so it can copy/paste with the recent
version of xterm w/o problem

Notice that UTF8_STRING is defined AFTER X11 R6 ICCCM. 
See the spec in http://www.pps.jussieu.fr/~jch/software/UTF8_STRING/ for
details


see http://lxr.mozilla.org/seamonkey/source/widget/src/gtk/nsClipboard.cpp
about mozilla's implementation


Phillip Farber wrote:

  After searching far and wide and reading all the HOWTOs etc.
I'm still at a loss as to how to make a simple copy/paste
work within xterm and between xterm and XEmacs.

If I cat a utf-8 encoded XML file containing Russian and it
displays just fine.  If I select a single word by left
double clicking and paste with a middle click I see a
mixture of '@', '^' and a few Cyrillic characters.
If I paste into XEmacs 21.4 in a buffer with the buffer
file coding system set to utf-8 I see a string of '?'.

Interestingly I can paste the selection into Windows
Notepad and Word and it displays just fine too.

Am I missing something very basic in my configuration/setup
or is this a known problem?  I'm wondering whether my
X server is not up to the task or perhaps are there
some x resources I should be setting?

I am running the xterm XFree86 4.2.0(165) that comes
with Linux Redhat 8 with display support from Hummingbird
eXceed X Server 7.1 on Windows NT 4.0 and XEmacs 21.4 (patch 8)
"Honest Recruiter" [Lucid] (i386-redhat-linux, Mule)
of Mon Aug 26 2002 on astest.

I'm invoking xterm as:

xterm -u8 -fn
'-misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646-1'

% locale

LANG=en_US.UTF-8
LC_CTYPE=en_US
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Phil.
---
Phillip Farber, Information Retrieval Specialist
Email: [EMAIL PROTECTED]
Digital Library Production Service (http://www.umdl.umich.edu/)
Hatcher Graduate Library, University of Michigan
308 Hatcher North, Ann Arbor, MI 48104-1205




  






Re: Problem in unix server with the encoding of pound(#163)

2003-03-21 Thread Yung-Fong Tang


Jain, Pankaj (MED, TCS) wrote:

Hi,
I am generating pound sign in html preview using XML XSLT transformation
and its working fine in windows using #163; in XML but same thing is
not working in unix server. 

What do you mean in unix server ? display the text on the Unix Xterm ? 
or you are talking about some UNIX browser? which browser? which version ?

I am using utf-8 encoding for this. And the
strange thing is that it works fine for PDF in both windows and unix
which I am generating using FOP XSLFO. So I am not able to figure out
where exactly the problem is.
Please help me in the above area if there is any dependency of encoding
in unix.
Thanks
-Pankaj
 






Re: Unicode Public Review Issues update

2003-03-18 Thread Yung-Fong Tang
url please

Rick McGowan wrote:

The Unicode Public Review Issues page has been updated today.

Highlights:

   Closed issue #1 (Language tag deprecation) without any change.
   Updated some deadlines on other issues to June 1, 2003.
   Added a document for issue #7 (tailored normalizations).
   Added an issue #8 regarding properties of math digits.
Regards,
Rick McGowan
Unicode, Inc.
 






Re: Characters that rotate in vertical text

2003-03-14 Thread Yung-Fong Tang




I think that is a hard problem

First of all. Take a look at 
http://www.unicode.org/Public/4.0-Update/UCD-4.0.0d5b.html
and find the vertical one

Second, anything which need to be Symmetric Swap in Bidi probably need to
be change in the vertical form. (If they need to be change in horizontal
direction, they probably will need to be change in the vertical position)

However, this is not that easy. First, there are some characters could be
rotate as optionl. For example, if you have English string "Book" in your
vertical text, should software rotate it? or not?
It could rotate the whole text 90 as "Book", or it could displayed as
B
o
o
k

Both are "right". It depend on the application domain to decide how to display
it. Which mean it need "a higher level protocol" look at the example in the
session of 3.3 of http://www.w3.org/TR/2003/WD-css3-text-20030226/

Second, it also depend on the people who design the glyph. For example, U+FF0C
in a Traditional font have the comma in the central position- which mean
it don't need to be change in the vertical layout. However, Japanese users
think that position is funny for horizontal text and won't accept that. So
the U+FF0C glyph in a Japanese font will be put in the left lower corner.
and in that case, it need a different Glyph (note, not different unicode,
but a different glyph id) to represent it in the Vertical layout. That is
way you see on the Window system most of the font have a "@ variant" version
there. That font is used for Veritcal layout and the same unicode map to
different glyph id (so the , show up in the left upper position, center position,
or right upper posiont [I am not a typographer so I am not sure which one
they choose, but one of them})

More info about Vertical text could be found at the following places
1. page 342-365, Chapter 7, Typography, CJKV Information Processing, Ken
Lunde, O'Reilly, ISBN 1-56592-224-7, http://www.oreilly.com/catalog/cjkvinfo/
2. page 192-193, Developing International Software- 2nd Edition, Dr. International,
Microsoft Press, ISBN 0-7356-1583-7 http://www.microsoft.com/mspress/books/5717.asp
They may have an online copy on the msdn




Rick Cameron wrote:
 
  
  
 
  
  Characters that rotate in vertical text

  Hi, all 
  
  When Japanese (and, I imagine, other
East Asian languages) is written vertically, certain characters are rotated
by 90 degrees. Examples: the parenthesis-like characters in the block at
U+3000, and U+30FC.

U+3000 is SPACE characters, I don't think it will need to be rotate, it should
show BLANK anywan.

 
 
  Does the Unicode character database
include information on which characters are rotated in vertical text? If
not, does anyone know of a definitive list?
  
  Thanks 
  
  - rick cameron 
  





Re: New document.

2003-03-14 Thread Yung-Fong Tang


Otto Stolz wrote:

The two scans under
  http://www.rz.uni-konstanz.de/Antivirus/tests/li.png
  http://www.rz.uni-konstanz.de/Antivirus/tests/re.png
are from the authoritative (until July 1996) book on German
orthography: Duden Rechtschreibung der deutschen Sprache
und der Fremdwörter / hrsg. von d. Dudenred. auf d. Grundlage
d. amtl. Rechtschreibregeln. [Red.Bearb.: Werner Scholze-
Stubenrecht unter Mitw. von Dieter Berger ...]. - 19., neu bearb.
u. erw. Aufl. ISBN: 3-411-20900-3.
Best wishes,
  Otto Stolz 
could you point out which symbol in that two images need to be proposed?
either by using red ciricle on the image or tell use the surrounding text.
Thanks




Re: pinyin syllable `rua'

2003-03-14 Thread Yung-Fong Tang
Which pinyin system the rua is in?

I use simpchinese win XP and if I switch to Full Spell (??)Simplified 
Chinese IME and type rua', then I got  (read this email in UTF-8) 
which is U+633C
I am not sure that is correct. At least, as a native Mardarin speaker, 
that sound is not nature for me at all. It could be a table mistake in 
the software. It sound like Japanese :)

Werner LEMBERG wrote:

Some lists of pinyin syllables contain `rua', but I actually can't
find any Chinese character with this name.
Does it exist at all?  Or is it just there for completeness of pinyin?

   Werner

 






Re: sorting order between win98/xp

2003-03-13 Thread Yung-Fong Tang






Dominikus Scherkl wrote:

  
Anyone know why the sort order is different under that two systems?

  
  As I mentioned: a new feature, keeping numbers ordered numerical.
  

I won't mind if they ALSO give me a flag to control that behavior. 
Number could be used for many different thing in a string. 

It does not make sense to sort differently on win98 and winxp if I have the
following subject in my IMAP mailbox. 
"7870789 is my phone number"
"1947 is the year Mary graduage from highschool"
"95129 is my zip code"
"23.95 only to get a DSL line"
"234458 - bugzilla bug- System crash when using large size font"





Re: sorting order between win98/xp

2003-03-13 Thread Yung-Fong Tang





  
  
  
Anyone know is there a way to make them sort in the same 
order?

  
  Why should anybody want that?
  

Because user expect a cross platforms (or I should said cross windows version)
product display the same sorting order in Win98 and on WinXP.
For example, the Netscape7 mailer could run on both Win98 and WinXP and use
it to access IMAP mailbox. When the user sort the mail by the subject, they
expect to see the same sequence on the same mailbox content from that two
system through the same mail client. I am not saying that is an IMPORTANT
issue, but it is A REASONABLE issue. 

Why should any OS user want to see different sorting order for the same locale?
I look at the Win32 API but I cannot find any flag to define it. 




Re: sorting order between win98/xp

2003-03-13 Thread Yung-Fong Tang






Michael (michka) Kaplan wrote:

  From: "Yung-Fong Tang" [EMAIL PROTECTED]

  
  
One of my colleague ask me this question.

  
  
Not much to do with Unicode, though. Is it?

It will be an Unicode issue if the cause is the new software try to implement
http://unicode.org/reports/tr10/ Unicode Technical Standard #10 Unciode Collation
Algorith, right?

  





Re: sorting order between win98/xp

2003-03-13 Thread Yung-Fong Tang




We cannot use that. The function you mention is to compare two Unicode strings.

We need the function to "generate sort key" from unicode strings instead
of compare two string. 

Michael (michka) Kaplan wrote:

  From: "Yung-Fong Tang" [EMAIL PROTECTED]

  
  
One of my colleague ask me this question.

  
  
In the interests of completeness

The function that does the type of sorting your colleague noted is
StrCmpLogicalW in shlwapi.dll, version 5.5 and later. See the
following link for more information (all on one line in the browser):

http://msdn.microsoft.com/library/en-us/shellcc/platform/shell/reference/shlwapi/string/strcmplogicalw.asp

MichKa

  






Re: sorting order between win98/xp

2003-03-13 Thread Yung-Fong Tang




Doug got my point. What I care is the "difference" instead of which one is
better. 

Doug Ewell wrote:

  Dominikus Scherkl Dominikus dot Scherkl at glueckkanja dot com wrote:

  
  

  It is not deterministic string ordering
  

?!?
What's non-deterministic in numeric ordering?
Ok, mix of (letter-)strings and numbers maybe not so
straight-forward to sort than simply sorting digits
by their encoding-value (this is the cause it was
not implemented before), but I prefer it always very much.

  
  
The question really isn't whether one sort order is "better" than
another.  It's easy to come up with examples where each method has an
advantage.

The real question is whether the same sort option should generate
different results on different versions of Windows, all other things
being equal.  It would be nice to have an explicit option to sort
strings by numeric value instead of character-set collating order, but
not so good if the developer has no control over which method is used,
and worse if Microsoft did not publicize this change; I don't know if
they did or not).

Note that I'm speaking in terms of programmable sorting.  I really don't
care how filenames in Windows Explorer are sorted.

Me neither. That is just to be used to show the problem is in the OS level
instead of programming error in our code. 

  

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/


  






Re: Unicode character transformation through XSLT

2003-03-13 Thread Yung-Fong Tang




I have not touch Java for years (probably 5 years) ... so, I could be wrong.


Jain, Pankaj (MED, TCS) wrote:
 
 
  
  
   
  
 
   
  Hi  ftang/james..
 
  thanks for the details
 explanation. and now I the root problem of my error.
 
  I have following string
is in  database as Long in which the special character(?) is equivalent to
 ndash(-)
 
  E8C ? 6 to 10 
 
  And i am using following
code to  write the string from database to property file, and in property
file i am  getting following string.
 
  value=  E8C \uFFE2\uFF80\uFF93 6 to 10 
 
  And as  \uFFE2\uFF80\uFF93 is not equivalent to ndash, I am not
able to figure out why  it is coming in property file.
 
  Do we  need to specify in my java program any type of encoding
like  utf-8.
 
  pls let  me know where is the problem.
 
  Here is  my code..
 
  while(rsResult.next())
 
  {
 
  /*Get the file contents from the value column*/
 
  ipStream = rsResult.getBinaryStream("VALUE");
  

what is rsResult? Blob?
you probably need to use 
BufferedInputStream
and 
DataInputStream
to pipe the InputStream
and use readChar or readUTF in the InputStream interface instad.
See http://www.webdeveloper.com/java/java_jj_read_write.html and 
http://java.sun.com/j2se/1.4/docs/api/java/io/DataInputStream.html#readUTF()
for more info.



   
  strBuf = new StringBuffer();
 
  while((chunk = ipStream.read())!=-1)
 
  {
 
  byte byChunk = new Integer(chunk).byteValue();
 
  strBuf.append((char) byChunk);
 
  }
  

Here is your problem, you read it in byte to byte. Each byte of the UTF-8
will be read in as a Byte instead of a Char in Java.


   
  prop.setProperty(rsResult.getString("KEY"),  strBuf.toString());
 
  }
 
  /*Write to o/p stream*/
 
  //opFile = new  FileOutputStream(strFileName+".properties");
 
  opFile = new FileOutputStream(strFileName);
 
  /*Store the Properties files*/
 
  prop.store(opFile, "Resource Bundle created from Database
View  "+vctView.get(i));
 


   
  
  
  Thnaks
 
  -Pankaj
 
  
 
  
 
  
 
  
  
 
 
-Original Message-
From: [EMAIL PROTECTED][mailto:[EMAIL PROTECTED]]
Sent: Tuesday, March 11, 2003 6:09PM
To: Jain, Pankaj (MED, TCS)
Cc: '[EMAIL PROTECTED]';'[EMAIL PROTECTED]'
Subject: Re: Unicode character transformationthrough XSLT



Because the following code got apply toyour unicode data

1. convert \u to unicode - 
\uFFE2\uFF80\uFF93
become
three unicode characters-  
U+FFE2, U+FF80, U+FF93
This is ok
2. a "Throw away hihg 8bits got apply to your code" so
it became 3 bytes
E2 80 93

3. andsome code treat it as UTF-8 and try to convert it to UCS2 again,
so 

E2= 1110 0010 and the right most 4 bits 0010 will be used for UCS2
80 = 1000 and the right most 6 bits 00  will be used for UCS2
93 = 1001 0011and the right most 6 bits 01 0011 will be used for UCS2

[0010] [00] [01 0011] = 0010  0001 0011 = 2013
U+2013 is EN DASH

so...in your code there are something very very bad which will corrupt
yourdata.
Step 2 and 3 are very bad. You probably need to find out where theyare
and remove that code. 

read my paper on http://people.netscape.com/ftang/paper/textintegrity.html
Probablyyour Java code have one or two bugs which listed in my paper.


Jain,Pankaj (MED, TCS) wrote:
   

  James,
thanks, its working for me now.
But still I have a doubt that why \uFFE2\uFF80\uFF93 is giving ndash in
html.
if you have any information on this, than pls let me know.

Thanks
-Pankaj

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
Sent: Monday, March 10, 2003 7:59 PM
To: Jain, Pankaj (MED, TCS)
Cc: '[EMAIL PROTECTED]'
Subject: Re: Unicode character transformation through XSLT


.
Pankaj Jain wrote,

  
 
  
My problem is that, I am getting Unicode character(\uFFE2\uFF80\uFF93)
from resource bundle property file which is equivalent to ndash(-) and
its 

  
  
U+2013 is the ndash (aEUR").  It is represented in UTF-8 by three
hex bytes: E2 80 93.

But, \uFFE2 is fullwidth pound sign
\uFF80 is half width katakana letter ta
and \uff93 is half width katakana letter mo.

Perhaps the reason you see three question marks is that the font
you are using doesn't support full width and half width characters.

What happens if you replace your string \uFFE2\uFF80\uFF93 with
\u2013 ?

Best regards,

James Kass
.

  


  






Re: farsi calendar components

2003-03-13 Thread Yung-Fong Tang
check http://emr.cs.iit.edu/home/reingold/calendar-book/second-edition/

Paul Hastings wrote:

does anybody know of any java farsi calendar components? thanks.

Paul Hastings   [EMAIL PROTECTED]  
CTO   Sustainable Development Research Institute
Member  Team Macromedia (ColdFusion)



 






Re: sorting order between win98/xp

2003-03-13 Thread Yung-Fong Tang





do you use
LCMapStringW on WinXP  and LCMapStringA on Win98 WITH LCMAP_SORTKEY to genearate
the SORT KEY ?

Have you try on both platforms ? (Win98 and WinXP)?


Michael (michka) Kaplan wrote:

  LCMapString does not do the reported behavior either. ComparesString
and LCMapString are based on the same data and return the same
results.

Your colleague is mistaken.

MichKa

- Original Message - 
From: "Yung-Fong Tang" [EMAIL PROTECTED]
To: "Michael (michka) Kaplan" [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Thursday, March 13, 2003 4:31 PM
Subject: Re: sorting order between win98/xp


  
  
We cannot use that. The function you mention is to compare two

  
  Unicode
  
  
strings.
We need the function to "generate sort key" from unicode strings

  
  instead
  
  
of compare two string.

Michael (michka) Kaplan wrote:

    
    
  From: "Yung-Fong Tang" [EMAIL PROTECTED]



  
  
One of my colleague ask me this question.



  
  In the interests of completeness

The function that does the type of sorting your colleague noted is
StrCmpLogicalW in shlwapi.dll, version 5.5 and later. See the
following link for more information (all on one line in the
  

  
  browser):
  
  
http://msdn.microsoft.com/library/en-us/shellcc/platform/shell/refere

  
  nce/shlwapi/string/strcmplogicalw.asp
  
  

  MichKa



  



  
  
  






Re: wap and utf-8

2003-03-13 Thread Yung-Fong Tang






Mary McCarter wrote:
Hi Friends, 
 
  
 
My phone (Motorola i550,i30sx,i85,i60c) doesn't show correctly the  neither
 #243; and it shows the  instead of . 
Is that a LATIN CAPITAL A WITH TILD and a SUPERSCRIPT THREE?
ISO-8859-1 use 0xc3 to encode LATIN CAPITAL A WITH TILD
ISO-8859-1 use 0xb3 to encode UPERSCRIPT THREE
UTF-8 use 0xc30xb3 to encode LATIN SMALL LETTER O WITH ACUTE

So... it looks like some code treat your UTF-8 as ISO-8859-1
case #4 in my paper http://people.netscape.com/ftang/paper/textintegrity.html

Why?
?xml version="1.0" encoding="ISO-8859-1"?
said "ISO-8859-1"

wml_binary has the  \xc3\xb3
What is wml_binary?
what encoding are you used to store the the wml? UTF-8 or ISO-8859-1
if you do a od -x on that wml file. do you see \xf3 on that characters or
\xc3\xb3 ?

One possibility is you create the file in UTF-8 but label it as ISO-8859-1
. Change the first
line from 
?xml version="1.0" encoding="ISO-8859-1"?
to
?xml version="1.0" encoding="UTF-8"?
will fix that

If you do stored your information in ISO-8859-1 then it could caused by the
following reason
1. some code read your xml file and convert it to UTF-8 correctly, however,
the encoding="iso-8859-1" is also stored with it
2. that code pass the converted xml to the next module, but it does not remove
the 'encoding="utf-8"' or change it from 'encoding="ISO-8859-1"' to "encoding="UTF-8"'
so the next module thought the data is stilled stored in UTF-8

How to fix it?
1. fix the data- again, change it to encoding="UTF-8" and use UTF-8 to store
the data in your wml file
2. fix the code. Add some code which perform the ISO-8859-1 TO UTF-8 conversion
to remove the encoding or change the encoding 

This is a typical "Dobule Conversion" issue mentioned in my paper http://people.netscape.com/ftang/paper/textintegrity.html
as point 6

However, i don't believe that is the case. Because if that IS the case, then
all your environemtn should display as garbage, not just your motorola phone
and 4.1 similator. 

The real preblem could bein two places-
1. some code didn't remove/change the xml encoding information even it perform
charset conversion
AND
2. your nokia phone and your 3.1 simulartor (not your (Motorola i550,i30sx,i85,i60c)
or up.sdk 4.1 simulator) may always ASSUME the data as "UTF-8" and may always
ignore the mislabel encoding="ISO-8859-1" data 

The DOUBLE false could cause you to see it well on Nokia phone and 3.1 simulator.
The SINGLE faulse in 1 and the CORRECT behavior in your Motorola phone probably
will let you see the wrong thing :)


The same happen with my up.sdk 4.1 simulator (connected trough my  wap-gateway) 
  
But a nokia phone shows the  correctly! and my nokia toolkit 3.1 simulator
 show it well, too. 
 
I check the wap-gateway code, and I can realize that the wml_binary has
the  \xc3\xb3 instead of  and it is right, I think so.. because it is the
UTF-8  code, but why my phone can't show it correctly. 
 
Any idea? 
 
I will be grateful with any contribution, 
Thanks a lot and Regards, 
Mary 
 
 
 
 
 
_ 
MSN 8 helps eliminate e-mail viruses. Get 2 months FREE*.  http://join.msn.com/?page=features/virus 
  
 
 






Re: Encoding: Unicode Quarterly Newsletter

2003-03-11 Thread Yung-Fong Tang
Hope they can reduce the weight next time by change the type of the 
paper. My Bible is about 500 pages (about 1500+ pages) more than the 
unicode 3.0 standard but only 50% of it's thick.  Same as my 
Chinese/English dictionary.

Otto Stolz wrote:

Kenneth Whistler wrote:

we can
calculate the weight as being *approximately* 9.05 pounds
(avoirdupois) [or 10.99 troy pounds].


Apparently a weighty publication, that forthcoming Unicode standard...

Cheers,
  Otto Stolz






Re: Encoding: Unicode Quarterly Newsletter

2003-03-11 Thread Yung-Fong Tang


John H. Jenkins wrote:

I certainly think it would be good published with a leather cover, 
onion-skin paper, and gilt edges, yes.  First we have to have Ken 
divide it into verses, though. 
I thought we already have verses dividied in Chapter 3. Those 
C1-C13/D1-2 stuff






sorting order between win98/xp

2003-03-11 Thread Yung-Fong Tang
One of my colleague ask me this question. We use LCMapStringW on WinXP 
and LCMapStringA on Win98 (by using LCMAP_SORTKEY ). And we got 
different sorting order for the following

Example of message list ordering  in Win98:
TESTING #1
TESTING #10
TESTING #100
TESTING #11
While, the message list ordering in WinXP:
TESTING #1
TESTING #10
TESTING #11
TESTING #100
Anyone know is there a way to make them sort in the same order? Anyone 
know why the sort order is different under that two systems?
The are running under the same locale.




Re: Unicode character transformation through XSLT

2003-03-11 Thread Yung-Fong Tang





Because the following code got apply to your unicode data

1. convert \u to unicode - 
\uFFE2\uFF80\uFF93
become
three unicode characters- 
U+FFE2, U+FF80, U+FF93
This is ok
2. a "Throw away hihg 8 bits got apply to your code" so
it became 3 bytes
E2 80 93

3. and some code treat it as UTF-8 and try to convert it to UCS2 again, so


E2 = 1110 0010 and the right most 4 bits 0010 will be used for UCS2
80 = 1000  and the right most 6 bits 00  will be used for UCS2
93 = 1001 0011 and the right most 6 bits 01 0011 will be used for UCS2

[0010] [00 ] [01 0011] = 0010  0001 0011 = 2013
U+2013 is EN DASH

so... in your code there are something very very bad which will corrupt your
data.
Step 2 and 3 are very bad. You probably need to find out where they are and
remove that code. 

read my paper on http://people.netscape.com/ftang/paper/textintegrity.html
Probably your Java code have one or two bugs which listed in my paper. 

Jain, Pankaj (MED, TCS) wrote:

  James,
thanks, its working for me now.
But still I have a doubt that why \uFFE2\uFF80\uFF93 is giving ndash in
html.
if you have any information on this, than pls let me know.

Thanks
-Pankaj

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
Sent: Monday, March 10, 2003 7:59 PM
To: Jain, Pankaj (MED, TCS)
Cc: '[EMAIL PROTECTED]'
Subject: Re: Unicode character transformation through XSLT


.
Pankaj Jain wrote,

  
  
My problem is that, I am getting Unicode character(\uFFE2\uFF80\uFF93)
from resource bundle property file which is equivalent to ndash(-) and
its 

  
  
U+2013 is the ndash (aEUR").  It is represented in UTF-8 by three
hex bytes: E2 80 93.

But, \uFFE2 is fullwidth pound sign
\uFF80 is half width katakana letter ta
and \uff93 is half width katakana letter mo.

Perhaps the reason you see three question marks is that the font
you are using doesn't support full width and half width characters.

What happens if you replace your string \uFFE2\uFF80\uFF93 with
\u2013 ?

Best regards,

James Kass
.

  






pesonal comments about http://www.w3.org/TR/xml11/

2003-03-10 Thread Yung-Fong Tang
  purports not to modify the interpretation of that
coded character 
 sequence.

 

   

  

   
  
If a noncharacter which does not have a specific internal use
is  
  unexpectedly encountered in processing, an implementation may
signal an   
 error or delete or ignore the noncharacter. If these
options are not 
   taken, the noncharacter should be treated as an
unassigned code point. For  
  example, an API that returned a character
property value for a
noncharacter would return the same value
as the default value for an 
   unassigned code point.
  

 

in the http://www.unicode.org/reports/tr27/

Therefore, should the following session changed from 

  

  [2]
 Char
 ::=
 #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x1-#x10]
 /* any Unicode character, excluding the surrogate blocks, FFFE, and 
. */

  

to

  

  [2]
 Char
 ::=
 #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFDCF]
| [#xFDF0-#xFFFD] | [#x1-#x1FFFD] | [#x2-#x2FFFD]
  |  [#x3-#x3FFFD] |
[#x4-#x4FFFD] | [#x5-#x5FFFD] |
  [#x6-#x6FFFD] | [#x7-#x7FFFD] |
   [#x8-#x8FFFD] | 
[#x9-#x9FFFD] |  [#xA-#xAFFFD]
|  [#xB-#xBFFFD] | [#xC-#xCFFFD]
| [#xD-#xDFFFD] | [#xE-#xEFFFD]
  | [#xF-#xD] |
[#x1-#x1FFFD]
 /* any Unicode character, excluding the surrogate blocks, FDD0 to
FDEF nFFFE, and n. */

  


2. similar thing should apply to 
[4] NameStartChar
#xFDD0-#xFDEF should not be allowed in NameStartChar
nFFFE nor n should not be allowed in NameStartChar neither

It looks the NameStartChar do not allow private use area [#xE000-#xF8FF].
If we follow that principal, then [#xF-#x10] should neither be in
NameStartChar since http://www.unicode.org/Public/3.2-Update/Blocks-3.2.0.txt
defined them as Supplementary Private Use Area
F..F; Supplementary Private Use Area-A
10..10; Supplementary Private Use Area-B

Also, I doubt we should allow 
E..E007F; Tags
to be used as NameStartChar



Frank Yung-Fong Tang




Re: length of text by different languages

2003-03-07 Thread Yung-Fong Tang






Ram Viswanadha wrote:

  
  
  
 
  
 

  There is also some information at
 
  http://oss.software.ibm.com/icu/docs/papers/binary_ordered_compression_for_unicode.html#Test_Results
 
  
 
  Not sure if this is what you are looking
 for.
 
  

thanks. not really. I am not look into the ratio caused by encoding. But
rather the ratio caused by language itself. For example, in order to communicate
the idea "I want to eat chicken for dinner tonight", French, German using
the same encoding may use different number of characters to communicate the
same "IDEA".
Misha's paper help a lot. but unfortunately it lack of japanese and German
data.





Re: Need program to convert UTF-8 - Hex sequences

2003-03-06 Thread Yung-Fong Tang
1. open you file with n7 and change the encoding to UTF-8
2. select and copy all the text
3. paste into the first textarea of the attached html file
David Oftedal wrote:

Hello!

Sorry to make this a mass spam, but I need a program to convert UTF-8 
to hex sequences. This is useful for embedding text in non-UTF web 
pages, but also for creating a Yudit keymap file, which I'm doing at 
the moment.

For example, a file with the content  would yield the output 
0x00E6 0X00F8 0X00E5, and the Japanese expression  would yield 
0x3042 0x306E 0x4EBA.

Can anyone tell me how to do it without making a program for it 
myself? It would be VERY helpful, and I've already made 2 programs for 
assembling this file and I'm not starting on another just yet.

Best regards

David J. Oftedal



Title: u.html



  
 

   
  Text:
   
   
  Unicode Value (\u):
  
  
  Unicode Value (#ddd;):
  
   
  Unicode Value (#;):
  
   
  UTF8 value for C (\xhhh):
  
  
 
  
 




Re: length of text by different languages

2003-03-06 Thread Yung-Fong Tang






Francois Yergeau wrote:

  [EMAIL PROTECTED] wrote:
  
  
I remember there were some study to show although UTF-8 encode each 
Japanese/Chinese characters in 3 bytes, Japanese/Chinese usually use 
LESS characters in writting to communicate information than 
alphabetic base langauges.

Any one can point to me such research?

  
  
I don't know of exactly what you want, but I vaguely remember a paper given
at a Unicode conference long ago that compared various translations of the
charter (or some such) of the Voice of America in a couple or three
encodings.  H, let's see  could be this:

http://www.unicode.org/iuc/iuc9/Friday2.html#b3
Reuters Compression Scheme for Unicode (RCSU) 
Misha Wolf

yea. That could be it. I got a hard copy and it looks like the Fig 2 is the
one I am looking for.


  

No paper online, alas.  I remember that Chinese was a clear winner in terms
of # of characters.  In fact, I kind of remember that Chinese was so much
denser that it still won after RCSU (now SCSU) compression, which would mean
that a Han character contains more than twice as much info on average as a
Latin letter as used in (say) English.

This is all on pretty shaky ground, distant memories.  Perhaps Misha stil
has the figures (if that's in fact the right paper).

  






Re: length of text by different languages

2003-03-06 Thread Yung-Fong Tang
Francois Yergeau wrote:

http://www.unicode.org/iuc/iuc9/Friday2.html#b3
Reuters Compression Scheme for Unicode (RCSU) 
Misha Wolf
 

Unfortunately, no information about Germany or Japanese. :(

It only have Chinese, Frasi, Urdu, Russian, Arabic, Hindi, Korean , 
Creole, Thai, French, Czech, Turkish, Polish, Armenain, Greek, English, 
Vietnamese, Albanian, Spanish

Anyone have data about that two languages (Germany or Japanese) ?






Re: length of text by different languages

2003-03-06 Thread Yung-Fong Tang
thanks, everyone. But I want to point out the punct and   itself 
should also be consider in your future caculation. Japanese and Chinese, 
Thai do not use   between word, and Latin based (or Greek, 
Koeran,Cyrillic, Arabic, Armenian Georgian, etc) does use   and when 
used for estimate size, those should also be caculated.




length of text by different languages

2003-03-05 Thread Yung-Fong Tang
I remember there were some study to show although UTF-8 encode each 
Japanese/Chinese characters in 3 bytes, Japanese/Chinese usually use 
LESS characters in writting to communicate information than alphabetic 
base langauges.

Any one can point to me such research? Martin, do you have some paper 
about that ?

I would like to find out the average ration between
English,
Geram,
French,
Japanese,
Chinese,
Korean
in term of the number of characters, and in term of the bytes needed to 
encode in UTF-8

If such research information have not been done, maybe one way to figure 
the result is to take tranlated Bible fo these language from swords 
project, strip out those xml tag and leave the pure text, and measure 
the size. Since all the Bible translation communicate the same 
information and the volumn is huge enough, that could be a good way to 
find out the result. Of course, those mark up need to be taken out to 
reduce the noise.






Re: Unicode Arabic Rendering Problem

2003-03-03 Thread Yung-Fong Tang
Guess I am not the right peson to answer that. put it back to 
unicode.org mailling list.
Let me ask you this way. Is this a rendering style issue? or is it a 
different way to combine characers?
How you pronounce the following 3?
Is there different pronouncation between 1 and 3?
Is there different pronouncation between 2 and 3?

The answer of the two questions above may tell us it is a encoding issue 
or a presentation (glyph variant) issue.
 This is a unique spelling that is commonly found in the Quran.
Is that spelling also found in text OTHER than the Quran?

Mete Kural wrote:

Hello Yung-Fong,

Thank you very much for all the information. It was
very helpful. I'm still not clear about something
though. As far as I understand, the block of
characters
U+0644-U+0654-U+0627 would be rendered as such:
   c 
\  /
 \/
 /\
 \/

U+0644-U+0627-U+0654 would be rendered:

c 
\  /
 \/
 /\
 \/

So how would you encode this rendering?

 c 
\  /
 \/
 /\
 \/

in which the hamza is neither directly above the alef,
nor directly above the lam, but it's in between the
alef and lam. This is a unique spelling that is
commonly found in the Quran.
Thank you very much for the help.

Mete

--- Yung-Fong Tang [EMAIL PROTECTED] wrote:

 






Re: Unicode 4.0 BETA available for review

2003-02-28 Thread Yung-Fong Tang




Thanks to let me know. I guess I didn't spend enugh time with www.unicode.org
these days :) when do you add those PDF there ? It used to have only partial
sesssion available... but that is probably story several years ago


Roozbeh Pournader wrote:

  On Thu, 27 Feb 2003, Mark Davis wrote:

  
  
The Unicode Standard *is* free of charge; the entire text is posted on
www.unicode.org.

  
  
Well, free of charge to *read personally on the screen*, of course. You
can't print the major versions yourself, Addison-Wesley must be asked for
that ;) And you can't copy and paste portions of its text into emails for
reference or discussion, as you can do with RFCs. You should retype it,
which I find very annoying.

roozbeh

  






Re: Unicode 4.0 BETA available for review

2003-02-28 Thread Yung-Fong Tang






Doug Ewell wrote:

  Yung-Fong Tang ftang at netscape dot com wrote:

  
  
So... in the future, in order to ensure we have a good software
environment, we not only need to make the Unicode 4.0 clear, but also
need to speed up the revision of those RFCs.

  
  
But the Unicode Consortium and UTC have no control over that.  And as
you can see, Franois is doing his best to move the new RFC along.

Sure, we all appreciate Franois' efforts on those. 

  

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/


  






Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-02-28 Thread Yung-Fong Tang


Kenneth Whistler wrote:

Think of it this way. Does anyone expect the ASCII standard to tell,
in detail, what a process should or should not do if it receives
data which purports to be ASCII, but which contains an 0x80 byte
in it? All the ASCII standard can really do is tell you that
0x80 is not defined in ASCII, and a conformant process shall not
interpret 0x80 as an ASCII character. Beyond that, it is up to
the software engineers to figure out who goofed up in mislabelling
or corrupting the data, and what the process receiving the bad data
should do about it.
 

That is not a good comparision. ASCII is a single byte character code 
standard. And when I got a 0x80 in ASCII string, I know where is the 
boundary- the boundary is the whole 8-bits of that 0x80 is bad. The 
scope is not the first 3 bits nor 9 bits- but the 8 bits data. I cannot 
tell the rest of the data is good or bad, but I know  ASCII is only 
8-bits and 8 bits only.

Same thing for JIS x0208 (a TWO and only TWO bytes character set, not a 
variable length character set). If I am processing a ISO-2022-JP message 
and in the JIS x0208 mode and I got a 0x24 0xa8 I know the boundary of 
that problem is 16 bits, not 8 -bits nor 32 bits.

When you deal with encoding which need states (ISO-2022, ISO-2022-JP, 
etc) or variable length encoding (Shift_JIS, Big5, UTF-8), then the 
situration is different.






Re: Unicode Arabic Rendering Problem

2003-02-28 Thread Yung-Fong Tang
My test data generator in
http://people.netscape.com/ftang/testscript/arabic/arabic.html
probably can also help people to look at the Arabic behavior
Unfortuatelly, it is currently coded against Windows-1256 instead of the 
unicode.




Re: Unicode Arabic Rendering Problem

2003-02-28 Thread Yung-Fong Tang





I think you have both problem in 1 and 2

1. I think you use the wrong way to encode, you probably should encode figure
2 by using
U+0644-U+0654-U+0627
and figure 3 by using 
U+0644-U+0627-U+0654

2. I think there are also font problem. From my test, all the font ship with
MS windows does not work either way (the way you encode or the way I encode)
on IE or Mozilla. But I do see one font which I got from some Arabic font
developer show me U+0644-U+0654-U+0627 as figure 2 and U+0644-U+0627-U+0654
as figure 3

I will send you a screenshot in private email. Don't want to send a big jpg
or png to the mailling list. 

I need to find out who design that font I have in my hard drive... and probably
will let you know more details later. 


Mete Kural wrote:

  Hello Folks,

I wanted to ask a question to those of you who have
Unicode Arabic knowledge. We have this website
http://www.quranreader.org where we are trying to
display the text of the Quran with accurately encoded
Unicode text rather than the traditional images. Some
of the characters in the Quran aren't rendered
correctly. We are letting the browser to use its
default Unicode font on the website, which is Times
New Roman Unicode for the newer versions of Internet
Explorer I think. If we used a high-quality Unicode
font for Arabic, would this solve the problem? Or is
this a bigger problem that has to do with the
rendering engine provided by the operating system?

I would like to give you an example. In Arabic when
you have a Lam And Alef together, it is rendered in a
unique way instead of the regular rendering for these
letters that kind of looks like this:

 \  /
  \/
  /\
  \/
Figure 1

In the Quran, there is sometimes this combination of
characters: Lam-Hamza-Alif
In such a case, the Lam and Alif are still rendered
the way they would be had there not been a hamza
inbetween, and the hamza is simply put above the alef
and lam in the middle which looks kind of like this:

  c 
 \  /
  \/
  /\
  \/
Figure 2

Note that this is different than the case as
illustrated in Figure 3 where the hamza is directly
above the alef and not "in between" lam and alef.

c 
 \  /
  \/
  /\
  \/
Figure 3

So there is a subtle difference that the hamza is not
directly above the alef but rather in between the alef
and the lam. I am attaching a small gif file named
"Sample.gif" that will demostrate the subtle
difference of the positioning of the hamza. Attached
are two words from the Quran. Look for the second word
where the hamza is in between the alef and the lam
instead of directly above the alef.

When we encode this case with this combination of
Unicode characters: 0644-0627-0621
in Internet Explorer, instead of showing it like
Figure 2, it totally seperates all letters and shows
it like this:

|  |
|  |
| C \__/

which is totally wrong. 

Which one do you think is the problem here?

1) We are not encoding this combination of characters
in the correct way.
2) This is a font-related problem.
3) This is a bigger problem for which the rendering
engine on the operating system has to be modified.

Thank you very very much,
Mete Kural


  
  
  
  
  




image/gif

Re: Unicode 4.0 BETA available for review

2003-02-27 Thread Yung-Fong Tang


Stefan Persson wrote:

Kenneth Whistler wrote:

Unicode 3.0 defined non-shorted UTF-8 as *irregular* code value
sequences. There were two types:
  a. 0xC0 0x80 for U+ (instead of 0x00)
  b. 0xED 0xA0 0x80 0xED 0xB0 0x80 for U+1 (instead of 0xF0 0x90 
0x80 0x80)
 

Ah, but encoding NULL as a surrogate character and then encoding those 
two surrogates as three bytes, making totally 6 bytes a character, 
would also be technically possible (though not legal), right? 
How ? Surrogate pairs can only be used to represent U+1 - U+10 . 
It is IMPOSSIBLE to use Surrogate pair to represent any characters in 
the range of U+ - U+, including U+ which is NULL.



Stefan

_
Gå före i kön och få din sajt värderad på nolltid med Yahoo! Express
Se mer på: http://se.docs.yahoo.com/info/express/help/index.html






Re: Unicode 4.0 BETA available for review

2003-02-27 Thread Yung-Fong Tang

 This discussion has been centered around UTF-8.  But I hope the
corresponding rules apply to UTF-16 and UTF-32 for Unicode 4.0:
. for UTF-32: occurrences of 'surrogates' are ill-formed.

   

How about UTF-32 sequence which the 4 bytes represent value  U+10 ? 
Are they considered ill-formed? Should they?




Re: Unicode 4.0 BETA available for review

2003-02-27 Thread Yung-Fong Tang


Kent Karlsson wrote:

The Unicode 4.0 text further strengthens Conformance Clause
C12, to make this crystal clear:
  C12 When a process generates a code unit sequence which
   purports to be in a Unicode character encoding form, it shall
   not emit ill-formed code unit sequences.
   
  C12a When a process interprets a code unit sequence which
   purports to be in a Unicode character encoding form, it
   shall treat ill-formed code unit sequences as an error
   condition, and shall not interpret such sequences as
   characters.
   
And just in case anyone still has any trouble reading the
painfully detailed specification of the UTF-8
encoding form, an explicit note is included there:

  * Because surrogate code points are not Unicode scalar
 values, any UTF-8 byte sequence that would otherwise
 map to code points D800..DFFF is ill-formed.
 
So I don't think there is any hole here. If anyone still
thinks that they can use these 3-octet/3-octet encodings
of supplementary characters and call it UTF-8, then they
are either engaging in wishful thinking or are not reading
the standard carefully enough.

The problem I need to deal with is not GENERATE those UTF-8, but how to 
handle these DATA when my code receive it. For example, when I receive a 
10K UTF-8 file which have 1000 lines of text, if there are one UTF-8 
sequence in the line 990 are ill-formed, should I fire the error for
1. the whole file (10K, 1000 lines),
2. all the line after line 899,
3. the line 990 itslef,
4. the text between the leading byte of that ill-formed UTF-8 till the 
end of the file,
5. the text between the leading byte of that ill-formed UTF-8 sequenec 
till the end of the line 990,
6. the text between the leading byte of that ill-formed UTF-8 till the 
next leading byte in line 990

I there are others way you can scope the ERROR, I probably can continue 
it on and on and tell you 10-20 other way to scope it if I spend 20 more 
minutes.

I do believe the error handling should be application specific.






Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available forreview)

2003-02-27 Thread Yung-Fong Tang


Likewise, the Unicode Standard tells you what a well-formed
UTF-8 byte sequence is. But it is the software designer who has
to be smart about determining what his/her software will do when
it encounters an error condition and finds itself dealing
with a sequence which is ill-formed according to the specification
of UTF-8 in the Unicode Standard.
or higher level specification, such as XML specification, SOAP 
specification, CSS2 specification, etc.
There are many many layers between Unicode standard and a software 
application. Not just the code itself

--Ken

 






Re: Unicode 4.0 BETA available for review

2003-02-27 Thread Yung-Fong Tang


   
I can keep answering these questions, but I can also assure
everyone that the UTC worked *very* hard this time around to
make the character encoding model much clearer in the Unicode 4.0
text, and to anticipate all these edge cases.

--Ken

The problem in the past come from two (or more places)

1. the definitation in Unicode itself (3.0, 3.1)
2. the RFC which summarize it.
I am sure you can control the point 1. But we have to understand the 
point 2 is also important. The reasone people refer to point 2 is 
usually the RFC is much shorter and focus than the Unicode standard 
itself. And also RFC is FREE of charge but not Unicode standard itself. 
So... in the future, in order to ensure we have a good software 
environment, we not only need to make the Unicode 4.0 clear, but also 
need to speed up the revision of those RFCs.




quoted-string in for MIME Content-Type charset parameter

2003-02-27 Thread Yung-Fong Tang






Not sure this is the right fourm to discuss this issue. I found this "problem"
when I debugging a UTF-8 email message.
 
 When I look into some email that we have problem with, I just saw some Content-Type
header like the following:
 
 Content-Type: text/html; charset="UTF-8"
 
 As I remember, the MIME specification does not allowed "" with the charset
parameter and it should only accept
 
 Content-Type: text/html; charset=UTF-8
 
 but not charset="UTF-8"
 
 So... I check the MIME spec try to figure out is it allowed or not. What 
shock me is the original MIME specification RFC 1521 disallowed it 
 http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc1521.html#sec-7.1.1
 and
 http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc1521.html#sec-7.1.2
 
 
   
  The formal grammar for the content-type header field for text is
asfollows:   
   
 
text-type := "text" "/" text-subtype [";" "charset" "=" charset]
   
text-subtype := "plain" / extension-token
   
charset := "us-ascii"/ "iso-8859-1"/ "iso-8859-2"/ "iso-8859-3"
   / "iso-8859-4"/ "iso-8859-5"/ "iso-8859-6"/ "iso-8859-7"
   / "iso-8859-8" / "iso-8859-9" / extension-token
 
but RFC 2045 which obsoleted RFC 1521 allow the " quoted charset name:
 
see http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2045.html#sec-5.1


 
   
   parameter := attribute "=" value

 attribute := token
  ; Matching of attributes
  ; is ALWAYS case-insensitive.
  
 
 
   
  
 value := token / quoted-string
  
   
  Note that the value of a quoted string parameter does not include
thequotes.  That is, the quotation marks in a quoted-string are not a 
   part of the value of the parameter, but are merely used to delimit   
that parameter value.  In addition, comments are allowed inaccordance
with RFC
822 rules for structured header fields.  Thus thefollowing two forms 
  
   
Content-type: text/plain; charset=us-ascii (Plain text)   
   
Content-type: text/plain; charset="us-ascii"   
   
  are completely equivalent.   
 
 I never aware this differences between RFC 1521 and RFC 2045. Not sure about
you folks aware of it or not. 
 
 I also check HTTP 1.1- RFC 2068. and HTTP 1.0 RFC 1945 . It looks like both 
specification have conflict language within the same specification about this
issue:
 http://www.w3.org/Protocols/rfc1945/rfc1945
 http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2068.html
 
 While one place say:
 
 
 charset = "US-ASCII"
 | "ISO-8859-1" | "ISO-8859-2" | "ISO-8859-3"
 | "ISO-8859-4" | "ISO-8859-5" | "ISO-8859-6"
 | "ISO-8859-7" | "ISO-8859-8" | "ISO-8859-9"
 | "ISO-2022-JP" | "ISO-2022-JP-2" | "ISO-2022-KR"
 | "UNICODE-1-1" | "UNICODE-1-1-UTF-7" | "UNICODE-1-1-UTF-8"
 | token
 and 
 
   
 token  = 1*any CHAR except CTLs or tspecials

   tspecials  = "(" | ")" | "" | "" | "@"
  | "," | ";" | ":" | "\" | "
  | "/" | "[" | "]" | "?" | "="
  | "{" | "}" | SP | HT
which ruled out the use of quoted-string
  
 
 The other placce it said
 
3.6  Media Types

   HTTP uses Internet Media Types [13] in the Content-Type header field
   (Section 10.5) in order to provide open and extensible data typing.

   media-type = type "/" subtype *( ";" parameter )

   parameter  = attribute "=" value

   value  = token | quoted-string
 
 
:( :( :( :(

 Therefore we need to make sure
 1. all the mailer which receive email not only deal with charset=value but 
also charset="value". I am not sure about Mozilla can deal with it or not. 
How about your email program?
 
 2. The browse can deal with
 Content-Type: text/html; charset="value"
 in additional to 
 Content-Type: text/html; charset=value
 
 3. because we also use META tag in the HTML to reflect the HTTP header,
that mean the browser not only have to deal with the following kind of meta
tag
 
 meta http-equiv="content-type" content="text/html; charset=value"
 meta http-equiv="content-type" content='text/html; charset=value'
 but also
 meta http-equiv="content-type" content='text/html; charset="value"'
 
 :( :( :( :( 
 
 not sure does mozilla handle 2 or 3. How about IE?
 
 However, for email, since RFC 1521 does NOT allow it, to make sure it work 
with most of the email program, when we try to send out internet email, we 
should try to use
 
 Content-Type: text/html; charset=UTF-8
 
 instead of 
 Content-Type: text/html; charset="UTF-8"
 
Can you check this issue with the product that you are working on ?
 
 
 
 
 
 
 
 





Re: Unicode 4.0 BETA available for review

2003-02-26 Thread Yung-Fong Tang


Kenneth Whistler wrote:

If you read through those definitions from Unicode 4.0 carefully,
you will see that UTF-8 representing a noncharacter is perfectly
valid, but UTF-8 representing an unpaired surrogate code point
is ill-formed (and therefore disallowed).
 

I see a hole here. How about UTF-8 representing a paired of surrogate 
code point with two 3 octets sequence instead of an one octets UTF-8 
sequence? It should be ill-formed since it is non-shortest form also, 
right? But we really need to watch out the language used there so we 
won't create new problem. I DO NOT want people think one 3 otects of 
UTF-8 surrogate low or high is ill-formed but one 3 octets of UTF-8 
surrogate high followed by a one 3 octets of UTF-8 surrogate low is legal.




Re: please review the paper for me

2003-02-26 Thread Yung-Fong Tang
I think that is a very commn mistake people WILL make.

Doug Ewell wrote:

Thanks to all who pointed out that noncharacters, unlike surrogate code
points, are NOT illegal or invalid in UTF-8 or any other CES.  I don't
know why I said they were.  (Bad brain!  Bad, bad brain!)
-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/
 






  1   2   >