Re: Problems encoding the spanish o
One thing may help you to think about this kind of issue is my 'under constrution" paper - "Frank Tang's List of Common Bugs that Break Text Integrity" http://people.netscape.com/ftang/paper/textintegrity.html I am going to present a newer revsion in the coming IUC25 if they accept my proposal. it look like "Ãn M" 4 bytes got changed to two bytes U+DB7A and U+DC0D which is a surrogate pair in UTF-16. Here is what I think what happened. 1. the text "...izaciÃn Map.." is output from process A and pass to a process B which the byte is encoded in ISO-8859-1. so the 4 bytes "Ãn M" are encoded as 0xf3, 0x6e, 0x20, 0x4d. 2. somehow process B think the incoming data is in UTF-8 instead of ISO-8859-1. You can find some possible cause as hint from my paper (url above). 3. Process B try to convert the data stream to UTF-16 by using "UTF-8 to UTF-16" conversion rule. However the UTF-8 scanner in the converter is not will written. It implement the conversion in the following way: 3.a. it hit the byte 0xf3, and it look at a look up table and notice 0xf3 in a legal UTF-8 sequence is the first bytes of a 4 bytes UTF-8 sequence. 3.b. it decode that 4 bytes UTF-8 sequence without checking the value of the next 3 bytes 0x6e, 0x20, 0x4d. It blindly think these bytes are the 2nd, 3rd and 4th bytes of this UTF-8 sequence. Of course, it need to first get the UCS4 value, what it does is m1 = byte1 & 0x07 m2 = byte2 & 0x3F m3 = byte3 & 0x3F m4 = byte4 & 0x3F in your case, what it got is m1 = 0xf3 & 0x07 = 0x03 m2 = 0x6e & 0x3F = 0x2e m3 = 0x20 & 0x3f = 0x20 m4 = 0x4d & 0x3f = 0x0d [Notice the problem is such algorighm does not check to make sure byte2, byte3 and byte4 is in the range of 0x80 - 0xBF at all. One possibility is it does not check in the code. The other possibility is the code do value checking but massed up by using (char) value to compare with (unsigne char) by using < and >. What I mean is the following: main() { char a=0x23; printf("a is %x ",a); if( a > (char)0x80) printf("and a is greater than 0x80\n"); else printf("and a is less or equal than 0x80\n"); } sh% ./b a is 23 and a is greater than 0x80 ] then it caculate the ucs4 by using ucs4 = (m1 << 18) | (m2 << 12) | (m3 << 6) | (m4 << 0); in your case, what it got is ucs4 = (0x03 << 18) | (0x2e << 12) | (0x20 << 6) | (0x0d << 0) = 0xc | 0x2e000 | 0x800 | 0x0d = U+ee80d; 3.c. now it turn that ucs4 into UTF-16 by surrogate high = ((ucs4-0x1 ) >> 10) | 0xd800 = ((0xee80d - 0x1) >> 10) | 0xd800 = ( 0xde80d >> 10 ) | 0xd800 = 0x037a | 0xd800 = 0xdb7a surrogte low = ((ucs4 - 0x1) & 0x03FF) | 0xdc00 = ((0xee80d - 0x1) & 0x03FF) | 0xdc00 = (0xde80d & 0x3FF) | 0xdc00 = 0x0d | 0xdc00 = 0xdc0d so you got a UTF-16 DB7A DC0D with you now 4. now process b (or some other code) try to convert the UTF-16 into HTML NCR, unfortunatelly, that process do not handle the UTF-16 to NCR conversion correctly. So... instead of doing the right way as below: 4.a take DB7A DC0D convert to UCS4 as 0xEE80D 4.b convert EE80D to decimal as 976909 and generate as "" it convert DB7A as decimal 56186 and generate as "�" and then it convert DC0D as decimal 56333 and generate as "�" So... in summary, there are 3 but not only 1 problem in your system Problem 1: Process A convert data to ISO-8859-1 while process B is expecting UTF-8. You should either fix the Process A to let it generate UTF-8 or fix the Process B to treat the input as ISO-8859-1. The preferred approach is the ealier one. Problem 2: The UTF-8 converter in Process B does not strictly implement the requirement in RFC 3629 which say it MUST protect against decode invalid sequence. If you put the non ASCII into the end of a line it probably will cause your software to fold line if you put it in the end of the record it may even crash your software for converter in this kind of quality. You need to fix the convert scanning part. Problem 3: The UTF-16 to NCR conversion is incorrect according to the HTML. Hope the above analysis help. pepe pepe wrote: > Hello: > > We have the following sequence of characters "...izaciÃn Map.." that is > the same than "...ización Map..." that after suffering some > transformations becomes to "...izaci�&56333;ap" > AS you can see the two characters 56186 and 56333 seem to represent this > sequences "Ãn M". Any idea?. > > Regards, > Mario. > > _ > Charla con tus amigos en lÃnea mediante MSN Messenger. > http://messenger.microsoft.com/es > > -- -- Frank Yung-Fong Tang ÅÃÅtÃm ÃrÃhÃtÃÃt, IÃtÃrnÃtiÃnÃl DÃvÃlÃpmeÃt, AOL IntÃrÃÃtÃvà SÃrviÃes AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913 Yahoo! Msg: frankyungfongtan John 3:16 "For God so loved the world that he gave his one and only Son, that whoever believes in him shall not perish but have eternal life. Does your
Re: Problems encoding the spanish o
Philippe Verdy wrote: > If IE really wants to keep some compatibility, it may only accept the > CESU-8 encoding only as a possible choice for its "automatic > selection" of charsets, or display a visible replacement character > (such as a narrow white box) for invalid characters (that could > internally be handled as if these invalid sequences were representing > U+). 1. CESU-8 should *never* be auto-detected. CESU-8 is intended for internal use only. Even the TR says this. 2. CESU-8 has nothing to do with overlong sequences. They're just as invalid there as in UTF-8. So I really don't know how CESU-8 got dragged into this thread in the first place. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: Problems encoding the spanish o
From: "Marco Cimarosti" <[EMAIL PROTECTED]> To: "'Pim Blokland'" <[EMAIL PROTECTED]>; "Unicode mailing list" <[EMAIL PROTECTED]> > Pim Blokland wrote: > > Not only that, but the process making the mistake of thinking it is > > UTF-8 also makes the mistake of not generating an error for > > encountering malformed byte sequences, > > BTW, this process has a name: "Internet Explorer". Don't blame IE too much if it attempts to interpret the text using UTF-8, because the page is tagged explicitly with a UTF-8 charset. Well, it's true that IE should stop to use this erroneous charset tag as soon as it sees a violation of the UTF-8 rule, and rather should attempt to use its "automatic selection". But it's true also, that IE still attempts to use the legacy UTF-8 encoding which allowed interpreting non-short sequences. I do think this bug does not occur within recent updates of IE, notably since it was corrected to remove the security hole in MSHTML.DLL to avoid interpreting non-short sequences. If IE really wants to keep some compatibility, it may only accept the CESU-8 encoding only as a possible choice for its "automatic selection" of charsets, or display a visible replacement character (such as a narrow white box) for invalid characters (that could internally be handled as if these invalid sequences were representing U+). But if the user forces the UTF-8 decoding in the GUI, IE should still not consider any invalid UTF-8 sequence, and interpret it as an invalid character like U+ or, even better, disable this UTF-8 choice in the user interface. So this is really an effect of the collision of multiple Unicode violations, both in the User-Agent interpreting the coded strings, and in the content of the page, incorrectly labelled UTF-8 when it is not (here: complain to your web page designer, or blame yourself if you created this page with invalid meta-tags). Beware, when editing an UTF-8 page that includes the UTF-8 charset metatag explicitly, that your editor will not save it into ISO-8859-1, only because it thinks it will save storage space... There are also of some bogous "web site optimizers" that perform this kind of encoding optimization (in addition to removing unnecessary spaces and new lines, or to compressing/obfuscating the JavaScript code, CSS stylesheet class names) and don't take care of changing the value of this meta-tag... Changing the internal encoding of any text file without an explicit request from the user should never be done automatically without confirmation and logging of the actions taken.
Re: Problems encoding the spanish o
Hello: My knowledge about encoding is very poor and you seem to know a lot abou this. could you explain a bit more what you have said. I have made the following: This is the problematic sequence 0011-01101110-0010-01001101 (F3-6e-20-4d) if I follow the instructions that appaear in the question(What is UTF-8?) in the UTf-8 fAQ i obtain the following 01110111010001101 instead 1EE80D 0111010001101(Have I made a mistake?) Following the utf-16 encoding from my result all works well. so to finalize who do you think that is the responsible for this strange situation the client for saying that the doc is utf-8 or the parser. Regards, Mario. From: Pim Blokland <[EMAIL PROTECTED]> To: Unicode mailing list <[EMAIL PROTECTED]> Subject: Re: Problems encoding the spanish o Date: Mon, 17 Nov 2003 13:26:19 +0100 pepe pepe schreef: > We have the following sequence of characters "...ización Map.." that is > the same than "...ización Map..." that after suffering some > transformations becomes to "...izaci�&56333;ap" > AS you can see the two characters 56186 and 56333 seem to represent this > sequences "ón M". Any idea?. Yes, your input text obviously gets flagged as being in UTF-8 format, even if it is Latin-1 (or any codepage that has a ó at index 243). Not only that, but the process making the mistake of thinking it is UTF-8 also makes the mistake of not generating an error for encountering malformed byte sequences, AND of outputting the result as two 16-bit numbers instead of one 21-bit number. If you take the byte sequence (hex) F3 6E 20 4D and treat it as UTF-8 and don't care it's not valid, this maps to the value (hex)1EE80D. Again, not caring this is not a valid codepoint, turning this into UTF-16 would yield U+DB7A U+DC0D, which is what you got in your output. Pim Blokland _ Dale rienda suelta a tu tiempo libre. Encuentra mil ideas para exprimir tu ocio con MSN Entretenimiento. http://entretenimiento.msn.es/
RE: Problems encoding the spanish o
Pim Blokland wrote: > Not only that, but the process making the mistake of thinking it is > UTF-8 also makes the mistake of not generating an error for > encountering malformed byte sequences, BTW, this process has a name: "Internet Explorer". > AND of outputting the result as two 16-bit numbers instead of one > 21-bit number. I guess that this resulted by copying & pasting the resulting text in an editor and saving it as UTF-16. _ Marco
Re: Problems encoding the spanish o
pepe pepe schreef: > We have the following sequence of characters "...ización Map.." that is > the same than "...ización Map..." that after suffering some > transformations becomes to "...izaci�&56333;ap" > AS you can see the two characters 56186 and 56333 seem to represent this > sequences "ón M". Any idea?. Yes, your input text obviously gets flagged as being in UTF-8 format, even if it is Latin-1 (or any codepage that has a ó at index 243). Not only that, but the process making the mistake of thinking it is UTF-8 also makes the mistake of not generating an error for encountering malformed byte sequences, AND of outputting the result as two 16-bit numbers instead of one 21-bit number. If you take the byte sequence (hex) F3 6E 20 4D and treat it as UTF-8 and don't care it's not valid, this maps to the value (hex)1EE80D. Again, not caring this is not a valid codepoint, turning this into UTF-16 would yield U+DB7A U+DC0D, which is what you got in your output. Pim Blokland
RE: Problems encoding the spanish o
pepe pepe wrote: > We have the following sequence of characters "...ización > Map.." that is the same than "...ización Map..." that > after suffering some transformations becomes to > "...izaci�&56333;ap" AS you can see the two > characters 56186 and 56333 seem to represent this > sequences "ón M". Any idea?. Yes. In the of your HTML file, you should have a line like this: Change "utf-8" to "iso-8859-1", or simply remove the whole line. _ Marco