Dimitry, Thank you for taking the time to respond. You were exactly right and your explanation was very helpful.
thanks again, -- John -----Original Message----- From: Voytenko, Dimitry [mailto:[EMAIL PROTECTED] Sent: Thursday, November 08, 2001 8:10 PM To: '[EMAIL PROTECTED]' Subject: RE: HELP ME! - Utf question Hi John, I pointed this in the previous letter. Here's fragment from XML specification: <spec href="http://www.w3.org/TR/REC-xml#sec-references"> [Definition: A character reference refers to a specific character in the ISO/IEC 10646 character set, for example one not directly accessible from available input devices.] </spec> It means that characters refered via &###; are interpreted as Unicode characters. So your two inputs are different: in case (1) you have three bytes which are read as 1 Unicode character; in case (2) you specified three Unicode characters. So XML parser reads each character reference &###; and considers that number specifies character in Unicode. NOT this way: it reads all character references &###;, then combines character in UTF-8, then converts it to Unicode. So character reference &###; should be already specify index of Unicode character. So in your case following two fragments are identical: (1) <abc>�z�</abc> (2) <abc>垾</abc> Thanks, Dmitry -----Original Message----- From: Colosi, John [mailto:[EMAIL PROTECTED] Sent: Thursday, November 08, 2001 16:39 To: '[EMAIL PROTECTED]' Subject: HELP ME! - Utf question All, Let me refine my explanation. Dimitry, your feedback has been very helpful so far. I would really appreciate any other feedback as well. Given: UTF-8 UTF-16 e5 9e be = 57be Now, consider the following: #1 <abc>�z�</abc> The <abc> element contains the values "e5", "9e", and "be" inside the brackets, but the values are in a raw binary format. The Xerces parser assumes these values are UTF-8 and converts them to UTF-16 (Unicode). A Java string of length 1 (one) is constructed whose value is 0x57be (see "Given" above) #2 <abc>åž¾</abc> Now the <abc> element contains the hex values "e5", "9e", and "be". These values however are specified as hex values and are not interpreted as UTF-8. A Java String of length 3 (three) is constructed whose value is 0x00e5, 0x009e, 0x00be. The input data is identical in the two cases. In both, the user wishes to specify the hex data "e5 9e be". The parser handles the data differently depending on the method of input resulting in different output. Is there any way to rectify this? thanks, -- John --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
