Dimitry,

Thank you for taking the time to respond.  You were exactly right and your
explanation was very helpful.

thanks again,
-- John

-----Original Message-----
From: Voytenko, Dimitry [mailto:[EMAIL PROTECTED]
Sent: Thursday, November 08, 2001 8:10 PM
To: '[EMAIL PROTECTED]'
Subject: RE: HELP ME! - Utf question


Hi John,

I pointed this in the previous letter. Here's fragment from XML
specification:

<spec href="http://www.w3.org/TR/REC-xml#sec-references";>
[Definition: A character reference refers to a specific character in the
ISO/IEC 10646 character set, for example one not directly accessible from
available input devices.]
</spec>

It means that characters refered via &###; are interpreted as Unicode
characters. So your two inputs are different: in case (1) you have three
bytes which are read as 1 Unicode character; in case (2) you specified three
Unicode characters. 
So XML parser reads each character reference &###; and considers that number
specifies character in Unicode. NOT this way: it reads all character
references &###;, then combines character in UTF-8, then converts it to
Unicode. So character reference &###; should be already specify index of
Unicode character.
So in your case following two fragments are identical:
(1) <abc>�z�</abc>
(2) <abc>&#x57be;</abc>

Thanks,
Dmitry

-----Original Message-----
From: Colosi, John [mailto:[EMAIL PROTECTED]
Sent: Thursday, November 08, 2001 16:39
To: '[EMAIL PROTECTED]'
Subject: HELP ME! - Utf question


All,

        Let me refine my explanation.  Dimitry, your feedback has been very
helpful so far.  I would really appreciate any other feedback as well.

        Given:
                  UTF-8      UTF-16
                e5 9e be  =   57be



        Now, consider the following:


        #1
                <abc>�z�</abc>
                The <abc> element contains the values "e5", "9e", and "be"
inside the brackets, but the values are in a raw binary format.  The Xerces
parser assumes these values are UTF-8 and converts them to UTF-16 (Unicode).
A Java string of length 1 (one) is constructed whose value is 0x57be (see
"Given" above)


        #2
                <abc>&#xe5;&#x9e;&#xbe;</abc>
                Now the <abc> element contains the hex values "e5", "9e",
and "be".  These values however are specified as hex values and are not
interpreted as UTF-8.  A Java String of length 3 (three) is constructed
whose value is 0x00e5, 0x009e, 0x00be.


        The input data is identical in the two cases.  In both, the user
wishes to specify the hex data "e5 9e be".  The parser handles the data
differently depending on the method of input resulting in different output.
Is there any way to rectify this?

thanks,
-- John

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to