Unicode Normalization forms on input

Elliotte Rusty Harold Sun, 13 Oct 2002 07:33:49 -0700

Does anyone happen to know what Unicode normalization form Xerces 
uses when reading a non-UCS character set such as ISO-8859-6 or SJIS? 
The issue is that some characters can be decoded to more than one 
different Unicode character or characters. For example, is e with 
accent acute &#xE9; or &#x65;&#x301; (ASCII e plus combining accent 
acute.)


Normally the difference doesn't matter, but canonical XML (and thus 
XML encryption) requires that Unicode normalization form C be used.

Java 1.4 also appears to be deficient in documenting exactly which 
normalization  form it actually uses.
-- 

+-----------------------+------------------------+-------------------+
| Elliotte Rusty Harold | [EMAIL PROTECTED] | Writer/Programmer |
+-----------------------+------------------------+-------------------+
|          XML in a  Nutshell, 2nd Edition (O'Reilly, 2002)          |
|              http://www.cafeconleche.org/books/xian2/              |
|  http://www.amazon.com/exec/obidos/ISBN%3D0596002920/cafeaulaitA/  |
+----------------------------------+---------------------------------+
|  Read Cafe au Lait for Java News:  http://www.cafeaulait.org/      |
|  Read Cafe con Leche for XML News: http://www.cafeconleche.org/    |
+----------------------------------+---------------------------------+

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Unicode Normalization forms on input

Reply via email to