Re: [xml] UTF-8 validation

Daniel Veillard Wed, 10 Oct 2007 01:50:39 -0700

On Fri, Oct 05, 2007 at 04:10:56PM -0700, Norbert Lindenberg wrote:
> Hi there,
> 
> Can you tell me whether libxml2 does complete validation of UTF-8  
> when input is provided in this character encoding? By complete  
> validation I mean:
> 
> - Verifying that each character is represented by a byte sequence  
> that matches one of the patterns described in section 3 of RFC 3629.
> 
> - Verifying that each character is represented by the shortest  
> possibly byte sequence (ruling out, for example the use of 0xC0 0x80  
> for U+0000).
> 
> - Verifying that supplementary characters are represented by a 4-byte  
> sequence, not by a pair of surrogate characters.
> 
> - Verifying that illegal code points, such as the not-a-character  
> characters, U+FFFE, U+FFFF, etc., do not occur.
> 
> Bug report 305333 implies that some of this validation occurs, but  
> the references to the obsolete RFC 2044 in the documentation worry me  
> a bit.


  libxml2 does checking of UTF-8 sequences when parsing documents. It
don't do checks from the APIs to modify or create document, xmlChar*
are assumed to be correct UTF-8 strings. 
  W.r.t. the checks they are based on the caracter ranges,
 see http://www.w3.org/TR/REC-xml/#NT-Char
this ensures that U+0000 or surrogates for examples are generating
fatal errors if encountered.
  Could you explain your concerns in terms of the XML character range
framework, in case my answer sounds incomplete to you,

Daniel

-- 
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard      | virtualization library  http://libvirt.org/
[EMAIL PROTECTED]  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine  http://rpmfind.net/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Re: [xml] UTF-8 validation

Reply via email to