On Fri, Oct 05, 2007 at 04:10:56PM -0700, Norbert Lindenberg wrote: > Hi there, > > Can you tell me whether libxml2 does complete validation of UTF-8 > when input is provided in this character encoding? By complete > validation I mean: > > - Verifying that each character is represented by a byte sequence > that matches one of the patterns described in section 3 of RFC 3629. > > - Verifying that each character is represented by the shortest > possibly byte sequence (ruling out, for example the use of 0xC0 0x80 > for U+0000). > > - Verifying that supplementary characters are represented by a 4-byte > sequence, not by a pair of surrogate characters. > > - Verifying that illegal code points, such as the not-a-character > characters, U+FFFE, U+FFFF, etc., do not occur. > > Bug report 305333 implies that some of this validation occurs, but > the references to the obsolete RFC 2044 in the documentation worry me > a bit.
libxml2 does checking of UTF-8 sequences when parsing documents. It don't do checks from the APIs to modify or create document, xmlChar* are assumed to be correct UTF-8 strings. W.r.t. the checks they are based on the caracter ranges, see http://www.w3.org/TR/REC-xml/#NT-Char this ensures that U+0000 or surrogates for examples are generating fatal errors if encountered. Could you explain your concerns in terms of the XML character range framework, in case my answer sounds incomplete to you, Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ [EMAIL PROTECTED] | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/ _______________________________________________ xml mailing list, project page http://xmlsoft.org/ [email protected] http://mail.gnome.org/mailman/listinfo/xml
