[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

Martijn Faassen Mon, 15 Jan 2007 13:09:30 -0800

Tres Seaver wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Andreas Jung wrote:

--On 14. Januar 2007 18:14:45 +0000 Chris Withers <[EMAIL PROTECTED]>wrote:

Dieter Maurer wrote:

A halfway intelligent parser would accept Unicode when it gets it
and concentrate on the remaining part of its task: either reporting
structural events or building a parse tree.

The trivial fix I use in Twiddler is as follows:

if isinstance(source,unicode):
   source = source.encode('utf-8')

Of course, this assumes a heading of either <?xml version="1.0"
encoding="utf-8"?> or a missing encoding attribute, in which case the xml
spec states that the string must be utf-8 encoded.

The encoding of the XML preamble should not matter when parsing a XML
document stored as unicode string.


That encoding is a *lie*, which is the real problem.  Parsers expect it
to be *correct*, and if missing, expect the text to be encoded as UTF-8,
per the spec (if the document comes from an HTTP request, then the
application may supply the encoding from the request headers).

Nothing in the XML specs allows or specifies and behavior for XML
documents serialized as unicode, becuase such serializations are
*programming language specific*.

While I agree that the encoding declaration is ambiguous at best andshould be rejected, you can find a bit in the spec which supports XML asPython unicode strings. A Python unicode string can be seen as a stringwith "external character encoding information": it's the native encodingof Python. Therefore we can make sense of it in an XML parser. For myprevious analysis of the spec see here:


http://codespeak.net/pipermail/lxml-dev/2006-May/001137.html

What however is bad and evil is to just ignore conflicting encodingdeclarations in an XML document itself. I'd choose either one of:


* bail with a clear error when unicode is supplied at all

* bail with a clear error when unicode is supplied with any explicitencoding declaration in the XML.

It is of importance as soon as youconvert the document back to a stream e.g. when we deliver the contentback to a browser or a FTP client. The ZPublisher (for Zope 2) deals withthat by changing the encoding parameter of the preamble for XML documentsbased on the desired output encoding. utf-8 is always a good choice however
other encodings like iso-8859-15 might raise UnicodeDecodeErrors. The Zope 2
publisher "avoids" this problem converting the unicode result usingerrors='replace' (which is likely something we might discuss :-))
Unicode XML is not only problematic for streaming. For instance, you
*can't* pass a Unicode string to the libxml2 *at all* , unless you want
a core dump.  The API requires that you pass it strings encoded as UTF8.

You can in lxml. :) libxml2 as a C API doesn't even support any unicodestring type as far as I am aware.


Regards,

Martijn

_______________________________________________
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com

[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

Reply via email to