Tres Seaver wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Andreas Jung wrote:
--On 14. Januar 2007 18:14:45 +0000 Chris Withers <[EMAIL PROTECTED]>
wrote:
Dieter Maurer wrote:
A halfway intelligent parser would accept Unicode when it gets it
and concentrate on the remaining part of its task: either reporting
structural events or building a parse tree.
The trivial fix I use in Twiddler is as follows:
if isinstance(source,unicode):
source = source.encode('utf-8')
Of course, this assumes a heading of either <?xml version="1.0"
encoding="utf-8"?> or a missing encoding attribute, in which case the xml
spec states that the string must be utf-8 encoded.
The encoding of the XML preamble should not matter when parsing a XML
document stored as unicode string.
That encoding is a *lie*, which is the real problem. Parsers expect it
to be *correct*, and if missing, expect the text to be encoded as UTF-8,
per the spec (if the document comes from an HTTP request, then the
application may supply the encoding from the request headers).
Nothing in the XML specs allows or specifies and behavior for XML
documents serialized as unicode, becuase such serializations are
*programming language specific*.
While I agree that the encoding declaration is ambiguous at best and
should be rejected, you can find a bit in the spec which supports XML as
Python unicode strings. A Python unicode string can be seen as a string
with "external character encoding information": it's the native encoding
of Python. Therefore we can make sense of it in an XML parser. For my
previous analysis of the spec see here:
http://codespeak.net/pipermail/lxml-dev/2006-May/001137.html
What however is bad and evil is to just ignore conflicting encoding
declarations in an XML document itself. I'd choose either one of:
* bail with a clear error when unicode is supplied at all
* bail with a clear error when unicode is supplied with any explicit
encoding declaration in the XML.
It is of importance as soon as you
convert the document back to a stream e.g. when we deliver the content
back to a browser or a FTP client. The ZPublisher (for Zope 2) deals with
that by changing the encoding parameter of the preamble for XML documents
based on the desired output encoding. utf-8 is always a good choice however
other encodings like iso-8859-15 might raise UnicodeDecodeErrors. The Zope 2
publisher "avoids" this problem converting the unicode result using
errors='replace' (which is likely something we might discuss :-))
Unicode XML is not only problematic for streaming. For instance, you
*can't* pass a Unicode string to the libxml2 *at all* , unless you want
a core dump. The API requires that you pass it strings encoded as UTF8.
You can in lxml. :) libxml2 as a C API doesn't even support any unicode
string type as far as I am aware.
Regards,
Martijn
_______________________________________________
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com