Ben Spiller <ben.spil...@softwareag.com> added the comment:

To help anyone else struggling with this bug, based on 
https://lsimons.wordpress.com/2011/03/17/stripping-illegal-characters-out-of-xml-in-python/
 the best workaround I've currently found is to define this:

def escape_xml_illegal_chars(unicodeString, replaceWith=u'?'):
        return re.sub(u'[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\uFFFF]', 
replaceWith, unicodeString)

and then copy+paste the following pattern into every bit of code that generates 
XML:

myfile.write(escape_xml_illegal_chars(document.toxml(encoding='utf-8').decode('utf-8')).encode('utf-8'))

It's obviously pretty grim (and unsafe) to expect every python developer to 
copy+paste this kind of thing into their own project to avoid buggy XML 
generation, so would be better to have the escape_xml_illegal_chars function in 
the python standard library (maybe alongside xml.sax.utils.escape - which 
notably does _not_ escape all the unicode characters that aren't valid XML), 
and built-in support for this as part of document.toxml. 

I guess we'd want it to be user-configurable for any users who are prepared to 
tolerate the possibility unparseable XML documents will be generated in return 
for improved performance for the common case where these characters are not 
present, not not having the capability at all just means most python 
applications that do XML generate with special-casing this have a bug. I 
suggest we definitely need some clear warnings about this in the doc.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue5166>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to