New submission from Dan Callaghan:
Python 2.7.3 (default, Jul 24 2012, 10:05:38)
[GCC 4.7.0 20120507 (Red Hat 4.7.0-5)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> c = u'\u65e5\u672c\u8a9e'
>>> import xml.dom.minidom
Encoded as UTF-8, everything is fine:
>>> xml.dom.minidom.parseString('<?xml version="1.0" encoding="UTF-8"
>>> ?><x>%s</x>' % c.encode('UTF-8'))
<xml.dom.minidom.Document instance at 0x7f310d27dcf8>
but not ISO-2022-JP:
>>> xml.dom.minidom.parseString('<?xml version="1.0" encoding="ISO-2022-JP"
>>> ?><x>%s</x>' % c.encode('ISO-2022-JP'))
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
File "/usr/lib64/python2.7/site-packages/_xmlplus/dom/minidom.py", line 1925,
in parseString
return expatbuilder.parseString(string)
File "/usr/lib64/python2.7/site-packages/_xmlplus/dom/expatbuilder.py", line
942, in parseString
return builder.parseString(string)
File "/usr/lib64/python2.7/site-packages/_xmlplus/dom/expatbuilder.py", line
223, in parseString
parser.Parse(string, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 48
lxml can handle it fine though:
>>> import lxml.etree
>>> lxml.etree.fromstring('<?xml version="1.0" encoding="ISO-2022-JP"
>>> ?><x>%s</x>' % c.encode('ISO-2022-JP'))
<Element x at 0x7f310d284960>
>>> _.text == c
True
----------
components: XML
messages: 169974
nosy: dcallagh
priority: normal
severity: normal
status: open
title: xml.dom.minidom cannot parse ISO-2022-JP
versions: Python 2.7
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue15877>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com