Re: suppressing bad characters in output PCDATA (converting JSON to XML)

Stefan Behnel Mon, 28 Nov 2011 10:24:04 -0800

Adam Funk, 25.11.2011 14:50:

I'm converting JSON data to XML using the standard library's json and
xml.dom.minidom modules.  I get the input this way:


input_source = codecs.open(input_file, 'rb', encoding='UTF-8', errors='replace')


It doesn't make sense to use codecs.open() with a "b" mode.

big_json = json.load(input_source)

You shouldn't decode the input before passing it into json.load(), justopen the file in binary mode. Serialised JSON is defined as being UTF-8encoded (or BOM-prefixed), not decoded Unicode.

input_source.close()

In case of a failure, the file will not be closed safely. All in all, usethis instead:


    with open(input_file, 'rb') as f:
        big_json = json.load(f)

Then I recurse through the contents of big_json to build an instance
of xml.dom.minidom.Document (the recursion includes some code to
rewrite dict keys as valid element names if necessary)

If the name "big_json" is supposed to hint at a large set of data, you maywant to use something other than minidom. Take a look at thexml.etree.cElementTree module instead, which is substantially more memoryefficient.

and I save the document:

xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8', errors='replace')
doc.writexml(xml_file, encoding='UTF-8')
xml_file.close()

Same mistakes as above. Especially the double encoding is both unnecessaryand likely to fail. This is also most likely the source of your problems.

I thought this would force all the output to be valid, but xmlstarlet
gives some errors like these on a few documents:

PCDATA invalid Char value 7
PCDATA invalid Char value 31

This strongly hints at a broken encoding, which can easily be triggered byyour erroneous encode-and-encode cycles above.

Also, the kind of problem you present here makes it pretty clear that youare using Python 2.x. In Python 3, you'd get the appropriate exceptionswhen trying to write binary data to a Unicode file.


Stefan

--
http://mail.python.org/mailman/listinfo/python-list

Re: suppressing bad characters in output PCDATA (converting JSON to XML)

Reply via email to