On 2011-11-28, Stefan Behnel wrote: > Adam Funk, 25.11.2011 14:50: >> I'm converting JSON data to XML using the standard library's json and >> xml.dom.minidom modules. I get the input this way: >> >> input_source = codecs.open(input_file, 'rb', encoding='UTF-8', >> errors='replace') > > It doesn't make sense to use codecs.open() with a "b" mode.
OK, thanks. >> big_json = json.load(input_source) > > You shouldn't decode the input before passing it into json.load(), just > open the file in binary mode. Serialised JSON is defined as being UTF-8 > encoded (or BOM-prefixed), not decoded Unicode. So just do input_source = open(input_file, 'rb') big_json = json.load(input_source) ? >> input_source.close() > > In case of a failure, the file will not be closed safely. All in all, use > this instead: > > with open(input_file, 'rb') as f: > big_json = json.load(f) OK, thanks. >> Then I recurse through the contents of big_json to build an instance >> of xml.dom.minidom.Document (the recursion includes some code to >> rewrite dict keys as valid element names if necessary) > > If the name "big_json" is supposed to hint at a large set of data, you may > want to use something other than minidom. Take a look at the > xml.etree.cElementTree module instead, which is substantially more memory > efficient. Well, the input file in this case contains one big JSON list of reasonably sized elements, each of which I'm turning into a separate XML file. The output files range from 600 to 6000 bytes. >> and I save the document: >> >> xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8', >> errors='replace') >> doc.writexml(xml_file, encoding='UTF-8') >> xml_file.close() > > Same mistakes as above. Especially the double encoding is both unnecessary > and likely to fail. This is also most likely the source of your problems. Well actually, I had the problem with the occasional control characters in the output *before* I started sticking encoding="UTF-8" all over the place (in an unsuccessful attempt to beat them down). >> I thought this would force all the output to be valid, but xmlstarlet >> gives some errors like these on a few documents: >> >> PCDATA invalid Char value 7 >> PCDATA invalid Char value 31 > > This strongly hints at a broken encoding, which can easily be triggered by > your erroneous encode-and-encode cycles above. No, I've checked the JSON input and those exact control characters are there too. I want to suppress them (delete or replace with spaces). > Also, the kind of problem you present here makes it pretty clear that you > are using Python 2.x. In Python 3, you'd get the appropriate exceptions > when trying to write binary data to a Unicode file. Sorry, I forgot to mention the version I'm using, which is "2.7.2+". -- In the 1970s, people began receiving utility bills for -£999,999,996.32 and it became harder to sustain the myth of the infallible electronic brain. (Stob 2001) -- http://mail.python.org/mailman/listinfo/python-list