On Fri, 25 Nov 2011 13:50:01 +0000, Adam Funk wrote: > I'm converting JSON data to XML using the standard library's json and > xml.dom.minidom modules. I get the input this way: > > input_source = codecs.open(input_file, 'rb', encoding='UTF-8', > errors='replace') big_json = json.load(input_source) > input_source.close() > > Then I recurse through the contents of big_json to build an instance of > xml.dom.minidom.Document (the recursion includes some code to rewrite > dict keys as valid element names if necessary),
How are you doing that? What do you consider valid? > and I save the document: > > xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8', > errors='replace') doc.writexml(xml_file, encoding='UTF-8') > xml_file.close() > > > I thought this would force all the output to be valid, but xmlstarlet > gives some errors like these on a few documents: It will force the output to be valid UTF-8 encoded to bytes, not necessarily valid XML. > PCDATA invalid Char value 7 > PCDATA invalid Char value 31 What's xmlstarlet, and at what point does it give this error? It doesn't appear to be in the standard library. > I guess I need to process each piece of PCDATA to clean out the control > characters before creating the text node: > > text = doc.createTextNode(j) > root.appendChild(text) > > What's the best way to do that, bearing in mind that there can be > multibyte characters in the strings? Are you mixing unicode and byte strings? Are you sure that the input source is actually UTF-8? If not, then all bets are off: even if the decoding step works, and returns a string, it may contain the wrong characters. This might explain why you are getting unexpected control characters in the output: they've come from a badly decoded input. Another possibility is that your data actually does contain control characters where there shouldn't be any. -- Steven -- http://mail.python.org/mailman/listinfo/python-list