I'm converting JSON data to XML using the standard library's json and xml.dom.minidom modules. I get the input this way:
input_source = codecs.open(input_file, 'rb', encoding='UTF-8', errors='replace') big_json = json.load(input_source) input_source.close() Then I recurse through the contents of big_json to build an instance of xml.dom.minidom.Document (the recursion includes some code to rewrite dict keys as valid element names if necessary), and I save the document: xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8', errors='replace') doc.writexml(xml_file, encoding='UTF-8') xml_file.close() I thought this would force all the output to be valid, but xmlstarlet gives some errors like these on a few documents: PCDATA invalid Char value 7 PCDATA invalid Char value 31 I guess I need to process each piece of PCDATA to clean out the control characters before creating the text node: text = doc.createTextNode(j) root.appendChild(text) What's the best way to do that, bearing in mind that there can be multibyte characters in the strings? I found some suggestions on the WWW involving filter with string.printable, which AFAICT isn't unicode-friendly --- is there a unicode.printable or something like that? -- "Mrs CJ and I avoid clichés like the plague." -- http://mail.python.org/mailman/listinfo/python-list