On 2011-11-28, Steven D'Aprano wrote: > On Fri, 25 Nov 2011 13:50:01 +0000, Adam Funk wrote: > >> I'm converting JSON data to XML using the standard library's json and >> xml.dom.minidom modules. I get the input this way: >> >> input_source = codecs.open(input_file, 'rb', encoding='UTF-8', >> errors='replace') big_json = json.load(input_source) >> input_source.close() >> >> Then I recurse through the contents of big_json to build an instance of >> xml.dom.minidom.Document (the recursion includes some code to rewrite >> dict keys as valid element names if necessary), > > How are you doing that? What do you consider valid?
Regex-replacing all whitespace ('\s+') with '_', and adding 'a_' to the beginning of any potential tag that doesn't start with a letter. This is good enough for my purposes. >> I thought this would force all the output to be valid, but xmlstarlet >> gives some errors like these on a few documents: > > It will force the output to be valid UTF-8 encoded to bytes, not > necessarily valid XML. Yes! >> PCDATA invalid Char value 7 >> PCDATA invalid Char value 31 > > What's xmlstarlet, and at what point does it give this error? It doesn't > appear to be in the standard library. It's a command-line tool I use a lot for finding the bad bits in XML, nothing to do with python. http://xmlstar.sourceforge.net/ >> I guess I need to process each piece of PCDATA to clean out the control >> characters before creating the text node: >> >> text = doc.createTextNode(j) >> root.appendChild(text) >> >> What's the best way to do that, bearing in mind that there can be >> multibyte characters in the strings? > > Are you mixing unicode and byte strings? I don't think I am. > Are you sure that the input source is actually UTF-8? If not, then all > bets are off: even if the decoding step works, and returns a string, it > may contain the wrong characters. This might explain why you are getting > unexpected control characters in the output: they've come from a badly > decoded input. I'm pretty sure that the input is really UTF-8, but has a few control characters (fairly rare). > Another possibility is that your data actually does contain control > characters where there shouldn't be any. I think that's the problem, and I'm looking for an efficient way to delete them from unicode strings. -- Some say the world will end in fire; some say in segfaults. [XKCD 312] -- http://mail.python.org/mailman/listinfo/python-list