Re: suppressing bad characters in output PCDATA (converting JSON to XML)
On 2011-11-29, Stefan Behnel wrote: > Adam Funk, 29.11.2011 13:57: >> On 2011-11-28, Stefan Behnel wrote: >>> If the name "big_json" is supposed to hint at a large set of data, you may >>> want to use something other than minidom. Take a look at the >>> xml.etree.cElementTree module instead, which is substantially more memory >>> efficient. >> >> Well, the input file in this case contains one big JSON list of >> reasonably sized elements, each of which I'm turning into a separate >> XML file. The output files range from 600 to 6000 bytes. > > It's also substantially easier to use, but if your XML writing code works > already, why change it. That module looks useful --- thanks for the tip. (TBH, I'm using minidom mainly because I've used it before and the API is similar to the DOM APIs I've used in other languages.) > You should read up on Unicode a bit. It wouldn't do me any harm. :-) I thought this would force all the output to be valid, but xmlstarlet gives some errors like these on a few documents: PCDATA invalid Char value 7 PCDATA invalid Char value 31 >>> >>> This strongly hints at a broken encoding, which can easily be triggered by >>> your erroneous encode-and-encode cycles above. >> >> No, I've checked the JSON input and those exact control characters are >> there too. > > Ah, right, I didn't look closely enough. Those are forbidden in XML: > > http://www.w3.org/TR/REC-xml/#charsets > > It's sad that minidom (apparently) lets them pass through without even a > warning. Yes, it is! I've now found this, which seems to fix the problem: http://bitkickers.blogspot.com/2011/05/stripping-control-characters-in-python.html -- The internet is quite simply a glorious place. Where else can you find bootlegged music and films, questionable women, deep seated xenophobia and amusing cats all together in the same place? [Tom Belshaw] -- http://mail.python.org/mailman/listinfo/python-list
Re: suppressing bad characters in output PCDATA (converting JSON to XML)
Adam Funk, 29.11.2011 13:57: On 2011-11-28, Stefan Behnel wrote: Adam Funk, 25.11.2011 14:50: Then I recurse through the contents of big_json to build an instance of xml.dom.minidom.Document (the recursion includes some code to rewrite dict keys as valid element names if necessary) If the name "big_json" is supposed to hint at a large set of data, you may want to use something other than minidom. Take a look at the xml.etree.cElementTree module instead, which is substantially more memory efficient. Well, the input file in this case contains one big JSON list of reasonably sized elements, each of which I'm turning into a separate XML file. The output files range from 600 to 6000 bytes. It's also substantially easier to use, but if your XML writing code works already, why change it. and I save the document: xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8', errors='replace') doc.writexml(xml_file, encoding='UTF-8') xml_file.close() Same mistakes as above. Especially the double encoding is both unnecessary and likely to fail. This is also most likely the source of your problems. Well actually, I had the problem with the occasional control characters in the output *before* I started sticking encoding="UTF-8" all over the place (in an unsuccessful attempt to beat them down). You should read up on Unicode a bit. I thought this would force all the output to be valid, but xmlstarlet gives some errors like these on a few documents: PCDATA invalid Char value 7 PCDATA invalid Char value 31 This strongly hints at a broken encoding, which can easily be triggered by your erroneous encode-and-encode cycles above. No, I've checked the JSON input and those exact control characters are there too. Ah, right, I didn't look closely enough. Those are forbidden in XML: http://www.w3.org/TR/REC-xml/#charsets It's sad that minidom (apparently) lets them pass through without even a warning. I want to suppress them (delete or replace with spaces). Ok, then you need to process your string content while creating XML from it. If replacing is enough, take a look at string.maketrans() in the string module and str.translate(), a method on strings. Or maybe just use a regular expression that matches any whitespace character and replace it with a space. Or whatever suits your data best. Also, the kind of problem you present here makes it pretty clear that you are using Python 2.x. In Python 3, you'd get the appropriate exceptions when trying to write binary data to a Unicode file. Sorry, I forgot to mention the version I'm using, which is "2.7.2+". Yep, Py2 makes Unicode handling harder than it should be. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: suppressing bad characters in output PCDATA (converting JSON to XML)
On 2011-11-28, Steven D'Aprano wrote: > On Fri, 25 Nov 2011 13:50:01 +, Adam Funk wrote: > >> I'm converting JSON data to XML using the standard library's json and >> xml.dom.minidom modules. I get the input this way: >> >> input_source = codecs.open(input_file, 'rb', encoding='UTF-8', >> errors='replace') big_json = json.load(input_source) >> input_source.close() >> >> Then I recurse through the contents of big_json to build an instance of >> xml.dom.minidom.Document (the recursion includes some code to rewrite >> dict keys as valid element names if necessary), > > How are you doing that? What do you consider valid? Regex-replacing all whitespace ('\s+') with '_', and adding 'a_' to the beginning of any potential tag that doesn't start with a letter. This is good enough for my purposes. >> I thought this would force all the output to be valid, but xmlstarlet >> gives some errors like these on a few documents: > > It will force the output to be valid UTF-8 encoded to bytes, not > necessarily valid XML. Yes! >> PCDATA invalid Char value 7 >> PCDATA invalid Char value 31 > > What's xmlstarlet, and at what point does it give this error? It doesn't > appear to be in the standard library. It's a command-line tool I use a lot for finding the bad bits in XML, nothing to do with python. http://xmlstar.sourceforge.net/ >> I guess I need to process each piece of PCDATA to clean out the control >> characters before creating the text node: >> >> text = doc.createTextNode(j) >> root.appendChild(text) >> >> What's the best way to do that, bearing in mind that there can be >> multibyte characters in the strings? > > Are you mixing unicode and byte strings? I don't think I am. > Are you sure that the input source is actually UTF-8? If not, then all > bets are off: even if the decoding step works, and returns a string, it > may contain the wrong characters. This might explain why you are getting > unexpected control characters in the output: they've come from a badly > decoded input. I'm pretty sure that the input is really UTF-8, but has a few control characters (fairly rare). > Another possibility is that your data actually does contain control > characters where there shouldn't be any. I think that's the problem, and I'm looking for an efficient way to delete them from unicode strings. -- Some say the world will end in fire; some say in segfaults. [XKCD 312] -- http://mail.python.org/mailman/listinfo/python-list
Re: suppressing bad characters in output PCDATA (converting JSON to XML)
On 2011-11-28, Stefan Behnel wrote: > Adam Funk, 25.11.2011 14:50: >> I'm converting JSON data to XML using the standard library's json and >> xml.dom.minidom modules. I get the input this way: >> >> input_source = codecs.open(input_file, 'rb', encoding='UTF-8', >> errors='replace') > > It doesn't make sense to use codecs.open() with a "b" mode. OK, thanks. >> big_json = json.load(input_source) > > You shouldn't decode the input before passing it into json.load(), just > open the file in binary mode. Serialised JSON is defined as being UTF-8 > encoded (or BOM-prefixed), not decoded Unicode. So just do input_source = open(input_file, 'rb') big_json = json.load(input_source) ? >> input_source.close() > > In case of a failure, the file will not be closed safely. All in all, use > this instead: > > with open(input_file, 'rb') as f: > big_json = json.load(f) OK, thanks. >> Then I recurse through the contents of big_json to build an instance >> of xml.dom.minidom.Document (the recursion includes some code to >> rewrite dict keys as valid element names if necessary) > > If the name "big_json" is supposed to hint at a large set of data, you may > want to use something other than minidom. Take a look at the > xml.etree.cElementTree module instead, which is substantially more memory > efficient. Well, the input file in this case contains one big JSON list of reasonably sized elements, each of which I'm turning into a separate XML file. The output files range from 600 to 6000 bytes. >> and I save the document: >> >> xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8', >> errors='replace') >> doc.writexml(xml_file, encoding='UTF-8') >> xml_file.close() > > Same mistakes as above. Especially the double encoding is both unnecessary > and likely to fail. This is also most likely the source of your problems. Well actually, I had the problem with the occasional control characters in the output *before* I started sticking encoding="UTF-8" all over the place (in an unsuccessful attempt to beat them down). >> I thought this would force all the output to be valid, but xmlstarlet >> gives some errors like these on a few documents: >> >> PCDATA invalid Char value 7 >> PCDATA invalid Char value 31 > > This strongly hints at a broken encoding, which can easily be triggered by > your erroneous encode-and-encode cycles above. No, I've checked the JSON input and those exact control characters are there too. I want to suppress them (delete or replace with spaces). > Also, the kind of problem you present here makes it pretty clear that you > are using Python 2.x. In Python 3, you'd get the appropriate exceptions > when trying to write binary data to a Unicode file. Sorry, I forgot to mention the version I'm using, which is "2.7.2+". -- In the 1970s, people began receiving utility bills for -£999,999,996.32 and it became harder to sustain the myth of the infallible electronic brain. (Stob 2001) -- http://mail.python.org/mailman/listinfo/python-list
Re: suppressing bad characters in output PCDATA (converting JSON to XML)
Adam Funk, 25.11.2011 14:50: I'm converting JSON data to XML using the standard library's json and xml.dom.minidom modules. I get the input this way: input_source = codecs.open(input_file, 'rb', encoding='UTF-8', errors='replace') It doesn't make sense to use codecs.open() with a "b" mode. big_json = json.load(input_source) You shouldn't decode the input before passing it into json.load(), just open the file in binary mode. Serialised JSON is defined as being UTF-8 encoded (or BOM-prefixed), not decoded Unicode. input_source.close() In case of a failure, the file will not be closed safely. All in all, use this instead: with open(input_file, 'rb') as f: big_json = json.load(f) Then I recurse through the contents of big_json to build an instance of xml.dom.minidom.Document (the recursion includes some code to rewrite dict keys as valid element names if necessary) If the name "big_json" is supposed to hint at a large set of data, you may want to use something other than minidom. Take a look at the xml.etree.cElementTree module instead, which is substantially more memory efficient. and I save the document: xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8', errors='replace') doc.writexml(xml_file, encoding='UTF-8') xml_file.close() Same mistakes as above. Especially the double encoding is both unnecessary and likely to fail. This is also most likely the source of your problems. I thought this would force all the output to be valid, but xmlstarlet gives some errors like these on a few documents: PCDATA invalid Char value 7 PCDATA invalid Char value 31 This strongly hints at a broken encoding, which can easily be triggered by your erroneous encode-and-encode cycles above. Also, the kind of problem you present here makes it pretty clear that you are using Python 2.x. In Python 3, you'd get the appropriate exceptions when trying to write binary data to a Unicode file. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: suppressing bad characters in output PCDATA (converting JSON to XML)
On Fri, 25 Nov 2011 13:50:01 +, Adam Funk wrote: > I'm converting JSON data to XML using the standard library's json and > xml.dom.minidom modules. I get the input this way: > > input_source = codecs.open(input_file, 'rb', encoding='UTF-8', > errors='replace') big_json = json.load(input_source) > input_source.close() > > Then I recurse through the contents of big_json to build an instance of > xml.dom.minidom.Document (the recursion includes some code to rewrite > dict keys as valid element names if necessary), How are you doing that? What do you consider valid? > and I save the document: > > xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8', > errors='replace') doc.writexml(xml_file, encoding='UTF-8') > xml_file.close() > > > I thought this would force all the output to be valid, but xmlstarlet > gives some errors like these on a few documents: It will force the output to be valid UTF-8 encoded to bytes, not necessarily valid XML. > PCDATA invalid Char value 7 > PCDATA invalid Char value 31 What's xmlstarlet, and at what point does it give this error? It doesn't appear to be in the standard library. > I guess I need to process each piece of PCDATA to clean out the control > characters before creating the text node: > > text = doc.createTextNode(j) > root.appendChild(text) > > What's the best way to do that, bearing in mind that there can be > multibyte characters in the strings? Are you mixing unicode and byte strings? Are you sure that the input source is actually UTF-8? If not, then all bets are off: even if the decoding step works, and returns a string, it may contain the wrong characters. This might explain why you are getting unexpected control characters in the output: they've come from a badly decoded input. Another possibility is that your data actually does contain control characters where there shouldn't be any. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
suppressing bad characters in output PCDATA (converting JSON to XML)
I'm converting JSON data to XML using the standard library's json and xml.dom.minidom modules. I get the input this way: input_source = codecs.open(input_file, 'rb', encoding='UTF-8', errors='replace') big_json = json.load(input_source) input_source.close() Then I recurse through the contents of big_json to build an instance of xml.dom.minidom.Document (the recursion includes some code to rewrite dict keys as valid element names if necessary), and I save the document: xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8', errors='replace') doc.writexml(xml_file, encoding='UTF-8') xml_file.close() I thought this would force all the output to be valid, but xmlstarlet gives some errors like these on a few documents: PCDATA invalid Char value 7 PCDATA invalid Char value 31 I guess I need to process each piece of PCDATA to clean out the control characters before creating the text node: text = doc.createTextNode(j) root.appendChild(text) What's the best way to do that, bearing in mind that there can be multibyte characters in the strings? I found some suggestions on the WWW involving filter with string.printable, which AFAICT isn't unicode-friendly --- is there a unicode.printable or something like that? -- "Mrs CJ and I avoid clichés like the plague." -- http://mail.python.org/mailman/listinfo/python-list