Re: suppressing bad characters in output PCDATA (converting JSON to XML)

2011-12-02 Thread Adam Funk
On 2011-11-29, Stefan Behnel wrote:

 Adam Funk, 29.11.2011 13:57:
 On 2011-11-28, Stefan Behnel wrote:

 If the name big_json is supposed to hint at a large set of data, you may
 want to use something other than minidom. Take a look at the
 xml.etree.cElementTree module instead, which is substantially more memory
 efficient.

 Well, the input file in this case contains one big JSON list of
 reasonably sized elements, each of which I'm turning into a separate
 XML file.  The output files range from 600 to 6000 bytes.

 It's also substantially easier to use, but if your XML writing code works 
 already, why change it.

That module looks useful --- thanks for the tip.  (TBH, I'm using
minidom mainly because I've used it before and the API is similar to
the DOM APIs I've used in other languages.)


 You should read up on Unicode a bit.

It wouldn't do me any harm.  :-)


 I thought this would force all the output to be valid, but xmlstarlet
 gives some errors like these on a few documents:

 PCDATA invalid Char value 7
 PCDATA invalid Char value 31

 This strongly hints at a broken encoding, which can easily be triggered by
 your erroneous encode-and-encode cycles above.

 No, I've checked the JSON input and those exact control characters are
 there too.

 Ah, right, I didn't look closely enough. Those are forbidden in XML:

 http://www.w3.org/TR/REC-xml/#charsets

 It's sad that minidom (apparently) lets them pass through without even a 
 warning.

Yes, it is!  I've now found this, which seems to fix the problem:

http://bitkickers.blogspot.com/2011/05/stripping-control-characters-in-python.html


-- 
The internet is quite simply a glorious place. Where else can you find
bootlegged music and films, questionable women, deep seated xenophobia
and amusing cats all together in the same place? [Tom Belshaw]
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: suppressing bad characters in output PCDATA (converting JSON to XML)

2011-11-29 Thread Adam Funk
On 2011-11-28, Stefan Behnel wrote:

 Adam Funk, 25.11.2011 14:50:
 I'm converting JSON data to XML using the standard library's json and
 xml.dom.minidom modules.  I get the input this way:

 input_source = codecs.open(input_file, 'rb', encoding='UTF-8', 
 errors='replace')

 It doesn't make sense to use codecs.open() with a b mode.

OK, thanks.

 big_json = json.load(input_source)

 You shouldn't decode the input before passing it into json.load(), just 
 open the file in binary mode. Serialised JSON is defined as being UTF-8 
 encoded (or BOM-prefixed), not decoded Unicode.

So just do
  input_source = open(input_file, 'rb')
  big_json = json.load(input_source)
?

 input_source.close()

 In case of a failure, the file will not be closed safely. All in all, use 
 this instead:

  with open(input_file, 'rb') as f:
  big_json = json.load(f)

OK, thanks.

 Then I recurse through the contents of big_json to build an instance
 of xml.dom.minidom.Document (the recursion includes some code to
 rewrite dict keys as valid element names if necessary)

 If the name big_json is supposed to hint at a large set of data, you may 
 want to use something other than minidom. Take a look at the 
 xml.etree.cElementTree module instead, which is substantially more memory 
 efficient.

Well, the input file in this case contains one big JSON list of
reasonably sized elements, each of which I'm turning into a separate
XML file.  The output files range from 600 to 6000 bytes.


 and I save the document:

 xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8', 
 errors='replace')
 doc.writexml(xml_file, encoding='UTF-8')
 xml_file.close()

 Same mistakes as above. Especially the double encoding is both unnecessary 
 and likely to fail. This is also most likely the source of your problems.

Well actually, I had the problem with the occasional control
characters in the output *before* I started sticking encoding=UTF-8
all over the place (in an unsuccessful attempt to beat them down).


 I thought this would force all the output to be valid, but xmlstarlet
 gives some errors like these on a few documents:

 PCDATA invalid Char value 7
 PCDATA invalid Char value 31

 This strongly hints at a broken encoding, which can easily be triggered by 
 your erroneous encode-and-encode cycles above.

No, I've checked the JSON input and those exact control characters are
there too.  I want to suppress them (delete or replace with spaces).

 Also, the kind of problem you present here makes it pretty clear that you 
 are using Python 2.x. In Python 3, you'd get the appropriate exceptions 
 when trying to write binary data to a Unicode file.

Sorry, I forgot to mention the version I'm using, which is 2.7.2+.


-- 
In the 1970s, people began receiving utility bills for
-£999,999,996.32 and it became harder to sustain the 
myth of the infallible electronic brain.  (Stob 2001)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: suppressing bad characters in output PCDATA (converting JSON to XML)

2011-11-29 Thread Adam Funk
On 2011-11-28, Steven D'Aprano wrote:

 On Fri, 25 Nov 2011 13:50:01 +, Adam Funk wrote:

 I'm converting JSON data to XML using the standard library's json and
 xml.dom.minidom modules.  I get the input this way:
 
 input_source = codecs.open(input_file, 'rb', encoding='UTF-8',
 errors='replace') big_json = json.load(input_source)
 input_source.close()
 
 Then I recurse through the contents of big_json to build an instance of
 xml.dom.minidom.Document (the recursion includes some code to rewrite
 dict keys as valid element names if necessary), 

 How are you doing that? What do you consider valid?

Regex-replacing all whitespace ('\s+') with '_', and adding 'a_' to
the beginning of any potential tag that doesn't start with a letter.
This is good enough for my purposes.

 I thought this would force all the output to be valid, but xmlstarlet
 gives some errors like these on a few documents:

 It will force the output to be valid UTF-8 encoded to bytes, not 
 necessarily valid XML.

Yes!

 PCDATA invalid Char value 7
 PCDATA invalid Char value 31

 What's xmlstarlet, and at what point does it give this error? It doesn't 
 appear to be in the standard library.

It's a command-line tool I use a lot for finding the bad bits in XML,
nothing to do with python.

http://xmlstar.sourceforge.net/

 I guess I need to process each piece of PCDATA to clean out the control
 characters before creating the text node:
 
   text = doc.createTextNode(j)
   root.appendChild(text)
 
 What's the best way to do that, bearing in mind that there can be
 multibyte characters in the strings?

 Are you mixing unicode and byte strings?

I don't think I am.

 Are you sure that the input source is actually UTF-8? If not, then all 
 bets are off: even if the decoding step works, and returns a string, it 
 may contain the wrong characters. This might explain why you are getting 
 unexpected control characters in the output: they've come from a badly 
 decoded input.

I'm pretty sure that the input is really UTF-8, but has a few control
characters (fairly rare).

 Another possibility is that your data actually does contain control 
 characters where there shouldn't be any.

I think that's the problem, and I'm looking for an efficient way to
delete them from unicode strings.


-- 
Some say the world will end in fire; some say in segfaults.
 [XKCD 312]
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: suppressing bad characters in output PCDATA (converting JSON to XML)

2011-11-29 Thread Stefan Behnel

Adam Funk, 29.11.2011 13:57:

On 2011-11-28, Stefan Behnel wrote:

Adam Funk, 25.11.2011 14:50:

Then I recurse through the contents of big_json to build an instance
of xml.dom.minidom.Document (the recursion includes some code to
rewrite dict keys as valid element names if necessary)


If the name big_json is supposed to hint at a large set of data, you may
want to use something other than minidom. Take a look at the
xml.etree.cElementTree module instead, which is substantially more memory
efficient.


Well, the input file in this case contains one big JSON list of
reasonably sized elements, each of which I'm turning into a separate
XML file.  The output files range from 600 to 6000 bytes.


It's also substantially easier to use, but if your XML writing code works 
already, why change it.




and I save the document:

xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8', errors='replace')
doc.writexml(xml_file, encoding='UTF-8')
xml_file.close()


Same mistakes as above. Especially the double encoding is both unnecessary
and likely to fail. This is also most likely the source of your problems.


Well actually, I had the problem with the occasional control
characters in the output *before* I started sticking encoding=UTF-8
all over the place (in an unsuccessful attempt to beat them down).


You should read up on Unicode a bit.



I thought this would force all the output to be valid, but xmlstarlet
gives some errors like these on a few documents:

PCDATA invalid Char value 7
PCDATA invalid Char value 31


This strongly hints at a broken encoding, which can easily be triggered by
your erroneous encode-and-encode cycles above.


No, I've checked the JSON input and those exact control characters are
there too.


Ah, right, I didn't look closely enough. Those are forbidden in XML:

http://www.w3.org/TR/REC-xml/#charsets

It's sad that minidom (apparently) lets them pass through without even a 
warning.




I want to suppress them (delete or replace with spaces).


Ok, then you need to process your string content while creating XML from 
it. If replacing is enough, take a look at string.maketrans() in the string 
module and str.translate(), a method on strings. Or maybe just use a 
regular expression that matches any whitespace character and replace it 
with a space. Or whatever suits your data best.




Also, the kind of problem you present here makes it pretty clear that you
are using Python 2.x. In Python 3, you'd get the appropriate exceptions
when trying to write binary data to a Unicode file.


Sorry, I forgot to mention the version I'm using, which is 2.7.2+.


Yep, Py2 makes Unicode handling harder than it should be.

Stefan

--
http://mail.python.org/mailman/listinfo/python-list


Re: suppressing bad characters in output PCDATA (converting JSON to XML)

2011-11-28 Thread Steven D'Aprano
On Fri, 25 Nov 2011 13:50:01 +, Adam Funk wrote:

 I'm converting JSON data to XML using the standard library's json and
 xml.dom.minidom modules.  I get the input this way:
 
 input_source = codecs.open(input_file, 'rb', encoding='UTF-8',
 errors='replace') big_json = json.load(input_source)
 input_source.close()
 
 Then I recurse through the contents of big_json to build an instance of
 xml.dom.minidom.Document (the recursion includes some code to rewrite
 dict keys as valid element names if necessary), 

How are you doing that? What do you consider valid?


 and I save the document:
 
 xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8',
 errors='replace') doc.writexml(xml_file, encoding='UTF-8')
 xml_file.close()
 
 
 I thought this would force all the output to be valid, but xmlstarlet
 gives some errors like these on a few documents:

It will force the output to be valid UTF-8 encoded to bytes, not 
necessarily valid XML.


 PCDATA invalid Char value 7
 PCDATA invalid Char value 31

What's xmlstarlet, and at what point does it give this error? It doesn't 
appear to be in the standard library.



 I guess I need to process each piece of PCDATA to clean out the control
 characters before creating the text node:
 
   text = doc.createTextNode(j)
   root.appendChild(text)
 
 What's the best way to do that, bearing in mind that there can be
 multibyte characters in the strings?

Are you mixing unicode and byte strings?

Are you sure that the input source is actually UTF-8? If not, then all 
bets are off: even if the decoding step works, and returns a string, it 
may contain the wrong characters. This might explain why you are getting 
unexpected control characters in the output: they've come from a badly 
decoded input.

Another possibility is that your data actually does contain control 
characters where there shouldn't be any.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: suppressing bad characters in output PCDATA (converting JSON to XML)

2011-11-28 Thread Stefan Behnel

Adam Funk, 25.11.2011 14:50:

I'm converting JSON data to XML using the standard library's json and
xml.dom.minidom modules.  I get the input this way:

input_source = codecs.open(input_file, 'rb', encoding='UTF-8', errors='replace')


It doesn't make sense to use codecs.open() with a b mode.



big_json = json.load(input_source)


You shouldn't decode the input before passing it into json.load(), just 
open the file in binary mode. Serialised JSON is defined as being UTF-8 
encoded (or BOM-prefixed), not decoded Unicode.




input_source.close()


In case of a failure, the file will not be closed safely. All in all, use 
this instead:


with open(input_file, 'rb') as f:
big_json = json.load(f)



Then I recurse through the contents of big_json to build an instance
of xml.dom.minidom.Document (the recursion includes some code to
rewrite dict keys as valid element names if necessary)


If the name big_json is supposed to hint at a large set of data, you may 
want to use something other than minidom. Take a look at the 
xml.etree.cElementTree module instead, which is substantially more memory 
efficient.




and I save the document:

xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8', errors='replace')
doc.writexml(xml_file, encoding='UTF-8')
xml_file.close()


Same mistakes as above. Especially the double encoding is both unnecessary 
and likely to fail. This is also most likely the source of your problems.




I thought this would force all the output to be valid, but xmlstarlet
gives some errors like these on a few documents:

PCDATA invalid Char value 7
PCDATA invalid Char value 31


This strongly hints at a broken encoding, which can easily be triggered by 
your erroneous encode-and-encode cycles above.


Also, the kind of problem you present here makes it pretty clear that you 
are using Python 2.x. In Python 3, you'd get the appropriate exceptions 
when trying to write binary data to a Unicode file.


Stefan

--
http://mail.python.org/mailman/listinfo/python-list


suppressing bad characters in output PCDATA (converting JSON to XML)

2011-11-25 Thread Adam Funk
I'm converting JSON data to XML using the standard library's json and
xml.dom.minidom modules.  I get the input this way:

input_source = codecs.open(input_file, 'rb', encoding='UTF-8', errors='replace')
big_json = json.load(input_source)
input_source.close()

Then I recurse through the contents of big_json to build an instance
of xml.dom.minidom.Document (the recursion includes some code to
rewrite dict keys as valid element names if necessary), and I save the
document:

xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8', errors='replace')
doc.writexml(xml_file, encoding='UTF-8')
xml_file.close()


I thought this would force all the output to be valid, but xmlstarlet
gives some errors like these on a few documents:

PCDATA invalid Char value 7
PCDATA invalid Char value 31

I guess I need to process each piece of PCDATA to clean out the
control characters before creating the text node:

  text = doc.createTextNode(j)
  root.appendChild(text)

What's the best way to do that, bearing in mind that there can be
multibyte characters in the strings?  I found some suggestions on the
WWW involving filter with string.printable, which AFAICT isn't
unicode-friendly --- is there a unicode.printable or something like
that?


-- 
Mrs CJ and I avoid clichés like the plague.
-- 
http://mail.python.org/mailman/listinfo/python-list