Re: suppressing bad characters in output PCDATA (converting JSON to XML)

2011-12-02 Thread Adam Funk
On 2011-11-29, Stefan Behnel wrote:

> Adam Funk, 29.11.2011 13:57:
>> On 2011-11-28, Stefan Behnel wrote:

>>> If the name "big_json" is supposed to hint at a large set of data, you may
>>> want to use something other than minidom. Take a look at the
>>> xml.etree.cElementTree module instead, which is substantially more memory
>>> efficient.
>>
>> Well, the input file in this case contains one big JSON list of
>> reasonably sized elements, each of which I'm turning into a separate
>> XML file.  The output files range from 600 to 6000 bytes.
>
> It's also substantially easier to use, but if your XML writing code works 
> already, why change it.

That module looks useful --- thanks for the tip.  (TBH, I'm using
minidom mainly because I've used it before and the API is similar to
the DOM APIs I've used in other languages.)


> You should read up on Unicode a bit.

It wouldn't do me any harm.  :-)


 I thought this would force all the output to be valid, but xmlstarlet
 gives some errors like these on a few documents:

 PCDATA invalid Char value 7
 PCDATA invalid Char value 31
>>>
>>> This strongly hints at a broken encoding, which can easily be triggered by
>>> your erroneous encode-and-encode cycles above.
>>
>> No, I've checked the JSON input and those exact control characters are
>> there too.
>
> Ah, right, I didn't look closely enough. Those are forbidden in XML:
>
> http://www.w3.org/TR/REC-xml/#charsets
>
> It's sad that minidom (apparently) lets them pass through without even a 
> warning.

Yes, it is!  I've now found this, which seems to fix the problem:

http://bitkickers.blogspot.com/2011/05/stripping-control-characters-in-python.html


-- 
The internet is quite simply a glorious place. Where else can you find
bootlegged music and films, questionable women, deep seated xenophobia
and amusing cats all together in the same place? [Tom Belshaw]
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: suppressing bad characters in output PCDATA (converting JSON to XML)

2011-11-29 Thread Stefan Behnel

Adam Funk, 29.11.2011 13:57:

On 2011-11-28, Stefan Behnel wrote:

Adam Funk, 25.11.2011 14:50:

Then I recurse through the contents of big_json to build an instance
of xml.dom.minidom.Document (the recursion includes some code to
rewrite dict keys as valid element names if necessary)


If the name "big_json" is supposed to hint at a large set of data, you may
want to use something other than minidom. Take a look at the
xml.etree.cElementTree module instead, which is substantially more memory
efficient.


Well, the input file in this case contains one big JSON list of
reasonably sized elements, each of which I'm turning into a separate
XML file.  The output files range from 600 to 6000 bytes.


It's also substantially easier to use, but if your XML writing code works 
already, why change it.




and I save the document:

xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8', errors='replace')
doc.writexml(xml_file, encoding='UTF-8')
xml_file.close()


Same mistakes as above. Especially the double encoding is both unnecessary
and likely to fail. This is also most likely the source of your problems.


Well actually, I had the problem with the occasional control
characters in the output *before* I started sticking encoding="UTF-8"
all over the place (in an unsuccessful attempt to beat them down).


You should read up on Unicode a bit.



I thought this would force all the output to be valid, but xmlstarlet
gives some errors like these on a few documents:

PCDATA invalid Char value 7
PCDATA invalid Char value 31


This strongly hints at a broken encoding, which can easily be triggered by
your erroneous encode-and-encode cycles above.


No, I've checked the JSON input and those exact control characters are
there too.


Ah, right, I didn't look closely enough. Those are forbidden in XML:

http://www.w3.org/TR/REC-xml/#charsets

It's sad that minidom (apparently) lets them pass through without even a 
warning.




I want to suppress them (delete or replace with spaces).


Ok, then you need to process your string content while creating XML from 
it. If replacing is enough, take a look at string.maketrans() in the string 
module and str.translate(), a method on strings. Or maybe just use a 
regular expression that matches any whitespace character and replace it 
with a space. Or whatever suits your data best.




Also, the kind of problem you present here makes it pretty clear that you
are using Python 2.x. In Python 3, you'd get the appropriate exceptions
when trying to write binary data to a Unicode file.


Sorry, I forgot to mention the version I'm using, which is "2.7.2+".


Yep, Py2 makes Unicode handling harder than it should be.

Stefan

--
http://mail.python.org/mailman/listinfo/python-list


Re: suppressing bad characters in output PCDATA (converting JSON to XML)

2011-11-29 Thread Adam Funk
On 2011-11-28, Steven D'Aprano wrote:

> On Fri, 25 Nov 2011 13:50:01 +, Adam Funk wrote:
>
>> I'm converting JSON data to XML using the standard library's json and
>> xml.dom.minidom modules.  I get the input this way:
>> 
>> input_source = codecs.open(input_file, 'rb', encoding='UTF-8',
>> errors='replace') big_json = json.load(input_source)
>> input_source.close()
>> 
>> Then I recurse through the contents of big_json to build an instance of
>> xml.dom.minidom.Document (the recursion includes some code to rewrite
>> dict keys as valid element names if necessary), 
>
> How are you doing that? What do you consider valid?

Regex-replacing all whitespace ('\s+') with '_', and adding 'a_' to
the beginning of any potential tag that doesn't start with a letter.
This is good enough for my purposes.

>> I thought this would force all the output to be valid, but xmlstarlet
>> gives some errors like these on a few documents:
>
> It will force the output to be valid UTF-8 encoded to bytes, not 
> necessarily valid XML.

Yes!

>> PCDATA invalid Char value 7
>> PCDATA invalid Char value 31
>
> What's xmlstarlet, and at what point does it give this error? It doesn't 
> appear to be in the standard library.

It's a command-line tool I use a lot for finding the bad bits in XML,
nothing to do with python.

http://xmlstar.sourceforge.net/

>> I guess I need to process each piece of PCDATA to clean out the control
>> characters before creating the text node:
>> 
>>   text = doc.createTextNode(j)
>>   root.appendChild(text)
>> 
>> What's the best way to do that, bearing in mind that there can be
>> multibyte characters in the strings?
>
> Are you mixing unicode and byte strings?

I don't think I am.

> Are you sure that the input source is actually UTF-8? If not, then all 
> bets are off: even if the decoding step works, and returns a string, it 
> may contain the wrong characters. This might explain why you are getting 
> unexpected control characters in the output: they've come from a badly 
> decoded input.

I'm pretty sure that the input is really UTF-8, but has a few control
characters (fairly rare).

> Another possibility is that your data actually does contain control 
> characters where there shouldn't be any.

I think that's the problem, and I'm looking for an efficient way to
delete them from unicode strings.


-- 
Some say the world will end in fire; some say in segfaults.
 [XKCD 312]
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: suppressing bad characters in output PCDATA (converting JSON to XML)

2011-11-29 Thread Adam Funk
On 2011-11-28, Stefan Behnel wrote:

> Adam Funk, 25.11.2011 14:50:
>> I'm converting JSON data to XML using the standard library's json and
>> xml.dom.minidom modules.  I get the input this way:
>>
>> input_source = codecs.open(input_file, 'rb', encoding='UTF-8', 
>> errors='replace')
>
> It doesn't make sense to use codecs.open() with a "b" mode.

OK, thanks.

>> big_json = json.load(input_source)
>
> You shouldn't decode the input before passing it into json.load(), just 
> open the file in binary mode. Serialised JSON is defined as being UTF-8 
> encoded (or BOM-prefixed), not decoded Unicode.

So just do
  input_source = open(input_file, 'rb')
  big_json = json.load(input_source)
?

>> input_source.close()
>
> In case of a failure, the file will not be closed safely. All in all, use 
> this instead:
>
>  with open(input_file, 'rb') as f:
>  big_json = json.load(f)

OK, thanks.

>> Then I recurse through the contents of big_json to build an instance
>> of xml.dom.minidom.Document (the recursion includes some code to
>> rewrite dict keys as valid element names if necessary)
>
> If the name "big_json" is supposed to hint at a large set of data, you may 
> want to use something other than minidom. Take a look at the 
> xml.etree.cElementTree module instead, which is substantially more memory 
> efficient.

Well, the input file in this case contains one big JSON list of
reasonably sized elements, each of which I'm turning into a separate
XML file.  The output files range from 600 to 6000 bytes.


>> and I save the document:
>>
>> xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8', 
>> errors='replace')
>> doc.writexml(xml_file, encoding='UTF-8')
>> xml_file.close()
>
> Same mistakes as above. Especially the double encoding is both unnecessary 
> and likely to fail. This is also most likely the source of your problems.

Well actually, I had the problem with the occasional control
characters in the output *before* I started sticking encoding="UTF-8"
all over the place (in an unsuccessful attempt to beat them down).


>> I thought this would force all the output to be valid, but xmlstarlet
>> gives some errors like these on a few documents:
>>
>> PCDATA invalid Char value 7
>> PCDATA invalid Char value 31
>
> This strongly hints at a broken encoding, which can easily be triggered by 
> your erroneous encode-and-encode cycles above.

No, I've checked the JSON input and those exact control characters are
there too.  I want to suppress them (delete or replace with spaces).

> Also, the kind of problem you present here makes it pretty clear that you 
> are using Python 2.x. In Python 3, you'd get the appropriate exceptions 
> when trying to write binary data to a Unicode file.

Sorry, I forgot to mention the version I'm using, which is "2.7.2+".


-- 
In the 1970s, people began receiving utility bills for
-£999,999,996.32 and it became harder to sustain the 
myth of the infallible electronic brain.  (Stob 2001)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: suppressing bad characters in output PCDATA (converting JSON to XML)

2011-11-28 Thread Stefan Behnel

Adam Funk, 25.11.2011 14:50:

I'm converting JSON data to XML using the standard library's json and
xml.dom.minidom modules.  I get the input this way:

input_source = codecs.open(input_file, 'rb', encoding='UTF-8', errors='replace')


It doesn't make sense to use codecs.open() with a "b" mode.



big_json = json.load(input_source)


You shouldn't decode the input before passing it into json.load(), just 
open the file in binary mode. Serialised JSON is defined as being UTF-8 
encoded (or BOM-prefixed), not decoded Unicode.




input_source.close()


In case of a failure, the file will not be closed safely. All in all, use 
this instead:


with open(input_file, 'rb') as f:
big_json = json.load(f)



Then I recurse through the contents of big_json to build an instance
of xml.dom.minidom.Document (the recursion includes some code to
rewrite dict keys as valid element names if necessary)


If the name "big_json" is supposed to hint at a large set of data, you may 
want to use something other than minidom. Take a look at the 
xml.etree.cElementTree module instead, which is substantially more memory 
efficient.




and I save the document:

xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8', errors='replace')
doc.writexml(xml_file, encoding='UTF-8')
xml_file.close()


Same mistakes as above. Especially the double encoding is both unnecessary 
and likely to fail. This is also most likely the source of your problems.




I thought this would force all the output to be valid, but xmlstarlet
gives some errors like these on a few documents:

PCDATA invalid Char value 7
PCDATA invalid Char value 31


This strongly hints at a broken encoding, which can easily be triggered by 
your erroneous encode-and-encode cycles above.


Also, the kind of problem you present here makes it pretty clear that you 
are using Python 2.x. In Python 3, you'd get the appropriate exceptions 
when trying to write binary data to a Unicode file.


Stefan

--
http://mail.python.org/mailman/listinfo/python-list


Re: suppressing bad characters in output PCDATA (converting JSON to XML)

2011-11-28 Thread Steven D'Aprano
On Fri, 25 Nov 2011 13:50:01 +, Adam Funk wrote:

> I'm converting JSON data to XML using the standard library's json and
> xml.dom.minidom modules.  I get the input this way:
> 
> input_source = codecs.open(input_file, 'rb', encoding='UTF-8',
> errors='replace') big_json = json.load(input_source)
> input_source.close()
> 
> Then I recurse through the contents of big_json to build an instance of
> xml.dom.minidom.Document (the recursion includes some code to rewrite
> dict keys as valid element names if necessary), 

How are you doing that? What do you consider valid?


> and I save the document:
> 
> xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8',
> errors='replace') doc.writexml(xml_file, encoding='UTF-8')
> xml_file.close()
> 
> 
> I thought this would force all the output to be valid, but xmlstarlet
> gives some errors like these on a few documents:

It will force the output to be valid UTF-8 encoded to bytes, not 
necessarily valid XML.


> PCDATA invalid Char value 7
> PCDATA invalid Char value 31

What's xmlstarlet, and at what point does it give this error? It doesn't 
appear to be in the standard library.



> I guess I need to process each piece of PCDATA to clean out the control
> characters before creating the text node:
> 
>   text = doc.createTextNode(j)
>   root.appendChild(text)
> 
> What's the best way to do that, bearing in mind that there can be
> multibyte characters in the strings?

Are you mixing unicode and byte strings?

Are you sure that the input source is actually UTF-8? If not, then all 
bets are off: even if the decoding step works, and returns a string, it 
may contain the wrong characters. This might explain why you are getting 
unexpected control characters in the output: they've come from a badly 
decoded input.

Another possibility is that your data actually does contain control 
characters where there shouldn't be any.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


suppressing bad characters in output PCDATA (converting JSON to XML)

2011-11-25 Thread Adam Funk
I'm converting JSON data to XML using the standard library's json and
xml.dom.minidom modules.  I get the input this way:

input_source = codecs.open(input_file, 'rb', encoding='UTF-8', errors='replace')
big_json = json.load(input_source)
input_source.close()

Then I recurse through the contents of big_json to build an instance
of xml.dom.minidom.Document (the recursion includes some code to
rewrite dict keys as valid element names if necessary), and I save the
document:

xml_file = codecs.open(output_fullpath, 'w', encoding='UTF-8', errors='replace')
doc.writexml(xml_file, encoding='UTF-8')
xml_file.close()


I thought this would force all the output to be valid, but xmlstarlet
gives some errors like these on a few documents:

PCDATA invalid Char value 7
PCDATA invalid Char value 31

I guess I need to process each piece of PCDATA to clean out the
control characters before creating the text node:

  text = doc.createTextNode(j)
  root.appendChild(text)

What's the best way to do that, bearing in mind that there can be
multibyte characters in the strings?  I found some suggestions on the
WWW involving filter with string.printable, which AFAICT isn't
unicode-friendly --- is there a unicode.printable or something like
that?


-- 
"Mrs CJ and I avoid clichés like the plague."
-- 
http://mail.python.org/mailman/listinfo/python-list