Re: [CODE4LIB] XML Parsing and Python

2013-03-07 Thread Michael Beccaria
I ended up doing a regular expression find and replace function to replace all 
illegal xml characters with a dash or something. I was more disappointed in the 
fact that on the xml creation end, minidom was able to create non-compliant xml 
files. I assumed that if minidom could make it, it would be compliant but that 
doesn't seem to be the case. Now I have to add a find and replace function on 
the creation side to avoid this issue in the future. Good learning experience I 
guess. Thanks for all your suggestions.

Mike Beccaria
Systems Librarian
Head of Digital Initiative
Paul Smith's College
518.327.6376
mbecca...@paulsmiths.edu
Become a friend of Paul Smith's Library on Facebook today!


-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Chris 
Beer
Sent: Tuesday, March 05, 2013 1:48 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] XML Parsing and Python

I'll note that 0x is a UTF-8 non-character, and  these noncharacters 
should never be included in text interchange between implementations. [1] I 
assume the OCR engine maybe using 0x when it can't recognize a character? 
So, it's not wrong for a parser to complain (or, not complain) about 0x, 
and you can just scrub the string like Jon suggests.

Chris


[1] http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Noncharacters

On 5 Mar, 2013, at 9:16 , Jon Stroop jstr...@princeton.edu wrote:

 Mike,
 I haven't used minidom extensively but my guess is that 
 doc.toprettyxml(indent= ,encoding=utf-8) isn't actually changing the 
 encoding because it can't parse the string in your content variable. I'm 
 surprised that you're not getting tossed a UnicodeError, but The docs for 
 Node.toxml() [1] might shed some light:
 
 To avoid UnicodeError exceptions in case of unrepresentable text data, the 
 encoding argument should be specified as utf-8.
 
 So what happens if you're not explicit about the encoding, i.e. just 
 doc.toprettyxml()? This would hopefully at least move your exception to a 
 more appropriate place.
 
 In any case, one solution would be to scrub the string in your content 
 variable to get rid of the invalid characters (hopefully they're 
 insignificant). Maybe something like this:
 
 def unicode_filter(char):
try:
unicode(char, encoding='utf-8', errors='strict')
return char
except UnicodeDecodeError:
return ''
 
 content = 'abc\xFF'
 content = ''.join(map(unicode_filter, content)) print content
 
 Not really my area of expertise, but maybe worth a shot
 -Jon
 
 1. 
 http://docs.python.org/2/library/xml.dom.minidom.html#xml.dom.minidom.
 Node.toxml
 
 --
 Jon Stroop
 Digital Initiatives Programmer/Analyst Princeton University Library 
 jstr...@princeton.edu
 
 
 
 
 On 03/04/2013 03:00 PM, Michael Beccaria wrote:
 I'm working on a project that takes the ocr data found in a pdf and places 
 it in a custom xml file.
 
 I use Python scripts to create the xml file. Something like this (trimmed 
 down a bit):
 
 from xml.dom.minidom import Document
 doc = Document()
  Page = doc.createElement(Page)
  doc.appendChild(Page)
  f = StringIO(txt)
  lines = f.readlines()
  for line in lines:
  word = doc.createElement(String)
  ...
  word.setAttribute(CONTENT,content)
  Page.appendChild(word)
  return doc.toprettyxml(indent=  ,encoding=utf-8)
 
 
 This creates a file, simply, that looks like this:
 ?xml version=1.0 encoding=utf-8? Page HEIGHT=3296 
 WIDTH=2609
   String CONTENT=BuffaloLaunch /
   String CONTENT=Club /
   String CONTENT=Offices /
   String CONTENT=Installed /
   ...
 /Page
 
 I am able to get this document to be created ok and saved to an xml file. 
 The problem occurs when I try and have it read using the lxml library:
 
 from lxml import etree
 doc = etree.parse(filename)
 
 
 I am running across errors like XMLSyntaxError: Char 0x out of allowed 
 range, line 94, column 19. Which when I look at the file, is true. There is 
 a 0X character in the content field.
 
 How is a file able to be created using minidom (which I assume would create 
 a valid xml file) and then failing when parsing with lxml? What should I do 
 to fix this on the encoding side so that errors don't show up on the parsing 
 side?
 Thanks,
 Mike
 
 How is the
 Mike Beccaria
 Systems Librarian
 Head of Digital Initiative
 Paul Smith's College
 518.327.6376
 mbecca...@paulsmiths.edu
 Become a friend of Paul Smith's Library on Facebook today!


Re: [CODE4LIB] XML Parsing and Python

2013-03-07 Thread Jay Luker
On Thu, Mar 7, 2013 at 10:49 AM, Michael Beccaria
mbecca...@paulsmiths.eduwrote:

 I ended up doing a regular expression find and replace function to replace
 all illegal xml characters with a dash or something.


:(

A string translation map might be a better approach. Here's what I do as
one part of a general purpose text cleanup method.

{{{
illegal_unichrs = [ (0x00, 0x08), (0x0B, 0x1F), (0x7F, 0x84), (0x86, 0x9F),
(0xD800, 0xDFFF), (0xFDD0, 0xFDDF), (0xFFFE, 0x),
(0x1FFFE, 0x1), (0x2FFFE, 0x2), (0x3FFFE, 0x3),
(0x4FFFE, 0x4), (0x5FFFE, 0x5), (0x6FFFE, 0x6),
(0x7FFFE, 0x7), (0x8FFFE, 0x8), (0x9FFFE, 0x9),
(0xAFFFE, 0xA), (0xBFFFE, 0xB), (0xCFFFE, 0xC),
(0xDFFFE, 0xD), (0xEFFFE, 0xE), (0xE, 0xF),
(0x10FFFE, 0x10) ]
tmap = dict.fromkeys(r for start, end in illegal_unichrs for r in
range(start, end+1))
...
text = text.translate(tmap)
}}}

See the str.translate() method at
http://docs.python.org/2/library/stdtypes.html#string-methods

--jay


Re: [CODE4LIB] XML Parsing and Python

2013-03-07 Thread Al Matthews
Hello Mike,

I realize minidom is a pure python library, but I wonder if elementtree
isn't preferred here since you're already using lxml?

I think the latter must be based on the former.

Or for a bit of a snark, try, e.g.
http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/ ..
Bicking: I don't recommend using minidom for anything.


--
Al Matthews

Software Developer, Digital Services Unit
Atlanta University Center, Robert W. Woodruff Library
email: amatth...@auctr.edu; office: 1 404 978 2057





On 3/7/13 10:49 AM, Michael Beccaria mbecca...@paulsmiths.edu wrote:

I ended up doing a regular expression find and replace function to
replace all illegal xml characters with a dash or something. I was more
disappointed in the fact that on the xml creation end, minidom was able
to create non-compliant xml files. I assumed that if minidom could make
it, it would be compliant but that doesn't seem to be the case. Now I
have to add a find and replace function on the creation side to avoid
this issue in the future. Good learning experience I guess. Thanks for
all your suggestions.

Mike Beccaria
Systems Librarian
Head of Digital Initiative
Paul Smith's College
518.327.6376
mbecca...@paulsmiths.edu
Become a friend of Paul Smith's Library on Facebook today!


-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
Chris Beer
Sent: Tuesday, March 05, 2013 1:48 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] XML Parsing and Python

I'll note that 0x is a UTF-8 non-character, and  these noncharacters
should never be included in text interchange between implementations.
[1] I assume the OCR engine maybe using 0x when it can't recognize a
character? So, it's not wrong for a parser to complain (or, not complain)
about 0x, and you can just scrub the string like Jon suggests.

Chris


[1]
http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Noncharacters

On 5 Mar, 2013, at 9:16 , Jon Stroop jstr...@princeton.edu wrote:

 Mike,
 I haven't used minidom extensively but my guess is that
doc.toprettyxml(indent= ,encoding=utf-8) isn't actually changing the
encoding because it can't parse the string in your content variable. I'm
surprised that you're not getting tossed a UnicodeError, but The docs
for Node.toxml() [1] might shed some light:

 To avoid UnicodeError exceptions in case of unrepresentable text data,
the encoding argument should be specified as utf-8.

 So what happens if you're not explicit about the encoding, i.e. just
doc.toprettyxml()? This would hopefully at least move your exception to
a more appropriate place.

 In any case, one solution would be to scrub the string in your content
variable to get rid of the invalid characters (hopefully they're
insignificant). Maybe something like this:

 def unicode_filter(char):
try:
unicode(char, encoding='utf-8', errors='strict')
return char
except UnicodeDecodeError:
return ''

 content = 'abc\xFF'
 content = ''.join(map(unicode_filter, content)) print content

 Not really my area of expertise, but maybe worth a shot
 -Jon

 1.
 http://docs.python.org/2/library/xml.dom.minidom.html#xml.dom.minidom.
 Node.toxml

 --
 Jon Stroop
 Digital Initiatives Programmer/Analyst Princeton University Library
 jstr...@princeton.edu




 On 03/04/2013 03:00 PM, Michael Beccaria wrote:
 I'm working on a project that takes the ocr data found in a pdf and
places it in a custom xml file.

 I use Python scripts to create the xml file. Something like this
(trimmed down a bit):

 from xml.dom.minidom import Document
 doc = Document()
  Page = doc.createElement(Page)
  doc.appendChild(Page)
  f = StringIO(txt)
  lines = f.readlines()
  for line in lines:
  word = doc.createElement(String)
  ...
  word.setAttribute(CONTENT,content)
  Page.appendChild(word)
  return doc.toprettyxml(indent=  ,encoding=utf-8)


 This creates a file, simply, that looks like this:
 ?xml version=1.0 encoding=utf-8? Page HEIGHT=3296
 WIDTH=2609
   String CONTENT=BuffaloLaunch /
   String CONTENT=Club /
   String CONTENT=Offices /
   String CONTENT=Installed /
   ...
 /Page

 I am able to get this document to be created ok and saved to an xml
file. The problem occurs when I try and have it read using the lxml
library:

 from lxml import etree
 doc = etree.parse(filename)


 I am running across errors like XMLSyntaxError: Char 0x out of
allowed range, line 94, column 19. Which when I look at the file, is
true. There is a 0X character in the content field.

 How is a file able to be created using minidom (which I assume would
create a valid xml file) and then failing when parsing with lxml? What
should I do to fix this on the encoding side so that errors don't show
up on the parsing side?
 Thanks,
 Mike

 How is the
 Mike Beccaria
 Systems Librarian
 Head of Digital Initiative
 Paul Smith's College
 518.327.6376
 mbecca

Re: [CODE4LIB] XML Parsing and Python

2013-03-05 Thread Jon Stroop

Mike,
I haven't used minidom extensively but my guess is that 
doc.toprettyxml(indent= ,encoding=utf-8) isn't actually changing the 
encoding because it can't parse the string in your content variable. I'm 
surprised that you're not getting tossed a UnicodeError, but The docs 
for Node.toxml() [1] might shed some light:


To avoid UnicodeError exceptions in case of unrepresentable text data, 
the encoding argument should be specified as “utf-8”.


So what happens if you're not explicit about the encoding, i.e. just 
doc.toprettyxml()? This would hopefully at least move your exception to 
a more appropriate place.


In any case, one solution would be to scrub the string in your content 
variable to get rid of the invalid characters (hopefully they're 
insignificant). Maybe something like this:


def unicode_filter(char):
try:
unicode(char, encoding='utf-8', errors='strict')
return char
except UnicodeDecodeError:
return ''

content = 'abc\xFF'
content = ''.join(map(unicode_filter, content))
print content

Not really my area of expertise, but maybe worth a shot
-Jon

1. 
http://docs.python.org/2/library/xml.dom.minidom.html#xml.dom.minidom.Node.toxml


--
Jon Stroop
Digital Initiatives Programmer/Analyst
Princeton University Library
jstr...@princeton.edu




On 03/04/2013 03:00 PM, Michael Beccaria wrote:

I'm working on a project that takes the ocr data found in a pdf and places it 
in a custom xml file.

I use Python scripts to create the xml file. Something like this (trimmed down 
a bit):

from xml.dom.minidom import Document
doc = Document()
Page = doc.createElement(Page)
doc.appendChild(Page)
f = StringIO(txt)
lines = f.readlines()
for line in lines:
word = doc.createElement(String)
...
word.setAttribute(CONTENT,content)
Page.appendChild(word)
return doc.toprettyxml(indent=  ,encoding=utf-8)


This creates a file, simply, that looks like this:
?xml version=1.0 encoding=utf-8?
Page HEIGHT=3296 WIDTH=2609
   String CONTENT=BuffaloLaunch /
   String CONTENT=Club /
   String CONTENT=Offices /
   String CONTENT=Installed /
   ...
/Page

I am able to get this document to be created ok and saved to an xml file. The 
problem occurs when I try and have it read using the lxml library:

from lxml import etree
doc = etree.parse(filename)


I am running across errors like XMLSyntaxError: Char 0x out of allowed range, 
line 94, column 19. Which when I look at the file, is true. There is a 0X 
character in the content field.

How is a file able to be created using minidom (which I assume would create a 
valid xml file) and then failing when parsing with lxml? What should I do to 
fix this on the encoding side so that errors don't show up on the parsing side?
Thanks,
Mike

How is the
Mike Beccaria
Systems Librarian
Head of Digital Initiative
Paul Smith's College
518.327.6376
mbecca...@paulsmiths.edu
Become a friend of Paul Smith's Library on Facebook today!


Re: [CODE4LIB] XML Parsing and Python

2013-03-05 Thread Chris Beer
I'll note that 0x is a UTF-8 non-character, and  these noncharacters 
should never be included in text interchange between implementations. [1] I 
assume the OCR engine maybe using 0x when it can't recognize a character? 
So, it's not wrong for a parser to complain (or, not complain) about 0x, 
and you can just scrub the string like Jon suggests.

Chris


[1] http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Noncharacters

On 5 Mar, 2013, at 9:16 , Jon Stroop jstr...@princeton.edu wrote:

 Mike,
 I haven't used minidom extensively but my guess is that 
 doc.toprettyxml(indent= ,encoding=utf-8) isn't actually changing the 
 encoding because it can't parse the string in your content variable. I'm 
 surprised that you're not getting tossed a UnicodeError, but The docs for 
 Node.toxml() [1] might shed some light:
 
 To avoid UnicodeError exceptions in case of unrepresentable text data, the 
 encoding argument should be specified as “utf-8”.
 
 So what happens if you're not explicit about the encoding, i.e. just 
 doc.toprettyxml()? This would hopefully at least move your exception to a 
 more appropriate place.
 
 In any case, one solution would be to scrub the string in your content 
 variable to get rid of the invalid characters (hopefully they're 
 insignificant). Maybe something like this:
 
 def unicode_filter(char):
try:
unicode(char, encoding='utf-8', errors='strict')
return char
except UnicodeDecodeError:
return ''
 
 content = 'abc\xFF'
 content = ''.join(map(unicode_filter, content))
 print content
 
 Not really my area of expertise, but maybe worth a shot
 -Jon
 
 1. 
 http://docs.python.org/2/library/xml.dom.minidom.html#xml.dom.minidom.Node.toxml
 
 -- 
 Jon Stroop
 Digital Initiatives Programmer/Analyst
 Princeton University Library
 jstr...@princeton.edu
 
 
 
 
 On 03/04/2013 03:00 PM, Michael Beccaria wrote:
 I'm working on a project that takes the ocr data found in a pdf and places 
 it in a custom xml file.
 
 I use Python scripts to create the xml file. Something like this (trimmed 
 down a bit):
 
 from xml.dom.minidom import Document
 doc = Document()
  Page = doc.createElement(Page)
  doc.appendChild(Page)
  f = StringIO(txt)
  lines = f.readlines()
  for line in lines:
  word = doc.createElement(String)
  ...
  word.setAttribute(CONTENT,content)
  Page.appendChild(word)
  return doc.toprettyxml(indent=  ,encoding=utf-8)
 
 
 This creates a file, simply, that looks like this:
 ?xml version=1.0 encoding=utf-8?
 Page HEIGHT=3296 WIDTH=2609
   String CONTENT=BuffaloLaunch /
   String CONTENT=Club /
   String CONTENT=Offices /
   String CONTENT=Installed /
   ...
 /Page
 
 I am able to get this document to be created ok and saved to an xml file. 
 The problem occurs when I try and have it read using the lxml library:
 
 from lxml import etree
 doc = etree.parse(filename)
 
 
 I am running across errors like XMLSyntaxError: Char 0x out of allowed 
 range, line 94, column 19. Which when I look at the file, is true. There is 
 a 0X character in the content field.
 
 How is a file able to be created using minidom (which I assume would create 
 a valid xml file) and then failing when parsing with lxml? What should I do 
 to fix this on the encoding side so that errors don't show up on the parsing 
 side?
 Thanks,
 Mike
 
 How is the
 Mike Beccaria
 Systems Librarian
 Head of Digital Initiative
 Paul Smith's College
 518.327.6376
 mbecca...@paulsmiths.edu
 Become a friend of Paul Smith's Library on Facebook today!


Re: [CODE4LIB] XML Parsing and Python

2013-03-04 Thread Stuart Myles
It sounds like your code isn't recognizing the XML file as UTF-8 (even
though the encoding is correctly marked in your example).

You could try telling the parser explicitly to use UTF-8, like this

parser = XMLParser(encoding=utf-8)

As discussed in
http://www.daniweb.com/software-development/python/threads/435360/using-xml.etree-with-xml-files-containing-a-symbol

There's also a bit of discussion about using lxml to parse UTF-8 in
http://stackoverflow.com/questions/3402520/is-there-a-way-to-force-lxml-to-parse-unicode-strings-that-specify-an-encoding-i

Hope this helps!

Regards,

Stuart







On Mon, Mar 4, 2013 at 3:00 PM, Michael Beccaria
mbecca...@paulsmiths.eduwrote:

 I'm working on a project that takes the ocr data found in a pdf and places
 it in a custom xml file.

 I use Python scripts to create the xml file. Something like this (trimmed
 down a bit):

 from xml.dom.minidom import Document
 doc = Document()
 Page = doc.createElement(Page)
 doc.appendChild(Page)
 f = StringIO(txt)
 lines = f.readlines()
 for line in lines:
 word = doc.createElement(String)
 ...
 word.setAttribute(CONTENT,content)
 Page.appendChild(word)
 return doc.toprettyxml(indent=  ,encoding=utf-8)


 This creates a file, simply, that looks like this:
 ?xml version=1.0 encoding=utf-8?
 Page HEIGHT=3296 WIDTH=2609
   String CONTENT=BuffaloLaunch /
   String CONTENT=Club /
   String CONTENT=Offices /
   String CONTENT=Installed /
   ...
 /Page

 I am able to get this document to be created ok and saved to an xml file.
 The problem occurs when I try and have it read using the lxml library:

 from lxml import etree
 doc = etree.parse(filename)


 I am running across errors like XMLSyntaxError: Char 0x out of
 allowed range, line 94, column 19. Which when I look at the file, is true.
 There is a 0X character in the content field.

 How is a file able to be created using minidom (which I assume would
 create a valid xml file) and then failing when parsing with lxml? What
 should I do to fix this on the encoding side so that errors don't show up
 on the parsing side?
 Thanks,
 Mike

 How is the
 Mike Beccaria
 Systems Librarian
 Head of Digital Initiative
 Paul Smith's College
 518.327.6376
 mbecca...@paulsmiths.edu
 Become a friend of Paul Smith's Library on Facebook today!