Re: [Tutor] Encoding and XML troubles

2006-11-05 Thread Kent Johnson
William O'Higgins Witteman wrote:
 I've been struggling with encodings in my XML input to Python programs.
 
 Here's the situation - my program has no declared encoding, so it
 defaults to ASCII.  It's written in Unicode, but apparently that isn't
 confusing to the parser.  Fine by me.  I import some XML, probably
 encoded in the Windows character set (I don't remember what that's
 called now).  I can read it for the most part - but it throws exceptions
 when it hits accented characters (some data is being input by French
 speakers).  I am using ElementTree for my XML parsing
 
 What I'm trying to do is figure out what I need to do to get my program
 to not barf when it hits an accented character.  I've tried adding an
 encoding line as suggested here:
 
 http://www.python.org/dev/peps/pep-0263/
 
 What these do is make the program fail to parse the XML at all.  Has
 anyone encountered this?  Suggestions?  Thanks.

As Luke says, the encoding of your program has nothing to do with the 
encoding of the XML or the types of data your program will accept. PEP 
263 only affects the encoding of string literals in your program.

It sounds like your XML is not well-formed. XML files can have an 
encoding declaration *in the XML*. If it in not present, the file is 
assumed to be in UTF-8 encoding. If your XML is in Cp1252 but lacks a 
correct encoding declaration, it is not valid XML because the Cp1252 
characters are not valid UTF-8.

Try including the line
?xml version=1.0 encoding=windows-1252?
or
?xml version=1.0 encoding=Cp1252?

as the first line of the XML. (windows-1252 is the official 
IANA-registered name for Cp1252; I'm not sure which name will actually 
work correctly.)

Kent

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding and XML troubles

2006-11-05 Thread Dustin J. Mitchell
For what it's worth, the vast majority of the XML out there (especially if
you're parsing RSS feeds, etc.) is written by monkeys and is totally
ill-formed.  It seems the days of 'it looked OK in my browser' are still here.

To find out if it's your app or the XML, you could try running the XML through
a validating parser.  There are also various tools out there which might be
able to parse the XML anyway -- xmllint, I believe, can do this.

Dustin (not by *any* stretch an expert on XML *or* Unicode)
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] Encoding and XML troubles

2006-11-04 Thread William O'Higgins Witteman
I've been struggling with encodings in my XML input to Python programs.

Here's the situation - my program has no declared encoding, so it
defaults to ASCII.  It's written in Unicode, but apparently that isn't
confusing to the parser.  Fine by me.  I import some XML, probably
encoded in the Windows character set (I don't remember what that's
called now).  I can read it for the most part - but it throws exceptions
when it hits accented characters (some data is being input by French
speakers).  I am using ElementTree for my XML parsing

What I'm trying to do is figure out what I need to do to get my program
to not barf when it hits an accented character.  I've tried adding an
encoding line as suggested here:

http://www.python.org/dev/peps/pep-0263/

What these do is make the program fail to parse the XML at all.  Has
anyone encountered this?  Suggestions?  Thanks.
-- 

yours,

William
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding and XML troubles

2006-11-04 Thread Luke Paireepinart
Inputting XML into a Python program has nothing to do with what encoding the python source is in.So it seems to me that that particular PEP doesn't apply in this case at all.I'm guessing that the ElementTree module has an option to use Unicode input.

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor