Re: [Tutor] symbol encoding and processing problem

Tim Golden Fri, 19 Oct 2007 08:48:26 -0700

Timmie wrote:
> I am totally lost:
> * python has ascii as default encoding
> * my linux uses UTF-8 (therefore all files created on linux are UTF-8)
> * windows uses cp1250
> * IPtyhon something else: on the machine where I am currently on stdin is set 
> to
> cp850
> 
> So what encoding to I use to display and process characters that exeeed the
> standard english alphabet?


> My initial question was:
> 
> 1) get a coordinate (DEG° MIN' SEC'') as input from user via easygui
> 2) split that string into its subscripts: degrees, minutes and secons
> 3) do some processing of the 3 varaibles
> 4) print the output with easygui.
> 
> I am not really interested which is the best encoding. I want to know:
> * how I do this that I don't get a encoding error?
> * how do I code it that the code runs on linux and windows
> from file and in IPython

I realise that I am running the risk of confusing you further,
but I'm afraid that your attitude of "This isn't my problem;
it's Python's" isn't really going to wash. If you're going to
be using characters which fall outside the realm of 7-bit
ASCII you're going to have to get some understanding of how
the various input, output and language mechanisms deal with
them. And all the more so if you're trying to do this cross-platform.

Maybe there's some kind of sealed environment in some other
language or operating system which takes care of all of this
for you transparently. I wouldn't know. What I do know is that,
if you're using the Python interpreter under Windows and Linux
and whatever else then you're at the mercy of those operating
systems at a certain level.

There are at least two points you have to understand:

1) Python needs to know what encoding was used to save a text file
which it is compiling to bytecode: usually a .py file. It has a default
which you can override in a couple of ways. If whatever encoding you've
specified turns out not to match the text in, say, a literal string with
a degree symbol, then Python will not know what to do and will stop with
an exception. Of course how you encoded the file in question is between
you and your editor.

2) When you are reading or writing text to or from a console or GUI window
or database or PDF or whatever, you also need to know what encoding to use.
If you're writing out, then whatever you're writing to will be able to
make sense of the encoding you're supplying -- and you may need to say
which one it was. If you're reading in, you are at the mercy of libraries:
some will always return unicode (BeautifulSoup springs to mind), others will
return raw bytes leaving it up to you to decode, others will return an
encoded string. This is pretty much an historical artefact (or, sadly in some
cases, a case of ignorance) and you're going to have to cope with it.

On my windows box, easygui handles unicode perfectly well, and the
console running cp437 displays the degree sign. If if didn't, I'd
have to compromise on the display (or use chcp to switch code pages
first). To illustrate, the following program works:

<code>
import easygui

sample = u"DEG\u00b0 MIN' SEC\""
from_user = easygui.enterbox (u"Enter" + sample)
#
# Paste in values from your email since I can't
# be bothered to work out how to get the degree
# sign
#

print from_user

</code>

and from_user is a perfectly good unicode string. Now, if you
want to write that out to a file, or a database or what-have-you
which can't store unicode natively, then you'll have to encode
it, probably as UTF8 which can encode anything.

For this email, I've used the unicode-escape, but if -- as
you did -- you wanted to use the string literal, then you'd
need to save the .py file in a certain encoding and to place
a line at the top of the file indicating what that encoding
was. If you're happy using unicode-escapes then that saves
a bit of finnicking about.

TJG
_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] symbol encoding and processing problem

Reply via email to