Re: [Tutor] Encoding

Dave Angel Fri, 05 Mar 2010 11:45:40 -0800

Giorgio wrote:

2010/3/5 Dave Angel <da...@ieee.org>

In other words, you don't understand my paragraph above.



Maybe. But please don't be angry. I'm here to learn, and as i've run into a
very difficult concept I want to fully undestand it.

I'm not angry, and I'm sorry if I seemed angry. Tone of voice is hardto convey in a text message.

Once the string is stored in t as an 8 bit string, it's irrelevant what the
source file encoding was.



Ok, you've said this 2 times, but, please, can you tell me why? I think
that's the key passage to understand how encoding of strings works. The
source file encoding affects all file lines, also strings.

Nope, not strings.  It only affects string literals.

 If my encoding is
UTF8 python will read the string "ciao è ciao" as 'ciao \xc3\xa8 ciao' but
if it's latin1 it will read 'ciao \xe8 ciao'. So, how can it be irrelevant?

I think the problem is that i can't find any difference between 2 lines
quoted above:

s = u"ciao è ciao"

and

t = "ciao è ciao"
c = unicode(t)

[**  I took the liberty of making the variable names different so I can refer 
to them **]

I'm still not sure whether your confusion is to what the rules are, orwhy the rules were made that way. The rules are that an unqualifiedconversion, such as the unicode() function with no second argument, usesthe default encoding, in strict mode. Thus the error.

Quoting the help:"If no optional parameters are given, unicode() will mimic the behaviourof str() except that it returns Unicode strings instead of 8-bitstrings. More precisely, if /object/ is a Unicode string or subclass itwill return that Unicode string without any additional decoding applied.

For objects which provide a __unicode__()<../reference/datamodel.html#object.__unicode__> method, it will callthis method without arguments to create a Unicode string. For all otherobjects, the 8-bit string version or representation is requested andthen converted to a Unicode string using the codec for the defaultencoding in 'strict' mode.

As for why the rules are that, I'd have to ask you what you'd prefer.The unicode() function has no idea that t was created from a literal(and no idea what source file that literal was in), so it has to picksome coding, called the default coding. The designers decided to use adefault encoding of ASCII, because manipulating ASCII strings is alwayssafe, while many functions won't behave as expected when given UTF-8encoded strings. For example, what's the 7th character of t ? That isnot necessarily the same as the 7th character of s, since one or more ofthe characters in between might have taken up multiple bytes in s. Thatdoesn't happen to be the case for your accented character, but would befor some other European symbols, and certainly for other languages as well.

If you then (whether it's in the next line, or ten thousand calls later)
try to convert to unicode without specifying a decoder, it uses the default
encoder, which is a application wide thing, and not a source file thing.  To
see what it is on your system, use sys.getdefaultencoding().


And this is ok. Spir said that it uses ASCII, you now say that it uses the
default encoder. I think that ASCII on spir's system is the default encoder
so.

I don't know, but I think it's the default in every country, at least onversion 2.6. It might make sense to get some value from the OS thatdefined the locally preferred encoding, but then a program that workedfine in one locale might fail miserably in another.

The point is that there isn't just one global value, and it's a good thing.
 You should figure everywhere characters come into  your program (eg. source
files, raw_input, file i/o...) and everywhere characters go out of your
program, and deal with each of them individually.



Ok. But it always happen this way. I hardly ever have to work with strings
defined in the file.

Not sure what you mean by "the file." If you mean the source file,that's what your examples are about. If you mean a data file, that'sdealt with differently.

Don't store anything internally as strings, and you won't create the
ambiguity you have with your 't' variable above.

DaveA


Thankyou Dave

Giorgio


_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Encoding

Reply via email to