Giorgio wrote:
2010/3/5 Dave Angel <da...@ieee.org>
In other words, you don't understand my paragraph above.


Maybe. But please don't be angry. I'm here to learn, and as i've run into a
very difficult concept I want to fully undestand it.


I'm not angry, and I'm sorry if I seemed angry. Tone of voice is hard to convey in a text message.
Once the string is stored in t as an 8 bit string, it's irrelevant what the
source file encoding was.


Ok, you've said this 2 times, but, please, can you tell me why? I think
that's the key passage to understand how encoding of strings works. The
source file encoding affects all file lines, also strings.
Nope, not strings.  It only affects string literals.
 If my encoding is
UTF8 python will read the string "ciao è ciao" as 'ciao \xc3\xa8 ciao' but
if it's latin1 it will read 'ciao \xe8 ciao'. So, how can it be irrelevant?

I think the problem is that i can't find any difference between 2 lines
quoted above:

s = u"ciao è ciao"

and

t = "ciao è ciao"
c = unicode(t)

[**  I took the liberty of making the variable names different so I can refer 
to them **]
I'm still not sure whether your confusion is to what the rules are, or why the rules were made that way. The rules are that an unqualified conversion, such as the unicode() function with no second argument, uses the default encoding, in strict mode. Thus the error.

Quoting the help: "If no optional parameters are given, unicode() will mimic the behaviour of str() except that it returns Unicode strings instead of 8-bit strings. More precisely, if /object/ is a Unicode string or subclass it will return that Unicode string without any additional decoding applied.

For objects which provide a __unicode__() <../reference/datamodel.html#object.__unicode__> method, it will call this method without arguments to create a Unicode string. For all other objects, the 8-bit string version or representation is requested and then converted to a Unicode string using the codec for the default encoding in 'strict' mode.
"

As for why the rules are that, I'd have to ask you what you'd prefer. The unicode() function has no idea that t was created from a literal (and no idea what source file that literal was in), so it has to pick some coding, called the default coding. The designers decided to use a default encoding of ASCII, because manipulating ASCII strings is always safe, while many functions won't behave as expected when given UTF-8 encoded strings. For example, what's the 7th character of t ? That is not necessarily the same as the 7th character of s, since one or more of the characters in between might have taken up multiple bytes in s. That doesn't happen to be the case for your accented character, but would be for some other European symbols, and certainly for other languages as well.
If you then (whether it's in the next line, or ten thousand calls later)
try to convert to unicode without specifying a decoder, it uses the default
encoder, which is a application wide thing, and not a source file thing.  To
see what it is on your system, use sys.getdefaultencoding().


And this is ok. Spir said that it uses ASCII, you now say that it uses the
default encoder. I think that ASCII on spir's system is the default encoder
so.


I don't know, but I think it's the default in every country, at least on version 2.6. It might make sense to get some value from the OS that defined the locally preferred encoding, but then a program that worked fine in one locale might fail miserably in another.
The point is that there isn't just one global value, and it's a good thing.
 You should figure everywhere characters come into  your program (eg. source
files, raw_input, file i/o...) and everywhere characters go out of your
program, and deal with each of them individually.


Ok. But it always happen this way. I hardly ever have to work with strings
defined in the file.

Not sure what you mean by "the file." If you mean the source file, that's what your examples are about. If you mean a data file, that's dealt with differently.
Don't store anything internally as strings, and you won't create the
ambiguity you have with your 't' variable above.

DaveA


Thankyou Dave

Giorgio




_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to