May 29 2018 11:12 AM, "Thomas Jollans" <t...@tjol.eu> wrote: > On 2018-05-29 09:55, f...@lutix.org wrote: > >> Hello, >> Using Python 2.7 (will switch to Py3 soon but Before I'd like to understand >> how string encoding >> worked) > > Oh dear. This is probably the exact wrong way to go about it: the > interplay between string encoding, unicode and bytes is much less clear > and easy to understand in Python 2.
Ok I will quickly jump into py3 then. > >> Could you please tell me is I understood well what occurs in Python's mind: >> in a .py file: >> if I write s="héhéhé", if my file is declared as unicode coding, python will >> store in memory >> s='hx82hx82hx82' > > No, it doesn't. At the very least, you're missing some backslashes – and > I don't know of any character encoding that using 0x82 to encode é. > surprinsingly backslash were removed from my initial text... ok so stored raw bytes are the one processed by the system encoder. If my console were utf-8 I would have same raw bytes string than you. > On my system, I see > >>>> s = 'héhéhé' >>>> s > > 'h\xc3\xa9h\xc3\xa9h\xc3\xa9' > > My system uses UTF-8. If your PC is set up to uses an encoding like ISO > 8859-15 or Windows-1252, you should see > > 'h\xe9h\xe9h\xe9' > > The \x?? are just Python notation. > >> however this is not yet unicode for python interpreter this is just raw >> bytes. Right? > > Right, this is a bunch of bytes: > >>>> s > > 'h\xe9h\xe9h\xe9' > >>>> [ord(c) for c in s] > > [104, 233, 104, 233, 104, 233] > >>>> [hex(ord(c)) for c in s] > > ['0x68', '0xe9', '0x68', '0xe9', '0x68', '0xe9'] > >>>> >> >> By the way, why 'h' is not turned into hexa value? Because it is already in >> the ASCII table? > > That's just how Python 2 likes to display stuff. > >> If I want python interpreter to recognize my string as unicode I have to >> declare it as unicode >> s=u'héhéhé' and magically python will look for those >> hex values 'x82' in the Unicode table. Still OK? > > In principle, the unicode table has nothing to do with anything here. It > so happens that for some characters in some encodings the value is equal > to the code point, but that's neither here nor there. > >> Now: how come when I declare s='héhéhé', print(s) displays well 'héhéhé'? Is >> it because of my shell >> windows that is dealing well with unicode? Or is it >> because the print function is magic? > > It's because the print statement is magic. > > Actually, this *only* works if the encoding of your file matches the > default encoding required by your console. This is usually the case as > long as you stay on the same PC, but this assumption can fall apart > quite easily when you move code and data between systems, especially if > they use different operating systems or (human) languages. > > Just use Python 3. There, the print function is not magic, which makes > life so much more logical. Thanks > > -- Thomas > -- > https://mail.python.org/mailman/listinfo/python-list -- https://mail.python.org/mailman/listinfo/python-list