Steven D'Aprano於 2012年8月7日星期二UTC+8上午10時01分05秒寫道: > On Mon, 06 Aug 2012 22:46:38 +0200, Mok-Kong Shen wrote: > > > > > If I have a string "abcd" then, with 8-bit encoding of each character, > > > there is a corresponding 32-bit binary integer. How could I best obtain > > > that integer and from that integer backwards again obtain the original > > > string? Thanks in advance. > > > > First you have to know the encoding, as that will define the integers you > > get. There are many 8-bit encodings, but of course they can't all encode > > arbitrary 4-character strings. Since there are tens of thousands of > > different characters, and an 8-bit encoding can only code for 256 of > > them, there are many strings that an encoding cannot handle. > > > > For those, you need multi-byte encodings like UTF-8, UTF-16, etc. > > > > Sticking to one-byte encodings: since most of them are compatible with > > ASCII, examples with "abcd" aren't very interesting: > > > > py> 'abcd'.encode('latin1') > > b'abcd' > > > > Even though the bytes object b'abcd' is printed as if it were a string, > > it is actually treated as an array of one-byte ints: > > > > py> b'abcd'[0] > > 97 > > > > Here's a more interesting example, using Python 3: it uses at least one > > character (the Greek letter π) which cannot be encoded in Latin1, and two > > which cannot be encoded in ASCII: > > > > py> "aπ©d".encode('iso-8859-7') > > b'a\xf0\xa9d' > > > > Most encodings will round-trip successfully: > > > > py> text = 'aπ©Z!' > > py> data = text.encode('iso-8859-7') > > py> data.decode('iso-8859-7') == text > > True > > > > > > (although the ability to round-trip is a property of the encoding itself, > > not of the encoding system). > > > > Naturally if you encode with one encoding, and then decode with another, > > you are likely to get different strings: > > > > py> text = 'aπ©Z!' > > py> data = text.encode('iso-8859-7') > > py> data.decode('latin1') > > 'að©Z!' > > py> data.decode('iso-8859-14') > > 'aŵ©Z!' > > > > > > Both the encode and decode methods take an optional argument, errors, > > which specify the error handling scheme. The default is errors='strict', > > which raises an exception. Others include 'ignore' and 'replace'. > > > > py> 'aŵðπ©Z!'.encode('ascii', 'ignore') > > b'aZ!' > > py> 'aŵðπ©Z!'.encode('ascii', 'replace') > > b'a????Z!' > > > > > > > > -- > > Steven
Steven D'Aprano於 2012年8月7日星期二UTC+8上午10時01分05秒寫道: > On Mon, 06 Aug 2012 22:46:38 +0200, Mok-Kong Shen wrote: > > > > > If I have a string "abcd" then, with 8-bit encoding of each character, > > > there is a corresponding 32-bit binary integer. How could I best obtain > > > that integer and from that integer backwards again obtain the original > > > string? Thanks in advance. > > > > First you have to know the encoding, as that will define the integers you > > get. There are many 8-bit encodings, but of course they can't all encode > > arbitrary 4-character strings. Since there are tens of thousands of > > different characters, and an 8-bit encoding can only code for 256 of > > them, there are many strings that an encoding cannot handle. > > > > For those, you need multi-byte encodings like UTF-8, UTF-16, etc. > > > > Sticking to one-byte encodings: since most of them are compatible with > > ASCII, examples with "abcd" aren't very interesting: > > > > py> 'abcd'.encode('latin1') > > b'abcd' > > > > Even though the bytes object b'abcd' is printed as if it were a string, > > it is actually treated as an array of one-byte ints: > > > > py> b'abcd'[0] > > 97 > > > > Here's a more interesting example, using Python 3: it uses at least one > > character (the Greek letter π) which cannot be encoded in Latin1, and two > > which cannot be encoded in ASCII: > > > > py> "aπ©d".encode('iso-8859-7') > > b'a\xf0\xa9d' > > > > Most encodings will round-trip successfully: > > > > py> text = 'aπ©Z!' > > py> data = text.encode('iso-8859-7') > > py> data.decode('iso-8859-7') == text > > True > > > > > > (although the ability to round-trip is a property of the encoding itself, > > not of the encoding system). > > > > Naturally if you encode with one encoding, and then decode with another, > > you are likely to get different strings: > > > > py> text = 'aπ©Z!' > > py> data = text.encode('iso-8859-7') > > py> data.decode('latin1') > > 'að©Z!' > > py> data.decode('iso-8859-14') > > 'aŵ©Z!' > > > > > > Both the encode and decode methods take an optional argument, errors, > > which specify the error handling scheme. The default is errors='strict', > > which raises an exception. Others include 'ignore' and 'replace'. > > > > py> 'aŵðπ©Z!'.encode('ascii', 'ignore') > > b'aZ!' > > py> 'aŵðπ©Z!'.encode('ascii', 'replace') > > b'a????Z!' > > > > > > > > -- > > Steven I think UTF-8 CODEC or UTF-16 is necessary, just recall those MS encoding codecs of Win98, and NT that collected taxes all over the world. Actually for each kind of some character encoding, please develop a codec to UTF-8 or UTF-16. It means one can make conversions between any two of the qualified character sets. -- http://mail.python.org/mailman/listinfo/python-list