Re: [newbie] String to binary conversion

2012-08-07 Thread 88888 Dihedral
Steven D'Aprano於 2012年8月7日星期二UTC+8上午10時01分05秒寫道:
 On Mon, 06 Aug 2012 22:46:38 +0200, Mok-Kong Shen wrote:
 
 
 
  If I have a string abcd then, with 8-bit encoding of each character,
 
  there is a corresponding 32-bit binary integer. How could I best obtain
 
  that integer and from that integer backwards again obtain the original
 
  string? Thanks in advance.
 
 
 
 First you have to know the encoding, as that will define the integers you 
 
 get. There are many 8-bit encodings, but of course they can't all encode 
 
 arbitrary 4-character strings. Since there are tens of thousands of 
 
 different characters, and an 8-bit encoding can only code for 256 of 
 
 them, there are many strings that an encoding cannot handle.
 
 
 
 For those, you need multi-byte encodings like UTF-8, UTF-16, etc.
 
 
 
 Sticking to one-byte encodings: since most of them are compatible with 
 
 ASCII, examples with abcd aren't very interesting:
 
 
 
 py 'abcd'.encode('latin1')
 
 b'abcd'
 
 
 
 Even though the bytes object b'abcd' is printed as if it were a string, 
 
 it is actually treated as an array of one-byte ints:
 
 
 
 py b'abcd'[0]
 
 97
 
 
 
 Here's a more interesting example, using Python 3: it uses at least one 
 
 character (the Greek letter π) which cannot be encoded in Latin1, and two 
 
 which cannot be encoded in ASCII:
 
 
 
 py aπ©d.encode('iso-8859-7')
 
 b'a\xf0\xa9d'
 
 
 
 Most encodings will round-trip successfully:
 
 
 
 py text = 'aπ©Z!'
 
 py data = text.encode('iso-8859-7')
 
 py data.decode('iso-8859-7') == text
 
 True
 
 
 
 
 
 (although the ability to round-trip is a property of the encoding itself, 
 
 not of the encoding system).
 
 
 
 Naturally if you encode with one encoding, and then decode with another, 
 
 you are likely to get different strings:
 
 
 
 py text = 'aπ©Z!'
 
 py data = text.encode('iso-8859-7')
 
 py data.decode('latin1')
 
 'að©Z!'
 
 py data.decode('iso-8859-14')
 
 'aŵ©Z!'
 
 
 
 
 
 Both the encode and decode methods take an optional argument, errors, 
 
 which specify the error handling scheme. The default is errors='strict', 
 
 which raises an exception. Others include 'ignore' and 'replace'.
 
 
 
 py 'aŵðπ©Z!'.encode('ascii', 'ignore')
 
 b'aZ!'
 
 py 'aŵðπ©Z!'.encode('ascii', 'replace')
 
 b'aZ!'
 
 
 
 
 
 
 
 -- 
 
 Steven



Steven D'Aprano於 2012年8月7日星期二UTC+8上午10時01分05秒寫道:
 On Mon, 06 Aug 2012 22:46:38 +0200, Mok-Kong Shen wrote:
 
 
 
  If I have a string abcd then, with 8-bit encoding of each character,
 
  there is a corresponding 32-bit binary integer. How could I best obtain
 
  that integer and from that integer backwards again obtain the original
 
  string? Thanks in advance.
 
 
 
 First you have to know the encoding, as that will define the integers you 
 
 get. There are many 8-bit encodings, but of course they can't all encode 
 
 arbitrary 4-character strings. Since there are tens of thousands of 
 
 different characters, and an 8-bit encoding can only code for 256 of 
 
 them, there are many strings that an encoding cannot handle.
 
 
 
 For those, you need multi-byte encodings like UTF-8, UTF-16, etc.
 
 
 
 Sticking to one-byte encodings: since most of them are compatible with 
 
 ASCII, examples with abcd aren't very interesting:
 
 
 
 py 'abcd'.encode('latin1')
 
 b'abcd'
 
 
 
 Even though the bytes object b'abcd' is printed as if it were a string, 
 
 it is actually treated as an array of one-byte ints:
 
 
 
 py b'abcd'[0]
 
 97
 
 
 
 Here's a more interesting example, using Python 3: it uses at least one 
 
 character (the Greek letter π) which cannot be encoded in Latin1, and two 
 
 which cannot be encoded in ASCII:
 
 
 
 py aπ©d.encode('iso-8859-7')
 
 b'a\xf0\xa9d'
 
 
 
 Most encodings will round-trip successfully:
 
 
 
 py text = 'aπ©Z!'
 
 py data = text.encode('iso-8859-7')
 
 py data.decode('iso-8859-7') == text
 
 True
 
 
 
 
 
 (although the ability to round-trip is a property of the encoding itself, 
 
 not of the encoding system).
 
 
 
 Naturally if you encode with one encoding, and then decode with another, 
 
 you are likely to get different strings:
 
 
 
 py text = 'aπ©Z!'
 
 py data = text.encode('iso-8859-7')
 
 py data.decode('latin1')
 
 'að©Z!'
 
 py data.decode('iso-8859-14')
 
 'aŵ©Z!'
 
 
 
 
 
 Both the encode and decode methods take an optional argument, errors, 
 
 which specify the error handling scheme. The default is errors='strict', 
 
 which raises an exception. Others include 'ignore' and 'replace'.
 
 
 
 py 'aŵðπ©Z!'.encode('ascii', 'ignore')
 
 b'aZ!'
 
 py 'aŵðπ©Z!'.encode('ascii', 'replace')
 
 b'aZ!'
 
 
 
 
 
 
 
 -- 
 
 Steven

I think UTF-8 CODEC or UTF-16 is necessary, just recall those MS encoding codecs
of Win98, and NT that collected taxes all over the world.


Actually for each kind of  some character encoding, 
please develop a codec to UTF-8 or UTF-16.

It means one can make conversions between any two of  the qualified 
character sets.

-- 

Re: [newbie] String to binary conversion

2012-08-06 Thread Tobiah

The binascii module looks like it might have
something for you.  I've never used it.

Tobiah

http://docs.python.org/library/binascii.html

On 08/06/2012 01:46 PM, Mok-Kong Shen wrote:


If I have a string abcd then, with 8-bit encoding of each character,
there is a corresponding 32-bit binary integer. How could I best
obtain that integer and from that integer backwards again obtain the
original string? Thanks in advance.

M. K. Shen


--
http://mail.python.org/mailman/listinfo/python-list


Re: [newbie] String to binary conversion

2012-08-06 Thread Tobiah

On 08/06/2012 01:59 PM, Tobiah wrote:

The binascii module looks like it might have
something for you. I've never used it.


Having actually read some of that doc, I see
it's not what you want at all.  Sorry.


--
http://mail.python.org/mailman/listinfo/python-list


Re: [newbie] String to binary conversion

2012-08-06 Thread Mok-Kong Shen

Am 06.08.2012 22:59, schrieb Tobiah:

The binascii module looks like it might have
something for you.  I've never used it.


Thanks for the hint, but if I don't err, the module binascii doesn't
seem to work. I typed:

import binascii

and a line that's given as example in the document:

crc = binascii.crc32(hello)

but got the following error message:

TypeError: 'str' does not support the buffer interface.

The same error message appeared when I tried the other functions.

M. K. Shen

--
http://mail.python.org/mailman/listinfo/python-list


Re: [newbie] String to binary conversion

2012-08-06 Thread MRAB

On 06/08/2012 21:46, Mok-Kong Shen wrote:


If I have a string abcd then, with 8-bit encoding of each character,
there is a corresponding 32-bit binary integer. How could I best
obtain that integer and from that integer backwards again obtain the
original string? Thanks in advance.


Try this (Python 3, in which strings are Unicode):

import struct

 # For a little-endian integer

struct.unpack(I, abcd.encode(latin-1))[0]

1684234849

hex(_)

'0x64636261'

or this (Python 2, in which strings are bytestrings):
 import struct
 # For a little-endian integer
 struct.unpack(I, abcd)[0]
1684234849
 hex(_)
'0x64636261'

--
http://mail.python.org/mailman/listinfo/python-list


Re: [newbie] String to binary conversion

2012-08-06 Thread Emile van Sebille

On 8/6/2012 1:46 PM Mok-Kong Shen said...


If I have a string abcd then, with 8-bit encoding of each character,
there is a corresponding 32-bit binary integer. How could I best
obtain that integer and from that integer backwards again obtain the
original string? Thanks in advance.


It's easy to write one:

def str2val(str,_val=0):
if len(str)1: return str2val(str[1:],256*_val+ord(str[0]))
return 256*_val+ord(str[0])


def val2str(val,_str=):
if val256: return val2str(int(val/256),_str)+chr(val%256)
return _str+chr(val)


print str2val(abcd)
print val2str(str2val(abcd))
print val2str(str2val(good))
print val2str(str2val(longer))
print val2str(str2val(verymuchlonger))

Flavor to taste.

Emile

--
http://mail.python.org/mailman/listinfo/python-list


Re: [newbie] String to binary conversion

2012-08-06 Thread Steven D'Aprano
On Mon, 06 Aug 2012 22:46:38 +0200, Mok-Kong Shen wrote:

 If I have a string abcd then, with 8-bit encoding of each character,
 there is a corresponding 32-bit binary integer. How could I best obtain
 that integer and from that integer backwards again obtain the original
 string? Thanks in advance.

First you have to know the encoding, as that will define the integers you 
get. There are many 8-bit encodings, but of course they can't all encode 
arbitrary 4-character strings. Since there are tens of thousands of 
different characters, and an 8-bit encoding can only code for 256 of 
them, there are many strings that an encoding cannot handle.

For those, you need multi-byte encodings like UTF-8, UTF-16, etc.

Sticking to one-byte encodings: since most of them are compatible with 
ASCII, examples with abcd aren't very interesting:

py 'abcd'.encode('latin1')
b'abcd'

Even though the bytes object b'abcd' is printed as if it were a string, 
it is actually treated as an array of one-byte ints:

py b'abcd'[0]
97

Here's a more interesting example, using Python 3: it uses at least one 
character (the Greek letter π) which cannot be encoded in Latin1, and two 
which cannot be encoded in ASCII:

py aπ©d.encode('iso-8859-7')
b'a\xf0\xa9d'

Most encodings will round-trip successfully:

py text = 'aπ©Z!'
py data = text.encode('iso-8859-7')
py data.decode('iso-8859-7') == text
True


(although the ability to round-trip is a property of the encoding itself, 
not of the encoding system).

Naturally if you encode with one encoding, and then decode with another, 
you are likely to get different strings:

py text = 'aπ©Z!'
py data = text.encode('iso-8859-7')
py data.decode('latin1')
'að©Z!'
py data.decode('iso-8859-14')
'aŵ©Z!'


Both the encode and decode methods take an optional argument, errors, 
which specify the error handling scheme. The default is errors='strict', 
which raises an exception. Others include 'ignore' and 'replace'.

py 'aŵðπ©Z!'.encode('ascii', 'ignore')
b'aZ!'
py 'aŵðπ©Z!'.encode('ascii', 'replace')
b'aZ!'



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list