Re: Some questions about decode/encode

John Machin Thu, 24 Jan 2008 01:56:28 -0800

On Jan 24, 2:49 pm, glacier <[EMAIL PROTECTED]> wrote:
> I use chinese charactors as an example here.
>
> >>>s1='你好吗'
> >>>repr(s1)
>
> "'\\xc4\\xe3\\xba\\xc3\\xc2\\xf0'"
>
> >>>b1=s1.decode('GBK')
>
> My first question is : what strategy does 'decode' use to tell the way
> to seperate the words. I mean since s1 is an multi-bytes-char string,
> how did it determine to seperate the string every 2bytes or 1byte?
>


The usual strategy for encodings like GBK is:
1. If the current byte is less than 0x80, then it's a 1-byte
character.
2. Current byte 0x81 to 0xFE inclusive: current byte and the next byte
make up a two-byte character.
3. Current byte 0x80: undefined (or used e.g. in cp936 for the 1-byte
euro character)
4: Current byte 0xFF: undefined

Cheers,
John

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Some questions about decode/encode

Reply via email to