On 1月24日, 下午5时51分, John Machin <[EMAIL PROTECTED]> wrote: > On Jan 24, 2:49 pm, glacier <[EMAIL PROTECTED]> wrote: > > > I use chinese charactors as an example here. > > > >>>s1='你好吗' > > >>>repr(s1) > > > "'\\xc4\\xe3\\xba\\xc3\\xc2\\xf0'" > > > >>>b1=s1.decode('GBK') > > > My first question is : what strategy does 'decode' use to tell the way > > to seperate the words. I mean since s1 is an multi-bytes-char string, > > how did it determine to seperate the string every 2bytes or 1byte? > > The usual strategy for encodings like GBK is: > 1. If the current byte is less than 0x80, then it's a 1-byte > character. > 2. Current byte 0x81 to 0xFE inclusive: current byte and the next byte > make up a two-byte character. > 3. Current byte 0x80: undefined (or used e.g. in cp936 for the 1-byte > euro character) > 4: Current byte 0xFF: undefined > > Cheers, > John
Thanks John, I will try to write a function to test if the strategy above caused the problem I described in the 1st post:) -- http://mail.python.org/mailman/listinfo/python-list