On 1月24日, 下午1时41分, Ben Finney <[EMAIL PROTECTED]> wrote: > Ben Finney <[EMAIL PROTECTED]> writes: > > glacier <[EMAIL PROTECTED]> writes: > > > > I use chinese charactors as an example here. > > > > >>>s1='你好吗' > > > >>>repr(s1) > > > "'\\xc4\\xe3\\xba\\xc3\\xc2\\xf0'" > > > >>>b1=s1.decode('GBK') > > > > My first question is : what strategy does 'decode' use to tell the > > > way to seperate the words. I mean since s1 is an multi-bytes-char > > > string, how did it determine to seperate the string every 2bytes > > > or 1byte? > > > The codec you specified ("GBK") is, like any character-encoding > > codec, a precise mapping between characters and bytes. It's almost > > certainly not aware of "words", only character-to-byte mappings. > > To be clear, I should point out that I didn't mean to imply static > tabular mappings only. The mappings in a character encoding are often > more complex and algorithmic. > > That doesn't make them any less precise, of course; and the core point > is that a character-mapping codec is *only* about getting between > characters and bytes, nothing else. > > -- > \ "He who laughs last, thinks slowest." -- Anonymous | > `\ | > _o__) | > Ben Finney- 隐藏被引用文字 - > > - 显示引用的文字 -
thanks for your respoonse:) When I mentioned 'word' in the previous post, I mean character. According to your reply, what will happen if I try to decode a long string seperately. I mean: ###################################### a='你好吗'*100000 s1 = u'' cur = 0 while cur < len(a): d = min(len(a)-i,1023) s1 += a[cur:cur+d].decode('mbcs') cur += d ###################################### May the code above produce any bogus characters in s1? Thanks :) -- http://mail.python.org/mailman/listinfo/python-list