On 1月27日, 下午7时20分, John Machin <[EMAIL PROTECTED]> wrote: > On Jan 27, 9:17 pm, glacier <[EMAIL PROTECTED]> wrote: > > > > > > > On 1月24日, 下午3时29分, "Gabriel Genellina" <[EMAIL PROTECTED]> wrote: > > > > En Thu, 24 Jan 2008 04:52:22 -0200, glacier <[EMAIL PROTECTED]> escribió: > > > > > According to your reply, what will happen if I try to decode a long > > > > string seperately. > > > > I mean: > > > > ###################################### > > > > a='你好吗'*100000 > > > > s1 = u'' > > > > cur = 0 > > > > while cur < len(a): > > > > d = min(len(a)-i,1023) > > > > s1 += a[cur:cur+d].decode('mbcs') > > > > cur += d > > > > ###################################### > > > > > May the code above produce any bogus characters in s1? > > > > Don't do that. You might be splitting the input string at a point that is > > > not a character boundary. You won't get bogus output, decode will raise a > > > UnicodeDecodeError instead. > > > You can control how errors are handled, see > > > http://docs.python.org/lib/string-methods.html#l2h-237 > > > > -- > > > Gabriel Genellina > > > Thanks Gabriel, > > > I guess I understand what will happen if I didn't split the string at > > the character's boundry. > > I'm not sure if the decode method will miss split the boundry. > > Can you tell me then ? > > > Thanks a lot. > > *IF* the file is well-formed GBK, then the codec will not mess up when > decoding it to Unicode. The usual cause of mess is a combination of a > human and a text editor :-)- 隐藏被引用文字 - > > - 显示引用的文字 -
I guess firstly, I should check if the file I used to test is well- formed GBK..:) -- http://mail.python.org/mailman/listinfo/python-list