On Jan 27, 9:17 pm, glacier <[EMAIL PROTECTED]> wrote: > On 1月24日, 下午3时29分, "Gabriel Genellina" <[EMAIL PROTECTED]> wrote: > > > > > En Thu, 24 Jan 2008 04:52:22 -0200, glacier <[EMAIL PROTECTED]> escribió: > > > > According to your reply, what will happen if I try to decode a long > > > string seperately. > > > I mean: > > > ###################################### > > > a='你好吗'*100000 > > > s1 = u'' > > > cur = 0 > > > while cur < len(a): > > > d = min(len(a)-i,1023) > > > s1 += a[cur:cur+d].decode('mbcs') > > > cur += d > > > ###################################### > > > > May the code above produce any bogus characters in s1? > > > Don't do that. You might be splitting the input string at a point that is > > not a character boundary. You won't get bogus output, decode will raise a > > UnicodeDecodeError instead. > > You can control how errors are handled, see > > http://docs.python.org/lib/string-methods.html#l2h-237 > > > -- > > Gabriel Genellina > > Thanks Gabriel, > > I guess I understand what will happen if I didn't split the string at > the character's boundry. > I'm not sure if the decode method will miss split the boundry. > Can you tell me then ? > > Thanks a lot.
*IF* the file is well-formed GBK, then the codec will not mess up when decoding it to Unicode. The usual cause of mess is a combination of a human and a text editor :-) -- http://mail.python.org/mailman/listinfo/python-list