John Salerno wrote: > Interesting. So then the read() method, if given a numeric argument for > bytes to read, would act differently depending on if you were using > Unicode or not?
The read method currently returns a byte string, not a Unicode string. It's not clear to me how the numeric argument should be interpreted when it returns characters some day; it might be best to take the number as counting characters, then. However, not supporting a numeric argument at all might also be reasonable. > As it is now, it seems to equate the bytes with number > of characters, but if the document was written using Unicode characters, > is it possible that read(2) might only pull out one character? Unicode isn't a character coding (*all* documents in the world are "written in Unicode", including those encoded with ASCII or Latin-1). In any case, it doesn't matter what encoding the document is in: read(2) always returns two bytes. How many characters that constitutes depends on the encoding - but read() doesn't return a character string. It might be that these two bytes are only part of a character, e.g. if you need three bytes to encode a character, or it might be that they are parts of two characters, e.g. when you get the second byte of the first character and the first byte of the second one. In some encodings (e.g. ISO-2022), these bytes may indicate *no* character, e.g. when the bytes just indicate an in-stream change of character set. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list