Walter Dörwald wrote:
The stateful decoder has a little problem: At least three bytes
have to be available from the stream until the StreamReader
decides whether these bytes are a BOM that has to be skipped.
This means that if the file only contains "ab", the user will
never see these two characters.

This can be improved, of course: If the first byte is "a", it most definitely is *not* an UTF-8 signature.

So we only need a second byte for the characters between U+F000
and U+FFFF, and a third byte only for the characters
U+FEC0...U+FEFF. But with the first byte being  \xef, we need
three bytes *anyway*, so we can always decide with the first
byte only whether we need to wait for three bytes.

A solution for this would be to add an argument named final to
the decode and read methods that tells the decoder that the
stream has ended and the remaining buffered bytes have to be
handled now.

Shouldn't an empty read from the underlying stream be taken as an EOF?

Regards,
Martin
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to