Steve Dower <steve.do...@python.org> added the comment:
The file that fails contains a UTF-8 BOM at the start, which is a multibyte character indicating that the file is definitely UTF-8. Unfortunately, none of Python's default settings will handle this, because it's a convention that only really exists on Windows. On Windows we currently still default to your console encoding, since that is what we have always done and changing it by default is very complex. Apparently your console encoding does not include the character represented by the first byte of the BOM - in any case, it's not a character you'd ever want to see, so if it _had_ worked, you'd just have garbage in your read data. The immediate fix for your scenario is to use "open(filename, 'r', encoding='utf-8-sig')" which will handle the BOM correctly. For the core team, I still think it's worth having the default encoding be able to read and drop the UTF-8 BOM from the start of a file. Since we shouldn't do it for any arbitrary operation (which may not be at the start of a file), it'd have to be a special default object for the TextIOWrapper case, but it would have solved this issue. If the BOM is there, it can switch to UTF-8 (or UTF-16, if that BOM exists); if not, it can use whatever the default would have been (based on all the other available settings). ---------- nosy: +methane title: file.read() UnicodeDecodeError with large files on Windows -> file.read() UnicodeDecodeError with UTF-8 BOM in files on Windows versions: +Python 3.11 -Python 3.6, Python 3.9 _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue44510> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com