Steve Dower <steve.do...@python.org> added the comment:

The file that fails contains a UTF-8 BOM at the start, which is a multibyte 
character indicating that the file is definitely UTF-8.

Unfortunately, none of Python's default settings will handle this, because it's 
a convention that only really exists on Windows.

On Windows we currently still default to your console encoding, since that is 
what we have always done and changing it by default is very complex. Apparently 
your console encoding does not include the character represented by the first 
byte of the BOM - in any case, it's not a character you'd ever want to see, so 
if it _had_ worked, you'd just have garbage in your read data.

The immediate fix for your scenario is to use "open(filename, 'r', 
encoding='utf-8-sig')" which will handle the BOM correctly.

For the core team, I still think it's worth having the default encoding be able 
to read and drop the UTF-8 BOM from the start of a file. Since we shouldn't do 
it for any arbitrary operation (which may not be at the start of a file), it'd 
have to be a special default object for the TextIOWrapper case, but it would 
have solved this issue. If the BOM is there, it can switch to UTF-8 (or UTF-16, 
if that BOM exists); if not, it can use whatever the default would have been 
(based on all the other available settings).

----------
nosy: +methane
title: file.read() UnicodeDecodeError with large files on Windows -> 
file.read() UnicodeDecodeError with UTF-8 BOM in files on Windows
versions: +Python 3.11 -Python 3.6, Python 3.9

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue44510>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to