[issue44510] file.read() UnicodeDecodeError with UTF-8 BOM in files on Windows

2021-06-25 Thread Eryk Sun


Eryk Sun  added the comment:

> On Windows we currently still default to your console encoding

In Windows, the default encoding for open() is the ANSI code page of the 
current process [1], from GetACP(), which is based on the system locale, unless 
it's overridden to UTF-8 in the application manifest. The console encoding is 
unrelated and not something we use much anymore since io._WindowsConsoleIO was 
introduced in Python 3.6.

--
nosy: +eryksun
resolution:  -> not a bug
stage:  -> resolved
status: open -> closed
versions: +Python 3.6, Python 3.9 -Python 3.11

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue44510] file.read() UnicodeDecodeError with UTF-8 BOM in files on Windows

2021-06-25 Thread Steve Dower


Steve Dower  added the comment:

The file that fails contains a UTF-8 BOM at the start, which is a multibyte 
character indicating that the file is definitely UTF-8.

Unfortunately, none of Python's default settings will handle this, because it's 
a convention that only really exists on Windows.

On Windows we currently still default to your console encoding, since that is 
what we have always done and changing it by default is very complex. Apparently 
your console encoding does not include the character represented by the first 
byte of the BOM - in any case, it's not a character you'd ever want to see, so 
if it _had_ worked, you'd just have garbage in your read data.

The immediate fix for your scenario is to use "open(filename, 'r', 
encoding='utf-8-sig')" which will handle the BOM correctly.

For the core team, I still think it's worth having the default encoding be able 
to read and drop the UTF-8 BOM from the start of a file. Since we shouldn't do 
it for any arbitrary operation (which may not be at the start of a file), it'd 
have to be a special default object for the TextIOWrapper case, but it would 
have solved this issue. If the BOM is there, it can switch to UTF-8 (or UTF-16, 
if that BOM exists); if not, it can use whatever the default would have been 
(based on all the other available settings).

--
nosy: +methane
title: file.read() UnicodeDecodeError with large files on Windows -> 
file.read() UnicodeDecodeError with UTF-8 BOM in files on Windows
versions: +Python 3.11 -Python 3.6, Python 3.9

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com