STINNER Victor added the comment:

> The BOM (byte order mark) appears in the standard input stream. When using 
> cmd.exe, the BOM is not present. This behavior occurs in CP1252 as well as 
> CP65001.

How you do change the console encoding? Using the chcp command?

I'm surprised that you get a UTF-8 BOM when the code page 1252 is used. Can you 
please check that sys.stdin.encoding is "cp1252"?


I tested PowerShell with Python 3.5 on Windows 7 with an OEM code page 850 and 
ANSI code page 1252:

- by default, the stdin encoding is cp850 (OEM code page) and 
os.device_encoding(0) returns "cp850". sys.stdin.readline() does not contain a 
BOM.

- when stdin is a pipe (ex: echo "abc"|python ...), the stdin encoding becomes 
cp1252 (ANSI code page) because os.device_encoding(0) returns None; cp1252 is 
the result of locale.getpreferredencoding(False) (ANSI code page). 
sys.stdin.readline() does not contain a BOM.

If I change the console encoding using the command "chcp 65001":

- by default, the stdin encoding = os.device_encoding(0) = "cp65001".  
sys.stdin.readline() does not contain a BOM.

- when stdin is a pipe, stdin encoding = locale.getpreferredencoding(False) = 
"cp1252" and sys.stdin.readline() *contains* the UTF-8 BOM

Note: The UTF-8 BOM is only written once, before the first character.

So the UTF-8 BOM is only written in one case under these conditions:

- Python is running in PowerShell (The UTF-8 BOM is not written in cmd.exe, 
even with chcp 65001)
- sys.stdin is a pipe
- the console encoding was set manually to cp65001

--

It looks like PowerShell decodes the output of the producer program (echo, 
type, ...) and then encodes the output to the consumer program (ex: python).

It's possible to change the encoding of the encoder by setting $OutputEncoding 
variable. Example to encode to UTF-8 without the BOM:

   $OutputEncoding = New-Object System.Text.UTF8Encoding($False)

Example to encode to UTF-8 without the BOM:

   $OutputEncoding = [System.Text.Encoding]::UTF8

Using [System.Text.Encoding]::UTF8, sys.stdin.readline() starts with a BOM even 
if the console encoding is cp850. If you set the console encoding to 65001 
(chcp 65001) and $OutputEncoding to [System.Text.Encoding]::UTF8, you get... 
two UTF-8 BOMs... yeah!

I tried different producer programs: [MS-DOS] echo "abc", [PowerShell] 
write-output "abc", [MS-DOS] type document.txt, [PowerShell] Get-Content 
document.txt, python -c "print('abc')". It doesn't like like using a different 
program changes anything. The UTF-8 BOM is added somewhere by PowerShell 
between by producer and the consumer programs.

To show the console input and output encodings in PowerShell, type 
"[console]::InputEncoding" and "[console]::OutputEncoding".

See also:
http://stackoverflow.com/questions/22349139/utf8-output-from-powershell

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue21927>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to