Python under PowerShell adds characters

2017-03-29 Thread lyngwyst
I wrote a Python script, which executed as intended on Linux and from cmd.exe 
on Windows.  Then, I ran it from the PowerShell command line, all print 
statements added ^@ after every character.

Have you seen this?  Do you know how to prevent this?

Thank you,
Jay
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python under PowerShell adds characters

2017-03-29 Thread Jussi Piitulainen
lyngw...@gmail.com writes:

> I wrote a Python script, which executed as intended on Linux and from
> cmd.exe on Windows.  Then, I ran it from the PowerShell command line,
> all print statements added ^@ after every character.
>
> Have you seen this?  Do you know how to prevent this?

Script is printing UTF-16 or something, viewer is expecting ASCII or
some eight bit code and making null bytes visible as ^@.

Python gets some default encoding from its environment. There are ways
to set the default, and ways to override the default in the script. For
example, you can specify an encoding when you open a file.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python under PowerShell adds characters

2017-03-29 Thread eryk sun
On Wed, Mar 29, 2017 at 4:06 PM,   wrote:
> I wrote a Python script, which executed as intended on Linux and
> from cmd.exe on Windows.  Then, I ran it from the PowerShell
>command line, all print statements added ^@ after every character.

ISE is the only command-line environment that's specific to
PowerShell. Surely you wouldn't be running Python scripts in ISE.

If powershell.exe is run normally, then it's a console application.
python.exe would inherit the console handle, and that's the end of its
interaction with PowerShell. At most PowerShell (or any process that's
attached to the console) may have set the console to a different
output codepage via SetConsoleOutputCP or set the mode on the screen
buffer via SetConsoleMode. As far as I know, neither of these can make
the console print "^@" as a representation of NUL. It only shows "^@"
in the input buffer when you type Ctrl+2, which is what most terminals
do. For example:

>>> s = sys.stdin.read(6)
spam^@
>>> s
'spam\x00\n'
>>> print(s)
spam
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python under PowerShell adds characters

2017-03-29 Thread Jay Braun
On Wednesday, March 29, 2017 at 10:28:58 AM UTC-7, eryk sun wrote:
> On Wed, Mar 29, 2017 at 4:06 PM,   wrote:
> > I wrote a Python script, which executed as intended on Linux and
> > from cmd.exe on Windows.  Then, I ran it from the PowerShell
> >command line, all print statements added ^@ after every character.
> 
> ISE is the only command-line environment that's specific to
> PowerShell. Surely you wouldn't be running Python scripts in ISE.
> 
> If powershell.exe is run normally, then it's a console application.
> python.exe would inherit the console handle, and that's the end of its
> interaction with PowerShell. At most PowerShell (or any process that's
> attached to the console) may have set the console to a different
> output codepage via SetConsoleOutputCP or set the mode on the screen
> buffer via SetConsoleMode. As far as I know, neither of these can make
> the console print "^@" as a representation of NUL. It only shows "^@"
> in the input buffer when you type Ctrl+2, which is what most terminals
> do. For example:
> 
> >>> s = sys.stdin.read(6)
> spam^@
> >>> s
> 'spam\x00\n'
> >>> print(s)
> spam

I'm not using ISE.  I'm using a pre-edited script, and running it with the 
python command.

Consider the following simple script named hello.py (Python 2.7):

print "Hello"

If I enter:

python hello.py > out.txt

from cmd.exe I get a 6-character file (characters plus new-line).

from PowerShell I get an extract ^@ character after every character

j
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python under PowerShell adds characters

2017-03-29 Thread Chris Angelico
On Thu, Mar 30, 2017 at 4:42 AM, Jay Braun  wrote:
> Consider the following simple script named hello.py (Python 2.7):
>
> print "Hello"
>
> If I enter:
>
> python hello.py > out.txt
>
> from cmd.exe I get a 6-character file (characters plus new-line).
>
> from PowerShell I get an extract ^@ character after every character

Sounds like cmd and PS are setting the output encodings differently.
Does the same thing occur with Python 3.6?

Try adding this to your script:

import sys
print(sys.stdout.encoding)

and run it in the same two environments (no redirection needed). Do
they give you the same thing?

My suspicion is that you're running cmd.exe in the default Windows
console, and PowerShell in some different console. But it's hard to be
sure.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python under PowerShell adds characters

2017-03-29 Thread eryk sun
On Wed, Mar 29, 2017 at 5:42 PM, Jay Braun  wrote:
>
> I'm not using ISE.  I'm using a pre-edited script, and running it with the 
> python command.
>
> Consider the following simple script named hello.py (Python 2.7):
>
> print "Hello"
>
> If I enter:
> python hello.py > out.txt
>
> from cmd.exe I get a 6-character file (characters plus new-line).
> from PowerShell I get an extract ^@ character after every character

You didn't say you were redirecting the output to a file. That's a
completely different story for PowerShell -- and far more frustrating.

cmd.exe implements redirecting a program's output to a file by
temporarily changing its own StandardOutput to the file; spawing the
process, which inherits the StandardOutput handle; and then changing
back to its original StandardOutput (typically a console screen
buffer). The program can write whatever it wants to the file, and cmd
isn't involved in any way.

PowerShell is far more invasive. Instead of giving the child process a
handle for the file, it gives it a handle for a *pipe*. PowerShell
reads from the pipe, and like an annoying busybody that no asked for,
decodes the output as text, processes it (e.g. replacing newlines),
and writes the processed data to the file. For example:

PS C:\Temp> $script = "import sys; sys.stdout.buffer.write(b'\n')"
PS C:\Temp> python -c $script > test.txt
PS C:\Temp> python -c "print(open('test.txt', 'rb').read())"
b'\xff\xfe\r\x00\n\x00'

I wrote a single byte, b'\n', but PowerShell decoded it, replaced "\n"
with "\r\n", and wrote it as UTF-16 with a BOM.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python under PowerShell adds characters

2017-03-29 Thread Chris Angelico
On Thu, Mar 30, 2017 at 5:19 AM, eryk sun  wrote:
> PowerShell is far more invasive. Instead of giving the child process a
> handle for the file, it gives it a handle for a *pipe*. PowerShell
> reads from the pipe, and like an annoying busybody that no asked for,
> decodes the output as text, processes it (e.g. replacing newlines),
> and writes the processed data to the file. For example:
>
> PS C:\Temp> $script = "import sys; sys.stdout.buffer.write(b'\n')"
> PS C:\Temp> python -c $script > test.txt
> PS C:\Temp> python -c "print(open('test.txt', 'rb').read())"
> b'\xff\xfe\r\x00\n\x00'
>
> I wrote a single byte, b'\n', but PowerShell decoded it, replaced "\n"
> with "\r\n", and wrote it as UTF-16 with a BOM.

Lolwut?

So PS can't handle binary redirection whatsoever. Fascinating.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python under PowerShell adds characters

2017-03-29 Thread Rob Gaddi

On 03/29/2017 11:23 AM, Chris Angelico wrote:

On Thu, Mar 30, 2017 at 5:19 AM, eryk sun  wrote:

PowerShell is far more invasive. Instead of giving the child process a
handle for the file, it gives it a handle for a *pipe*. PowerShell
reads from the pipe, and like an annoying busybody that no asked for,
decodes the output as text, processes it (e.g. replacing newlines),
and writes the processed data to the file. For example:

PS C:\Temp> $script = "import sys; sys.stdout.buffer.write(b'\n')"
PS C:\Temp> python -c $script > test.txt
PS C:\Temp> python -c "print(open('test.txt', 'rb').read())"
b'\xff\xfe\r\x00\n\x00'

I wrote a single byte, b'\n', but PowerShell decoded it, replaced "\n"
with "\r\n", and wrote it as UTF-16 with a BOM.


Lolwut?

So PS can't handle binary redirection whatsoever. Fascinating.

ChrisA



Engineer 1: Man, that old DOS shell we keep emulating is just getting 
older and clunkier.


Engineer 2: I know, we should rewrite it.  You know, whole new thing, 
really modernize it.


E1: So, like, bring in bash like everyone else?

E2: No, better.  How about something that integrates with no preexisting 
workflow in the world.


E1: Wait, but what commands would it use?

E2: New ones.

E1: But, then how would it behave?

E2: Totally new.  Never before seen.  New commands, new semantics, whole 
9 yards.  If anyone's ever used it, I don't want to.


E1: I love this plan.

--
Rob Gaddi, Highland Technology -- www.highlandtechnology.com
Email address domain is currently out of order.  See above to fix.
--
https://mail.python.org/mailman/listinfo/python-list


Re: Python under PowerShell adds characters

2017-03-29 Thread Jay Braun
On Wednesday, March 29, 2017 at 11:20:45 AM UTC-7, eryk sun wrote:
> On Wed, Mar 29, 2017 at 5:42 PM, Jay Braun  wrote:
> >
> > I'm not using ISE.  I'm using a pre-edited script, and running it with the 
> > python command.
> >
> > Consider the following simple script named hello.py (Python 2.7):
> >
> > print "Hello"
> >
> > If I enter:
> > python hello.py > out.txt
> >
> > from cmd.exe I get a 6-character file (characters plus new-line).
> > from PowerShell I get an extract ^@ character after every character
> 
> You didn't say you were redirecting the output to a file. That's a
> completely different story for PowerShell -- and far more frustrating.
> 
> cmd.exe implements redirecting a program's output to a file by
> temporarily changing its own StandardOutput to the file; spawing the
> process, which inherits the StandardOutput handle; and then changing
> back to its original StandardOutput (typically a console screen
> buffer). The program can write whatever it wants to the file, and cmd
> isn't involved in any way.
> 
> PowerShell is far more invasive. Instead of giving the child process a
> handle for the file, it gives it a handle for a *pipe*. PowerShell
> reads from the pipe, and like an annoying busybody that no asked for,
> decodes the output as text, processes it (e.g. replacing newlines),
> and writes the processed data to the file. For example:
> 
> PS C:\Temp> $script = "import sys; sys.stdout.buffer.write(b'\n')"
> PS C:\Temp> python -c $script > test.txt
> PS C:\Temp> python -c "print(open('test.txt', 'rb').read())"
> b'\xff\xfe\r\x00\n\x00'
> 
> I wrote a single byte, b'\n', but PowerShell decoded it, replaced "\n"
> with "\r\n", and wrote it as UTF-16 with a BOM.

You are correct.  Sorry I omitted that in my first post.  Thank you for your 
help.

j
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python under PowerShell adds characters

2017-03-29 Thread Chris Angelico
On Thu, Mar 30, 2017 at 5:29 AM, Rob Gaddi
 wrote:
> Engineer 1: Man, that old DOS shell we keep emulating is just getting older
> and clunkier.
>
> Engineer 2: I know, we should rewrite it.  You know, whole new thing, really
> modernize it.
>
> E1: So, like, bring in bash like everyone else?
>
> E2: No, better.  How about something that integrates with no preexisting
> workflow in the world.
>
> E1: Wait, but what commands would it use?
>
> E2: New ones.

My understanding of PowerShell is more like this:

E1: Man, batch files are so clunky. It's a ridiculous hodge-podge.

E2: Yeah, every shell ever written is clunky. Let's do our own thing
and make it more like a scripting language.

E1: Cool! Only, we won't use an existing language, we'll make our own,
because it'll be better.

E2: I love this plan.

We've had discussions on this list about using Python as a job control
shell, and the usual response is: Python sucks as a command executor.
It's just not designed for that. The clunkiness of bash is precisely
BECAUSE it's designed to be convenient and comfortable for a sysadmin.
All those weird splitting and escaping rules are because (a) the
easiest way to do piping, command sequencing, etc is with symbols, (b)
you can't stop people from using those symbols in file names or
arguments, and (c) it's a pain to have to quote every single string.

AIUI PowerShell is somewhat like VBScript, only it isn't quite that
either. But it's definitely more like a scripting language than a
shell language.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python under PowerShell adds characters

2017-03-29 Thread Marko Rauhamaa
eryk sun :
> PowerShell is far more invasive. Instead of giving the child process a
> handle for the file, it gives it a handle for a *pipe*. PowerShell
> reads from the pipe, and like an annoying busybody that no asked for,
> decodes the output as text,

You mean, a bit like Python3 does?


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python under PowerShell adds characters

2017-03-29 Thread Chris Angelico
On Thu, Mar 30, 2017 at 6:13 AM, Marko Rauhamaa  wrote:
> eryk sun :
>> PowerShell is far more invasive. Instead of giving the child process a
>> handle for the file, it gives it a handle for a *pipe*. PowerShell
>> reads from the pipe, and like an annoying busybody that no asked for,
>> decodes the output as text,
>
> You mean, a bit like Python3 does?

If you open a file in Python 3, you can choose whether to open it as
text or binary. When you print text to stdout, well, it's text, so of
course it has to be encoded appropriately; if you're doing something
unusual (like a CGI script creating an image file), you can override
the default and change stdout to be binary. But normally, the standard
streams are connected ultimately to a human, so they're text.

The problem is that PS is decoding and then re-encoding instead of
simply telling the process what encoding to use. That's just wrong.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python under PowerShell adds characters

2017-03-29 Thread Marko Rauhamaa
Chris Angelico :
> But normally, the standard streams are connected ultimately to a
> human, so they're text.

Huh? The standard input is the workload and the standard output is the
result of the computation.

Arguably, the standard error stream is meant for humans.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python under PowerShell adds characters

2017-03-29 Thread eryk sun
On Wed, Mar 29, 2017 at 7:13 PM, Marko Rauhamaa  wrote:
> eryk sun :
>> PowerShell is far more invasive. Instead of giving the child process a
>> handle for the file, it gives it a handle for a *pipe*. PowerShell
>> reads from the pipe, and like an annoying busybody that no asked for,
>> decodes the output as text,
>
> You mean, a bit like Python3 does?

The closest to what we're talking about here would be using
subprocess.Popen and friends to create pipelines and redirect output
to files. Opening a file defaults to text mode in Python for how
Python access the file, but if you pass a file descriptor as Popen's
stdout argument, Python isn't acting as a middle man. The child
process writes directly to the file.

PowerShell makes itself a middle man in the cases of file redirection
and pipelines. It does this to enable all of the capabilities of its
object pipeline. That's fine. But, IMO, there should be a simple way
to get the plain-old redirection and piping in which the shell is not
a middle man. The simplest way I know to do that in PowerShell  is to
run the command line via `cmd /c`. But I'm no PowerShell expert.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python under PowerShell adds characters

2017-03-29 Thread Steve D'Aprano
On Thu, 30 Mar 2017 06:47 am, Marko Rauhamaa wrote:

> Chris Angelico :
>> But normally, the standard streams are connected ultimately to a
>> human, so they're text.
> 
> Huh? The standard input is the workload 

Which is usually typed by a human, read from a file containing
human-readable text, a file-name intended to be read by a human, or some
other data in human-readable form.


> and the standard output is the result of the computation.

Which is generally intended to be read by a human.


> Arguably, the standard error stream is meant for humans.

Just like the rest of the computation. Relatively few command-line
computations are performed by machines, for machines, using
machine-friendly human-hostile formats.




-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python under PowerShell adds characters

2017-03-29 Thread Chris Angelico
On Thu, Mar 30, 2017 at 11:14 AM, Steve D'Aprano
 wrote:
> Just like the rest of the computation. Relatively few command-line
> computations are performed by machines, for machines, using
> machine-friendly human-hostile formats.

And even most computations performed by machines for machines are done
using human-friendly transmission formats. Trying to brain-storm
actually human-hostile formats... let's see. There's extremely
low-level protocols like IP, TCP, DNS, and X11. There are
(de)compression tools like gzip, where byte size is the entire point
of the utility. There's XML, of course, which isn't exactly
machine-friendly, but is pretty human-hostile. And most crypto is done
with byte streams rather than text. Beyond that, pretty much
everything is text. Ever since I started git-managing my /etc
directory, I've been seeing changes made by various programs being
reflected there - in text files. Internet protocols are almost
exclusively text. Unless you say otherwise, most Unix utilities
consume and emit text, even when they're in "machine readable" mode -
for example, a number of git commands support a "--porcelain" option
that removes color codes and pretty formatting, and also uses a format
that's guaranteed to be stable, but it's still text.

Despite fundamentally working with bit operations and numbers,
computers today are still heavily text-aligned. Might have something
to do with the fact that they're programmed by humans...

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python under PowerShell adds characters

2017-03-29 Thread Marko Rauhamaa
Steve D'Aprano :

> On Thu, 30 Mar 2017 06:47 am, Marko Rauhamaa wrote:
>> Huh? The standard input is the workload 
>
> Which is usually typed by a human, read from a file containing
> human-readable text, a file-name intended to be read by a human, or
> some other data in human-readable form.

The main point is that it is supposed to be processed programmatically.
It is somewhat rare that you type in the standard input. In particular,
you want to be able to form useful pipelines from commands.

Of course, in grand UNIX tradition, you strive to design your
interchange formats to be also marginally applicable for human
interaction (XML, base64 etc).

>> and the standard output is the result of the computation.
> Which is generally intended to be read by a human.

That is more often the case. However, you want the format to be rigorous
so it can be easily parsed programmatically.

> Relatively few command-line computations are performed by machines,
> for machines, using machine-friendly human-hostile formats.

Didn't count them. Still, I'd expect not having to deal with Unicode
decoding exceptions with arbitrary input.

There recently was a related debate on the Guile mailing list. Like
Python3, Guile2 is sensitive to illegal UTF-8 on the command line and in
the standard streams. An emacs developer was urging Guile developers to
follow emacs's example and support a superset of UTF-8 and Unicode where
all byte strings can be bijectively mapped into text.

Python3 partially does a similar thing, but only when dealing with
pathnames.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python under PowerShell adds characters

2017-03-29 Thread Steven D'Aprano
On Thu, 30 Mar 2017 07:29:48 +0300, Marko Rauhamaa wrote:

[...]
> I'd expect not having to deal with Unicode
> decoding exceptions with arbitrary input.

That's just silly. If you have *arbitrary* bytes, not all byte-sequences 
are valid Unicode, so you have to expect decoding exceptions, if you're 
processing text.

Coming back to your complaint: Python 3 might default to automatically 
decoding stdin to Unicode, but you can choose to read stdin as bytes if 
you so wish.


> There recently was a related debate on the Guile mailing list. Like
> Python3, Guile2 is sensitive to illegal UTF-8 on the command line and in
> the standard streams. An emacs developer was urging Guile developers to
> follow emacs's example and support a superset of UTF-8 and Unicode where
> all byte strings can be bijectively mapped into text.

I'd like to read that. Got a link?



-- 
Steve
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python under PowerShell adds characters

2017-03-29 Thread Marko Rauhamaa
Steven D'Aprano :

> On Thu, 30 Mar 2017 07:29:48 +0300, Marko Rauhamaa wrote:
>> I'd expect not having to deal with Unicode decoding exceptions with
>> arbitrary input.
>
> That's just silly. If you have *arbitrary* bytes, not all
> byte-sequences are valid Unicode, so you have to expect decoding
> exceptions, if you're processing text.

The input is not in my control, and bailing out may not be an option:

   $ echo $'aa\n\xdd\naa' | grep aa
   aa
   aa
   $ echo $'\xdd' | python2 -c 'import sys; sys.stdin.read(1)'
   $ echo $'\xdd' | python3 -c 'import sys; sys.stdin.read(1)'
   Traceback (most recent call last):
 File "", line 1, in 
 File "/usr/lib64/python3.5/codecs.py", line 321, in decode
   (result, consumed) = self._buffer_decode(data, self.errors, final)
   UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdd in position 0:
invalid continuation byte

Note that "grep" is also locale-aware.

>> There recently was a related debate on the Guile mailing list. Like
>> Python3, Guile2 is sensitive to illegal UTF-8 on the command line and
>> in the standard streams. An emacs developer was urging Guile
>> developers to follow emacs's example and support a superset of UTF-8
>> and Unicode where all byte strings can be bijectively mapped into
>> text.
>
> I'd like to read that. Got a link?

http://lists.gnu.org/archive/html/guile-user/2017-02/msg00054.html>


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python under PowerShell adds characters

2017-03-29 Thread Chris Angelico
On Thu, Mar 30, 2017 at 4:43 PM, Marko Rauhamaa  wrote:
> The input is not in my control, and bailing out may not be an option:
>
>$ echo
> aa\n\xdd\naa' | grep aa
>aa
>aa
>$ echo \xdd' | python2 -c 'import sys; sys.stdin.read(1)'
>$ echo \xdd' | python3 -c 'import sys; sys.stdin.read(1)'
>Traceback (most recent call last):
>  File "", line 1, in 
>  File "/usr/lib64/python3.5/codecs.py", line 321, in decode
>(result, consumed) = self._buffer_decode(data, self.errors, final)
>UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdd in position 0:
> invalid continuation byte
>
> Note that "grep" is also locale-aware.

So what exactly does byte value 0xDD mean in your stream?

And if you say "it doesn't matter", then why are you assigning meaning
to byte value 0x0A in your first example? Truly binary data doesn't
give any meaning to 0x0A.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python under PowerShell adds characters

2017-03-29 Thread Marko Rauhamaa
Chris Angelico :

> On Thu, Mar 30, 2017 at 4:43 PM, Marko Rauhamaa  wrote:
>> The input is not in my control, and bailing out may not be an option:
>>
>>$ echo
>> aa\n\xdd\naa' | grep aa
>>aa
>>aa
>>$ echo \xdd' | python2 -c 'import sys; sys.stdin.read(1)'
>>$ echo \xdd' | python3 -c 'import sys; sys.stdin.read(1)'
>>Traceback (most recent call last):
>>  File "", line 1, in 
>>  File "/usr/lib64/python3.5/codecs.py", line 321, in decode
>>(result, consumed) = self._buffer_decode(data, self.errors, final)
>>UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdd in position 0:
>> invalid continuation byte
>>
>> Note that "grep" is also locale-aware.
>
> So what exactly does byte value 0xDD mean in your stream?
>
> And if you say "it doesn't matter", then why are you assigning meaning
> to byte value 0x0A in your first example? Truly binary data doesn't
> give any meaning to 0x0A.

What I'm saying is that every program must behave in a minimally
controlled manner regardless of its inputs (which are not in its
control). With UTF-8, it is dangerously easy to write programs that
explode surprisingly. What's more, resyncing after such exceptions is
not at all easy. I would venture to guess that few Python programs even
try to do that.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python under PowerShell adds characters

2017-03-29 Thread Chris Angelico
On Thu, Mar 30, 2017 at 4:57 PM, Marko Rauhamaa  wrote:
> What I'm saying is that every program must behave in a minimally
> controlled manner regardless of its inputs (which are not in its
> control). With UTF-8, it is dangerously easy to write programs that
> explode surprisingly. What's more, resyncing after such exceptions is
> not at all easy. I would venture to guess that few Python programs even
> try to do that.

If you expect to get a series of decimal integers, and you find a "Q"
in the middle, is it dangerously easy for your program blow up? How do
you resync after that? Do these questions even make sense? Not in my
opinion; you got invalid data, so you throw an exception and stop
reading data.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list



Re: Python under PowerShell adds characters

2017-03-30 Thread Steve D'Aprano
On Thu, 30 Mar 2017 04:43 pm, Marko Rauhamaa wrote:

> Steven D'Aprano :
> 
>> On Thu, 30 Mar 2017 07:29:48 +0300, Marko Rauhamaa wrote:
>>> I'd expect not having to deal with Unicode decoding exceptions with
>>> arbitrary input.
>>
>> That's just silly. If you have *arbitrary* bytes, not all
>> byte-sequences are valid Unicode, so you have to expect decoding
>> exceptions, if you're processing text.
> 
> The input is not in my control, and bailing out may not be an option:


You have to deal with bad input *somehow*. You can't just say it will never
happen. If bailing out is not an option, then perhaps the solution is not
to read stdin as Unicode text, if there's a chance that it actually doesn't
contain Unicode text. Otherwise, you have to deal with any errors.

("Deal with" can include the case of not dealing with them at all, and just
letting your script raise an exception.)



>$ echo $'aa\n\xdd\naa' | grep aa
>aa
>aa
>$ echo $'\xdd' | python2 -c 'import sys; sys.stdin.read(1)'
>$ echo $'\xdd' | python3 -c 'import sys; sys.stdin.read(1)'
>Traceback (most recent call last):
>  File "", line 1, in 
>  File "/usr/lib64/python3.5/codecs.py", line 321, in decode
>(result, consumed) = self._buffer_decode(data, self.errors, final)
>UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdd in position 0:
> invalid continuation byte

As I said, what did you expect? You choose to read from stdin as Unicode
text, then fed it something that wasn't Unicode text. That's no different
from expecting to read a file name, then passing an ASCII NUL byte.
Something is going to break, somewhere, so you have to deal with it.

I'm not sure if there are better ways, but one way of dealing with this is
to bypass the text layer and read from the raw byte-oriented stream:

[steve@ando ~]$ echo $'\xdd' | python3 -c 'import sys;
print(sys.stdin.buffer.read(1))'
b'\xdd'


You have a choice. The default choice is aimed at the most-common use-case,
which is that input will be text, but its not the only choice.



-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list