Re: [Python-ideas] Fix default encodings on Windows

2016-08-18 Thread eryk sun
On Thu, Aug 18, 2016 at 2:32 AM, Stephen J. Turnbull
 wrote:
>
> So it's not just invalid surrogate *pairs*, it's invalid surrogates of
> all kinds.  This means that it's theoretically possible (though I
> gather that it's unlikely in the extreme) for a real Windows filename
> to indistinguishable from one generated by Python's surrogateescape
> handler.

Absolutely if the filesystem is one of Microsoft's such as NTFS,
FAT32, exFAT, ReFS, NPFS (named pipes), MSFS (mailslots) -- and I'm
pretty sure it's also possible with CDFS and UDFS. UDF allows any
Unicode character except NUL.

> What happens when Python's directory manipulation functions on Windows
> encounter such a filename?  Do they try to write it to the disk
> directory?  Do they succeed?  Does that depend on surrogateescape?

Python allows these 'Unicode' (but not strictly UTF compatible)
strings, so it doesn't have a problem with such filenames, as long as
it's calling the Windows wide-character APIs.

> Is there a reason in practice to allow surrogateescape at all on names
> in Windows filesystems, at least when using the *W API?  You mention
> non-Microsoft filesystems; are they common enough to matter?

Previously I gave an example with a VirtualBox shared folder, which
rejects names with invalid surrogates. I don't know how common that is
in general. I typically switch between 2 guests on a Linux host and
share folders between systems. In Windows I mount shared folders as
directory symlinks in C:\Mount.

I just tested another example that led to different results. Ext2Fsd
is a free ext2/ext3 filesystem driver for Windows. I mounted an ext2
disk in Windows 10. Next, in Python I created a file named
"\udc00b\udc00a\udc00d" in the root directory. Ext2Fsd defaults to
using UTF-8 as the drive codepage, so I expected it to reject this
filename, just like VBoxSF does. But it worked:

>>> os.listdir('.')[-1]
'\udc00b\udc00a\udc00d'

As expected the ANSI API substitutes question marks for the surrogate codes:

>>> os.listdir(b'.')[-1]
b'?b?a?d'

So what did Ext2Fsd write in this supposedly UTF-8 filesystem? I
mounted the disk in Linux to check:

>>> os.listdir(b'.')[-1]
b'\xed\xb0\x80b\xed\xb0\x80a\xed\xb0\x80d'

It blindly encoded the surrogate codes, creating invalid UTF-8. I
think it's called WTF-8 (Wobbly Transformation Format). The file
manager in Linux displays this file as "���b���a���d (invalid
encoding)", and ls prints "???b???a???d". Python uses its
surrogateescape error handler:

>>> os.listdir('.')[-1]
'\udced\udcb0\udc80b\udced\udcb0\udc80a\udced\udcb0\udc80d'

The original name can be decoded using the surrogatepass error handler:

>>> os.listdir(b'.')[-1].decode(errors='surrogatepass')
'\udc00b\udc00a\udc00d'
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-18 Thread Steve Dower
"You consistently ignore Makefiles, .ini, etc."

Do people really do open('makefile', 'rb'), extract filenames and try to use 
them without ever decoding the file contents?

I've honestly never seen that, and it certainly looks like the sort of thing 
Python 3 was intended to discourage. (As soon as you open(..., 'r') you're only 
affected by this change if you explicitly encode again with mbcs.)

Top-posted from my Windows Phone

-Original Message-
From: "Stephen J. Turnbull" 
Sent: ‎8/‎17/‎2016 19:43
To: "Steve Dower" 
Cc: "Paul Moore" ; "Python-Ideas" 
Subject: Re: [Python-ideas] Fix default encodings on Windows

Steve Dower writes:
 > On 17Aug2016 0235, Stephen J. Turnbull wrote:

 > > So a full statement is, "How do we best represent Windows file
 > > system paths in bytes for interoperability with systems that
 > > natively represent paths in bytes?"  ("Other systems" refers to
 > > both other platforms and existing programs on Windows.)
 > 
 > That's incorrect, or at least possible to interpret correctly as
 > the wrong thing. The goal is "code compatibility with systems ...",
 > not interoperability.

You're right, I stated that incorrectly.  I don't have anything to add
to your corrected version.

 > > In a properly set up POSIX locale[1], it Just Works by design,
 > > especially if you use UTF-8 as the preferred encoding.  It's
 > > Windows developers and users who suffer, not those who wrote the
 > > code, nor their primary audience which uses POSIX platforms.
 > 
 > You mentioned "locale", "preferred" and "encoding" in the same sentence, 
 > so I hope you're not thinking of locale.getpreferredencoding()? Changing 
 > that function is orthogonal to this discussion,

You consistently ignore Makefiles, .ini, etc.  It is *not* orthogonal,
it is *the* reason for all opposition to your proposal or request that
it be delayed.  Filesystem names *are* text in part because they are
*used as filenames in text*.

 > When Windows developers and users suffer, I see it as my responsibility 
 > to reduce that suffering. Changing Python on Windows should do that 
 > without affecting developers on Linux, even though the Right Way is to 
 > change all the developers on Linux to use str for paths.

I resent that.  If I were a partisan Linux fanboy, I'd be cheering you
on because I think your proposal is going to hurt an identifiable and
large class of *Windows* users.  I know about and fear this possiblity
because they use a language I love (Japanese) and an encoding I hate
but have achieved a state of peaceful coexistence with (Shift JIS).

And on the general principle, *I* don't disagree.  I mentioned earlier
that I use only the str interfaces in my own code on Linux and Mac OS
X, and that I suspect that there are no real efficiency implications
to using str rather than bytes for those interfaces.

On the other hand, the programming convenience of reading the
occasional "text" filename (or other text, such as XML tags) out of a
binary stream and passing it directly to filesystem APIs cannot be
denied.  I think that the kind of usage you propose (a fixed,
universal codec, universally accepted; ie, 'utf-8') is the best way to
handle that in the long run.  But as Grandmaster Lasker said, "Before
the end game, the gods have placed the middle game."  (Lord Keynes
isn't relevant here, Python will outlive all of us. :-)

 > I don't think there's any reasonable way to noisily deprecate these
 > functions within Python, but certainly the docs can be made
 > clearer. People who explicitly encode with
 > sys.getfilesystemencoding() should not get the deprecation message,
 > but we can't tell whether they got their bytes from the right
 > encoding or a RNG, so there's no way to discriminate.

I agree with you within Python; the custom is for DeprecationWarnings
to be silent by default.

As for "making noise", how about announcing the deprecation as like
the top headline for 3.6, postponing the actual change to 3.7, and in
the meantime you and Nick do a keynote duet at PyCon?  (Your partner
could be Guido, too, but Nick has been the most articulate proponent
for this particular aspect of "inclusion".  I think having a
representative from the POSIX world explaining the importance of this
for "all of us" would greatly multiply the impact.)  Perhaps, given my
proposed timing, a discussion at the language summit in '17 and the
keynote in '18 would be the best timing.

(OT, political: I've been strongly influenced in this proposal by
recently reading http://blog.aurynn.com/contempt-culture.  There's not
as much of it in Python as in other communities I'm involved in, but I
think this would be a good symbolic opportunity to express our
oppostion to it.  "Inclusion" isn't just about gender and race!)

 > I'm going to put together a summary post here (hopefully today) and get 
 > those who have been contributing to basically sign off on it, then I'll 
 > take it to python-dev. The possible outcomes I'll propose will basical

Re: [Python-ideas] Fix default encodings on Windows

2016-08-18 Thread Steve Dower

Summary for python-dev.

This is the email I'm proposing to take over to the main mailing list to 
get some actual decisions made. As I don't agree with some of the 
possible recommendations, I want to make sure that they're represented 
fairly.


I also want to summarise the background leading to why we should 
consider making a change here at all, rather than simply leaving it 
alone. There's a chance this will all make its way into a PEP, depending 
on how controversial the core team thinks this is.


Please let me know if you think I've misrepresented (or unfairly 
represented) any of the positions, or if you think I can 
simplify/clarify anything in here. Please don't treat this like a PEP 
review - it's just going to be an email to python-dev - but the more we 
can avoid having the discussions there we've already had here the better.


Cheers,
Steve

---

Background
==

File system paths are almost universally represented as text in some 
encoding determined by the file system. In Python, we expose these paths 
via a number of interfaces, such as the os and io modules. Paths may be 
passed either direction across these interfaces, that is, from the 
filesystem to the application (for example, os.listdir()), or from the 
application to the filesystem (for example, os.unlink()).


When paths are passed between the filesystem and the application, they 
are either passed through as a bytes blob or converted to/from str using 
sys.getfilesystemencoding(). The result of encoding a string with 
sys.getfilesystemencoding() is a blob of bytes in the native format for 
the default file system.


On Windows, the native format for the filesystem is utf-16-le. The 
recommended platform APIs for accessing the filesystem all accept and 
return text encoded in this format. However, prior to Windows NT (and 
possibly further back), the native format was a configurable machine 
option and a separate set of APIs existed to accept this format. The 
option (the "active code page") and these APIs (the "*A functions") 
still exist in recent versions of Windows for backwards compatibility, 
though new functionality often only has a utf-16-le API (the "*W 
functions").


In Python, we recommend using str as the default format on Windows 
because it can correctly round-trip all the characters representable in 
utf-16-le. Our support for bytes explicitly uses the *A functions and 
hence the encoding for the bytes is "whatever the active code page is". 
Since the active code page cannot represent all Unicode characters, the 
conversion of a path into bytes can lose information without warning.


As a demonstration of this:

>>> open('test\uAB00.txt', 'wb').close()
>>> import glob
>>> glob.glob('test*')
['test\uab00.txt']
>>> glob.glob(b'test*')
[b'test?.txt']

The Unicode character in the second call to glob is missing information. 
You can observe the same results in os.listdir() or any function that 
matches its result type to the parameter type.


Why is this a problem?
==

While the obvious and correct answer is to just use str everywhere, it 
remains well known that on Linux and MacOS it is perfectly okay to use 
bytes when taking values from the filesystem and passing them back. 
Doing so also avoids the cost of decoding and reencoding, such that 
(theoretically), code like below should be faster because of the `b'.'`:


>>> for f in os.listdir(b'.'):
... os.stat(f)
...

On Windows, if a filename exists that cannot be encoding with the active 
code page, you will receive an error from the above code. These errors 
are why in Python 3.3 the use of bytes paths on Windows was deprecated 
(listed in the What's New, but not clearly obvious in the documentation 
- more on this later). The above code produces multiple deprecation 
warnings in 3.3, 3.4 and 3.5 on Windows.


However, we still keep seeing libraries use bytes paths, which can cause 
unexpected issues on Windows. Given the current approach of quietly 
recommending that library developers either write their code twice (once 
for bytes and once for str) or use str exclusively are not working, we 
should consider alternative mitigations.


Proposals
=

There are two dimensions here - the fix and the timing. We can basically 
choose any fix and any timing.


The main differences between the fixes are the balance between incorrect 
behaviour and backwards-incompatible behaviour. The main issue with 
respect to timing is whether or not we believe using bytes as paths on 
Windows was correctly deprecated in 3.3 and sufficiently advertised 
since to allow us to change the behaviour in 3.6.


Fixes
-

Fix #1: Change sys.getfilesystemencoding() to utf-8 on Windows

Currently the default filesystem encoding is 'mbcs', which is a 
meta-encoder that uses the active code page. In reality, our 
implementation uses the *A APIs and we don't explicitly decode bytes in 
order to pass them to the filesystem. This allows the OS to quietly 
rep

Re: [Python-ideas] Fix default encodings on Windows

2016-08-18 Thread Chris Angelico
On Fri, Aug 19, 2016 at 1:25 AM, Steve Dower  wrote:
 open('test\uAB00.txt', 'wb').close()
 import glob
 glob.glob('test*')
> ['test\uab00.txt']
 glob.glob(b'test*')
> [b'test?.txt']
>
> The Unicode character in the second call to glob is missing information. You
> can observe the same results in os.listdir() or any function that matches
> its result type to the parameter type.

Apologies if this is just noise, but I'm a little confused by this.
The second call to glob doesn't have any Unicode characters at all,
the way I see it - it's all bytes. Am I completely misunderstanding
this?

ChrisA
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Let’s make escaping in f-literals impossible

2016-08-18 Thread Philipp A.
Hi, I originially posted this via google groups, which didn’t make it
through to the list proper, sorry! Read it here please:
https://groups.google.com/forum/#!topic/python-ideas/V1U6DGL5J1s

My arguments are basically:

   1. f-literals are semantically not strings, but expressions.
   2. Their escape sequences in the code parts are fundamentally both
   detrimental and superfluous (they’re only in for convenience, as confirmed
   by Guido in the quote below)
   3. They’re detrimental because Syntax highlighters are (by design)
   unable to handle this part of Python 3.6a4’s grammar. This will cause code
   to be highlighted as parts of a string and therefore overlooked. i’m very
   sure this will cause bugs.
   4. The fact that people see the embedded expressions as somehow “part of
   the string” is confusing.

My poposal is to redo their grammar:
They shouldn’t be parsed as strings and post-processed, but be their own
thing. This also opens the door to potentially extend to with something
like JavaScript’s tagged templates)

Without the limitations of the string tokenization code/rules, only the
string parts would have escape sequences, and the expression parts would be
regular python code (“holes” in the literal).

Below the mentioned quote and some replies to the original thread:

Guido van Rossum  schrieb am Mi., 17. Aug. 2016 um
20:11 Uhr:

> The explanation is honestly that the current approach is the most
> straightforward for the implementation (it's pretty hard to intercept the
> string literal before escapes have been processed) and nobody cares enough
> about the edge cases to force the implementation to jump through more hoops.
>
> I really don't think this discussion should be reopened. If you disagree,
> please start a new thread on python-ideas.
>

I really think it should. Please look at python code with f-literals. if
they’re highlighted as strings throughout, you won’t be able to spot which
parts are code. if they’re highlighted as code, the escaping rules
guarantee that most highlighters can’t correctly highlight python anymore.
i think that’s a big issue for readability.

Brett Cannon  schrieb am Mi., 17. Aug. 2016 um 20:28 Uhr:

> They are still strings, there is just post-processing on the string itself
> to do the interpolation.
>

Sounds hacky to me. I’d rather see a proper parser for them, which of
course would make my vision easy.


> By doing it this way the implementation can use Python itself to do the
> tokenizing of the string, while if you do the string interpolation
> beforehand you would then need to do it entirely at the C level which is
> very messy and painful since you're explicitly avoiding Python's automatic
> handling of Unicode, etc.
>

of course we reuse the tokenization for the string parts. as said, you can
view an f-literal as interleaved sequence of strings and expressions with
an attached format specification.

 starts the f-literal, string contents follow. the only difference to
other strings is
<{> which starts expression tokenization. once the expression ends, an
optional
 follows, then a
<}> to switch back to string tokenization
this repeats until (in string parsing mode) a
<'> is encountered which ends the f-literal.

You also make it harder to work with Unicode-based variable names (or at
> least explain it). If you have Unicode in a variable name but you can't use
> \N{} in the string to help express it you then have to say "normal Unicode
> support in the string applies everywhere *but* in the string interpolation
> part".
>

i think you’re just proving my point that the way f-literals work now is
confusing.

the embedded expressions are just normal python. the embedded strings just
normal strings. you can simply switch between both using <{> and
<[format]}>.

unicode in variable names works exactly the same as in all other python
code because it is regular python code.

Or another reason is you can explain f-strings as "basically
> str.format_map(**locals(), **globals()), but without having to make the
> actual method call" (and worrying about clashing keys but I couldn't think
> of a way of using dict.update() in a single line). But with your desired
> change it kills this explanation by saying f-strings aren't like this but
> some magical string that does all of this stuff before normal string
> normalization occurs.
>

no, it’s simply the expression parts (that for normal formatting are inside
of the braces of  .format(...)) are *interleaved* in between string parts.
they’re not part of the string. just regular plain python code.

Cheers, and i really hope i’ve made a strong case,
philipp
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-18 Thread Random832
On Thu, Aug 18, 2016, at 11:29, Chris Angelico wrote:
>  glob.glob('test*')
> > ['test\uab00.txt']
>  glob.glob(b'test*')
> > [b'test?.txt']
> >
> > The Unicode character in the second call to glob is missing information. 
> 
> Apologies if this is just noise, but I'm a little confused by this.
> The second call to glob doesn't have any Unicode characters at all,
> the way I see it - it's all bytes. Am I completely misunderstanding
> this?

The unicode character is in the actual name of the actual file being
matched. That the byte string returned by glob fails to represent that
character in any encoding is the problem. Glob results don't exist in a
vacuum, they're supposed to represent, and be usable to access, files
that actually exist on the real filesystem.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Fix default encodings on Windows

2016-08-18 Thread Steve Dower

On 18Aug2016 0829, Chris Angelico wrote:

The second call to glob doesn't have any Unicode characters at all,
the way I see it - it's all bytes. Am I completely misunderstanding
this?


You're not the only one - I think this has been the most common 
misunderstanding.


On Windows, the paths as stored in the filesystem are actually all text 
- more precisely, utf-16-le encoded bytes, represented as 16-bit 
characters strings.


Converting to an 8-bit character representation only exists for 
compatibility with code written for other platforms (either Linux, or 
much older versions of Windows). The operating system has one way to do 
the conversion to bytes, which Python currently uses, but since we 
control that transformation I'm proposing an alternative conversion that 
is more reliable than compatible (with Windows 3.1... shouldn't affect 
compatibility with code that properly handles multibyte encodings, which 
should include anything developed for Linux in the last decade or two).


Does that help? I tried to keep the explanation short and focused :)

Cheers,
Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Fix default encodings on Windows

2016-08-18 Thread Chris Angelico
On Fri, Aug 19, 2016 at 1:54 AM, Steve Dower  wrote:
> On 18Aug2016 0829, Chris Angelico wrote:
>>
>> The second call to glob doesn't have any Unicode characters at all,
>> the way I see it - it's all bytes. Am I completely misunderstanding
>> this?
>
>
> You're not the only one - I think this has been the most common
> misunderstanding.
>
> On Windows, the paths as stored in the filesystem are actually all text -
> more precisely, utf-16-le encoded bytes, represented as 16-bit characters
> strings.
>
> Converting to an 8-bit character representation only exists for
> compatibility with code written for other platforms (either Linux, or much
> older versions of Windows). The operating system has one way to do the
> conversion to bytes, which Python currently uses, but since we control that
> transformation I'm proposing an alternative conversion that is more reliable
> than compatible (with Windows 3.1... shouldn't affect compatibility with
> code that properly handles multibyte encodings, which should include
> anything developed for Linux in the last decade or two).
>
> Does that help? I tried to keep the explanation short and focused :)

Ah, I think I see what you mean. There's a slight ambiguity in the
word "missing" here.

1) The Unicode character in the result lacks some of the information
it should have

2) The Unicode character in the file name is information that has now been lost.

My reading was the first, but AIUI you actually meant the second. If
so, I'd be inclined to reword it very slightly, eg:

"The Unicode character in the second call to glob is now lost information."

Is that a correct interpretation?

ChrisA
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Fix default encodings on Windows

2016-08-18 Thread Steve Dower

On 18Aug2016 0900, Chris Angelico wrote:

On Fri, Aug 19, 2016 at 1:54 AM, Steve Dower  wrote:

On 18Aug2016 0829, Chris Angelico wrote:


The second call to glob doesn't have any Unicode characters at all,
the way I see it - it's all bytes. Am I completely misunderstanding
this?



You're not the only one - I think this has been the most common
misunderstanding.

On Windows, the paths as stored in the filesystem are actually all text -
more precisely, utf-16-le encoded bytes, represented as 16-bit characters
strings.

Converting to an 8-bit character representation only exists for
compatibility with code written for other platforms (either Linux, or much
older versions of Windows). The operating system has one way to do the
conversion to bytes, which Python currently uses, but since we control that
transformation I'm proposing an alternative conversion that is more reliable
than compatible (with Windows 3.1... shouldn't affect compatibility with
code that properly handles multibyte encodings, which should include
anything developed for Linux in the last decade or two).

Does that help? I tried to keep the explanation short and focused :)


Ah, I think I see what you mean. There's a slight ambiguity in the
word "missing" here.

1) The Unicode character in the result lacks some of the information
it should have

2) The Unicode character in the file name is information that has now been lost.

My reading was the first, but AIUI you actually meant the second. If
so, I'd be inclined to reword it very slightly, eg:

"The Unicode character in the second call to glob is now lost information."

Is that a correct interpretation?


I think so, though I find the wording a little awkward (and on 
rereading, my original wording was pretty bad). How about:


"The second call to glob has replaced the Unicode character with '?', 
which means the actual filename cannot be recovered and the path is no 
longer valid."


Cheers,
STeve

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Let’s make escaping in f-literals impossible

2016-08-18 Thread Chris Angelico
On Fri, Aug 19, 2016 at 1:05 AM, Philipp A.  wrote:
> the embedded expressions are just normal python. the embedded strings just
> normal strings. you can simply switch between both using <{> and
> <[format]}>.
>

The trouble with that way of thinking is that, to a human, the braces
contain something. They don't "uncontain" it. Those braced expressions
are still part of a string; they just have this bit of magic that gets
them evaluated. Consider this:

>>> "This is a number: {:0\u07c4}".format(13)
'This is a number: 0013'

Format codes are just text, so I should be able to use Unicode
escapes. Okay. Now let's make that an F-string.

>>> f"This is a number: {13:0\u07c4}"
'This is a number: 0013'

Format codes are still just text. So you'd have to say that the rules
of text stop at an unbracketed colon, which is a pretty complicated
rule to follow. The only difference between .format and f-strings is
that the bit before the colon is the actual expression, rather than a
placeholder that drags the value in from the format arguments. In
human terms, that's not all that significant.

IMO it doesn't matter that much either way - people will have to
figure stuff out anyway. I like the idea that everything in the quotes
is a string (and then parts of it get magically evaluated), but could
live with there being some non-stringy parts in it. My suspicion is
that what's easiest to code (ie easiest for the CPython parser) is
also going to be easiest for all or most other tools (eg syntax
highlighters).

ChrisA
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Fix default encodings on Windows

2016-08-18 Thread Chris Angelico
On Fri, Aug 19, 2016 at 2:07 AM, Steve Dower  wrote:
> I think so, though I find the wording a little awkward (and on rereading, my
> original wording was pretty bad). How about:
>
> "The second call to glob has replaced the Unicode character with '?', which
> means the actual filename cannot be recovered and the path is no longer
> valid."

I like that. Very clear and precise, without losing too much concision.

Thank you for explaining, as Cameron Baum often says.

ChrisA
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Let’s make escaping in f-literals impossible

2016-08-18 Thread Random832
On Thu, Aug 18, 2016, at 12:17, Chris Angelico wrote:
> The trouble with that way of thinking is that, to a human, the braces
> contain something. They don't "uncontain" it. Those braced expressions
> are still part of a string; they just have this bit of magic that gets
> them evaluated. Consider this:

There's a precedent. "$()" works this way in bash - call it a recursive
parser context or whatever you like, but the point is that "$(command
"argument with spaces")" works fine, and humans don't seem to have any
trouble with it. Really it all comes down to what exactly the "bit of
magic" is and how magical it is.

> IMO it doesn't matter that much either way - people will have to
> figure stuff out anyway. I like the idea that everything in the quotes
> is a string (and then parts of it get magically evaluated), but could
> live with there being some non-stringy parts in it. My suspicion is
> that what's easiest to code (ie easiest for the CPython parser) is
> also going to be easiest for all or most other tools (eg syntax
> highlighters).

Except the parser has to actually parse string literals into what string
they represent (so it can apply a further transformation to the result).
Syntax highlighters generally don't.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Fix default encodings on Windows

2016-08-18 Thread eryk sun
On Thu, Aug 18, 2016 at 4:07 PM, Steve Dower  wrote:
> On 18Aug2016 0900, Chris Angelico wrote:
>>
>> On Fri, Aug 19, 2016 at 1:54 AM, Steve Dower 
>> wrote:
>>>
>>> On 18Aug2016 0829, Chris Angelico wrote:


 The second call to glob doesn't have any Unicode characters at all,
 the way I see it - it's all bytes. Am I completely misunderstanding
 this?
>>>
>>>
>>>
>>> You're not the only one - I think this has been the most common
>>> misunderstanding.
>>>
>>> On Windows, the paths as stored in the filesystem are actually all text -
>>> more precisely, utf-16-le encoded bytes, represented as 16-bit characters
>>> strings.
>>>
>>> Converting to an 8-bit character representation only exists for
>>> compatibility with code written for other platforms (either Linux, or
>>> much
>>> older versions of Windows). The operating system has one way to do the
>>> conversion to bytes, which Python currently uses, but since we control
>>> that
>>> transformation I'm proposing an alternative conversion that is more
>>> reliable
>>> than compatible (with Windows 3.1... shouldn't affect compatibility with
>>> code that properly handles multibyte encodings, which should include
>>> anything developed for Linux in the last decade or two).
>>>
>>> Does that help? I tried to keep the explanation short and focused :)
>>
>>
>> Ah, I think I see what you mean. There's a slight ambiguity in the
>> word "missing" here.
>>
>> 1) The Unicode character in the result lacks some of the information
>> it should have
>>
>> 2) The Unicode character in the file name is information that has now been
>> lost.
>>
>> My reading was the first, but AIUI you actually meant the second. If
>> so, I'd be inclined to reword it very slightly, eg:
>>
>> "The Unicode character in the second call to glob is now lost
>> information."
>>
>> Is that a correct interpretation?
>
>
> I think so, though I find the wording a little awkward (and on rereading, my
> original wording was pretty bad). How about:
>
> "The second call to glob has replaced the Unicode character with '?', which
> means the actual filename cannot be recovered and the path is no longer
> valid."

They're all just characters in the context of Unicode, so I think it's
clearest to use the character code, e.g.:

The second call to glob has replaced the U+AB00 character with '?',
which means ...
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Fix default encodings on Windows

2016-08-18 Thread Chris Angelico
On Fri, Aug 19, 2016 at 2:39 AM, eryk sun  wrote:
> They're all just characters in the context of Unicode, so I think it's
> clearest to use the character code, e.g.:
>
> The second call to glob has replaced the U+AB00 character with '?',
> which means ...

Technically the character has been replaced with the byte value 63,
although at this point, we're getting into dangerous areas of bytes
being interpreted in one way or another.

ChrisA
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Let’s make escaping in f-literals impossible

2016-08-18 Thread Steve Dower
I'm generally inclined to agree, especially as someone who is very 
likely to be implementing syntax highlighting and completion support 
within f-literals.


I stepped out of the original discussion near the start as it looked 
like we were going to end up with interleaved strings and normal 
expressions, but if that's not the case then it is going to make it very 
difficult to provide a nice coding experience for them.


On 18Aug2016 0805, Philipp A. wrote:

My poposal is to redo their grammar:
They shouldn’t be parsed as strings and post-processed, but be their own
thing. This also opens the door to potentially extend to with something
like JavaScript’s tagged templates)

Without the limitations of the string tokenization code/rules, only the
string parts would have escape sequences, and the expression parts would
be regular python code (“holes” in the literal).


This is where I thought we'd end up - the '{' character (unless escaped 
by, e.g. \N, which addresses a concern below) would terminate the string 
literal and start an expression, which may be followed by a ':' and a 
format code literal. The '}' character would open the next string 
literal, and this continues until the closing quote.



They are still strings, there is just post-processing on the string
itself to do the interpolation.


Sounds hacky to me. I’d rather see a proper parser for them


I believe the proper parser is already used, but the issue is that 
escapes have already been dealt with. Of course, it shouldn't be too 
difficult for the tokenizer to recognize {} quoted expressions within an 
f-literal and not modify escapes. There are multiple ways to handle this.



Or another reason is you can explain f-strings as "basically
str.format_map(**locals(), **globals()), but without having to make
the actual method call" (and worrying about clashing keys but I
couldn't think of a way of using dict.update() in a single line).
But with your desired change it kills this explanation by saying
f-strings aren't like this but some magical string that does all of
this stuff before normal string normalization occurs.


no, it’s simply the expression parts (that for normal formatting are
inside of the braces of  .format(...)) are *interleaved* in between
string parts. they’re not part of the string. just regular plain python
code.


Agreed. The .format_map() analogy breaks down very quickly when you 
consider f-literals like:


>>> f'a { \'b\' }'
'a b'

If the contents of the braces were simply keys in the namespace then we 
wouldn't be able to put string literals in there. But because it is an 
arbitrary expression, if we want to put string literals in the f-literal 
(bearing in mind that we may be writing something more like 
f'{x.partition(\'-\')[0]}'), the escaping rules become very messy very 
quickly.


I don't think f'{x.partition('-')[0]}' is any less readable as a result 
of the reused quotes, and it will certainly be easier for highlighters 
to handle (assuming they're doing anything more complicated than simply 
displaying the entire expression in a different colour).


So I too would like to see escapes made unnecessary within the 
expression part of a f-literal. Possibly if we put together a simple 
enough patch for the tokenizer it will be accepted?


Cheers,
Steve

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Let’s make escaping in f-literals impossible

2016-08-18 Thread Steve Dower

On 18Aug2016 0950, Steve Dower wrote:

I'm generally inclined to agree, especially as someone who is very
likely to be implementing syntax highlighting and completion support
within f-literals.


I also really don't like the subject line. "Do not require string 
escapes within expressions in f-literals" more accurately represents the 
topic and the suggestion.


"Let's make  impossible" is just asking for a highly 
emotionally-charged discussion, which is best avoided in basically all 
circumstances, especially for less-frequent contributors to a community, 
and extra-especially when you haven't met most of the other contributors 
in person.


Cheers,
Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Fix default encodings on Windows

2016-08-18 Thread MRAB

On 2016-08-16 16:56, Steve Dower wrote:

I just want to clearly address two points, since I feel like multiple
posts have been unclear on them.

1. The bytes API was deprecated in 3.3 and it is listed in
https://docs.python.org/3/whatsnew/3.3.html. Lack of mention in the docs
is an unfortunate oversight, but it was certainly announced and the
warning has been there for three released versions. We can freely change
or remove the support now, IMHO.

2. Windows file system encoding is *always* UTF-16. There's no "assuming
mbcs" or "assuming ACP" or "assuming UTF-8" or "asking the OS what
encoding it is". We know exactly what the encoding is on every supported
version of Windows. UTF-16.

This discussion is for the developers who insist on using bytes for
paths within Python, and the question is, "how do we best represent
UTF-16 encoded paths in bytes?"

The choices are:

* don't represent them at all (remove bytes API)
* convert and drop characters not in the (legacy) active code page
* convert and fail on characters not in the (legacy) active code page
* convert and fail on invalid surrogate pairs
* represent them as UTF-16-LE in bytes (with embedded '\0' everywhere)

Currently we have the second option.

My preference is the fourth option, as it will cause the least breakage
of existing code and enable the most amount of code to just work in the
presence of non-ACP characters.

The fifth option is the best for round-tripping within Windows APIs.

The only code that will break with any change is code that was using an
already deprecated API. Code that correctly uses str to represent
"encoding agnostic text" is unaffected.

If you see an alternative choice to those listed above, feel free to
contribute it. Otherwise, can we focus the discussion on these (or any
new) choices?


Could we use still call it 'mbcs', but use 'surrogateescape'?

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Fix default encodings on Windows

2016-08-18 Thread Random832
On Thu, Aug 18, 2016, at 13:18, MRAB wrote:
> > If you see an alternative choice to those listed above, feel free to
> > contribute it. Otherwise, can we focus the discussion on these (or any
> > new) choices?
> >
> Could we use still call it 'mbcs', but use 'surrogateescape'?

Er, this discussion is about converting *from* unicode (including
arbitrary but usually valid characters) *to* bytes.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Fix default encodings on Windows

2016-08-18 Thread Steve Dower

On 18Aug2016 1018, MRAB wrote:

Could we use still call it 'mbcs', but use 'surrogateescape'?


surrogateescape is used for escaping undecodable values when you want to 
represent arbitrary bytes in Unicode.


It's the wrong direction for this situation - we are starting with valid 
Unicode and encoding to bytes (for the convenience of the Python 
developer who wants to use bytes everywhere). Bytes correctly encoded 
under mbcs can always be correctly decoded to Unicode ('correctly' 
implies that they were encoded with the same configuration as the 
machine doing the decoding - mbcs changes from machine to machine).


So there's nothing to escape from mbcs->Unicode, and we don't control 
the definition of Unicode->mbcs well enough to be able to invent an 
escaping scheme while remaining compatible with the operating system's 
interpretation of mbcs (CP_ACP).


(One way to look at the utf-8 proposal is saying "we will escape 
arbitrary Unicode characters within Python bytes strings and decode them 
at the Python-OS boundary". The main concern about this is the backwards 
compatibility issues around people taking arbitrarily encoded bytes and 
sharing them without including the encoding. Previously that would work 
on a subset of machines without Unicode support, but this change would 
only make it work within Python 3.6 and later. Hence the discussion 
about whether this whole thing was deprecated already or not.)


Cheers,
Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Fix default encodings on Windows

2016-08-18 Thread Terry Reedy

On 8/18/2016 11:25 AM, Steve Dower wrote:


In this case, we would announce in 3.6 that using bytes as paths on
Windows is no longer deprecated,


My understanding is the the first 2 fixes refine the deprecation rather 
than reversing it.  And #3 simply applies it.



--
Terry Jan Reedy

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Let’s make escaping in f-literals impossible

2016-08-18 Thread Brett Cannon
On Thu, 18 Aug 2016 at 08:32 Philipp A.  wrote:

> [SNIP]
> Brett Cannon  schrieb am Mi., 17. Aug. 2016 um
> 20:28 Uhr:
>
>> They are still strings, there is just post-processing on the string
>> itself to do the interpolation.
>>
>
> Sounds hacky to me. I’d rather see a proper parser for them, which of
> course would make my vision easy.
>

You say "hacky", I say "pragmatic". And Python's code base is actually
rather top-notch and so it isn't bad code, but simply a design decision you
are disagreeing with.

Please remember that you're essentially asking people to spend their
personal time to remove working code and re-implement something that you
have not volunteered to actually code up yourself. Don't forget that none
of us get paid to work on Python full-time; a lucky couple of us get to
spend one day a week on Python and we all take time away from our family to
work on things when we can. Insulting someone's hard work that they did for
free to try and improve Python is not going to motivate people to want to
help out with this idea. And considering Eric Smith who originally
implemented all of this is possibly the person in the best position to
implement your idea just had his work called "hacky" by you is not really a
great motivator for him.

IOW you really need to be mindful of the tone of your emails (as does
anyone else who ever asks for something to change while not being willing
to put in the time and effort to actually produce the code to facilitate
the change). You have now had both Steve and me point out your tone and so
you're quickly approaching a threshold where people will stop pointing this
out and simply ignore your emails, so please be mindful of how you phrase
things.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-18 Thread Steve Dower

On 18Aug2016 1036, Terry Reedy wrote:

On 8/18/2016 11:25 AM, Steve Dower wrote:


In this case, we would announce in 3.6 that using bytes as paths on
Windows is no longer deprecated,


My understanding is the the first 2 fixes refine the deprecation rather
than reversing it.  And #3 simply applies it.


#3 certainly just applies the deprecation.

As for the first two, I don't see any reason to deprecate the 
functionality once the issues are resolved. If using utf-8 encoded bytes 
is going to work fine in all the same cases as using str, why discourage it?


___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Fix default encodings on Windows

2016-08-18 Thread eryk sun
On Thu, Aug 18, 2016 at 4:44 PM, Chris Angelico  wrote:
> On Fri, Aug 19, 2016 at 2:39 AM, eryk sun  wrote:
>> They're all just characters in the context of Unicode, so I think it's
>> clearest to use the character code, e.g.:
>>
>> The second call to glob has replaced the U+AB00 character with '?',
>> which means ...
>
> Technically the character has been replaced with the byte value 63,
> although at this point, we're getting into dangerous areas of bytes
> being interpreted in one way or another.

Windows NLS codepages are all supersets of ASCII (no EBCDIC to worry
about), and the default character when encoding is always b"?". The
default Unicode character when decoding is also almost always "?",
except Japanese uses U+30FB.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Let’s make escaping in f-literals impossible

2016-08-18 Thread Terry Reedy

On 8/18/2016 12:50 PM, Steve Dower wrote:
> I'm generally inclined to agree, especially as someone who is very
> likely to be implementing syntax highlighting and completion support
> within f-literals.

I consider these separate issues.  IDLE currently provides filename 
completion support within strings while still highlighting the string 
one color.  Even if it were enhanced to do name completion within an 
f-string, I am not sure I would want to put a mixture of colors within 
the string rather than leave if all one color.


> I stepped out of the original discussion near the start as it looked
> like we were going to end up with interleaved strings and normal
> expressions, but if that's not the case then it is going to make it
> very difficult to provide a nice coding experience for them.

This is the crux of this thread.  Is an f-string a single string that 
contains magically handled code, or interleaved strings using { and } as 
closing and opening quotes (which is backwards from their normal 
function of being opener and closer) and expressions?  The latter view 
makes the grammar context sensitive, I believe, as } could only open a 
string if there is a previous f-tagged string an indefinite number of 
alternations back.


It is not uncommon to write strings that consist completely of code.
  "for i in iterable: a.append(f(i))"
to be written out or eval()ed or exec()ed.
Does your environment have a mode to provide syntax highlighting and 
completion support for such things?


What I think would be more useful would be the ability to syntax check 
such code strings while editing.  A python-coded editor could just pass 
the extracted string to compile().



I don't think f'{x.partition('-')[0]}' is any less readable as a result
of the reused quotes,


I find it hard to not read f'{x.partition(' + ')[0]}' as string 
concatenation.



and it will certainly be easier for highlighters
to handle (assuming they're doing anything more complicated than simply
displaying the entire expression in a different colour).


Without the escapes, existing f-unaware highlighters like IDLE's will be 
broken in that they will highlight the single f-string as two strings 
with differently highlighted content in the middle.  For 
f'{x.partition('if')[0]}', the 'if' is and will be erroneously 
highlighted as a keyword.  I consider this breakage unacceptible.


--
Terry Jan Reedy

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Let’s make escaping in f-literals impossible

2016-08-18 Thread Random832
On Thu, Aug 18, 2016, at 15:15, Terry Reedy wrote:
> This is the crux of this thread.  Is an f-string a single string that 
> contains magically handled code, or interleaved strings using { and } as 
> closing and opening quotes (which is backwards from their normal 
> function of being opener and closer)

I'd rather conceptualize it as a sequence of two* kinds of thing:
literal character sequences [as sequences of characters other than {]
and expressions [started with {, and ended with a } that is not
otherwise part of the expression] rather than treating { as a closing
quote.

In particular, treating } as an opening quote doesn't really work, since
expressions can contain both strings (which may contain an unbalanced })
and dictionary/set literals (which contain balanced }'s which are not in
quotes) - what ends the expression is a } at the top level.

*or three, considering that escapes are used in the non-expression
parts.

> and expressions?  The latter view 
> makes the grammar context sensitive, I believe, as } could only open a 
> string if there is a previous f-tagged string an indefinite number of 
> alternations back.

} at the top level is otherwise a syntax error. I don't know enough
about the theoretical constructs involved to know if this makes it
formally 'context sensitive' or not - I don't know that it's any more
context sensitive than ) being valid if there is a matching (. Honestly,
I'd be more worried about : than }.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Let’s make escaping in f-literals impossible

2016-08-18 Thread Steve Dower

On 18Aug2016 1215, Terry Reedy wrote:

On 8/18/2016 12:50 PM, Steve Dower wrote:

I don't think f'{x.partition('-')[0]}' is any less readable as a result
of the reused quotes,


I find it hard to not read f'{x.partition(' + ')[0]}' as string
concatenation.


That's a fair counter-example. Though f'{x.partition(\' + \')[0]}' still 
reads like string concatenation to me at first glance. YMMV.



and it will certainly be easier for highlighters
to handle (assuming they're doing anything more complicated than simply
displaying the entire expression in a different colour).


Without the escapes, existing f-unaware highlighters like IDLE's will be
broken in that they will highlight the single f-string as two strings
with differently highlighted content in the middle.  For
f'{x.partition('if')[0]}', the 'if' is and will be erroneously
highlighted as a keyword.  I consider this breakage unacceptible.


Won't it be broken anyway because of the new prefix?

I'm sure there's a fairly straightforward way for a regex to say that a 
closing quote must not be preceded immediately by a backslash or by an 
open brace at all without a closing brace in between.


Not having escapes within the expression makes it harder for everyone 
except the Python developer, in my opinion, and the rest of us ought to 
go out of our way for them.


Cheers,
Steve

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Allow manual creation of DirEntry objects

2016-08-18 Thread Brendan Moloney
Thanks, opened an issue here: http://bugs.python.org/issue27796

-Brendan

From: gvanros...@gmail.com [gvanros...@gmail.com] on behalf of Guido van Rossum 
[gu...@python.org]
Sent: Wednesday, August 17, 2016 7:20 AM
To: Nick Coghlan; Brendan Moloney
Cc: Victor Stinner; python-ideas@python.org
Subject: Re: [Python-ideas] Allow manual creation of DirEntry objects

Brendan,

The conclusion is that you should just file a bug asking for a working 
constructor -- or upload a patch if you want to.

--Guido

On Wed, Aug 17, 2016 at 12:18 AM, Nick Coghlan 
mailto:ncogh...@gmail.com>> wrote:
On 17 August 2016 at 09:56, Victor Stinner 
mailto:victor.stin...@gmail.com>> wrote:
> 2016-08-17 1:50 GMT+02:00 Guido van Rossum 
> mailto:gu...@python.org>>:
>> We could expose the class with a
>> constructor that always fails (the C code could construct instances through
>> a backdoor).
>
> Oh, in fact you cannot create an instance of os.DirEntry, it has no
> (Python) constructor:
>
> $ ./python
> Python 3.6.0a4+ (default:e615718a6455+, Aug 17 2016, 00:12:17)
 import os
 os.DirEntry(1)
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: cannot create 'posix.DirEntry' instances
>
> Only os.scandir() can produce such objects.
>
> The question is still if it makes sense to allow to create DirEntry
> objects in Python :-)

I think it does, as it isn't really any different from someone calling
the stat() method on a DirEntry instance created by os.scandir(). It
also prevents folks attempting things like:

def slow_constructor(dirname, entryname):
for entry in os.scandir(dirname):
if entry.name == entryname:
entry.stat()
return entry

Allowing DirEntry construction from Python further gives us a
straightforward answer to the "stat caching" question: "just use
os.DirEntry instances and call stat() to make the snapshot"

If folks ask why os.DirEntry caches results when pathlib.Path doesn't,
we have the answer that cache invalidation is a hard problem, and
hence we consider it useful in the lower level interface that is
optimised for speed, but problematic in the higher level one that is
more focused on cross-platform correctness of filesystem interactions.

I don't know whether it would make sense to allow a pre-existing stat
result to be based to DirEntry, but it does seem like it might be
useful for adapting existing stat-based backend APIs to a more user
friendly DirEntry based front end API.

Cheers,
Nick.

--
Nick Coghlan   |   ncogh...@gmail.com   |   
Brisbane, Australia



--
--Guido van Rossum (python.org/~guido)
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Fix default encodings on Windows

2016-08-18 Thread Chris Barker
On Thu, Aug 18, 2016 at 6:23 AM, Steve Dower  wrote:

> "You consistently ignore Makefiles, .ini, etc."
>
> Do people really do open('makefile', 'rb'), extract filenames and try to
> use them without ever decoding the file contents?
>

I'm sure they do :-(

But this has always confused me - back in the python2 "good old days" text
and binary mode were exactly the same on *nix -- so folks sometimes fell
into the trap of opening binary files as text on *nix, and then it failing
on Windows but I can't image why anyone would have done the opposite.

So in porting to py3, they would have had to *add* that 'b' (and a bunch of
b'filename') to keep the good old bytes is text interface.

Why would anyone do that?

Honestly confused.

I've honestly never seen that, and it certainly looks like the sort of
> thing Python 3 was intended to discourage.
>

exactly -- we really don't need to support folks reading text files in
binary mode and not considering encoding...

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Let’s make escaping in f-literals impossible

2016-08-18 Thread Greg Ewing

Chris Angelico wrote:


f"This is a number: {13:0\u07c4}"


If I understand correctly, the proposal intends to make
it easier for a syntax hightlighter to treat

   f"This is a number: {foo[42]:0\u07c4}"

as

   f"This is a number: {foo[42] :0\u07c4}"
   ---- --
   highlight as string hightlight   highlight as string
  as
 code

I'm not sure an RE-based syntax hightlighter would
have any easier a time with that, because for the
second part it would need to recognise ':' as starting
a string, but only if it followed some stuff that was
preceded by the beginning of an f-string.

I'm not very familiar with syntax higlighters, so I
don't know if they're typically smart enought to cope
with things like that.

--
Greg
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Let’s make escaping in f-literals impossible

2016-08-18 Thread Steven D'Aprano
On Fri, Aug 19, 2016 at 02:17:29AM +1000, Chris Angelico wrote:

> Format codes are just text, 

I really think that is wrong. They're more like executable code.

https://www.python.org/dev/peps/pep-0498/#expression-evaluation

"Just text" implies it is data:

result = "function(arg)"

like the string on the right hand side of the = is data. You wouldn't 
say that a function call was data (although it may *return* data):

result = function(arg)

or that it was "just text", and you shouldn't say the same about:

result = f"{function(arg)}"

either since they are functionally equivalent. Format codes are "just 
text" only in the sense that source code is "just text". Its technically 
correct and horribly misleading.


> so I should be able to use Unicode
> escapes. Okay. Now let's make that an F-string.
> 
> >>> f"This is a number: {13:0\u07c4}"
> 'This is a number: 0013'

If your aim is to write obfuscated code, then, yes, you should be able 
to write something like that.

*wink*

I seem to recall that Java allows string escapes in ordinary 
expressions, so that instead of writing:

result = function(arg)

you could write:

result = \x66\x75\x6e\x63\x74\x69\x6f\x6e\x28\x61\x72\x67\x29

instead. We can't, and shouldn't, allow anything like this in Python 
code. Should we allow it inside f-strings?

result = f"{\x66\x75\x6e\x63\x74\x69\x6f\x6e\x28\x61\x72\x67\x29}"



-- 
Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Let’s make escaping in f-literals impossible

2016-08-18 Thread Steven D'Aprano
On Thu, Aug 18, 2016 at 12:26:26PM -0400, Random832 wrote:

> There's a precedent. "$()" works this way in bash - call it a recursive
> parser context or whatever you like, but the point is that "$(command
> "argument with spaces")" works fine, and humans don't seem to have any
> trouble with it.

This is the first time I've ever seen anyone claim that humans don't 
have any trouble with bash escaping and evaluation rules.


-- 
Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Let’s make escaping in f-literals impossible

2016-08-18 Thread Eric V. Smith

On 8/18/2016 3:15 PM, Terry Reedy wrote:

On 8/18/2016 12:50 PM, Steve Dower wrote:
I find it hard to not read f'{x.partition(' + ')[0]}' as string
concatenation.


and it will certainly be easier for highlighters
to handle (assuming they're doing anything more complicated than simply
displaying the entire expression in a different colour).


Without the escapes, existing f-unaware highlighters like IDLE's will be
broken in that they will highlight the single f-string as two strings
with differently highlighted content in the middle.  For
f'{x.partition('if')[0]}', the 'if' is and will be erroneously
highlighted as a keyword.  I consider this breakage unacceptible.


Right. Because all strings (regardless of prefixes) are first parsed as 
strings, and then have their prefix "operator" applied, it's easy for a 
parser to ignore any sting prefix character.


So something that parses or scans a Python file and currently 
understands u, b, and r to be string prefixes, just needs to add f to 
the prefixes it uses, and it can now at least understand f-strings (and 
fr-strings). It doesn't need to implement a full-blown expression parser 
just to find out where the end of a f-string is.


Eric.


___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Let’s make escaping in f-literals impossible

2016-08-18 Thread Chris Angelico
On Fri, Aug 19, 2016 at 10:18 AM, Steven D'Aprano  wrote:
> On Fri, Aug 19, 2016 at 02:17:29AM +1000, Chris Angelico wrote:
>
>> Format codes are just text,
>
> I really think that is wrong. They're more like executable code.
>
> https://www.python.org/dev/peps/pep-0498/#expression-evaluation
>
> "Just text" implies it is data:
>
> result = "function(arg)"
>
> like the string on the right hand side of the = is data. You wouldn't
> say that a function call was data (although it may *return* data):
>
> result = function(arg)
>
> or that it was "just text", and you shouldn't say the same about:
>
> result = f"{function(arg)}"
>
> either since they are functionally equivalent. Format codes are "just
> text" only in the sense that source code is "just text". Its technically
> correct and horribly misleading.
>

By "format code", I'm talking about the bit after the colon, which
isn't executable code, but is a directive that says how the result is
to be formatted. These have existed since str.format() was introduced,
and have always been text, not code.

ChrisA
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Let’s make escaping in f-literals impossible

2016-08-18 Thread Terry Reedy

On 8/18/2016 3:30 PM, Steve Dower wrote:

On 18Aug2016 1215, Terry Reedy wrote:

On 8/18/2016 12:50 PM, Steve Dower wrote:

I don't think f'{x.partition('-')[0]}' is any less readable as a result
of the reused quotes,


Why are you reusing the single quote', which needs the escaping that you 
don't like, instead of any of at least 6 alternatives that do not need 
any escaping?


f'{x.partition("-")[0]}'
f'{x.partition("""-""")[0]}'
f"{x.partition('-')[0]}"
f'''{x.partition('-')[0]}'''
f"""{x.partition('-')[0]}"""
f"""{x.partition('''-''')[0]}"""

It seems to me that that this is at least somewhat a strawman issue.

If you want to prohibit backslashed quote reuse in expressions, as in 
f'{x.partition(\'-\')[0]}', that is okay with me, as this is 
unnecessary* and arguably bad.  The third alternative above is better. 
What breaks colorizers, and what I therefore object to, is the 
innovation of adding magical escaping of ' or " without \.


Or add a new style rule to PEP 8.

F-strings: avoid unnecessary escaping in the expression part of f-strings.
Good: f"{x.partition('-')[0]}"
Bad: f'{x.partition(\'-\')[0]}'

Then PEP-8 checkers will flag such usage.

*I am sure that there are possible complex expressions that would be 
prohibited by the rule that would be otherwise possible.  But they 
should be extremely rare and possibly not the best solution anyway.



I find it hard to not read f'{x.partition(' + ')[0]}' as string
concatenation.



That's a fair counter-example. Though f'{x.partition(\' + \')[0]}' still
reads like string concatenation to me at first glance. YMMV.


When the outer and inner quotes are no longer the same, the effect is 
greatly diminished if not eliminated.



and it will certainly be easier for highlighters
to handle (assuming they're doing anything more complicated than simply
displaying the entire expression in a different colour).


Without the escapes, existing f-unaware highlighters like IDLE's will be
broken in that they will highlight the single f-string as two strings
with differently highlighted content in the middle.  For
f'{x.partition('if')[0]}', the 'if' is and will be erroneously
highlighted as a keyword.  I consider this breakage unacceptible.


Won't it be broken anyway because of the new prefix?


No.  IDLE currently handles f-strings just fine other than not coloring 
the 'f'. This is a minor issue and easily fixed by adding '|f' and if 
allowed, '|F' at the end of the current stringprefix re.



I'm sure there's a fairly straightforward way for a regex to say that a
closing quote must not be preceded immediately by a backslash or by an
open brace at all without a closing brace in between.


I do not know that this is possible.
Here is IDLE's current re for an unprefixed single quoted string.
   r"'[^'\\\n]*(\\.[^'\\\n]*)*'?"
The close quote is optional because it must match a string that is in 
the process of being typed and is not yet closed.  I consider providing 
a tested augmented re to be required for this proposal.


Even then, making the parsing out of strings in Python code for 
colorizing version dependent is a problem in itself for colorizers not 
tied to a particular x.y version.  Leaving prefixes aside, I can't 
remember string delimiter syntax changing since I learned it in 1.3.



Not having escapes within the expression makes it harder for everyone
except the Python developer, in my opinion, and the rest of us ought to
go out of our way for them.


I am not sure that this says what you mean.

--
Terry Jan Reedy

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Let’s make escaping in f-literals impossible

2016-08-18 Thread Terry Reedy

On 8/18/2016 8:27 PM, Eric V. Smith wrote:

On 8/18/2016 3:15 PM, Terry Reedy wrote:



Without the escapes, existing f-unaware highlighters like IDLE's will be
broken in that they will highlight the single f-string as two strings
with differently highlighted content in the middle.  For
f'{x.partition('if')[0]}', the 'if' is and will be erroneously
highlighted as a keyword.  I consider this breakage unacceptible.


Right. Because all strings (regardless of prefixes) are first parsed as
strings, and then have their prefix "operator" applied, it's easy for a
parser to ignore any sting prefix character.

So something that parses or scans a Python file and currently
understands u, b, and r to be string prefixes, just needs to add f to
the prefixes it uses, and it can now at least understand f-strings (and
fr-strings). It doesn't need to implement a full-blown expression parser
just to find out where the end of a f-string is.


Indeed, IDLE has one prefix re, which has changed occasionally and which 
I need to change for 3.6, and 4 res for the 4 unprefixed strings, which 
have been the same, AFAIK, for decades.  It that prefixes all 4 string 
res with the prefix re and o or's the results together to get the 
'string' re.


--
Terry Jan Reedy

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Let’s make escaping in f-literals impossible

2016-08-18 Thread Terry Reedy

On 8/18/2016 8:18 PM, Steven D'Aprano wrote:

On Fri, Aug 19, 2016 at 02:17:29AM +1000, Chris Angelico wrote:


Format codes are just text,


I really think that is wrong. They're more like executable code.

https://www.python.org/dev/peps/pep-0498/#expression-evaluation


I agree with you here.  I just note that the strings passed to exec, 
eval, and compile are also executable code strings (and nothing but!). 
But I don't remember a suggestion that *they* should by colored as 
anything other than a string.  However, this thread has suggested to me 
that perhaps there *should* be a way to syntax check such strings in the 
editor rather than waiting for the runtime call.


--
Terry Jan Reedy

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Fix default encodings on Windows

2016-08-18 Thread Terry Reedy

On 8/18/2016 1:39 PM, Steve Dower wrote:

On 18Aug2016 1036, Terry Reedy wrote:

On 8/18/2016 11:25 AM, Steve Dower wrote:


In this case, we would announce in 3.6 that using bytes as paths on
Windows is no longer deprecated,


My understanding is the the first 2 fixes refine the deprecation rather
than reversing it.  And #3 simply applies it.


#3 certainly just applies the deprecation.

As for the first two, I don't see any reason to deprecate the
functionality once the issues are resolved. If using utf-8 encoded bytes
is going to work fine in all the same cases as using str, why discourage
it?


As I understand it, you still proposing to remove the use of bytes 
encoded with anything other than utf-8 (and the corresponding *A 
internal functions) and in particular stop lossy path transformations. 
Am I wrong?


--
Terry Jan Reedy

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Let’s make escaping in f-literals impossible

2016-08-18 Thread Steven D'Aprano
On Thu, Aug 18, 2016 at 08:27:50PM -0400, Eric V. Smith wrote:

> Right. Because all strings (regardless of prefixes) are first parsed as 
> strings, and then have their prefix "operator" applied, it's easy for a 
> parser to ignore any sting prefix character.

Is that why raw strings can't end with a backspace?

If so, that's the first time I've seen an explanation of that fact which 
makes sense!


-- 
Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Let’s make escaping in f-literals impossible

2016-08-18 Thread אלעזר
בתאריך יום ו׳, 19 באוג' 2016, 08:29, מאת Terry Reedy ‏:

> On 8/18/2016 8:18 PM, Steven D'Aprano wrote:
> > On Fri, Aug 19, 2016 at 02:17:29AM +1000, Chris Angelico wrote:
> >
> >> Format codes are just text,
> >
> > I really think that is wrong. They're more like executable code.
> >
> > https://www.python.org/dev/peps/pep-0498/#expression-evaluation
>
> I agree with you here.  I just note that the strings passed to exec,
> eval, and compile are also executable code strings (and nothing but!).
> But I don't remember a suggestion that *they* should by colored as
> anything other than a string.


But these are objects of type str, not string literals. If they were, I
guess someone would have suggested such coloring.

~Elazar
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Let’s make escaping in f-literals impossible

2016-08-18 Thread Serhiy Storchaka

On 19.08.16 08:07, Terry Reedy wrote:

On 8/18/2016 3:30 PM, Steve Dower wrote:

On 18Aug2016 1215, Terry Reedy wrote:

On 8/18/2016 12:50 PM, Steve Dower wrote:

I don't think f'{x.partition('-')[0]}' is any less readable as a result
of the reused quotes,


Why are you reusing the single quote', which needs the escaping that you
don't like, instead of any of at least 6 alternatives that do not need
any escaping?

f'{x.partition("-")[0]}'
f'{x.partition("""-""")[0]}'
f"{x.partition('-')[0]}"
f'''{x.partition('-')[0]}'''
f"""{x.partition('-')[0]}"""
f"""{x.partition('''-''')[0]}"""

It seems to me that that this is at least somewhat a strawman issue.

If you want to prohibit backslashed quote reuse in expressions, as in
f'{x.partition(\'-\')[0]}', that is okay with me, as this is
unnecessary* and arguably bad.  The third alternative above is better.
What breaks colorizers, and what I therefore object to, is the
innovation of adding magical escaping of ' or " without \.

Or add a new style rule to PEP 8.


+1. It is even possible to add a SyntaxWarning in future.


___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/