[Python-Dev] Re: pth file encoding

2021-03-19 Thread Dan Stromberg
On Wed, Mar 17, 2021 at 1:11 AM Michał Górny  wrote:

> On Wed, 2021-03-17 at 13:55 +0900, Inada Naoki wrote:
> > OK. setuptools doesn't specify encoding at all. So locale-specific
> > encoding is used.
> > We can not fix it in short term.
>
> How about writing paths as bytestrings in the long term?  I think this
> should eliminate the necessity of knowing the correct encoding for
> the filesystem.
>
On Linux and many Unixes, there is no "correct" filesystem encoding.  ASCII
and UTF-8 are probably the most common encodings for individual files,
maybe even large collections of files, but nevertheless, paths are
bytestrings.  Treating paths as UTF-8 works fine for most files, but once
in a while there'll be a filename that fails to convert, and that's not the
fault of the filename.

For example, what happens if you need a file to be named touch "Ma$(echo |
tr '\012' '\361')ana" ?

For a presentation application (for EG), assuming UTF-8 is probably fine,
maybe even a good thing.  But for a filesystem backup tool, it's important
to not assume an encoding so you can back up and restore all filenames
irrespective of what the files' creators intended encodingwise.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/HLTFATPMRA57UU3KQOXHIMELZZGXUUJJ/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pth file encoding

2021-03-17 Thread Ivan Pozdeev via Python-Dev


On 17.03.2021 23:04, Steve Dower wrote:

On 3/17/2021 7:34 PM, Ivan Pozdeev via Python-Dev wrote:

On 17.03.2021 20:30, Steve Dower wrote:

On 3/17/2021 8:00 AM, Michał Górny wrote:

How about writing paths as bytestrings in the long term?  I think this
should eliminate the necessity of knowing the correct encoding for
the filesystem.


That's what we're trying to do, the problem is that they start as strings, and 
so we need to convert them to a bytestring.

That conversion is the encoding ;)

And yeah, for reading, I'd use a UTF-8 reader that falls back to locale on failure (and restarts reading the file). But for writing, we 
need the tools that create these files (including Notepad!) to use the encoding we want.




I don't see a problem with using a file encoding specification like in Python 
source files.
Since site.py is under our control, we can introduce it easily.

We can opt to allow only UTF-8 here -- then we wait out a transitional period and disallow anything else than UTF-8 (then the 
specification can be removed, too).


The only thing we can introduce *easily* is an error when the (exclusively third-party) tools that create them aren't up to date. Getting 
everyone to specify the encoding we want is a much bigger problem with a much slower solution.


I don't see a problem with either.
If we want to standardize something, we have to encourage, then ultimately 
enforce compliance, this way or another.



This particular file is probably the worst case scenario, but preferring UTF-8 and handling existing files with a fallback is the best we 
can do (especially since an assumption of UTF-8 can be invalidated on a particular file, whereas most locale encodings cannot). Once we 
openly document that it should be UTF-8, tools will have a chance to catch up, and eventually the fallback will become harmless.


Cheers,
Steve


--
Regards,
Ivan

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/LN3MHC7O7NHBCCROZGZJOZ5DY76KFLJP/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pth file encoding

2021-03-17 Thread Steve Dower

On 3/17/2021 7:34 PM, Ivan Pozdeev via Python-Dev wrote:

On 17.03.2021 20:30, Steve Dower wrote:

On 3/17/2021 8:00 AM, Michał Górny wrote:

How about writing paths as bytestrings in the long term?  I think this
should eliminate the necessity of knowing the correct encoding for
the filesystem.


That's what we're trying to do, the problem is that they start as 
strings, and so we need to convert them to a bytestring.


That conversion is the encoding ;)

And yeah, for reading, I'd use a UTF-8 reader that falls back to 
locale on failure (and restarts reading the file). But for writing, we 
need the tools that create these files (including Notepad!) to use the 
encoding we want.




I don't see a problem with using a file encoding specification like in 
Python source files.

Since site.py is under our control, we can introduce it easily.

We can opt to allow only UTF-8 here -- then we wait out a transitional 
period and disallow anything else than UTF-8 (then the specification can 
be removed, too).


The only thing we can introduce *easily* is an error when the 
(exclusively third-party) tools that create them aren't up to date. 
Getting everyone to specify the encoding we want is a much bigger 
problem with a much slower solution.


This particular file is probably the worst case scenario, but preferring 
UTF-8 and handling existing files with a fallback is the best we can do 
(especially since an assumption of UTF-8 can be invalidated on a 
particular file, whereas most locale encodings cannot). Once we openly 
document that it should be UTF-8, tools will have a chance to catch up, 
and eventually the fallback will become harmless.


Cheers,
Steve
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/5B53GCQNYXFBYAHSJKI6I34XAV6S67HN/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pth file encoding

2021-03-17 Thread Ivan Pozdeev via Python-Dev

On 17.03.2021 20:30, Steve Dower wrote:

On 3/17/2021 8:00 AM, Michał Górny wrote:

How about writing paths as bytestrings in the long term?  I think this
should eliminate the necessity of knowing the correct encoding for
the filesystem.


That's what we're trying to do, the problem is that they start as strings, and 
so we need to convert them to a bytestring.

That conversion is the encoding ;)

And yeah, for reading, I'd use a UTF-8 reader that falls back to locale on failure (and restarts reading the file). But for writing, we 
need the tools that create these files (including Notepad!) to use the encoding we want.




I don't see a problem with using a file encoding specification like in Python 
source files.
Since site.py is under our control, we can introduce it easily.

We can opt to allow only UTF-8 here -- then we wait out a transitional period and disallow anything else than UTF-8 (then the specification 
can be removed, too).



Cheers,
Steve

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/MVD67FOAJRCNR2XXLJ4JDVFPYGZWYLDP/
Code of Conduct: http://python.org/psf/codeofconduct/


--
Regards,
Ivan

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/WZJ5EIP47AQV6X4MBN7427O4TNN5F4WY/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pth file encoding

2021-03-17 Thread Steve Dower

On 3/17/2021 6:08 PM, Stefan Ring wrote:

A somewhat radical idea carrying this to the extreme would be to use
UTF-16 (LE) on Windows. After all, this _is_ the native file system
encoding, and Notepad will happily read and write it.


I'm not opposed to detecting a BOM by default (when no other encoding is 
specified), but that won't help most UTF-8 files which these days come 
with no marker at all.


I wouldn't change the default file encoding for writing though (except 
to unmarked UTF-8, and only with the compatibility approach Inada is 
working on). Everyone has basically come around to the idea that UTF-8 
is the only needed encoding, and I'm sure if it had existed when Windows 
decided to support a universal character set, it would have been chosen. 
But with what we have now, UTF-16-LE is not a good choice for anything 
apart from compatibility with Windows.


Cheers,
Steve

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/LTEJSNOH6EHESXSMXSW352JFG2SF7ZMX/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pth file encoding

2021-03-17 Thread Stefan Ring
On Wed, Mar 17, 2021 at 6:37 PM Steve Dower  wrote:
>
> On 3/17/2021 8:00 AM, Michał Górny wrote:
> > How about writing paths as bytestrings in the long term?  I think this
> > should eliminate the necessity of knowing the correct encoding for
> > the filesystem.
>
> That's what we're trying to do, the problem is that they start as
> strings, and so we need to convert them to a bytestring.
>
> That conversion is the encoding ;)
>
> And yeah, for reading, I'd use a UTF-8 reader that falls back to locale
> on failure (and restarts reading the file). But for writing, we need the
> tools that create these files (including Notepad!) to use the encoding
> we want.

A somewhat radical idea carrying this to the extreme would be to use
UTF-16 (LE) on Windows. After all, this _is_ the native file system
encoding, and Notepad will happily read and write it.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/WRAW4UI3X3WYMQ3FMIERDKTVD6WKD5S2/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pth file encoding

2021-03-17 Thread Steve Dower

On 3/17/2021 8:00 AM, Michał Górny wrote:

How about writing paths as bytestrings in the long term?  I think this
should eliminate the necessity of knowing the correct encoding for
the filesystem.


That's what we're trying to do, the problem is that they start as 
strings, and so we need to convert them to a bytestring.


That conversion is the encoding ;)

And yeah, for reading, I'd use a UTF-8 reader that falls back to locale 
on failure (and restarts reading the file). But for writing, we need the 
tools that create these files (including Notepad!) to use the encoding 
we want.


Cheers,
Steve

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/MVD67FOAJRCNR2XXLJ4JDVFPYGZWYLDP/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pth file encoding

2021-03-17 Thread Paul Moore
On Wed, 17 Mar 2021 at 09:26, Paul Moore  wrote:
> The problem is with the transition - we need to find a way to deal
> with existing `.pth` files, and with people using older version of
> tools (like setuptools and pipx) that write `.pth` files (so we can't
> assume, for example, that Python 3.12 will never see a .pth file using
> the old-style encoding).

Hmm, I just checked and pipx uses UTF-8 when writing .pth files. See
https://github.com/pipxproject/pipx/blob/master/src/pipx/venv.py#L176
(and lol, it was my mistake, I wrote that code -
https://github.com/pipxproject/pipx/pull/168). I'm inclined to report
that as a bug, even though it appears no-one has complained about it.
But that seems counter-productive given the context here.

Paul
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/V5AJK4WZY2JCGZVFI5KY3QD4DYVSSIBB/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pth file encoding

2021-03-17 Thread Paul Moore
On Wed, 17 Mar 2021 at 08:52, Inada Naoki  wrote:
> On Windows, it must be UTF-8. For example, we use `chcp 65001` in
> `activate.bat` to support unicode path.
> On Unix, raw path is bytestring. So paths can be written as-is. Python
> decode it with fsencoding.

Remember that .pth files contain executable code as well as paths, so
fsencoding is not correct for a .pth file as a whole.

> So I think this is the ideal solution. But this solution requires
> platform-specific code in the site.py.
> I don't think pth files are important enough for this complexity.

.pth files are pretty important in the packaging community. I'd
strongly support making their format and behaviour more precisely
defined.

> Sub-optimal idea is using UTF-8. It is the best encoding for Windows.
> And most Unix systems use UTF-8 too.

+1. IMO, UTF-8 is the only reasonable choice here.

The problem is with the transition - we need to find a way to deal
with existing `.pth` files, and with people using older version of
tools (like setuptools and pipx) that write `.pth` files (so we can't
assume, for example, that Python 3.12 will never see a .pth file using
the old-style encoding).

It's worth noting that using the default encoding is the *correct* way
of writing .pth files at the moment (as that's how site.py reads them
- see https://github.com/python/cpython/blob/master/Lib/site.py#L173)
so this is technically a file format change - tools writing .pth files
will *have* to include version-specific code if they want to support
multiple versions of Python. We need to be very clear about this -
it's not just a case of "tools need to specify the encoding".

Paul
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/MIZLKDTX2EXEHFKKHO33FRSO7EH62DGW/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pth file encoding

2021-03-17 Thread Antoine Pitrou
On Tue, 16 Mar 2021 11:44:13 +0900
Inada Naoki  wrote:
> Hi, all.
> 
> I found .pth file is decoded by the default (i.e. locale-specific) encoding.
> https://github.com/python/cpython/blob/0269ce87c9347542c54a653dd78b9f60bb9fa822/Lib/site.py#L173
> 
> pth files contain:
> 
> * import statements
> * paths
> 
> For import statement, UTF-8 is the default Python code encoding.
> For paths, fsencoding is the right encoding. It is UTF-8 on Windows
> (excpet PYTHONLEGACYWINDOWSFSENCODING is set), and locale-specific
> encoding in Linux.
> 
> What encoding should we use?
> 
> * UTF-8
> * sys.getfilesystemencoding()
> * Keep status-quo.

You could add special markup to specify utf8 encoding:

# -*- encoding: utf8 -*-

If no markup is present, use locale encoding.  If markup is present,
use utf8 encoding.  Bail out if markup specifies something else than
utf8.

Then update all pth-producing tools to write utf8-encoded pth files
(at least on the Python versions that support the encoding markup).
In 15 years, you can switch to utf8 by default.

Regards

Antoine.


___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/J2IM4IQ3L3XEN6XBRFSDLQ2S2FORN3PP/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pth file encoding

2021-03-17 Thread Inada Naoki
On Wed, Mar 17, 2021 at 5:33 PM Paul Moore  wrote:
>
> On Wed, 17 Mar 2021 at 08:13, Michał Górny  wrote:
> >
> > On Wed, 2021-03-17 at 13:55 +0900, Inada Naoki wrote:
> > > OK. setuptools doesn't specify encoding at all. So locale-specific
> > > encoding is used.
> > > We can not fix it in short term.
> >
> > How about writing paths as bytestrings in the long term?  I think this
> > should eliminate the necessity of knowing the correct encoding for
> > the filesystem.
>
> If I have a path in my Python program that is "a£b" (a unicode string)
> and I want to write it to a .pth file, what encoding should I use to
> "write it as a bytestring"? I don't understand what you;re trying to
> suggest here.
> Paul

On Windows, it must be UTF-8. For example, we use `chcp 65001` in
`activate.bat` to support unicode path.
On Unix, raw path is bytestring. So paths can be written as-is. Python
decode it with fsencoding.

So I think this is the ideal solution. But this solution requires
platform-specific code in the site.py.
I don't think pth files are important enough for this complexity.

Sub-optimal idea is using UTF-8. It is the best encoding for Windows.
And most Unix systems use UTF-8 too.

Regards,

-- 
Inada Naoki  
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/NWBYQHLUIIWU2U2MX4KZXJH4PBTNJYAW/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pth file encoding

2021-03-17 Thread Paul Moore
On Wed, 17 Mar 2021 at 08:13, Michał Górny  wrote:
>
> On Wed, 2021-03-17 at 13:55 +0900, Inada Naoki wrote:
> > OK. setuptools doesn't specify encoding at all. So locale-specific
> > encoding is used.
> > We can not fix it in short term.
>
> How about writing paths as bytestrings in the long term?  I think this
> should eliminate the necessity of knowing the correct encoding for
> the filesystem.

If I have a path in my Python program that is "a£b" (a unicode string)
and I want to write it to a .pth file, what encoding should I use to
"write it as a bytestring"? I don't understand what you;re trying to
suggest here.
Paul
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/YBE6D37V73OXZYNEW36JO24ZBD7EKAJQ/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pth file encoding

2021-03-17 Thread Michał Górny
On Wed, 2021-03-17 at 13:55 +0900, Inada Naoki wrote:
> OK. setuptools doesn't specify encoding at all. So locale-specific
> encoding is used.
> We can not fix it in short term.

How about writing paths as bytestrings in the long term?  I think this
should eliminate the necessity of knowing the correct encoding for
the filesystem.

-- 
Best regards,
Michał Górny


___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/EKG7ELEWSG6ZPFYOVTCNVJCGV5W7S7J3/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pth file encoding

2021-03-16 Thread Inada Naoki
OK. setuptools doesn't specify encoding at all. So locale-specific
encoding is used.
We can not fix it in short term.

On Wed, Mar 17, 2021 at 4:56 AM Brett Cannon  wrote:
>
>
>
> On Mon, Mar 15, 2021 at 7:53 PM Inada Naoki  wrote:
>>
>> Hi, all.
>>
>> I found .pth file is decoded by the default (i.e. locale-specific) encoding.
>> https://github.com/python/cpython/blob/0269ce87c9347542c54a653dd78b9f60bb9fa822/Lib/site.py#L173
>>
>> pth files contain:
>>
>> * import statements
>> * paths
>>
>> For import statement, UTF-8 is the default Python code encoding.
>> For paths, fsencoding is the right encoding. It is UTF-8 on Windows
>> (excpet PYTHONLEGACYWINDOWSFSENCODING is set), and locale-specific
>> encoding in Linux.
>>
>> What encoding should we use?
>>
>> * UTF-8
>> * sys.getfilesystemencoding()
>> * Keep status-quo.
>
>
> What are packaging tools like pip and setuptools writing .pth files out as?



-- 
Inada Naoki  
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/B5EWSS6GT5O4HBUJTMCKWKZMTC6U6VTV/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pth file encoding

2021-03-16 Thread Brett Cannon
On Mon, Mar 15, 2021 at 7:53 PM Inada Naoki  wrote:

> Hi, all.
>
> I found .pth file is decoded by the default (i.e. locale-specific)
> encoding.
>
> https://github.com/python/cpython/blob/0269ce87c9347542c54a653dd78b9f60bb9fa822/Lib/site.py#L173
>
> pth files contain:
>
> * import statements
> * paths
>
> For import statement, UTF-8 is the default Python code encoding.
> For paths, fsencoding is the right encoding. It is UTF-8 on Windows
> (excpet PYTHONLEGACYWINDOWSFSENCODING is set), and locale-specific
> encoding in Linux.
>
> What encoding should we use?
>
> * UTF-8
> * sys.getfilesystemencoding()
> * Keep status-quo.
>

What are packaging tools like pip and setuptools writing .pth files out as?
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/GMVXCNGP2J5JFE4ISANASZ5D67UWWVM7/
Code of Conduct: http://python.org/psf/codeofconduct/