[issue9992] Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent

2010-10-11 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

  ... So Antoine and Martin: which encoding do you prefer?
 
 I still propose to drop the fsname encoding. Then this question goes away.

You mean that we should use the following encoding for the command line 
arguments, environment variables and all filenames/paths:
 - Mac OS X: utf-8
 - Windows: unicode for command line/env, mbcs to decode filenames
 - others OSes: locale encoding

To do that, we have to:
 - others OSes: delete the PYTHONFSENCODING variable
 - Mac OS X: use utf-8 to decode the command line arguments (we can use 
PyUnicode_DecodeUTF8()+PyUnicode_AsWideCharString() before Python is 
initialized)

On others OSes, we continue to use the FS encoding to encode command 
line/env vars, because the FS encoding will always be the locale encoding. And 
it's more pratical to use sys.getfilesystemencoding() than mbstowcs(), 
wcstombs(), _Py_wchar2char(), _Py_char2wchar(), etc. because the FS encoding 
doesn't depend on the current locale, and it uses Python codecs which support 
more error handlers.

I like this solution because it doesn't change a lot of things. I agree to 
drop PYTHONFSENCODING because it looks like PYTHONFSENCODING introduced more 
inconsistencies than it solved.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9992
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9992] Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent

2010-10-11 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

STINNER Victor wrote:
 
 I like this solution because it doesn't change a lot of things. I agree to 
 drop PYTHONFSENCODING because it looks like PYTHONFSENCODING introduced more 
 inconsistencies than it solved.

If you remove the PYTHONFSENCODING, then we have to reconsider
removal of sys.setfilesystemencoding().

The main argument for removal of the sys function was having
the environment variable.

If you remove both, Python will get very poor grades for OS
interoperability on platforms that often deal with multiple
different encodings for file names.

I am repeating myself, but please keep in mind that the locale
is an application scope setting. It doesn't have anything
to do with what's actually stored in file systems or what the
OS uses internally.

Python therefore has to provide a way to customize the file system
encoding and allow to override the locale guessing that's currently
happening.

You can't just tell people to go with whatever encoding setup
you prefer to make Python's guessing easier or more correct. Python
has to adapt to what the users actually use, not the other way
around. Where that's not easily possible, there have to be ways
to explicitly tell Python what to use... telling the user to adjust
his or her locale settings just to be able to run Python is not
an option.

The world is still moving towards Unicode - it's not 100% there
yet.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9992
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9992] Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent

2010-10-11 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

 You mean that we should use the following encoding for the command line 
 arguments, environment variables and all filenames/paths:
  - Mac OS X: utf-8
  - Windows: unicode for command line/env, mbcs to decode filenames

No: unicode for filenames also.

  - others OSes: locale encoding

Yes, that is my proposal.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9992
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9992] Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent

2010-10-11 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

 If you remove both, Python will get very poor grades for OS
 interoperability on platforms that often deal with multiple
 different encodings for file names.

Why that? It will work very well in such a setting, much better
than, say, Java.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9992
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9992] Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent

2010-10-11 Thread Ronald Oussoren

Ronald Oussoren ronaldousso...@mac.com added the comment:

On 09 Oct, 2010,at 02:07 PM, Antoine Pitrou rep...@bugs.python.org wrote:

Antoine Pitrou pit...@free.fr added the comment:

 For the command line, it would mean that we 
 introduced a new encoding: command line encoding, which will be utf-8 on 
 OSX.

Or more generally environment encoding, if it's also used for env
vars. This could solve the subprocess issue neatly.
 

Note that the command-line and environment encoding on OSX is generally UTF-8, 
even if that is not always reflected in the locale settings.

On recent OSX releases LANG will be set to a UTF-8 aware locale (en_US.UTF-8 
on my machine) when you start a shell using Terminal.app.

The correct locale environment variables are AFAIK not set in two important 
situations: on OSX 10.4 and when running code from an application bundle, in 
both cases the environment/command-line encoding should be treated as UTF-8.

There is one reason for not wanting to assume that the encoding is always 
UTF-8: the user might access the system from a non-UTF8 terminal (such as when 
logging in with an SSH session from a system not using UTF-8, or using an 
alternate terminal application). IMHO these are minor enough use-cases that we 
could just enforce that the encoding is UTF-8 on OSX. 

That would ensure that the filesystem encoding and environment/command-line 
encoding are consistent and we'd no longer run into the problem that triggered 
this issue.

Ronald

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9992
___

--
title: Command line arguments are not correctly decodediflocale and fileystem 
encodingsaredifferent - Command line arguments are not correctly 
decodediflocale and fileystem encodingsaredifferent
Added file: http://bugs.python.org/file19184/unnamed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9992
___htmlbodydivbrbrOn 09 Oct, 2010,at 02:07 PM, Antoine Pitrou 
lt;rep...@bugs.python.orggt; wrote:brbr/divdivblockquote 
type=citedivdiv class=_stretchbr
Antoine Pitrou lt;pit...@free.frgt; added the comment:br
br
gt; For the command line, it would mean that we br
gt; introduced a new encoding: command line encoding, which will be utf-8 on 
br
gt; OSX.br
br
Or more generally environment encoding, if it's also used for envbr
vars. This could solve the subprocess issue 
neatly./div/div/blockquotespannbsp;/span/divdivbr/divdivspan/spanNote
 that the command-line and environment encoding on OSX is generally UTF-8, even 
if that is not always reflected in the locale 
settings./divdivbr/divdivOn recent OSX releases LANG will be set to a 
UTF-8 aware locale (en_US.UTF-8 on my machine) when you start a shell using 
Terminal.app./divdivbr/divdivThe correct locale environment variables 
are AFAIK not set in two important situations: on OSX 10.4 and when running 
code from an application bundle, in both cases the environment/command-line 
encoding should be treated as UTF-8./divdivbr/divdivThere is one 
reason for not wanting to assume that the encoding is always UTF-8: the user 
might access the system from a non-UTF8 terminal (such as when logging in with 
an SSH session from a system not using UTF-8, or using an alternate terminal 
application). IMHO these are
  minor enough use-cases that we could just enforce that the encoding is UTF-8 
on OSX.nbsp;/divdivbr/divdivThat would ensure that the filesystem 
encoding and environment/command-line encoding are consistent and we'd no 
longer run into the problem that triggered this 
issue./divdivbr/divdivRonald/divdivblockquote 
type=citedivdiv class=_stretchbr
br
--br
br
___br
Python tracker lt;rep...@bugs.python.orggt;br
lt;a href=http://bugs.python.org/issue9992; 
_mce_href=http://bugs.pythonorg/issue9992;http://bugs.python.org/issue9992/agt;br
___br
/div/div/blockquote/div/body/html___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9992] Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent

2010-10-11 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

 There is one reason for not wanting to assume that the encoding is
 always UTF-8: the user might access the system from a non-UTF8
 terminal (such as when logging in with an SSH session from a system
 not using UTF-8, or using an alternate terminal application). IMHO
 these are minor enough use-cases that we could just enforce that the
 encoding is UTF-8 on OSX.

Ok, that's enough of an expert statement for me to settle the OSX
case: we will always assume that environment data is UTF-8 on OSX
(leaving the rest to the surrogate escape handler).

--
title: Command line arguments are not correctly decodediflocale and fileystem 
encodingsaredifferent - Command line arguments are not correctly 
decodediflocale and fileystem encodingsaredifferent

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9992
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9992] Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent

2010-10-11 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Martin v. Löwis wrote:
 
 Martin v. Löwis mar...@v.loewis.de added the comment:
 
 If you remove both, Python will get very poor grades for OS
 interoperability on platforms that often deal with multiple
 different encodings for file names.
 
 Why that? It will work very well in such a setting, much better
 than, say, Java.

Well, Java pretty much fails completely in this respect, so being
better than Java is not exactly the benchmark I had in mind :-)

I think the proper benchmark would be a Python2 application that
has no problems with these things, since file names are just
bytes that refer to files on the disk, with no associated encoding -
at least on Unix and related platforms.

Being pedantic about forcing some encoding onto things that don't
have an encoding won't really work out in practice. Dealing with
file names, OS environments, pipes and sockets is dirty work, so
I think we should go with the 80-20 approach in making 80% easy
and 20% harder, but still possible.

--
title: Command line arguments are not correctly decodediflocale and fileystem 
encodingsaredifferent - Command line arguments are not correctly 
decodediflocale and fileystem encodingsaredifferent

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9992
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9992] Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent

2010-10-11 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

 Being pedantic about forcing some encoding onto things that don't
 have an encoding won't really work out in practice. Dealing with
 file names, OS environments, pipes and sockets is dirty work, so
 I think we should go with the 80-20 approach in making 80% easy
 and 20% harder, but still possible.

Unix applications can always use the byte-oriented file name APIs
if they need to. Then you are back to the state that things have
in Python 2. No need to have a user-tunable file system encoding
there.

However, I completely fail to see the advantage that the
PYTHONFSENCODING variable has over the LANG variable. If it's
possible to set PTHONFSENCODING in some application, it surely
is also possible to set LANG (or LC_CTYPE), no? Setting the
latter also gives you the advantage that environment variables
and command line arguments use the same encoding as file names.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9992
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9992] Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent

2010-10-11 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

 However, I completely fail to see the advantage that the
 PYTHONFSENCODING variable has over the LANG variable. If it's
 possible to set PTHONFSENCODING in some application, it surely
 is also possible to set LANG (or LC_CTYPE), no? Setting the
 latter also gives you the advantage that environment variables
 and command line arguments use the same encoding as file names.

I guess LANG and LC_CTYPE can be used for other purposes such as
internationalization.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9992
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9992] Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent

2010-10-11 Thread Marc-Andre Lemburg

Marc-Andre Lemburg m...@egenix.com added the comment:

Martin v. Löwis wrote:
 
 Martin v. Löwis mar...@v.loewis.de added the comment:
 
 Being pedantic about forcing some encoding onto things that don't
 have an encoding won't really work out in practice. Dealing with
 file names, OS environments, pipes and sockets is dirty work, so
 I think we should go with the 80-20 approach in making 80% easy
 and 20% harder, but still possible.
 
 Unix applications can always use the byte-oriented file name APIs
 if they need to. Then you are back to the state that things have
 in Python 2. No need to have a user-tunable file system encoding
 there.

Right and if you take the position of refusing to guess
which we usually do in Python, then interfacing to file names
using bytes would be the appropriate way to handle the situation.

However, since Python3 has chosen to regard file names as
text regardless of platform, we're now in the situation that
we have to come up with some educated guess on the encoding.

 However, I completely fail to see the advantage that the
 PYTHONFSENCODING variable has over the LANG variable. If it's
 possible to set PTHONFSENCODING in some application, it surely
 is also possible to set LANG (or LC_CTYPE), no? Setting the
 latter also gives you the advantage that environment variables
 and command line arguments use the same encoding as file names.

The advantage is that you can change the Python files system
encoding *without* having to change your locale settings.

You can't possibly expect a user to switch to using UTF-8 for
all his/her applications just because Python needs this to
properly decode file names.

Users of applications written in Python will most likely not
even know how to change the locale encoding.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9992
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9992] Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent

2010-10-11 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

MvL   - Windows: unicode for command line/env, mbcs to decode filenames
MvL No: unicode for filenames also.

Yes, I mean unicode for everything, but decode bytes data from the mbcs 
encoding.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9992
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9992] Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent

2010-10-11 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

MAL If you remove the PYTHONFSENCODING, then we have to reconsider
MAL removal of sys.setfilesystemencoding().

Plase, Marc, read my comments. You never consider technical problems, 
you just propose to ensure that Python just works, without answering to my 
technical questions. I already explained 2 or 3 times that 
sys.setfilesystemencoding() was completly buggy and not usable in pratical. You 
proposed PYTHONFSENCODING and I implemented it. But then I explained in an 
email to python-dev and in this issue, that this environment variable 
introduced many problems.

I don't see how sys.setfilesystemencoding() would solve this issue, it's out of 
scope.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9992
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9992] Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent

2010-10-11 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

 You can't possibly expect a user to switch to using UTF-8 for
 all his/her applications just because Python needs this to
 properly decode file names.

If the user hasn't switched to UTF-8, why would Python need that
to properly decode file names?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9992
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9992] Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent

2010-10-11 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

MAL You can't just tell people to go with whatever encoding setup
MAL you prefer to make Python's guessing easier or more correct.

Python doesn't really *guess* the encoding, it just reads the encoding from the 
locale.

What do you mean by more correct? How can Python knowns the right encoding 
better than the user? Python should not guess anything. If the environment is 
not correctly configured, it's not Python's fault. The user has to fix its 
environment.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9992
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9992] Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent

2010-10-11 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

 I guess LANG and LC_CTYPE can be used for other purposes
 such as internationalization.

That's why there are different environement variables:
 * LC_MESSAGES for i18n (messages)
 * LC_CTYPE for the encoding
 * LC_TIME for time and date
 * etc.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9992
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9992] Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent

2010-10-11 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

issue9992.patch:
 - Remove PYTHONFSENCODING environment variable
 - Mac OS X: Use utf-8 to decode command line arguments
 - Fix issue #9992 (this issue): attached test, locale_fs_encoding.py, pass
 - Fix issue #9988
 - Fix issue #10014
 - Fix issue #10039

$ diffstat issue9992.patch 
 Doc/using/cmdline.rst   |   12 
 Doc/whatsnew/3.2.rst|6 --
 Lib/test/test_os.py |   30 --
 Lib/test/test_subprocess.py |4 
 Lib/test/test_sys.py|   29 -
 Modules/main.c  |3 ---
 Modules/python.c|   10 +-
 Python/pythonrun.c  |   22 ++
 8 files changed, 15 insertions(+), 101 deletions(-)

I like such patch: it removes more code than it adds, but it fixes 4 different 
issues!

I didn't tested the patch specific to OSX (use utf8 to decode command line 
arguments).

--
Added file: http://bugs.python.org/file19190/issue9992.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9992
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9992] Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent

2010-10-11 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

I think that issue9992.patch fixes also #4388 because it uses the same encoding 
(FS encoding, utf8) on OSX to encode and to decode command line arguments.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9992
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9992] Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent

2010-10-10 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

 We run into problems because we have two inconsistent
 encodings, ...

What? No. We have problems because we don't use the same encoding to decode and 
to encode the same data type. It's not a problem to use a different encoding 
for each data type (stdout, filenames, environment variables, ...).

--

About the 3rd encoding: it will be just the locale encoding. Use the locale 
encoding to encode/decode command line arguments and environment variables is 
complelty compatible with Python 3.1, because Python 3.1 initializes the 
filesystem encoding with the locale encoding. Use the locale encoding helps the 
interoperability because other programs use the same encoding.

Mac OS X is a special case. Filesystem encoding is utf-8 on this OS, whereas 
the locale encoding depends on LANG variable. If I understood MvL proposition 
correctly, we should not rely on the locale on Mac OS X. So the 3rd encoding 
and the filesystem encodings should be hardcoded to utf-8?

--

The third encoding is no more controlable by a special environment variable, 
only by classic locale environment variables (LC_ALL, LC_CTYPE, LANG). Is it a 
problem? I remember a comment from MAL saying that it may be a problem for CGI 
for the environment variables because some (all?) variables are not encoded 
with the locale encoding (but the HTML encoding?). I don't know if Python 
should workaround CGI specific issues. In Python 3.2, we have now os.environb: 
it's now possible to use a different encoding for each variable.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9992
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9992] Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent

2010-10-10 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

Am 10.10.2010 17:51, schrieb STINNER Victor:
 
 STINNER Victor victor.stin...@haypocalc.com added the comment:
 
 We run into problems because we have two inconsistent encodings,
 ...
 
 What? No. We have problems because we don't use the same encoding to
 decode and to encode the same data type. It's not a problem to use a
 different encoding for each data type (stdout, filenames, environment
 variables, ...).

This is exactly the very problem that we face. In particular, the
question is what encoding to use if something is *both* a filename
and an environment variable value, or both a filename and a command
line argument.

 Mac OS X is a special case. Filesystem encoding is utf-8 on this OS,
 whereas the locale encoding depends on LANG variable. If I understood
 MvL proposition correctly, we should not rely on the locale on Mac OS
 X.

Not rely on is perhaps a bit harsh. It's not clear (to me) under what
conditions the locale's encoding will be more correct than just assuming
UTF-8 - there may actually be use cases for it.

However, with the surrogate escapes, we could just always decode using
UTF-8, and leave any mojibake problems that may arise from this from
this to the application. I do think that these problems will be rare,
since a) many OSX installations use UTF-8, anyway, and b) those that
don't likely experience the proper round-tripping of the escape mechanism.

 So the 3rd encoding and the filesystem encodings should be
 hardcoded to utf-8?

That's an option to consider, yes - I'd like an OSX expert to
comment.

 The third encoding is no more controlable by a special environment
 variable, only by classic locale environment variables (LC_ALL,
 LC_CTYPE, LANG). Is it a problem? I remember a comment from MAL
 saying that it may be a problem for CGI for the environment variables
 because some (all?) variables are not encoded with the locale
 encoding (but the HTML encoding?). I don't know if Python should
 workaround CGI specific issues. In Python 3.2, we have now
 os.environb: it's now possible to use a different encoding for each
 variable.

I think these problems are sufficiently resolved now: either by
PEP , PEP 444, PEP 383, or os.environb.

I think you misunderstood MAL's comment, though: the environment
variables are not encoded in *any* specific encoding. Instead,
they are copied literally from the HTTP request, using whatever
bytes the browser originally put in there - which may or may
not have followed a particular encoding. HTTP is silent on
this most of the time, and HTML is out of scope.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9992
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9992] Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent

2010-10-10 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

  What? No. We have problems because we don't use the same encoding to
  decode and to encode the same data type. It's not a problem to use a
  different encoding for each data type (stdout, filenames, environment
  variables, ...).
 
 This is exactly the very problem that we face. In particular, the
 question is what encoding to use if something is *both* a filename
 and an environment variable value, or both a filename and a command
 line argument.

The question is: what is the best default encoding for a specific data type? 
There is no perfect answer (well, except maybe using byte strings :-)). Each 
solution has its own use cases and disadvantages.

If an application knows exactly the encoding of a data, and it is not the 
default encoding, it can still redecode the data. Using os.environb, it's a 
little bit better: the application just has to decode (don't have to encode 
and to know which encoding was used to decode initially the data). For 
sys.argv, I still want to create sys.argvb (bytes version) ;-)

For the command line arguments and environment variables, we don't have a lot 
of choices: locale or filesystem encodings. So Antoine and Martin: which 
encoding do you prefer? We should maybe try to find some use cases

Here is a dummy script bla.py:
---
import sys
print(sys.argv)
try:
open(sys.argv[1]).close()
except Exception as err:
print(open error: %s % err)
else:
print(open ok)
---

Locale encoding = FS encoding = utf-8:

$ ./python bla.py xxxé.txt 
['bla.py', 'xxxé.txt']
open ok

Locale encoding = utf8, FS encoding = ascii:

$ PYTHONFSENCODING=ascii ./python bla.py xxxé.txt 
['bla.py', 'xxxé.txt']
open error: 'ascii' codec can't encode character '\xe9' ...

The filename is displayed correctly, but we are unable to open the file if 
PYTHONFSENCODING is used :-/ Should the filename be displayed differently if 
PYTHONFSENCODING is used?

 I think these problems are sufficiently resolved now: either by
 PEP , PEP 444, PEP 383, or os.environb.

Ok, cool :-)

 I think you misunderstood MAL's comment, though: the environment
 variables are not encoded in *any* specific encoding. Instead,
 they are copied literally from the HTTP request, using whatever
 bytes the browser originally put in there - which may or may
 not have followed a particular encoding. HTTP is silent on
 this most of the time, and HTML is out of scope.

Ah yes, thanks for you explaination. I was unable to find its comment.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9992
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9992] Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent

2010-10-10 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

 For the command line arguments and environment variables, we don't have a lot 
 of choices: locale or filesystem encodings. So Antoine and Martin: which 
 encoding do you prefer?

I still propose to drop the fsname encoding. Then this question goes away.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9992
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9992] Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent

2010-10-10 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

Le dimanche 10 octobre 2010 à 18:23 +, Martin v. Löwis a écrit :
 Martin v. Löwis mar...@v.loewis.de added the comment:
 
  For the command line arguments and environment variables, we don't have a 
  lot 
  of choices: locale or filesystem encodings. So Antoine and Martin: which 
  encoding do you prefer?
 
 I still propose to drop the fsname encoding. Then this question goes away.

I don't know what you mean by dropping, since OS X by construction needs
a filesystem encoding (utf-8) different from the locale encoding; and
Windows hardwires the decoding/encoding of bytes filenames using mbcs
regardless of the current codepage, IIRC.

So do you just mean the filesystem encoding should be hidden from the
user? What would be the benefit?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9992
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9992] Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent

2010-10-10 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

 I don't know what you mean by dropping, since OS X by construction needs
 a filesystem encoding (utf-8) different from the locale encoding;

See above. I propose to stop using the locale encoding for command line
arguments and environment variables on OSX, and use UTF-8 instead.

 and
 Windows hardwires the decoding/encoding of bytes filenames using mbcs
 regardless of the current codepage, IIRC.

I wish byte-oriented file names could be dropped on Windows. But that
is probably too incompatible.

 So do you just mean the filesystem encoding should be hidden from the
 user? What would be the benefit?

That the very issue that this bug report (re-read the title) is about
would go away.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9992
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9992] Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent

2010-10-09 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

 Perhaps. We could also declare that command line arguments and
 environment variables are always UTF-8-encoded on OSX (which I think
 would be fairly accurate)

Python uses the filesystem encoding to encode/decode environment variables, 
and OSX, fs encoding is utf-8. For the command line, it would mean that we 
introduced a new encoding: command line encoding, which will be utf-8 on 
OSX.

--
title: Command line arguments are not correctly decodedif localeand 
fileystem encodings aredifferent - Command line arguments are not correctly 
decodediflocaleand fileystem encodingsaredifferent

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9992
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9992] Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent

2010-10-09 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

 For the command line, it would mean that we 
 introduced a new encoding: command line encoding, which will be utf-8 on 
 OSX.

Or more generally environment encoding, if it's also used for env
vars. This could solve the subprocess issue neatly.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9992
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9992] Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent

2010-10-09 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

 So perhaps it would be best if Python had two external default encodings:
 the IO one (command line arguments, environment variables, text files),
 and the file name encoding (defaulting to the IO encoding if not set)

Hum, I prefer to consider the FS encoding as an *internal* encoding. ... But 
it's not completly true: it is used for the environment variables.

Let's consider that FS encoding is only an internal encoding. Wee need 3 
encodings:
 - FS encoding: any operation on the filesystem
 - IO encoding: text file contents (included stdin, stdout, stderr which are 
text files)
 - a 3rd encoding (let's call it the command line encoding): used for the 
command line arguments and the environment variables

For technical reasons (bootstrap: Python initialization issues), I would 
like that the 3rd encoding is set using the locale encoding. The user can only 
control it using the classical locale variables (LC_ALL, LC_CTYPE, LANG).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9992
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9992] Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent

2010-10-09 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

Am 09.10.2010 14:07, schrieb Antoine Pitrou:
 
 Antoine Pitrou pit...@free.fr added the comment:
 
 For the command line, it would mean that we 
 introduced a new encoding: command line encoding, which will be utf-8 on 
 OSX.
 
 Or more generally environment encoding, if it's also used for env
 vars. This could solve the subprocess issue neatly.

Please no. We run into problems because we have two inconsistent
encodings, and now you propose to introduce another one, allowing
for even more inconsistencies???

--
title: Command line arguments are not correctly decodediflocale and fileystem 
encodingsaredifferent - Command line arguments are not correctly 
decodediflocale and fileystem encodingsaredifferent

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9992
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9992] Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent

2010-10-09 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

 Please no. We run into problems because we have two inconsistent
 encodings, and now you propose to introduce another one, allowing
 for even more inconsistencies???

It would not really be a third encoding, since it would replace the
locale encoding for all pratical purposes, if I understand Victor's
proposal correctly.

--
title: Command line arguments are not correctly decodediflocale and fileystem 
encodingsaredifferent - Command line arguments are not correctly 
decodediflocale and fileystem encodingsaredifferent

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9992
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com