subject:"\[Python\-Dev\] Import and unicode\: part two"

Re: [Python-Dev] Import and unicode: part two

2011-01-27 Thread Martin v. Löwis

>When switching to a UTF-8 locale, they can also change the file
> names of their modules to be encoded in UTF-8. It would be fairly easy
> to write a script that identifies non-ASCII file names in a directory
> and offers to transcode their names from their current encoding to
> UTF-8.

In fact, convmv (http://j3e.de/linux/convmv/) does exactly that;
it comes as a Debian package also.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-26 Thread Glenn Linderman


On 1/26/2011 4:47 PM, Toshio Kuratomi wrote:

There's one further case that I am worried about that has no real
"transfer".  Since people here seem to think that unicode module names are
the future (for instance, the comments about redefining the C locale to
include utf-8 and the comments about archiving tools needing to support
encoding bits), there are eventually going to be unicode modules that become
dependencies of other modules and programs.  These will need to be installed
on systems.  Linux distributions that ship these will need to choose
a filesystem encoding for the filenames of these.  Likely the sensible thing
for them to do is to use utf-8 since all the ones I can think of default to
utf-8.  But, as Stephen and Victor have pointed out, users change their
locale settings to things that aren't utf-8 and save their modules using
filenames in that encoding.  When they update their OS to a version that has
utf-8 python module names, they will find that they have to make a choice.
They can either change their locale settings to a utf-8 encoding and have
the system installed modules work or they can leave their encoding on their
non-utf-8 encoding and have the modules that they've created on-site work.

This is not a good position to put users of these systems in.


The way this case should work, is that programs that install files 
(installation is a form of transfer) should transform their names from 
the encoding used in the transfer medium to the encoding of the 
filesystem on which they are installed.


Python3 should access the files, transforming the names from the 
encoding of the filesystem on which they are installed to Unicode for 
use by the program.


I think Python3 is trying to do its part, and Victor is trying to make 
that more robust on more platforms, specifically Windows.


The programs that install files, which may include programs that install 
Python files I don't know, may or may not be doing their part, but 
clearly there are cases where they do not.


Systems that have different encodings for names on the same or different 
file systems need to have a way to obtain the encoding for the file 
names, so they can be properly decoded.  If they don't have such a way, 
they are broken.


=
The rest of this is an attempt to describe the problem of Linux and 
other systems which use byte strings instead of character strings as 
file names.  No problem, as long as programs allow byte strings as file 
names.  Python3 does not, for the import statement, thus the problem is 
relevant for discussion here, as has been ongoing.

=

Since file names are defined to be byte strings, there is no way to 
obtain the encoding for file names, so they cannot always be decoded, 
and sometimes not properly decoded, because no one knows which encoding 
was used to create them, _if any_.


Hence, Linux programs that use character strings as file names 
internally and expect them to match the byte strings in the file system 
are promoting a fiction: that there is a transformation (encoding) from 
character strings to byte strings that will match.


When using ASCII character strings, they can be transformed to bytes 
using a simple transformation: identity... but that isn't necessarily 
correct, if the files were created using EBCDIC (unlikely on Linux 
systems, but not impossible, since Linux files are byte strings).


When using non-ASCII character strings, the fiction promoted is even 
bigger, and the transformation even harder.  Any 8-bit character 
encoding can pretend that identity is the correct transformation, but 
the result is mojibake if it isn't.  Unicode other multi-byte encodings 
have an even harder job, because there can be 8-bit sequences that are 
not legal for some transformations, but are legal for others.  This is 
when the fiction is exposed!


As the recent description of glib points out, when the file names are 
read as bytes, and shown to the user for selection, possibly using some 
mojibake-generating transformation to characters, the user has a 
fighting chance to pick the right file, less chance if the 
transformation is lossy ('?' substitutions, etc.) and/or the names are 
redundant in their lossless characters.


However, when the specification of the name is in characters (such as 
for Python import, or file names specified as character constants in any 
application system that provides/permits such), and there are large 
numbers of transformations that could be used to convert characters to 
bytes, the problem is harder, and error-prone... programs that want to 
promote the fiction of using characters for filenames must work harder.  
It seems that Python on Linux is such a program.


One technique is to have conventions agreed on by applications and users 
to limit the number of encodings used on a particular system to one 
(optimal) or a few, the latter requires understanding that files created 
in one encoding may not be accessible by systems that use a diffe

Re: [Python-Dev] Import and unicode: part two

2011-01-26 Thread Neil Hodgson

Toshio Kuratomi:

> When they update their OS to a version that has
> utf-8 python module names, they will find that they have to make a choice.
> They can either change their locale settings to a utf-8 encoding and have
> the system installed modules work or they can leave their encoding on their
> non-utf-8 encoding and have the modules that they've created on-site work.

   When switching to a UTF-8 locale, they can also change the file
names of their modules to be encoded in UTF-8. It would be fairly easy
to write a script that identifies non-ASCII file names in a directory
and offers to transcode their names from their current encoding to
UTF-8.

   Neil
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-26 Thread Toshio Kuratomi

On Wed, Jan 26, 2011 at 11:12:02AM +0100, "Martin v. Löwis" wrote:
> Am 26.01.2011 10:40, schrieb Victor Stinner:
> > Le lundi 24 janvier 2011 à 19:26 -0800, Toshio Kuratomi a écrit :
> >> Why not locale:
> >> * Relying on locale is simply not portable. (...)
> >> * Mixing of modules from different locales won't work. (...)
> > 
> > I don't understand what you are talking about.
> 
> I think by "portability", he means "moving files from one computer to
> another". He argues that if Python would mandate UTF-8 for all file
> names on Unix, moving files in such a way would support portability,
> whereas using the locale's filename might not (if the locale use a
> different charset on the target system).
> 
> While this is technically true, I don't think it's a helpful way of
> thinking: by mandating that file names are UTF-8 when accessed from
> Python, we make the actual files inaccessible on both the source and
> the target system.
> 
> > I don't understand the relation between the local filesystem encoding
> > and the portability. I suppose that you are talking about the
> > distribution of a module to other computers. Here the question is how
> > the filenames are stored during the transfer. The user is free to use
> > any tool, and try to find a tool handling Unicode correctly :-) But it's
> > no more the Python problem.
> 
> There are cases where there is no real "transfer", in the sense in which
> you are using the word. For example, with NFS, you can access the very
> same file simultaneously on two systems, with no file name conversion
> (unless you are using NFSv4, and unless your NFSv4 implementations
> support the UTF-8 mandate in NFS well).
> 
> Also, if two users of the same machine have different locale settings,
> the same file name might be interpreted differently.
> 
Thanks Martin, I think that you understand my view even if you don't share
it.

There's one further case that I am worried about that has no real
"transfer".  Since people here seem to think that unicode module names are
the future (for instance, the comments about redefining the C locale to
include utf-8 and the comments about archiving tools needing to support
encoding bits), there are eventually going to be unicode modules that become
dependencies of other modules and programs.  These will need to be installed
on systems.  Linux distributions that ship these will need to choose
a filesystem encoding for the filenames of these.  Likely the sensible thing
for them to do is to use utf-8 since all the ones I can think of default to
utf-8.  But, as Stephen and Victor have pointed out, users change their
locale settings to things that aren't utf-8 and save their modules using
filenames in that encoding.  When they update their OS to a version that has
utf-8 python module names, they will find that they have to make a choice.
They can either change their locale settings to a utf-8 encoding and have
the system installed modules work or they can leave their encoding on their
non-utf-8 encoding and have the modules that they've created on-site work.

This is not a good position to put users of these systems in.

-Toshio

pgpRiKtOLoK13.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-26 Thread Martin v. Löwis

> If NFSv3 doesn't reencode filenames for each client and the clients
> don't reencode filenames, all clients have to use the same locale
> encoding than the server. Otherwise, I don't see how it can work.

In practice, users accept that they get mojibake - their editors can
still open the files, and they can double-click them in a file browser
just fine. So it doesn't really need to work, and users can still use
it.

> Again, I don't think that Python should do anything special to
> workaround these issues.

I agree, and I'm certainly in favor of keeping the current code base.
Just make sure you understand the reasoning of those opposing.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-26 Thread James Y Knight


On Jan 26, 2011, at 11:47 AM, Victor Stinner wrote:
> Not exactly. Gtk+ uses the glib library, and to encode/decode filenames,
> the glib library uses:
> 
> - UTF-8 on Windows
> - G_FILENAME_ENCODING environment variable if set (comma-separated list
> of encodings)
> - UTF-8 if G_BROKEN_FILENAMES env var is set
> - or the locale encoding


But the documentation says:

> On Unix, the character sets are determined by consulting the environment 
> variables G_FILENAME_ENCODING and G_BROKEN_FILENAMES. On Windows, the 
> character set used in the GLib API is always UTF-8 and said environment 
> variables have no effect.
> 
> G_FILENAME_ENCODING may be set to a comma-separated list of character set 
> names. The special token "@locale" is taken to mean the character set for 
> thecurrent locale. If G_FILENAME_ENCODING is not set, but G_BROKEN_FILENAMES 
> is, the character set of the current locale is taken as the filename 
> encoding. If neither environment variable is set, UTF-8 is taken as the 
> filename encoding, but the character set of the current locale is also put in 
> the list of encodings.

Which indicates to me that (unless you override the behavior with env vars) it 
encodes filenames in UTF-8 regardless of the locale, and attempts decoding in 
UTF-8 primarily. And that only when the filename doesn't make sense in UTF-8, 
it will also try decoding it in the locale encoding.

James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-26 Thread Victor Stinner

Le mercredi 26 janvier 2011 à 08:24 -0500, James Y Knight a écrit :
> On Jan 26, 2011, at 4:40 AM, Victor Stinner wrote:
> > During
> > Python 3.2 development, we tried to be able to use a filesystem encoding
> > different than the locale encoding (PYTHONFSENCODING environment
> > variable): but it doesn't work simply because Python is not alone in the
> > OS. Except Python, all programs speak the same "language": the locale
> > encoding. Let's try to give you an example: if create a module with a
> > name encoded to UTF-8, your file browser will display mojibake.
> 
> Is that really true? I'm pretty sure GTK+ treats all filenames as
> UTF-8 no matter what the locale says. (over-rideable by
> G_FILENAME_ENCODING or G_BROKEN_FILENAMES)

Not exactly. Gtk+ uses the glib library, and to encode/decode filenames,
the glib library uses:

 - UTF-8 on Windows
 - G_FILENAME_ENCODING environment variable if set (comma-separated list
of encodings)
 - UTF-8 if G_BROKEN_FILENAMES env var is set
 - or the locale encoding

glib has no type to store a filename, a filename is a raw byte string
(char*). It has a nice function to workaround mojibake issues:
g_filename_display_name(). This function tries to decode the filename
from each encoding of the filename encoding list, if all decodings
failed, use UTF-8 and escape undecodable bytes.

So yes, if you set G_FILENAME_ENCODING you can fix mojibake issues. But
you have to pass the raw bytes filenames to other libraries and
programs.

The problem with PYTHONFSENCODING is that sys.getfilesystemencoding() is
not only used for the filenames, but also for the command line arguments
and the environment variables.

For more information about glib, see g_filename_to_utf8(),
g_filename_display_name() and g_get_filename_charsets() documentation:

http://library.gnome.org/devel/glib/2.26/glib-Character-Set-Conversion.html

Victor

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-26 Thread James Y Knight

On Jan 26, 2011, at 4:40 AM, Victor Stinner wrote:
> During
> Python 3.2 development, we tried to be able to use a filesystem encoding
> different than the locale encoding (PYTHONFSENCODING environment
> variable): but it doesn't work simply because Python is not alone in the
> OS. Except Python, all programs speak the same "language": the locale
> encoding. Let's try to give you an example: if create a module with a
> name encoded to UTF-8, your file browser will display mojibake.

Is that really true? I'm pretty sure GTK+ treats all filenames as UTF-8 no 
matter what the locale says. (over-rideable by G_FILENAME_ENCODING or 
G_BROKEN_FILENAMES)

James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-26 Thread Victor Stinner

Le mercredi 26 janvier 2011 à 11:12 +0100, "Martin v. Löwis" a écrit :
> There are cases where there is no real "transfer", in the sense in which
> you are using the word. For example, with NFS, you can access the very
> same file simultaneously on two systems, with no file name conversion
> (unless you are using NFSv4, and unless your NFSv4 implementations
> support the UTF-8 mandate in NFS well).

Python encodes the module name to the locale encoding to create a
filename. If the locale encoding is not the encoding used on the NFS
server, it doesn't work, but I don't think that Python has to workaround
this issue. If an user plays with non-ASCII module names, (s)he has to
understand that (s)he will have to fight against badly configured
systems and tools unable to handle Unicode correctly. We might warn
him/her in the documentation.

If NFSv3 doesn't reencode filenames for each client and the clients
don't reencode filenames, all clients have to use the same locale
encoding than the server. Otherwise, I don't see how it can work.

> Also, if two users of the same machine have different locale settings,
> the same file name might be interpreted differently.

Except Mac OS X and Windows, no kernel supports Unicode and so all users
of the same computer have to use the same locale encoding, or they will
not be able to share non-ASCII filenames.

--

Again, I don't think that Python should do anything special to
workaround these issues.

(Hardcode the module filename encoding to UTF-8 doesn't work for all the
reasons explained in other emails.)

Victor

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-26 Thread Oleg Broytman

On Wed, Jan 26, 2011 at 11:12:02AM +0100, "Martin v. L??wis" wrote:
> There are cases where there is no real "transfer", in the sense in which
> you are using the word. For example, with NFS, you can access the very
> same file simultaneously on two systems, with no file name conversion
> (unless you are using NFSv4, and unless your NFSv4 implementations
> support the UTF-8 mandate in NFS well).
> 
> Also, if two users of the same machine have different locale settings,
> the same file name might be interpreted differently.

   I have a solution for all these problems, with a price, of course.
Let's use utf8+base64. Base64 uses a very restricted subset of ASCII and
filenames will never be interpreted whatever filesystem encodings would
be. The price is users loose standard OS tools like ls and find.
   I am partially joking, of course, but only partially.

Oleg.
-- 
 Oleg Broytmanhttp://phdru.name/p...@phdru.name
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-26 Thread Martin v. Löwis

Am 26.01.2011 10:40, schrieb Victor Stinner:
> Le lundi 24 janvier 2011 à 19:26 -0800, Toshio Kuratomi a écrit :
>> Why not locale:
>> * Relying on locale is simply not portable. (...)
>> * Mixing of modules from different locales won't work. (...)
> 
> I don't understand what you are talking about.

I think by "portability", he means "moving files from one computer to
another". He argues that if Python would mandate UTF-8 for all file
names on Unix, moving files in such a way would support portability,
whereas using the locale's filename might not (if the locale use a
different charset on the target system).

While this is technically true, I don't think it's a helpful way of
thinking: by mandating that file names are UTF-8 when accessed from
Python, we make the actual files inaccessible on both the source and
the target system.

> I don't understand the relation between the local filesystem encoding
> and the portability. I suppose that you are talking about the
> distribution of a module to other computers. Here the question is how
> the filenames are stored during the transfer. The user is free to use
> any tool, and try to find a tool handling Unicode correctly :-) But it's
> no more the Python problem.

There are cases where there is no real "transfer", in the sense in which
you are using the word. For example, with NFS, you can access the very
same file simultaneously on two systems, with no file name conversion
(unless you are using NFSv4, and unless your NFSv4 implementations
support the UTF-8 mandate in NFS well).

Also, if two users of the same machine have different locale settings,
the same file name might be interpreted differently.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-26 Thread Victor Stinner

Le lundi 24 janvier 2011 à 19:26 -0800, Toshio Kuratomi a écrit :
> Why not locale:
> * Relying on locale is simply not portable. (...)
> * Mixing of modules from different locales won't work. (...)

I don't understand what you are talking about.

When you import a module, the module name becomes a filename. On
Windows, you can reuse the Unicode name directly as a filename. On the
other OSes, you have to encode the name to filesystem encoding. During
Python 3.2 development, we tried to be able to use a filesystem encoding
different than the locale encoding (PYTHONFSENCODING environment
variable): but it doesn't work simply because Python is not alone in the
OS. Except Python, all programs speak the same "language": the locale
encoding. Let's try to give you an example: if create a module with a
name encoded to UTF-8, your file browser will display mojibake.

I don't understand the relation between the local filesystem encoding
and the portability. I suppose that you are talking about the
distribution of a module to other computers. Here the question is how
the filenames are stored during the transfer. The user is free to use
any tool, and try to find a tool handling Unicode correctly :-) But it's
no more the Python problem.

Each computer uses a different locale encoding. You have to use it to
cooperate with other programs and avoid mojibake. But I don't understand
why you write that "Mixing of modules from different locales won't
work". If you use a tool storing filenames in your locale encoding (eg.
TAR file format... and sometimes the ZIP format), the problem comes from
your tool and you should use another tool.

I created http://bugs.python.org/issue10972 to workaround ZIP tools
supposing that ZIP files use the locale encoding instead of cp497: this
issue adds an option to force the usage of the Unicode flag (and so
store filenames to UTF-8). Even if initially, I created the issue to
workaround a bootstrap issue (#10955).

Victor

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-26 Thread Stephen J. Turnbull

Toshio Kuratomi writes:

 > Sure ... but with these systems, neither read-modules-as-locale or
 > read-modules-as-utf-8 are a good solution to work, correct?

Good solution, no, but I believe that read-modules-as-locale *should*
work to a great extent.  AFAIK Python 3 reads Python programs as str
(ie, converting to Unicode -- if it doesn't, it *should*).

 > Especially if the OS does get upgraded but the filesystems with
 > user data (and user created modules) are migrated as-is, you'll run
 > into situations where system installed modules are in utf-8 and
 > user created modules are shift-jis and so something will always be
 > broken.

I don't know what you mean by "system-installed modules".  If you're
talking about Python itself, it's not a problem.  Python doesn't have
any Japanese-named modules in any encoding.

On the other hand, *everything* that involves scripting (shell
scripts, make, etc) related to those filesystems will be broken
*unless* the system, after upgrade but before going live, is converted
to have an appropriate locale encoding.  So I don't really see a
problem here.

The problem is portability across systems, and that is a problem that
only the third-party transports can really deal with.  tar and unzip
need to be taught how to change file names to the locale, etc.

 > The only way to make sure that modules work is to restrict them to ASCII-only
 > on the filesystem.  But because unicode module names are seen as
 > a necessary feature, the question is which way forward is going to lead to
 > the least brokenness.  Which could be locale... but from the python2
 > locale-related bugs that I get to look at, I doubt.

AFAICS this is going to be site-specific.  End of story.  Or, if you
prefer, "maru-nage".

IMHO, Python 2 locale bugs are unlikely to be a good guide to Python 3
locale bugs because in Python 2 most people just ignore locale and use
"native" strings (~= bytes in Python 3), and that typically "just
works".  In Python 3 that just *doesn't* work any more because you get
a UnicodeError on import, etc, etc.

IMHO, YMMV, and all that.  I know *of* such systems (there remain
quite a few here used by student and research labs), but the ones I
maintain were easy to convert to UTF-8 because I don't export file
systems (except my private files for my own use); everything is
mediated by Apache and Zope, and browsers are happy to cope if I
change from EUC-JP to UTF-8 and then flip the Apache switch to change
default encodings.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-25 Thread Toshio Kuratomi

On Wed, Jan 26, 2011 at 11:24:54AM +0900, Stephen J. Turnbull wrote:
> Toshio Kuratomi writes:
> 
>  > On Linux there's no defined encoding that will work; file names are just
>  > bytes to the Linux kernel so based on people's argument that the convention
>  > is and should be that filenames are utf-8 and anything else is
>  > a misconfigured system -- python should mandate that its module filenames 
> on
>  > Linux are utf-8 rather than using the user's locale settings.
> 
> This isn't going to work where I live (Tsukuba).  At the national
> university alone there are hundreds of pre-existing *nix systems whose
> filesystems were often configured a decade or more ago.  Even if the
> hardware and OS have been upgraded, the filesystems are usually
> migrated as-is, with OS configuration tweaks to accomodate them.  Many
> of them use EUC-JP (and servers often Shift JIS).  That means that you
> won't be able to read module names with ls, and that will make Python
> unacceptable for this purpose.  I imagine that in Russia the same is
> true for the various Cyrillic encodings.
> 
Sure ... but with these systems, neither read-modules-as-locale or
read-modules-as-utf-8 are a good solution to work, correct?  Especially if
the OS does get upgraded but the filesystems with user data (and user
created modules) are migrated as-is, you'll run into situations where system
installed modules are in utf-8 and user created modules are shift-jis and so
something will always be broken.

The only way to make sure that modules work is to restrict them to ASCII-only
on the filesystem.  But because unicode module names are seen as
a necessary feature, the question is which way forward is going to lead to
the least brokenness.  Which could be locale... but from the python2
locale-related bugs that I get to look at, I doubt.

> I really don't think there is anything that can be done here except to
> warn people that "Kids, these stunts are performed by highly-trained
> professionals.  Don't try this at home!"  Of course they will anyway,
> but at least they will have been warned in sufficiently strong terms
> that they might pay attention and be able to recover when they run
> into bizarre import exceptions.
> 
So on the subject of warnings... I think a reason it's better to pick an
encoding for the platform/filesystem rather than to use locale is because
people will get an error or a warning at the appropriate time if that's the
case -- the first time they attempt to create and import a module with
a filename that's not encoded in the correct encoding for the platform.
It's all very well to say: "We wrote in the documentation on
http://docs.python.org/distutils/introduction.html#Choosing-a-name that only
ASCII names should be used when distributing python modules" but if the
interpreter doesn't complain when people use a non-ASCII filename we all
know that they aren't going to look in the documentation; they'll try it and
if it works they'll learn that habit.  

-Toshio


pgpjrrsvd3wof.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-25 Thread Stephen J. Turnbull

Toshio Kuratomi writes:

 > On Linux there's no defined encoding that will work; file names are just
 > bytes to the Linux kernel so based on people's argument that the convention
 > is and should be that filenames are utf-8 and anything else is
 > a misconfigured system -- python should mandate that its module filenames on
 > Linux are utf-8 rather than using the user's locale settings.

This isn't going to work where I live (Tsukuba).  At the national
university alone there are hundreds of pre-existing *nix systems whose
filesystems were often configured a decade or more ago.  Even if the
hardware and OS have been upgraded, the filesystems are usually
migrated as-is, with OS configuration tweaks to accomodate them.  Many
of them use EUC-JP (and servers often Shift JIS).  That means that you
won't be able to read module names with ls, and that will make Python
unacceptable for this purpose.  I imagine that in Russia the same is
true for the various Cyrillic encodings.

I really don't think there is anything that can be done here except to
warn people that "Kids, these stunts are performed by highly-trained
professionals.  Don't try this at home!"  Of course they will anyway,
but at least they will have been warned in sufficiently strong terms
that they might pay attention and be able to recover when they run
into bizarre import exceptions.

Oh, yeah, don't forget to apply Victor's patch, which allows Python to
keep the promises it can make about consistency.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-25 Thread Toshio Kuratomi

On Tue, Jan 25, 2011 at 10:22:41AM +0100, Xavier Morel wrote:
> On 2011-01-25, at 04:26 , Toshio Kuratomi wrote:
> > 
> > * If you can pick a set of encodings that are valid (utf-8 for Linux and
> >  MacOS
> 
> HFS+ uses UTF-16 in NFD (actually in an Apple-specific variant of NFD). Right 
> here you've already broken Python modules on OSX.
>
Others have been saying that Mac OSX's HFS+ uses UTF-8.  But the question is
not whether UTF-16 or UTF-8 is used by HFS+.  It's whether you can sensibly
decide on an encoding from the type of system that is being run on.  This
could be querying the filesystem or a check on sys.platform or some other
method.  I don't know what detection the current code does.

On Linux there's no defined encoding that will work; file names are just
bytes to the Linux kernel so based on people's argument that the convention
is and should be that filenames are utf-8 and anything else is
a misconfigured system -- python should mandate that its module filenames on
Linux are utf-8 rather than using the user's locale settings.
> 
> And as far as I know, Linux software/FS generally use NFC (I've already seen 
> this issue cause trouble)
> 
Linux FS's are bytes with a small blacklist (so you can't use the NULL byte
in a filename, for instance).  Linux software would be free to use any
normal form that they want.  If one software used NFC and another used NFD,
the FS would record two separate files with two separate filenames.  Other
programs might or might not display this correctly.

Example:
$ touch cafe
$ python
Python 2.7 (r27:82500, Sep 16 2010, 18:02:00) 
>>> import os
>>> import unicodedata
>>> a=u'café'
>>> b=unicodedata.normalize('NFC', a)
>>> c=unicodedata.normalize('NFD', a)
>>> open(b.encode('utf8'), 'w').close()
>>> open(c.encode('utf8'), 'w').close()
>>> os.listdir(u'.')
>>> [u'people-etc-changes.txt', u'cafe\u0301', u'cafe', 
>>> u'people-etc-changes.sha256sum', u'caf\xe9']
>>> os.listdir('.')
>>> ['people-etc-changes.txt', 'cafe\xcc\x81', 'cafe', 
>>> 'people-etc-changes.sha256sum', 'caf\xc3\xa9']
>>> ^D

$ ls -al .
drwxrwxr-x.  2 badger badger  4096 Jan 25 07:46 .
drwxr-xr-x. 17 badger badger  4096 Jan 24 18:27 ..
-rw-rw-r--.  1 badger badger 0 Jan 25 07:45 cafe
-rw-rw-r--.  1 badger badger 0 Jan 25 07:46 cafe
-rw-rw-r--.  1 badger badger 0 Jan 25 07:46 café

$ ls -al cafe
-rw-rw-r--.  1 badger badger 0 Jan 25 07:45 cafe
$ ls -al cafe?
-rw-rw-r--.  1 badger badger 0 Jan 25 07:46 cafe

Now in this case, the decomposed form of the filename is being displayed
incorrectly and the shell treats the decomposed character as two characters
instead of one.  However, when you view these files in dolphin (the KDE file
manager) you properly see café repeated twice.

-Toshio


pgp2jXsIKYdB7.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-25 Thread exarkun


On 09:22 am, catch-...@masklinn.net wrote:

On 2011-01-25, at 04:26 , Toshio Kuratomi wrote:


* If you can pick a set of encodings that are valid (utf-8 for Linux 
and

 MacOS


HFS+ uses UTF-16 in NFD (actually in an Apple-specific variant of NFD). 
Right here you've already broken Python modules on OSX.


Are you sure about the UTF-16 part?  Evidence strongly points towards 
UTF-8:


 $ python
 Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)  [GCC 4.2.1 (Apple 
Inc. build 5646)] on darwin

 Type "help", "copyright", "credits" or "license" for more information.
 >>> import unicodedata, os
 >>> file(u'\N{SNOWMAN}', 'w').close()
 >>> os.listdir('.')
 ['\xe2\x98\x83']
 >>> unicodedata.name('\xe2\x98\x83'.decode('utf-8'))
 'SNOWMAN'
 >>>
Jean-Paul
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-25 Thread Xavier Morel

On 2011-01-25, at 04:26 , Toshio Kuratomi wrote:
> 
> * If you can pick a set of encodings that are valid (utf-8 for Linux and
>  MacOS

HFS+ uses UTF-16 in NFD (actually in an Apple-specific variant of NFD). Right 
here you've already broken Python modules on OSX.

And as far as I know, Linux software/FS generally use NFC (I've already seen 
this issue cause trouble)

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-25 Thread Stephen J. Turnbull

As Nick points out, nobody really seems to think this is an
argument against your patch.  I'm going to bow out of this thread
after this post, as I'm clearly out of my technical depth.

Victor Stinner writes:

 > Le lundi 24 janvier 2011 11:35:22, Stephen J. Turnbull a écrit :
 > > ... VFAT-formatted file systems and Shift JIS file names ...
 > 
 > I missed something: VFAT stores filenames as unicode (whereas FAT only 
 > supports byte filenames). Well, VFAT stores filenames twice: as a 8+3 byte 
 > strings and as a 255 unicode (UTF-16-LE) string (UTF-16-LE).

I don't know what it is; I didn't have char-device-level access to the
file system, nor did I have the specs (it was a proprietary phone by a
Japanese OEM).  It *presented* filenames in Shift JIS when mounted on
Linux with the vfat filesystem (either "mount -t vfat /dev/sde1
/mnt/gadget" or "mount -t auto /dev/sde1 /mnt/gadget").  Maybe there
is some unusual layer to translate from Unicode there, I'm not
familiar with Linux kernel drivers and libc facilities (such
special-casing is a common pattern in programming for Japanese;
remember, the Japanese had to deal with these issues before there was
any standard for them).

 > On which OS do you access this VFAT file system? On Windows, you have two 
 > APIs: bytes (*A) and wide character (*W). If you use the wide character, 
 > there 
 > is explicit encoding at all. Linux has two mount options to control unicode 
 > on 
 > a VFAT filesystem: "codepage" for the byte filenames (use Shift JIS here) 
 > and 
 > "iocharset" for the unicode filenames (I don't understand this
 > option). 

I didn't either, in fact this is the first I've heard of it, so I've
never tried it.

 > I suppose that Shift JIS is used to encode the filename in the 8+3 byte 
 > string 
 > form.

Could be, but I'm pretty sure these were long filenames, although
maybe they were just short enough (that is, I don't recall noticing
any truncation when mounted compared to the way they were presented on
the phone itself).  I don't use that phone anymore, it's in a box of
junk equipment somewhere
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-24 Thread Toshio Kuratomi

On Thu, Jan 20, 2011 at 03:27:08PM -0500, Glyph Lefkowitz wrote:
> 
> On Jan 20, 2011, at 11:46 AM, Guido van Rossum wrote:
> Same here. *Most* code will never be shared, or will only be shared
> between users in the same community. When it goes wrong it's also a
> learning opportunity. :-)
> 
> 
> Despite my usual proclivity for being contrarian, I find myself in agreement
> here.  Linux users with locales that don't specify UTF-8 frankly _should_ have
> to deal with all kinds of nastiness until they can transcode their 
> filesystems.
>  MacOS and Windows both have a "right" answer here and your third-party tools
> shouldn't create mojibake in your filenames.
> 
However, if this is the consensus, it makes a lot more sense to pick utf-8
as *the* encoding for python module filenames on Linux.

Why UTF-8:

* UTF-8 can cover the whole range of unicode whereas most (all?) other
  locale friendly encodings cannot.
* UTF-8 is becoming a standard for Linux distributions whether or not Linux
  users are adopting it.
* Third party tools are gaining support for UTF-8 even when they aren't
  gaining support for generic encodings (If I read the spec on zip
  correctly, this is actually what's happening there).

Why not locale:
* Relying on locale is simply not portable.  If nothing prevents people from
  distributing a unicode filename then they will go ahead and do so.  If
  the result works (say, because it's utf-8 and 80% of the Linux userbase is
  using utf-8) then it will get packaged and distributed and people won't
  know that it's a problem until someone with a non-utf-8 locale decids to
  use it.
* Mixing of modules from different locales won't work.  Suppose that the
  system python installs the previous module.  The local site has other
  modules that it has installed using a different filename encoding.
  The users at the site will find that either one or hte other of the two
  modules won't work.
* Because of the portability problems you have no choice but to tell people
  not to distribute python modules with non-ASCII names.  This makes the use
  of unicode names second class indefintely (until the kernel devs decide
  that they're wrong to not enforce a filesystem encoding or Linux becomes
  irrelevant as a platform).
* If you can pick a set of encodings that are valid (utf-8 for Linux and
  MacOS, wide unicode for windows [I get the feeling from other parts of the
  conversation that Windows won't be so lucky, though]) tools to convert
  python names become easier to write.  If you restrict it far enough, you
  could even write tools/importers that automatically do the detection.

PS: Sorry for not replying immediately, the team I'm on is dealing with an
issue at my work and I'm also preparing for a conference later this week.

-Toshio

pgpq1C0qGW77C.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-24 Thread Martin v. Löwis

Am 24.01.2011 16:39, schrieb Victor Stinner:
> Le lundi 24 janvier 2011 11:35:22, Stephen J. Turnbull a écrit :
>> ... VFAT-formatted file systems and Shift JIS file names ...
> 
> I missed something: VFAT stores filenames as unicode (whereas FAT only 
> supports byte filenames). Well, VFAT stores filenames twice: as a 8+3 byte 
> strings and as a 255 unicode (UTF-16-LE) string (UTF-16-LE).

Stephen may not have meant VFAT. Instead, he might have meant FAT32,
or, more likely, exFAT. VFAT is patented by Microsoft, so vendors of
devices using flash memory cards often don't support VFAT.

In any case, file names are encoded in the OEM code page even on VFAT.

> On which OS do you access this VFAT file system? On Windows, you have two 
> APIs: bytes (*A) and wide character (*W). If you use the wide character, 
> there 
> is explicit encoding at all.

Right ("no explicit encoding"). However, this is actually where things
can go wrong: Windows needs to guess the file system, and will guess it
uses the OEM code page. If the device writing the file system uses a
different OEM code age than the Windows installation reading it, you
get moji-bake.

This will actually happen with the *A APIs as well: they do *not* give
you the file name from disk. Instead, Windows converts the OEM
characters on disk to Unicode, and then the Unicode characters to the
ANSI code page.

> Linux has two mount options to control unicode on 
> a VFAT filesystem: "codepage" for the byte filenames (use Shift JIS here) and 
> "iocharset" for the unicode filenames (I don't understand this option). 
> Anyway, both systems support unicode filenames.

Linux doesn't support "unicode file names". Instead, it can support
UTF-8. As Oleg explains: you need one encoding for the bytes on disk
(to know what they mean, when converted to Unicode), and one encoding
to then convert the "abstract" unicode to bytes again to present to
the application. This is similar to how *A works on Windows.

The iocharset is needed even if the file system is known to use UTF-16
(say, NTFS, VFAT, or Joliet).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-24 Thread Oleg Broytman

On Mon, Jan 24, 2011 at 04:39:39PM +0100, Victor Stinner wrote:
> I missed something: VFAT stores filenames as unicode (whereas FAT only 
> supports byte filenames). Well, VFAT stores filenames twice: as a 8+3 byte 
> strings and as a 255 unicode (UTF-16-LE) string (UTF-16-LE).
> 
> On which OS do you access this VFAT file system? On Windows, you have two 
> APIs: bytes (*A) and wide character (*W). If you use the wide character, 
> there 
> is explicit encoding at all. Linux has two mount options to control unicode 
> on 
> a VFAT filesystem: "codepage" for the byte filenames (use Shift JIS here) and 
> "iocharset" for the unicode filenames (I don't understand this option). 

   AFAIU, `codepage` is "remote charset" while `iocharset` is "local
charset". I.e., to mount windows-1251 filesystem to my linux with koi8-r
locale I use codepage=cp866,iocharset=koi8-r (cp866 is OEM encoding for
cp1251 ANSI).

Oleg.
-- 
 Oleg Broytmanhttp://phdru.name/p...@phdru.name
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-24 Thread Victor Stinner

Le lundi 24 janvier 2011 16:39:39, Victor Stinner a écrit :
> Le lundi 24 janvier 2011 11:35:22, Stephen J. Turnbull a écrit :
> > ... VFAT-formatted file systems and Shift JIS file names ...
> 
> I missed something: VFAT stores filenames as unicode (whereas FAT only
> supports byte filenames). Well, VFAT stores filenames twice: as a 8+3 byte
> strings and as a 255 unicode (UTF-16-LE) string (UTF-16-LE).
> 
> On which OS do you access this VFAT file system? On Windows, you have two
> APIs: bytes (*A) and wide character (*W). If you use the wide character,
> there is explicit encoding at all.

Oops, there is *not* explicit encoding a all.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-24 Thread Victor Stinner

Le lundi 24 janvier 2011 11:35:22, Stephen J. Turnbull a écrit :
> ... VFAT-formatted file systems and Shift JIS file names ...

I missed something: VFAT stores filenames as unicode (whereas FAT only 
supports byte filenames). Well, VFAT stores filenames twice: as a 8+3 byte 
strings and as a 255 unicode (UTF-16-LE) string (UTF-16-LE).

On which OS do you access this VFAT file system? On Windows, you have two 
APIs: bytes (*A) and wide character (*W). If you use the wide character, there 
is explicit encoding at all. Linux has two mount options to control unicode on 
a VFAT filesystem: "codepage" for the byte filenames (use Shift JIS here) and 
"iocharset" for the unicode filenames (I don't understand this option). 
Anyway, both systems support unicode filenames.

I suppose that Shift JIS is used to encode the filename in the 8+3 byte string 
form.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-24 Thread Nick Coghlan

On Mon, Jan 24, 2011 at 8:35 PM, Stephen J. Turnbull  wrote:
> First of all, these aren't just phones; these are all kinds of gadgets
> (the example I gave was a camera).  They're not as smart as an Android
> or iPhone-like device, and I don't know what OS they use.

We're getting a little far afield from the original question though -
once it was pointed out that non-ASCII module names already work on
some systems but not others, it became fairly clear that Victor's
patch is about fixing an existing feature to be more robust rather
than adding something new.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-24 Thread Stephen J. Turnbull

"Martin v. Löwis" writes:

 > It's one thing how the file systems are formatted, but another thing
 > how they are presented to APIs. For example, the phones using Windows CE
 > would have to convert the file names to Unicode in the OS kernel.
 > 
 > So: for these phones - do you know how they present file names to the
 > application?

First of all, these aren't just phones; these are all kinds of gadgets
(the example I gave was a camera).  They're not as smart as an Android
or iPhone-like device, and I don't know what OS they use.

As for "presentation to the application", as I said, my older phones
presented themselves as "removable memory devices" (specifically on
the USB port), with VFAT-formatted file systems and Shift JIS file
names.  In that case you can surely have the kinds of problems
described, even if the app is not running on the device itself.  I
don't know if this is still true of more modern devices, but I was a
little shocked that is was true at all, even 5 or 6 years ago.

That may be one reason why the phone I have now doesn't provide a USB
interface at all.  That kind of interface is not only unnecessary with
Bluetooth, but Bluetooth uses more robust protocols.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-24 Thread Martin v. Löwis

>  > Really? I would have thought that cell phones have long been the
>  > platforms most supportive of Unicode.
> 
> I would think so too, except in Japan.
> 
> However, my previous phones exposed file systems with names encoded in
> Shift JIS to USB and IR browsers, though.  (My current one uses
> Bluetooth, and I don't know how to "get at" the filesystem itself.)  A
> lot of these devices also tend to present themselves as VFAT-formatted
> drives (a la a USB memory stick), and Shift JIS is very commonly used
> on those for reasons I don't really understand.

It's one thing how the file systems are formatted, but another thing
how they are presented to APIs. For example, the phones using Windows CE
would have to convert the file names to Unicode in the OS kernel.

So: for these phones - do you know how they present file names to the
application?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-24 Thread Stephen J. Turnbull

Guido van Rossum writes:

 > Really? I would have thought that cell phones have long been the
 > platforms most supportive of Unicode.

I would think so too, except in Japan.

However, my previous phones exposed file systems with names encoded in
Shift JIS to USB and IR browsers, though.  (My current one uses
Bluetooth, and I don't know how to "get at" the filesystem itself.)  A
lot of these devices also tend to present themselves as VFAT-formatted
drives (a la a USB memory stick), and Shift JIS is very commonly used
on those for reasons I don't really understand.

In any case, AIUI here the problem is like the problem of refactoring
a "make"-based system.  There are identifiers which are "spelled" one
way inside of files which need to match the "spelling" of names of
external filesystem objects.  If you transport such a set of files to
a POSIX system (which AFAIK most servers still are), then it's quite
possible that the file names will get translated to the locale's
encoding while the identifiers will not.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-23 Thread Guido van Rossum

On Sun, Jan 23, 2011 at 6:33 PM, Stephen J. Turnbull  wrote:
> "Martin v. Löwis" writes:
>  > Actually, as long people only involve Windows, or only involve Mac,
>  > it will all work just fine. It's only when they use non-Mac Unix
>  > (such as Linux), or try to move files across systems using sub-prime
>  > technology (such as your typical Windows zip utility) they will run
>  > into problems.
>
> I believe that the kind of thing that Ishimoto-san has in mind is
> things like "smart cameras" that will upload your photos to your blog
> with one touch on the cameras screen and other "Web 2.0 for the rest
> of us" apps.  What with the popularity of Linux and *BSD for such
> sites, it's easy to imagine problems of the kind he describes
> occurring between those (which will probably be using Shift JIS in
> Japan) apps and the websites.

Really? I would have thought that cell phones have long been the
platforms most supportive of Unicode. IIRC Nokia's Python port to S60
*required* Unicode strings for all system interfaces. Android, using
Java, also is pretty much all Unicode inside. Am I naive to generalize
from these two examples?

(This is not meant as a rhetorical question -- I may well be missing
something and am genuinely curious about the answer.)

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-23 Thread Stephen J. Turnbull

"Martin v. Löwis" writes:

 > Actually, as long people only involve Windows, or only involve Mac,
 > it will all work just fine. It's only when they use non-Mac Unix
 > (such as Linux), or try to move files across systems using sub-prime
 > technology (such as your typical Windows zip utility) they will run
 > into problems.

I believe that the kind of thing that Ishimoto-san has in mind is
things like "smart cameras" that will upload your photos to your blog
with one touch on the cameras screen and other "Web 2.0 for the rest
of us" apps.  What with the popularity of Linux and *BSD for such
sites, it's easy to imagine problems of the kind he describes
occurring between those (which will probably be using Shift JIS in
Japan) apps and the websites.

Why people with the skills to be actually using Python would have a
problem like that, I don't know, but my experience with Japanese
vendors is no different from anywhere else: they put the blame for
bugs in systems on any convenient component other than their own or
close business partners'.  Open source is especially convenient
because of the NO WARRANTY section prominently displayed in all
licenses.

 > So the more people get confronted with the poor support of non-ASCII
 > file names in tools, the faster the tools will improve. It took PKWARE
 > many years to come up with a reasonable Unicode story - but now it's
 > really the tools that need to catch up, not the spec.

I still agree with this point of view, but there is some scope for
discussion of whether these tools should be "included batteries" or
not.  (Unfortunately I'm not in a position to volunteer to help with
them for some time. :-( )

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-21 Thread Martin v. Löwis

>> I don't think anybody is *encouraging* it.  The argument is for
>> *permitting* it, partly for consistency with other identifiers, and
>> partly because of Python's usual "consenting adults" standard for
>> permitting "dangerous" practices.
> 
> I'm sorry, I was not clear. I was afraid that saying "learning
> opportunity" tempt people to try non-ASCII module names.
> In these days, even non technical people have access to Windows, Mac
> and Linux boxes at a time. So chances to be annoyed with broken
> non-ASCII named files are pretty common.

Actually, as long people only involve Windows, or only involve Mac,
it will all work just fine. It's only when they use non-Mac Unix
(such as Linux), or try to move files across systems using sub-prime
technology (such as your typical Windows zip utility) they will run
into problems. But then it will be clear whom to blame - and people
run in the same problems regardless of whether they move Python modules,
or regular files (say, Word documents).

So the more people get confronted with the poor support of non-ASCII
file names in tools, the faster the tools will improve. It took PKWARE
many years to come up with a reasonable Unicode story - but now it's
really the tools that need to catch up, not the spec.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-21 Thread Atsuo Ishimoto

On Fri, Jan 21, 2011 at 5:45 PM, Stephen J. Turnbull  wrote:
> Nick Coghlan writes:
>  > On Fri, Jan 21, 2011 at 3:44 PM, Atsuo Ishimoto  
> wrote:
>
>  > > I don't want Python to encourage people to use non-ascii module names.
>
> I don't think anybody is *encouraging* it.  The argument is for
> *permitting* it, partly for consistency with other identifiers, and
> partly because of Python's usual "consenting adults" standard for
> permitting "dangerous" practices.

I'm sorry, I was not clear. I was afraid that saying "learning
opportunity" tempt people to try non-ASCII module names.
In these days, even non technical people have access to Windows, Mac
and Linux boxes at a time. So chances to be annoyed with broken
non-ASCII named files are pretty common.

>
> I still don't see this as a reason to give up on non-ASCII module
> names.  Just have the documentation warn that many non-ASCII names
> will be non-portable, so use on multiple systems will require care
> (maybe gloss that with "probably more care than you want to take").
>
Nice gloss.

-- 
Atsuo Ishimoto
Mail: ishim...@gembook.org
Blog: http://d.hatena.ne.jp/atsuoishimoto/
Twitter: atsuoishimoto
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-21 Thread Antoine Pitrou

On Thu, 20 Jan 2011 22:25:17 -0500
James Y Knight  wrote:
> 
> On Jan 20, 2011, at 3:55 PM, Antoine Pitrou wrote:
> 
> > On Thu, 20 Jan 2011 15:27:08 -0500
> > Glyph Lefkowitz  wrote:
> >> 
> >> To support the latter, could we just make sure that zipimport has a 
> >> consistent,
> >> non-locale-or-operating-system-dependent interpretation of encoding?
> > 
> > It already has, but it's dependent on a flag in the zip file itself
> > (actually, one flag per archived file in the zip it seems).
> > 
> > (by the way, it would be nice if your text/mail editor wrapped lines at
> > 80 characters or something)
> 
> You could complain to Apple, but it seems unlikely that they'd change it. 
> They broke it intentionally in OSX 10.6.2 for better compatibility with MS 
> Outlook.
> 
> (for the technically inclined: It still wraps lines at 80 characters in the 
> raw message, but it uses quoted-printable encoding to escape the line-breaks, 
> so mail readers which decode quoted-printable but can't flow text are now 
> S.O.L. Apple used to use the nice format=flowed standard instead.)

I think most mail readers are able to word-wrap raw text correctly
(even though it still makes your messages look bad amongst a thread of
nicely-formatted 80-column messages).
The real annoyance is when reading Web archives of mailing-lists, e.g.
http://twistedmatrix.com/pipermail/twisted-python/2011-January/023346.html

Regards

Antoine.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-21 Thread Stephen J. Turnbull

Atsuo Ishimoto writes:

 > Java, a leading language of IT industry, have already support
 > non-ASCII class files for years. But I've never seen such files in
 > production in Japan, and didn't improve situation until now.

So why wouldn't Python work the same way?  The rest of the world can
use non-ASCII modules names sparingly, and Japanese programmers can
avoid them diligently.  Or learn to use them properly and teach each
other; if anybody has the experience of multiple encodings needed to
figure out a good way to use the native language in program
identifiers despite the encoding problem, my bet is it would be Japan.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-21 Thread Martin v. Löwis


> I don't want Python to encourage people to use non-ascii module names.

I don't think the feature is open for debate anymore. PEP 3131 has been
accepted (after *long* debates), and I'll pronounce that supporting
non-ASCII module names is a direct consequence of having it accepted.
Of course, there may be limitations with respect to operating systems,
and in the way Python modules integrate with the file system - but
that non-ASCII module names must be supported is really out of question.

If you would like this to be reverted, you need to write another PEP.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-21 Thread Stephen J. Turnbull

Nick Coghlan writes:
 > On Fri, Jan 21, 2011 at 3:44 PM, Atsuo Ishimoto  wrote:

 > > I don't want Python to encourage people to use non-ascii module names.

I don't think anybody is *encouraging* it.  The argument is for
*permitting* it, partly for consistency with other identifiers, and
partly because of Python's usual "consenting adults" standard for
permitting "dangerous" practices.

I realize this is a somewhat problematic distinction in Japan, for
several reasons, but it's really not one that can be avoided in
computing in any case.  The sooner novice programmers learn it, the
better.

 > > Today, seeing UnicodeEncodingError is one of popular reasons for
 > > newbies to abandon learning Python in Japan. Non-ascii module name is
 > > an another source of confusion for newbies.
 > >
 > > Experienced Japanese programmers may not use non-ascii module names to
 > > avoid encoding issues.
 > >
 > > But novice programmers or non-programmers willing to learn programming
 > > with Python will wish to use Japanese module names. Their programs
 > > will stop working if they copy them to another environment. Sooner or
 > > later, they will see storange ImportError and will start complaining
 > > "Python sucks! Python doesn't support Japanese!" on Twitter.

So ask them, "What language *does* 'support Japanese'?" ;-)

Seriously, "support Japanese" is an impossibly hard standard in the
current environment.  Not only does Japan have 5 more or less standard
encodings still in daily use (EUC-JP, ISO-2022-JP, Shift JIS, UTF-8,
and UTF-16LE), but many major IT companies have their own variants of
the JIS standard character repertoire (all of the variant ideographs
I've seen in the wild are in Unicode, but many corporate repertoires
add extra symbols that are not), and of course some Microsoft
utilities insist on using the deprecated UTF-8 signature with UTF-8.

That said, I really don't see module names as a particular problem.
By the time your novice is using her own modules (as opposed to
importing stdlib and PyPI add-on modules, all with ASCII-only names),
she'll be doing file I/O which has all the same problems, AFAICS.
True, file names will be strings rather than identifiers, but I don't
see why that matters.

 > > Copying files with non-ascii file name over platform is not easy as it
 > > sounds.

Agreed, it's not trivial.  But it's not that hard, either[1], and web
hosts and others *could* help by providing checkers for languages that
they support.

 > > What happen if I copy such files from OSX to my web hosting
 > > server ? Results might differ depending on tools I use to copy and
 > > platforms.

I don't see why this problem is specific to Python modules, as opposed
to any file name.

 > These all sound like good reasons to continue to *advise* against
 > using non-ASCII module names.

+1

 > But aside from that, they sound exactly like a lot of the arguments
 > we heard when Py3k started enforcing the bytes/text distinction
 > more rigorously: "you're going to break stuff!".

Well, not exactly.  Enforcing the bytes/text distinction was a change
in the definition of Python; breakage was our fault.  The change was
made because in the (not so) long run it would reduce new breakage.

Here, Python is fine (or at least we have some pretty good ideas how
to fix it), it's the world that's broken.  *Especially* Japan, with
its five standard encodings in daily use and scads of private variant
repertoires masquerading as standard encodings on top of that.  But
the whole world is broken because of the NFD/NFC thing.  AFAIK, the
only file system that tries to enforce an NF is Mac OS X HFS+, and
(unfortunately for portability *from* Mac OS X *to* other systems)
they chose NFD.  Proper NFD support is arguably better for a number of
reasons (for one, people regularly invent new composition sequences
that will not have precomposed glyphs in any font), but NFC has the
advantage that existing fonts support precomposed standard characters
while many display engines do not support composition properly yet.
And it's likely to stay broken for a while: the move to conformant
display engines is going to take more time.

I still don't see this as a reason to give up on non-ASCII module
names.  Just have the documentation warn that many non-ASCII names
will be non-portable, so use on multiple systems will require care
(maybe gloss that with "probably more care than you want to take").

Footnotes: 
[1]  I actually find copying file names with spaces to be a bigger
problem, because it's so hard to get shell quoting right.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-21 Thread Atsuo Ishimoto

On Fri, Jan 21, 2011 at 2:59 PM, Nick Coghlan  wrote:
>
> These all sound like good reasons to continue to *advise* against
> using non-ASCII module names. But aside from that, they sound exactly
> like a lot of the arguments we heard when Py3k started enforcing the
> bytes/text distinction more rigorously: "you're going to break
> stuff!".

No, non-ASCII module names are new breakage you are going to introduce now :)
If the advice against using non-ASCII module names is reasonable, why
bother supporting them?

>
> Yes, we know. But if core software development components like Python
> don't try to improve their Unicode support, how is the situation ever
> going to get better?
>

Java, a leading language of IT industry, have already support
non-ASCII class files for years. But I've never seen such files in
production in Japan, and didn't improve situation until now.

-- 
Atsuo Ishimoto
Mail: ishim...@gembook.org
Blog: http://d.hatena.ne.jp/atsuoishimoto/
Twitter: atsuoishimoto
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-21 Thread Atsuo Ishimoto

On Fri, Jan 21, 2011 at 1:46 AM, Guido van Rossum  wrote:
> On Thu, Jan 20, 2011 at 5:16 AM, Nick Coghlan  wrote:
>> On Thu, Jan 20, 2011 at 10:08 PM, Simon Cross
>>  wrote:
>>> I'm changing my vote on this to a +1 for two reasons:
>>>
>>> * Initially I thought this wasn't supported by Python at all but I see
>>> that currently it is supported but that support is broken (or at least
>>> limited to UTF-8 filesystem encodings). Since support is there, might
>>> as well make it better (especially if it tidies up the code base at
>>> the same time).
>>>
>>> * I still don't think it's a good idea to give modules non-ASCII names
>>> but the "consenting adults" approach suggests we should let people
>>> shoot themselves in the foot if they believe they have good reason to
>>> do so.
>>
>> I'm also +1 on this for the reasons Simon gives.
>
> Same here. *Most* code will never be shared, or will only be shared
> between users in the same community. When it goes wrong it's also a
> learning opportunity. :-)
>

I don't want Python to encourage people to use non-ascii module names.
Today, seeing UnicodeEncodingError is one of popular reasons for
newbies to abandon learning Python in Japan. Non-ascii module name is
an another source of confusion for newbies.

Experienced Japanese programmers may not use non-ascii module names to
avoid encoding issues.

But novice programmers or non-programmers willing to learn programming
with Python will wish to use Japanese module names. Their programs
will stop working if they copy them to another environment. Sooner or
later, they will see storange ImportError and will start complaining
"Python sucks! Python doesn't support Japanese!" on Twitter.

Copying files with non-ascii file name over platform is not easy as it
sounds. What happen if I copy such files from OSX to my web hosting
server ? Results might differ depending on tools I use to copy and
platforms.

Is it a good opportunity to start learnig abound encodings? I don't
think so. They should learn concepts of charater set and encodings,
Unicode and JIS character sets, some kind of Japanse encodings, number
of platform specifix issues, non-standard extention of Microsoft and
Apple, and so on. I think they should defer learning these messes
until they get ready.

-- 
Atsuo Ishimoto
Mail: ishim...@gembook.org
Blog: http://d.hatena.ne.jp/atsuoishimoto/
Twitter: atsuoishimoto
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-20 Thread Nick Coghlan

On Fri, Jan 21, 2011 at 4:44 PM, Atsuo Ishimoto  wrote:
> On Fri, Jan 21, 2011 at 2:59 PM, Nick Coghlan  wrote:
>>
>> These all sound like good reasons to continue to *advise* against
>> using non-ASCII module names. But aside from that, they sound exactly
>> like a lot of the arguments we heard when Py3k started enforcing the
>> bytes/text distinction more rigorously: "you're going to break
>> stuff!".
>
> No, non-ASCII module names are new breakage you are going to introduce now :)

No, they're not. Non-ASCII module names *already work* in Python 3.1
on UTF-8 filesystems. The portability problem you're complaining about
exists now, and Victor is trying to at least partially alleviate it by
making these filenames work correctly on more properly configured
systems (such as Windows). It won't go away until all filesystem
manipulation tools are properly Unicode aware, but that's no reason
for us to continue to unnecessarily exacerbate the problem.

Given imp_cafe.py:

import café

And café.py:

print('Hello world from {}'.format(__name__))

I get the following result:

~$ python3.1 imp_cafe.py
Hello world from café

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-20 Thread Terry Reedy


On 1/20/2011 12:44 PM, Toshio Kuratomi wrote:


The problem occurs in
that the code that one of the parties develops (either the students or the
professors) is developed on one of those OS's and then used on the other OS.


The problem that I reported and hope will be fixed is that private code 
written and tested on one machine, which will never be distributed, 
could not be imported on the *same* machine, with nothing changed on 
that machine except for writing a second file that does the import.


If filenames get mangled when file are transported (admittedly more 
likely with non-ascii chars), that is a different issue.


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-20 Thread Nick Coghlan

On Fri, Jan 21, 2011 at 3:44 PM, Atsuo Ishimoto  wrote:
> I don't want Python to encourage people to use non-ascii module names.
> Today, seeing UnicodeEncodingError is one of popular reasons for
> newbies to abandon learning Python in Japan. Non-ascii module name is
> an another source of confusion for newbies.
>
> Experienced Japanese programmers may not use non-ascii module names to
> avoid encoding issues.
>
> But novice programmers or non-programmers willing to learn programming
> with Python will wish to use Japanese module names. Their programs
> will stop working if they copy them to another environment. Sooner or
> later, they will see storange ImportError and will start complaining
> "Python sucks! Python doesn't support Japanese!" on Twitter.
>
> Copying files with non-ascii file name over platform is not easy as it
> sounds. What happen if I copy such files from OSX to my web hosting
> server ? Results might differ depending on tools I use to copy and
> platforms.

These all sound like good reasons to continue to *advise* against
using non-ASCII module names. But aside from that, they sound exactly
like a lot of the arguments we heard when Py3k started enforcing the
bytes/text distinction more rigorously: "you're going to break
stuff!".

Yes, we know. But if core software development components like Python
don't try to improve their Unicode support, how is the situation ever
going to get better?

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-20 Thread James Y Knight

On Jan 20, 2011, at 3:55 PM, Antoine Pitrou wrote:

> On Thu, 20 Jan 2011 15:27:08 -0500
> Glyph Lefkowitz  wrote:
>> 
>> To support the latter, could we just make sure that zipimport has a 
>> consistent,
>> non-locale-or-operating-system-dependent interpretation of encoding?
> 
> It already has, but it's dependent on a flag in the zip file itself
> (actually, one flag per archived file in the zip it seems).
> 
> (by the way, it would be nice if your text/mail editor wrapped lines at
> 80 characters or something)

You could complain to Apple, but it seems unlikely that they'd change it. They 
broke it intentionally in OSX 10.6.2 for better compatibility with MS Outlook.

(for the technically inclined: It still wraps lines at 80 characters in the raw 
message, but it uses quoted-printable encoding to escape the line-breaks, so 
mail readers which decode quoted-printable but can't flow text are now S.O.L. 
Apple used to use the nice format=flowed standard instead.)

James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-20 Thread Nick Coghlan

On Fri, Jan 21, 2011 at 5:27 AM, Toshio Kuratomi  wrote:
> I think that both ideas are inferior to mandating that every python module
> filename is ascii.  From what I'm getting from Victor's posts is that he, at
> least, considers the portability problems to be ignorable because dealing
> with ambiguous file name encodings is something that he'd like to force
> third party tools to deal with.

I think you're starting from an incorrect premise: we *already* allow
non-ASCII module names in Py3k. They just don't always work properly,
hence why people are currently much, much better off using pure ASCII
for their module names (as ASCII is still the lowest common
denominator for internet communication).

However, you are proposing that, instead of attempting to fix at least
some of the cases where it doesn't work, we throw up our hands and
tell people "Since some poorly configured systems have trouble with
this feature, we're taking it away from everybody. Sorry if this
breaks your code." While there may be situations where that's a valid
approach, this isn't one of them.

Yes, non-ASCII filenames are problems for all sorts of reasons (with
Python's historically poor support being one of them). The idea is
that we're striving to no longer be part of that problem, even if it
isn't within our power to fix it entirely. Once we fix the core to
handle various Unicode issues, then over time that support can ripple
out through the rest of the Python ecosystem - we don't expect
everything to magically "just work" as soon as the basic issue in the
core is fixed. It's going to be *years* before non-ASCII file names
are as portable as pure ASCII ones (it kind of reminds me of the era
when you had to avoid spaces in filenames because so many applications
choked on them, even after the OS had been updated to support them).

As far as the question of filenames not being re-encoded properly when
copied between two systems, then yes, that *is* a problem with the
third party tools used to do the copying. Such tools will break any
code that uses the str APIs to access the filesystem.

To deal with the case of undecodable filenames that the import system
skips over, it is certainly possibly that importlib or runpy (probably
the former) could acquire a function that allowed a named file to
imported directly (with a specific module name) rather than requiring
the import system to search for it.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-20 Thread Georg Brandl

Am 20.01.2011 12:51, schrieb Victor Stinner:

> You only give theorical arguments

Read Anathem lately? ;)

Georg

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-20 Thread Glenn Linderman


On 1/20/2011 12:27 PM, Glyph Lefkowitz wrote:

To support the latter, could we just make sure that zipimport has a
consistent, non-locale-or-operating-system-dependent interpretation of
encoding?  That way a distributed egg would be importable from a zipfile
regardless of how screwed up the distribution target machine's
filesystem is.  (And this is yet more motivation for distributors to set
zip_safe=True.)


I guess zip_safe is a distutils thing, and I haven't (yet) used distutils.

But regarding zip files, I was trying to figure out if ZipFile module 
supported the CP437/UTF-8 flag, but its documentation seems to predate 
that concept, and just talks about unencoded byte streams.  Yet, I think 
I have Python3 code that passes str to the filenames, and that works, so 
some amount of encoding and decoding to something must be happening 
behind the documentation's back?


It does seem that if a ZipFile is created with the UTF-8 flag turned on, 
that Python should respect that, and that should be independent of the 
file system configured encoding on the local machine on which the 
ZipFile is used (as long as the name of the ZipFile is usable).


I do know that listing filenames from a zip file created without the 
UTF-8 flag, using ZipFile to access it and place the names inside a web 
page that specifies its encoding to be UTF-8 produces illegal 
characters, so I've become tuned in recently to the zip files do have 
such a flag, and have been learning the right options to turn it on for 
the command line tools I use to create zip files... but was surprised 
when investigating the same for ZipFile.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-20 Thread Antoine Pitrou

On Thu, 20 Jan 2011 15:27:08 -0500
Glyph Lefkowitz  wrote:
> 
> To support the latter, could we just make sure that zipimport has a 
> consistent,
> non-locale-or-operating-system-dependent interpretation of encoding?

It already has, but it's dependent on a flag in the zip file itself
(actually, one flag per archived file in the zip it seems).

(by the way, it would be nice if your text/mail editor wrapped lines at
80 characters or something)

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-20 Thread Neil Hodgson

Toshio Kuratomi:

> My examples that you're replying to involve two "properly
> configured" OS's.  The Linux workstations are configured with a UTF-8
> locale.  The Windows OS's use wide character unicode.  The problem occurs in
> that the code that one of the parties develops (either the students or the
> professors) is developed on one of those OS's and then used on the other OS.

   This implies a symmetric issue,. but I can not see how there can be
a problem with non-ASCII module names on Windows as the file system
allows all Unicode characters so can represent any module name.

   OS X is also based on Unicode file names. While it is possible to
mount file systems on Windows or OS X that do not support Unicode file
names these are a very unusual situation that will cause problems in
other ways.

   Common Linux distributions like Ubuntu and Fedora now default to
UTF-8 locales. The situations in which users may encounter
installations that do not support Unicode file names have reduced
greatly.

   Neil
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-20 Thread Glyph Lefkowitz

On Jan 20, 2011, at 11:46 AM, Guido van Rossum wrote:

> On Thu, Jan 20, 2011 at 5:16 AM, Nick Coghlan  wrote:
>> On Thu, Jan 20, 2011 at 10:08 PM, Simon Cross
>>  wrote:
>>> I'm changing my vote on this to a +1 for two reasons:
>>> 
>>> * Initially I thought this wasn't supported by Python at all but I see
>>> that currently it is supported but that support is broken (or at least
>>> limited to UTF-8 filesystem encodings). Since support is there, might
>>> as well make it better (especially if it tidies up the code base at
>>> the same time).
>>> 
>>> * I still don't think it's a good idea to give modules non-ASCII names
>>> but the "consenting adults" approach suggests we should let people
>>> shoot themselves in the foot if they believe they have good reason to
>>> do so.
>> 
>> I'm also +1 on this for the reasons Simon gives.
> 
> Same here. *Most* code will never be shared, or will only be shared
> between users in the same community. When it goes wrong it's also a
> learning opportunity. :-)

Despite my usual proclivity for being contrarian, I find myself in agreement 
here.  Linux users with locales that don't specify UTF-8 frankly _should_ have 
to deal with all kinds of nastiness until they can transcode their filesystems. 
 MacOS and Windows both have a "right" answer here and your third-party tools 
shouldn't create mojibake in your filenames.

However, I feel that we should not necessarily be making non-ASCII programmers 
second-class citizens, if they are to be supported at all.  The obvious outcome 
of the current regime is, if you want your code to work in the wider world, you 
have to make everything ASCII, so non-ASCII programmers have to do a huge 
amount of extra work to prepare their stuff for distribution.  As an english 
speaker I'd be happy about that, but as a person with a lot of Chinese in-laws, 
it gives me pause.

There is a difference between sharing code for inspection and editing (where a 
little codec pain is good for the soul: set your locale to UTF-8 and forget it 
already!) and sharing code so that a (non-programming) user can just run it.  
If I can write software in English and distribute it to Chinese people, fair's 
fair, they should be able to write it in chinese and have it work on my 
computer.

To support the latter, could we just make sure that zipimport has a consistent, 
non-locale-or-operating-system-dependent interpretation of encoding?  That way 
a distributed egg would be importable from a zipfile regardless of how screwed 
up the distribution target machine's filesystem is.  (And this is yet more 
motivation for distributors to set zip_safe=True.)___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-20 Thread Toshio Kuratomi

On Thu, Jan 20, 2011 at 01:43:03PM -0500, Alexander Belopolsky wrote:
> On Thu, Jan 20, 2011 at 12:44 PM, Toshio Kuratomi  wrote:
> > .. My examples that you're replying to involve two "properly
> > configured" OS's.  The Linux workstations are configured with a UTF-8
> > locale.  The Windows OS's use wide character unicode.  The problem occurs in
> > that the code that one of the parties develops (either the students or the
> > professors) is developed on one of those OS's and then used on the other OS.
> >
> 
> I re-read your posts on this thread, but could not find the examples
> that you refer to.
>
Examples might be a bad word in this context.  Victor was commenting on the
two brainstorm ideas for alternatives to ascii-only that I had.  One was:

* Mandate that every python module on a platform has a specific encoding
  (rather than the value of the locale)

The other was:
* allow using byte strings for import

I think that both ideas are inferior to mandating that every python module
filename is ascii.  From what I'm getting from Victor's posts is that he, at
least, considers the portability problems to be ignorable because dealing
with ambiguous file name encodings is something that he'd like to force
third party tools to deal with.

-Toshio


pgpdh2k6Fwv56.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-20 Thread Alexander Belopolsky

On Thu, Jan 20, 2011 at 12:44 PM, Toshio Kuratomi  wrote:
> .. My examples that you're replying to involve two "properly
> configured" OS's.  The Linux workstations are configured with a UTF-8
> locale.  The Windows OS's use wide character unicode.  The problem occurs in
> that the code that one of the parties develops (either the students or the
> professors) is developed on one of those OS's and then used on the other OS.
>

I re-read your posts on this thread, but could not find the examples
that you refer to.  ISTM, your hypothetical students should have no
problem as long as their professor uses proper tools to package her
code.  For example, if she uses a recent version of zip that supports
the Info-ZIP Unicode Comment Extra Field (see
http://www.pkware.com/documents/casestudies/APPNOTE.TXT) and students
use similarly up to date unzip tool, the shared code should work as
expected.  Similarly, I would be surprised if Samba server would not
be able to present a shared Linux partition that uses UTF-8 encoding
to a Windows client in a way that will make wopen() work as expected.

The problem with current Python import mechanism is that it does not
use wopen() on Windows and instead, attempts to encode Unicode module
name into a mythical single-byte filesystem encoding (locale ANSI code
page?)  and calls byte-oriented open(char *) on the result.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-20 Thread Alexander Belopolsky

On Thu, Jan 20, 2011 at 11:45 AM, Andy Teijelo  wrote:
..
> but if the code said:
>
> import café
>
> then Python would look, in any platform, for a file named:
>
> café.py  or  café.py  or something nicer.
>
> Something along the lines of xmlcharrefreplace.
> Just an idea.

Curiously, something like this already happens on OSX when filename is
not valid UTF-8.  For example,

>>> open(b'\xdb\xcd', 'w').close()
>>> open(b'\xdb\xcd')
<_io.TextIOWrapper name=b'\xdb\xcd' mode='r' encoding='UTF-8'>

but the actual file created is named "%DB%CD".  (Looks like URL-encoding).
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-20 Thread Toshio Kuratomi

On Thu, Jan 20, 2011 at 12:51:29PM +0100, Victor Stinner wrote:
> Le mercredi 19 janvier 2011 à 20:39 -0800, Toshio Kuratomi a écrit :
> > Teaching students to write non-portable code (relying on filesystem encoding
> > where your solution is, don't upload to pypi anything that has non-ascii
> > filenames) seems like the exact opposite of how you'd want to shape a young
> > student's understanding of good programming practices.
> 
> That was already discuted before: see PEP 3131.
> http://www.python.org/dev/peps/pep-3131/#common-objections
> 
> If the teacher choose to use non-ASCII, (s)he is responsible to explain
> the consequences to his/her students :-)
> 
It's not discussed in that PEP section.

The PEP section says this: "People claim that they will not be able to use
a library if to do so they have to use characters they cannot type on their
keyboards."

Whether you can type it at your keyboard or not is not the problem here.
The problem is portability.  The students and professors are sharing code
with each other.  But because of a mixture of operating systems (let alone
locale settings), the code written by one partner is unable to run on the
computer of the other.

If non-ascii filenames without a defined encoding are considered a feature,
python cannot even issue a descriptive error when this occurs.  It can only
say that it could not find the module but not why.  A restriction on module
names to ascii only could actually state that module names are not allowed
to be non-ASCII when it encounters the import line.

> > > In a school, you can use the same configuration
> > > (encoding) on all computers.
> > > 
> > In a school computer lab perhaps.  But not on all the students' and
> > professors' machines.  How many professors will be cursing python when they
> > discover that the example code that they wrote on their Linux workstation
> > doesn't work when the students try to use it in their windows computer lab?
> 
> Because some students use a stupid or misconfigured OS, Python should
> only accept ASCII names?

Just a note -- you'll get much farther if you refrain from calling names.
It just makes me think that you aren't reading and understanding the issue
I'm raising.  My examples that you're replying to involve two "properly
configured" OS's.  The Linux workstations are configured with a UTF-8
locale.  The Windows OS's use wide character unicode.  The problem occurs in
that the code that one of the parties develops (either the students or the
professors) is developed on one of those OS's and then used on the other OS.

> So, why do Python 3 support non-ASCII
> filenames: it is very well known that non-ASCII filenames is the root in
> many troubles! Should we simply drop unicode support for all filenames?
> And maybe restrict bytes filenames to bytes in [0; 127]? Or better,
> restrict to [32; 126] (U+007f causes some troubles in some terminals).
> 
If you want to argue that because python3 supports non-ascii filenames in
other code, then the logical extension is that the import mechanism should
support importing module names defined by byte sequences.  I happen to think
that import has a lot of differences between it and other filenames as I've
said three times now.

> I think that in 2011, non-ASCII filenames are well supported on all
> (modern) operating systems. Issues with non-ASCII filenames are OS
> specific and should be fixed by the user (the admin of the computer).
> 
> > Additionally, those other filesystem operations have
> > been growing the ability to take byte values and encoding parameters because
> > unicode translation via a single filesystem encoding is a good default but
> > not a complete solution.
> 
> If you are unable to configure correctly your system to decode/encode
> correctly filenames, you should just avoid non-ASCII characters in the
> module names.
> 
This seems like an argument to only have unicode versions of all filesystem
operations.  Since you've been spearheading the effort to have bytes
versions of things that access filenames, environment variables, etc,
I don't think that you seriously mean that.  Perhaps there is a language
issue here.

> You only give theorical arguments: did you at least try to use non-ASCII
> module names on your system with Python 3.2? I suppose that it will just
> work and you will never notice that the unicode module name (on "import
> café") in encoded to bytes.
> 
Yes I did and I got it to fail a cornercase as I showed twice with the same
example in other posts.  However, I want to make clear here that the issue
is not that I can create a non-ascii filename and then import it.  The issue
is that I can create a non-ascii filename and then try to share it with the
usual tools and it won't work on the recipient's system.  (A tangent is
whether the recipient's system is physically distinct from mine or only has
a different environment on the same physical host.)

> It fails on on OSes using filesystem encodings other than UTF-8 (eg

Re: [Python-Dev] Import and unicode: part two

2011-01-20 Thread Andy Teijelo

(Hi, I'm writing from an address different to the one I'm subscribed 
with to the list because I don't have reverse dns in my mail server and 
mail.python.org rejects my messages. I hope that's not much trouble)


Maybe Python should always use an ASCII encodable filename for modules: 
a translation of the module name into an ASCII encodable string that, 
preferrably, was the same as the module name if the module name didn't 
have any non-ASCII characters. Like, if the code said:


import cafe

Python would look for a file named:

cafe.py

but if the code said:

import café

then Python would look, in any platform, for a file named:

café.py  or  café.py  or something nicer.

Something along the lines of xmlcharrefreplace.
Just an idea.

Andy.

El 1/20/11 12:21 a.m., Glyph Lefkowitz escribió:


On Jan 20, 2011, at 12:19 AM, Glenn Linderman wrote:


Now if the stuff after m_ was the hex UTF-8 of "café", that could get
interesting :)


(As it happens, it's the hex digest of the MD5 of the UTF-8 of café... ;-))



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/andy%40lists.teijelo.net

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-20 Thread Guido van Rossum

On Thu, Jan 20, 2011 at 5:16 AM, Nick Coghlan  wrote:
> On Thu, Jan 20, 2011 at 10:08 PM, Simon Cross
>  wrote:
>> I'm changing my vote on this to a +1 for two reasons:
>>
>> * Initially I thought this wasn't supported by Python at all but I see
>> that currently it is supported but that support is broken (or at least
>> limited to UTF-8 filesystem encodings). Since support is there, might
>> as well make it better (especially if it tidies up the code base at
>> the same time).
>>
>> * I still don't think it's a good idea to give modules non-ASCII names
>> but the "consenting adults" approach suggests we should let people
>> shoot themselves in the foot if they believe they have good reason to
>> do so.
>
> I'm also +1 on this for the reasons Simon gives.

Same here. *Most* code will never be shared, or will only be shared
between users in the same community. When it goes wrong it's also a
learning opportunity. :-)

> I should have a chance to look at the patch this weekend.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-20 Thread Nick Coghlan

On Thu, Jan 20, 2011 at 10:08 PM, Simon Cross
 wrote:
> I'm changing my vote on this to a +1 for two reasons:
>
> * Initially I thought this wasn't supported by Python at all but I see
> that currently it is supported but that support is broken (or at least
> limited to UTF-8 filesystem encodings). Since support is there, might
> as well make it better (especially if it tidies up the code base at
> the same time).
>
> * I still don't think it's a good idea to give modules non-ASCII names
> but the "consenting adults" approach suggests we should let people
> shoot themselves in the foot if they believe they have good reason to
> do so.

I'm also +1 on this for the reasons Simon gives.

I should have a chance to look at the patch this weekend.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-20 Thread Simon Cross

On Wed, Jan 19, 2011 at 5:01 PM, Simon Cross
 wrote:
> On Wed, Jan 19, 2011 at 2:34 PM, Victor Stinner
>  wrote:
>>  (a) Python 3 doesn't support non-ASCII module names
>
> -0: I'm vaguely against this being supported because I'd rather not
> have to deal with what happens when the guess regarding the filesystem
> encoding is wrong. On the other hand, a general encouragement to stick
> to ASCII module names is probably functionally equivalent without
> imposing a hard restriction.

I'm changing my vote on this to a +1 for two reasons:

* Initially I thought this wasn't supported by Python at all but I see
that currently it is supported but that support is broken (or at least
limited to UTF-8 filesystem encodings). Since support is there, might
as well make it better (especially if it tidies up the code base at
the same time).

* I still don't think it's a good idea to give modules non-ASCII names
but the "consenting adults" approach suggests we should let people
shoot themselves in the foot if they believe they have good reason to
do so.

Schiavo
Simon
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-20 Thread Victor Stinner

Le mercredi 19 janvier 2011 à 20:39 -0800, Toshio Kuratomi a écrit :
> Teaching students to write non-portable code (relying on filesystem encoding
> where your solution is, don't upload to pypi anything that has non-ascii
> filenames) seems like the exact opposite of how you'd want to shape a young
> student's understanding of good programming practices.

That was already discuted before: see PEP 3131.
http://www.python.org/dev/peps/pep-3131/#common-objections

If the teacher choose to use non-ASCII, (s)he is responsible to explain
the consequences to his/her students :-)

> > In a school, you can use the same configuration
> > (encoding) on all computers.
> > 
> In a school computer lab perhaps.  But not on all the students' and
> professors' machines.  How many professors will be cursing python when they
> discover that the example code that they wrote on their Linux workstation
> doesn't work when the students try to use it in their windows computer lab?

Because some students use a stupid or misconfigured OS, Python should
only accept ASCII names? So, why do Python 3 support non-ASCII
filenames: it is very well known that non-ASCII filenames is the root in
many troubles! Should we simply drop unicode support for all filenames?
And maybe restrict bytes filenames to bytes in [0; 127]? Or better,
restrict to [32; 126] (U+007f causes some troubles in some terminals).

I think that in 2011, non-ASCII filenames are well supported on all
(modern) operating systems. Issues with non-ASCII filenames are OS
specific and should be fixed by the user (the admin of the computer).

> Additionally, those other filesystem operations have
> been growing the ability to take byte values and encoding parameters because
> unicode translation via a single filesystem encoding is a good default but
> not a complete solution.

If you are unable to configure correctly your system to decode/encode
correctly filenames, you should just avoid non-ASCII characters in the
module names.

You only give theorical arguments: did you at least try to use non-ASCII
module names on your system with Python 3.2? I suppose that it will just
work and you will never notice that the unicode module name (on "import
café") in encoded to bytes.

It fails on on OSes using filesystem encodings other than UTF-8 (eg.
Windows)... because of a Python bug, and I just asked if I have to fix
this bug (or if we should deny non-ASCII names). If the bug is fixed, it
will works everywhere.

> Your solution creates modules which aren't portable

More and more operating systems use a filesystem encoding able to encode
any Unicode characters. ASCII-only always give you the best portability,
but I think that today you can start to play with (at least) ISO-8859-1
characters (café should work on all operating systems). If you don't
Unicode issues (I personally love them!), just use ASCII everywhere.

> One of my proposals creates python code which isn't portable.  The other one
> suffers some of the same disadvantages as your solution in portability but
> allows for tools that could automatically correct modules.

__import__('café'.encode('UTF-8')) or
__import__('café'.encode('ISO-8859-1')) is less portable than
__import__('café').

> You think that if a module is named appropriately on one system but is not 
> portable to another
> system, that's fine.

No, I am not saying that.

I say that if your name is broken while you transfer your project from a
system to another (eg. decompressing an archive creates filenames with
mojibake in the filenames), you should fix your transfer procedure (eg.
use another archive format, use a script to fix filenames, or anything
else), but don't try to handle invalid filenames.

> Setting system locale to ASCII for use in system-wide scripts

This is stupid :-) Yes, on such system you, cannot open *any* non-ASCII
file with Python 3 (except if you work, as Python 2, on bytes
filenames).

Python cannot do anything to improve Unicode support on such system:
only the administrator have to something to do for that.

I know that you can give me many examples of systems where Unicode
doesn't work because the system is not correctly configured. But my
opinion is that we should support non-ASCII names because there are
somewhere "some" systems where Unicode is fully functionnal :-)

Victor

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-20 Thread James Y Knight

On Jan 19, 2011, at 11:39 PM, Toshio Kuratomi wrote:
> Tangent: This is not true about Linux.  UTF-8 is a matter of the
> interpretation of the filesystem bytes that the user specifies by setting
> their system locale.  Setting system locale to ASCII for use in system-wide
> scripts, is quite common as is changing locale settings in other parts of
> the world (as I can tell you from the bug reports colleagues CC me on to fix
> for the problems with unicode support in their python2 programs).

Fortunately, there's been some (slow) movement towards adding a "C.UTF-8" 
locale and using that by default where "C" (ASCII) is currently used. So that 
may be less of a problem in a few years time.

James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Glenn Linderman


On 1/19/2011 11:20 PM, Toshio Kuratomi wrote:

On Wed, Jan 19, 2011 at 09:02:17PM -0800, Glenn Linderman wrote:

On 1/19/2011 8:39 PM, Toshio Kuratomi wrote:

 use this::

import cafe as café

 When you do things this way you do not have to translate between unknown
 encodings into unicode.  Everything is within python source where you have
 a defined encoding.


This is a great way of converting non-portable module names, if the module ever
leaves the bounds of its computer, and runs into problems there.


You're missing a piece here.  If you mandate ascii you can convert to
a unicode name using "import as" because python knows that it has ascii text
from the filesystem when it converts it to an abstract unicode string that
you've specified in the program text.  You cannot go the other way because
python lacks the information (the encoding of the filename on the
filesystem) to do the transformation.


Your demonstration of such an easy solution to the concerns you raise convinces
me more than ever that it is acceptable to allow non-ASCII module names.  For
those programmers in a single locale environment, it'll just work.  And for
those not in a single locale environment, there is your above simple solution
to achieve portability without changing large numbers of lines of code.


Does my demonstration that you can't do that mean that it's no longer
acceptable?  :-)

/me guesses that the relative merits of being forced to write portable code
vs convenience of writing a module name in your native script still has
a different balance than in mine, thus the smiley :-)

-Toshio


Sadly, you didn't demonstrate it, you seem to have misunderstood my 
statement, which was probably not all that clear, somehow.  Let me try 
again.


User codes module   café.py, tests, debugs, completes, is happy.
User moves code to a different computer, different locale, no é 
character, module can't be found, is sad.

User renames file to cafefromuser.py, changes the import statement from

import café

to

import cafefromuser as café

module now imports successfully, no other code changes needed.  User is 
happy again, thanks Toshio for great solution to file system encoding 
problem.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Toshio Kuratomi

On Wed, Jan 19, 2011 at 09:02:17PM -0800, Glenn Linderman wrote:
> On 1/19/2011 8:39 PM, Toshio Kuratomi wrote:
> 
> use this::
> 
>import cafe as café
> 
> When you do things this way you do not have to translate between unknown
> encodings into unicode.  Everything is within python source where you have
> a defined encoding.
> 
> 
> This is a great way of converting non-portable module names, if the module 
> ever
> leaves the bounds of its computer, and runs into problems there.
> 
You're missing a piece here.  If you mandate ascii you can convert to
a unicode name using "import as" because python knows that it has ascii text
from the filesystem when it converts it to an abstract unicode string that
you've specified in the program text.  You cannot go the other way because
python lacks the information (the encoding of the filename on the
filesystem) to do the transformation.

> Your demonstration of such an easy solution to the concerns you raise 
> convinces
> me more than ever that it is acceptable to allow non-ASCII module names.  For
> those programmers in a single locale environment, it'll just work.  And for
> those not in a single locale environment, there is your above simple solution
> to achieve portability without changing large numbers of lines of code.
> 
Does my demonstration that you can't do that mean that it's no longer
acceptable?  :-)

/me guesses that the relative merits of being forced to write portable code
vs convenience of writing a module name in your native script still has
a different balance than in mine, thus the smiley :-)

-Toshio


pgpVg5DKpRDXA.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Alexander Belopolsky

On Thu, Jan 20, 2011 at 12:11 AM, Glyph Lefkowitz
 wrote:
..
>> But for local code, having to think up an ASCII name for a module rather
>> than use the obvious native-language name, is just brain-burden when
>> creating the code.
>
> Is it really?  You already had to type 'import', presumably if you can think
> in Python you can think in ASCII.

Yes, it is a burden.  For example, Russian word "щи" can be
transliterated into ASCII as "schi", "shchi", "stchi", or even "wji".
There are many incompatible standards and neither is well-known or
"natural".  Reading transliterated Cyrillic text is not hard, but
guessing the correct spelling is nearly impossible.  Good programming
style guides recommend avoiding arbitrary contractions in variable
names for the same reason.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Glyph Lefkowitz

On Jan 20, 2011, at 12:19 AM, Glenn Linderman wrote:

> Now if the stuff after m_ was the hex UTF-8 of  "café", that could get 
> interesting :)

(As it happens, it's the hex digest of the MD5 of the UTF-8 of café... ;-))___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Glenn Linderman


On 1/19/2011 9:11 PM, Glyph Lefkowitz wrote:


On Jan 20, 2011, at 12:02 AM, Glenn Linderman wrote:

But for local code, having to think up an ASCII name for a module 
rather than use the obvious native-language name, is just 
brain-burden when creating the code.


Is it really?  You already had to type 'import', presumably if you can 
think in Python you can think in ASCII.


There is a difference between memorizing and typing keywords, and 
inventing new names in non-native scripts.  It is hard to even invent 
all the names in one's native language; if restricted to inventing them, 
even some of them, in some non-native script such as ASCII, it is just 
brain-burden indeed.




(After my experiences with namespace crowding in Twisted, I'm inclined 
to suggest something more like "import 
m_07117FE4A1EBD544965DC19573183DA2 as café" - then I never need to 
worry about "café2" looking ugly or "cafe" being incompatible :).)




Now if the stuff after m_ was the hex UTF-8 of  "café", that could get 
interesting :)  But now you are talking about automating the creation of 
ASCII file names from the actual non-ASCII names of the modules, or 
something.  Sadly, the module is not required to contain its name, so if 
it differs from the filename, some global view or non-Python annotation 
would be required to create/maintain the mapping.  [This paragraph is 
only semi-serious, like yours.]



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Alexander Belopolsky

On Wed, Jan 19, 2011 at 11:39 PM, Toshio Kuratomi  wrote:
..
> Teaching students to write non-portable code (relying on filesystem encoding
> where your solution is, don't upload to pypi anything that has non-ascii
> filenames) seems like the exact opposite of how you'd want to shape a young
> student's understanding of good programming practices.
>

Let's not confuse language definition with the quality of
implementation.   It would be a perfectly valid Python implementation
that would run on a system that does not even have a traditional
filesystem and "import foo" looks up foo module code in an in-memory
database.   Should Python be redefined so that module names are case
insensitive simply because case insensitive filesystems are still
popular?  I don't think so.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Glyph Lefkowitz


On Jan 20, 2011, at 12:02 AM, Glenn Linderman wrote:

> But for local code, having to think up an ASCII name for a module rather than 
> use the obvious native-language name, is just brain-burden when creating the 
> code.

Is it really?  You already had to type 'import', presumably if you can think in 
Python you can think in ASCII.

(After my experiences with namespace crowding in Twisted, I'm inclined to 
suggest something more like "import m_07117FE4A1EBD544965DC19573183DA2 as café" 
- then I never need to worry about "café2" looking ugly or "cafe" being 
incompatible :).)

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Glenn Linderman


On 1/19/2011 8:39 PM, Toshio Kuratomi wrote:

use this::
import cafe as café

When you do things this way you do not have to translate between unknown
encodings into unicode.  Everything is within python source where you have
a defined encoding.


This is a great way of converting non-portable module names, if the 
module ever leaves the bounds of its computer, and runs into problems there.


It may be that the best practices for writing platform portable modules 
should include

* ASCII module filenames
* Code that can handle 16 or 32 bit Unicode
* and likely some other things.

But for local code, having to think up an ASCII name for a module rather 
than use the obvious native-language name, is just brain-burden when 
creating the code.


Your demonstration of such an easy solution to the concerns you raise 
convinces me more than ever that it is acceptable to allow non-ASCII 
module names.  For those programmers in a single locale environment, 
it'll just work.  And for those not in a single locale environment, 
there is your above simple solution to achieve portability without 
changing large numbers of lines of code.


Glenn
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Alexander Belopolsky

On Wed, Jan 19, 2011 at 9:07 PM, Toshio Kuratomi  wrote:
..
> Do you have a solution to the problem?  I haven't looked at your patch so
> perhaps you have an ingenous method of translating from the unicode
> representation of the module in the import statement to the bytes in
> arbitrary encodings on the filesystem that I haven't thought of.

If I understand what Victor's patch does correctly, it allows Python
on Windows to bypass translation from Unicode to bytes by using
Windows "wide character" APIs.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Toshio Kuratomi

On Thu, Jan 20, 2011 at 03:51:05AM +0100, Victor Stinner wrote:
> For a lesson at school, it is nice to write examples in the
> mother language, instead of using "raw" english with ASCII identifiers
> and filenames.

Then use this::
   import cafe as café

When you do things this way you do not have to translate between unknown
encodings into unicode.  Everything is within python source where you have
a defined encoding.

Teaching students to write non-portable code (relying on filesystem encoding
where your solution is, don't upload to pypi anything that has non-ascii
filenames) seems like the exact opposite of how you'd want to shape a young
student's understanding of good programming practices.

> In a school, you can use the same configuration
> (encoding) on all computers.
> 
In a school computer lab perhaps.  But not on all the students' and
professors' machines.  How many professors will be cursing python when they
discover that the example code that they wrote on their Linux workstation
doesn't work when the students try to use it in their windows computer lab?
How many students will be upset when the code they turn in runs on their
professor's test machine if the lab computers were booted into the Linux
partition but not if the they were booted into Windows?

> 
> > > > * Specify an encoding per platform and stick to that.
> > > 
> > > It doesn't work: on UNIX/BSD, the user chooses its own encoding and all
> > > programs will use it.
> > > 
> > (...) This prevents getting a mixture of encodings of modules (...)
> 
> If you have an issue with encodings, when have to fix it when you create
> a module (on disk), not when you load a module (it is too late).
> 
It's not too late to throw a clear error of what's wrong.

> > I haven't looked at your patch so
> > perhaps you have an ingenous method of translating from the unicode
> > representation of the module in the import statement to the bytes in
> > arbitrary encodings on the filesystem that I haven't thought of.
> 
> On Windows, My patch tries to avoid any conversion: it uses unicode
> everywhere.
> 
> On other OSes, it uses the Python filesystem encoding to encode a module
> name (as it is done for any other operation on the filesystem with an
> unicode filename).
> 
The other interfaces are somewhat of a red herring here.  As I wrote in
another email, importing modules has ramifications that open(), for
instance, does not.  Additionally, those other filesystem operations have
been growing the ability to take byte values and encoding parameters because
unicode translation via a single filesystem encoding is a good default but
not a complete solution.

I think that this problem demands a complete solution, however, and it seems
to me that limiting the scope of the problem is the most pleasant method to
accomplish this.  Your solution creates modules which aren't portable.  One
of my proposals creates python code which isn't portable.  The other one
suffers some of the same disadvantages as your solution in portability but
allows for tools that could automatically correct modules.

> --
> 
> Python 3 supports bytes filename to be able to read/copy/delete
> undecodable filenames, filenames stored in a encoding different than the
> system encoding, broken filenames. It is also possible to access these
> files using PEP 383 (with surrogate characters). This is useful to use
> Python on an old system.
> 
> > If you don't, however, then really - ASCII-only seems like the sanest 
> > of the three solutions I can think of.
> 
> But a (Python 3) module is not supposed to have a broken filename. If it
> is the case, you have better to fix its name, instead of trying to fix
> the problem later (in Python).
> 
We agree that there should not be broken module names.  However it seems we
very hotly disagree about the definition of that.  You think that if
a module is named appropriately on one system but is not portable to another
system, that's fine.  I think that portability between systems is very
important and sacrificing that so that someone can locally use a module with
non-ASCII characters doesn't have a justifiable reward.

> With UTF-8 filesystem encoding (eg. on Mac OS X, and most Linux setups),
> it is already possible to use non-ASCII module names.
> 
Tangent: This is not true about Linux.  UTF-8 is a matter of the
interpretation of the filesystem bytes that the user specifies by setting
their system locale.  Setting system locale to ASCII for use in system-wide
scripts, is quite common as is changing locale settings in other parts of
the world (as I can tell you from the bug reports colleagues CC me on to fix
for the problems with unicode support in their python2 programs).  Allowing
module names incompatible with ascii without specifying an encoding will
just lead to bug reports down the line.

Relatively few programmers understand the difference between the python
unicode abstraction and the byte representations possible for those strings.
Allowing

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Victor Stinner

Le mercredi 19 janvier 2011 à 18:07 -0800, Toshio Kuratomi a écrit :
> Saying that multiple encodings on a single system is a misconfiguration
> every time it comes up does not make it true.

Yes, each filesystem can have its own encoding. For example, this is
supported by Linux. Python doesn't support such configuration, but this
limitation is wider than the import machinery. If you consider it import
enough, please open an issue.

> To the existing list I'd add getting a package from pypi --
> neither tar nor zip files contain encoding information about the filenames.

ZIP contain a flag to indicate the encoding: cp437 or UTF-8.

TAR has an extension called "PAX" which stores filenames as UTF-8. But
yes, most tarballs store filenames as raw byte strings.

Anyway, if you would like to share your code on PyPI, you should not use
non-ASCII module names (or any other non-ASCII name/identifier :-)).

Python 3 supports non-ASCII identifiers (PEP 3131), but the developer is
responsible to decide if (s)he uses it or not, depending on its
audience. For a lesson at school, it is nice to write examples in the
mother language, instead of using "raw" english with ASCII identifiers
and filenames. In a school, you can use the same configuration
(encoding) on all computers.


> > > * Specify an encoding per platform and stick to that.
> > 
> > It doesn't work: on UNIX/BSD, the user chooses its own encoding and all
> > programs will use it.
> > 
> (...) This prevents getting a mixture of encodings of modules (...)

If you have an issue with encodings, when have to fix it when you create
a module (on disk), not when you load a module (it is too late).

> (...) I mean something at the python code level::
> 
>import café encoded_as('latin1')

Import a module using its byte name? You mean that café filename was not
encoded to the Python filesystem encoding, but to other (wrong)
encoding, at the creation of the module. As written before, you should
fix your filename, instead of using an (ugly) workaround in Python.

> I haven't looked at your patch so
> perhaps you have an ingenous method of translating from the unicode
> representation of the module in the import statement to the bytes in
> arbitrary encodings on the filesystem that I haven't thought of.

On Windows, My patch tries to avoid any conversion: it uses unicode
everywhere.

On other OSes, it uses the Python filesystem encoding to encode a module
name (as it is done for any other operation on the filesystem with an
unicode filename).

--

Python 3 supports bytes filename to be able to read/copy/delete
undecodable filenames, filenames stored in a encoding different than the
system encoding, broken filenames. It is also possible to access these
files using PEP 383 (with surrogate characters). This is useful to use
Python on an old system.

> If you don't, however, then really - ASCII-only seems like the sanest 
> of the three solutions I can think of.

But a (Python 3) module is not supposed to have a broken filename. If it
is the case, you have better to fix its name, instead of trying to fix
the problem later (in Python).

With UTF-8 filesystem encoding (eg. on Mac OS X, and most Linux setups),
it is already possible to use non-ASCII module names.

Victor

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Toshio Kuratomi

On Thu, Jan 20, 2011 at 01:26:01AM +0100, Victor Stinner wrote:
> Le mercredi 19 janvier 2011 à 15:44 -0800, Toshio Kuratomi a écrit : 
> > Additionally, many unix filesystem don't specify a filesystem encoding for
> > filenames; they deal in legal and illegal bytes which could lead to
> > troubles.  This problem of which encoding to use is a problem that can be
> > seen on UNIX systems even now.
> 
> If the system is not correctly configured, it is not a bug in Python,
> but a bug in the system config. Python relies on the locale to choose
> the filesystem encoding (sys.getfilesystemencoding()). Python uses this
> encoding to decode and encode all filenames.
> 
Saying that multiple encodings on a single system is a misconfiguration
every time it comes up does not make it true.  There's been multiple
examples of how you can end up with multiple encodings of filenames on
a single system listed in past threads: multiple users with different
encodings for their locales, mounting remote filesystems, downloading
a file To the existing list I'd add getting a package from pypi --
neither tar nor zip files contain encoding information about the filenames.
Therefore if I create an sdist of a python module using non-ascii filenames
using a locale of latin1 and then upload to pypi, people downloading that
on a utf-8 using locale will end up not being able to use the module.

> > * Specify an encoding per platform and stick to that.
> 
> It doesn't work: on UNIX/BSD, the user chooses its own encoding and all
> programs will use it.
> 
The proposal is that you ignore that when talking about loading and creating
(I mentioned distutils because my thought was that distutils could grow the
ability to translate from the system locale to a chosen neutral encoding
when running setup.py any of the dist commands but that doesn't address the
issue when testing a module that you've just written so perhaps that's not
necessary.) python modules.  Python modules would have a set of defined
filesystem encodings per system.  This prevents getting a mixture of
encodings of modules and having things work in one location but fail when
used somewhere else.  Instead, you get an upfront failure until you correct
the encoding.

> Anyway, I don't see why it is a problem to have different encodings on
> different systems. Each system can use its own encoding. The bug that
> I'm trying to solve is a Python bug, not an OS bug.
> 
There is no OS bug here.  There is perhaps an OS design flaw but it's not
a flaw that will be going away soon (in part, because the present OS
designers do not see it as an OS flaw... to them it's a bug in code that
attempts to build a simpler interface on top of it.)

> > * Change import semantics to allow specifying the encoding of the module on
> >   the filesystem (seems really icky).
> 
> This is a very bad idea. I introduced PYTHONFSENCODING environment
> variable in Python 3.2, but then quickly removed it, because it
> introduced a lot of inconsistencies.
> 
Thanks for getting rid of that, PYTHONFSENCODING is a bad idea because it
doesn't solve the underlying issues.  However, when I say specifying the
encoding of the module on the filesystem, I don't mean something global like
PYTHONFSENCODING -- I mean something at the python code level::

   import café encoded_as('latin1')

After thinking about this one, though, I don't think it will work either.
This takes care of importing modules where the fs encoding of the module is
known but it doesn't where the fs encoding may be translated between
platforms.  I believe that this could arise when untarring a module on
windows using winzip or similar that gives you the option of translating
from utf-8 bytes into bytes that have meaning as characters on that
platform, for instance.

Do you have a solution to the problem?  I haven't looked at your patch so
perhaps you have an ingenous method of translating from the unicode
representation of the module in the import statement to the bytes in
arbitrary encodings on the filesystem that I haven't thought of.  If you
don't, however, then really - ASCII-only seems like the sanest of the three
solutions I can think of.

-Toshio


pgpxKdCbo8dSk.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Toshio Kuratomi

On Wed, Jan 19, 2011 at 07:11:52PM -0500, James Y Knight wrote:
> On Jan 19, 2011, at 6:44 PM, Toshio Kuratomi wrote:
> > This problem of which encoding to use is a problem that can be
> > seen on UNIX systems even now.  Try this:
> > 
> >  echo 'print("hi")' > café.py
> >  convmv -f utf-8 -t latin1 café.py
> >  python3 -c 'import café'
> > 
> > ASCII seems very sensible to me when faced with these ambiguities.
> > 
> > Other options I can brainstorm that could be explored:
> > 
> > * Specify an encoding per platform and stick to that.  (So, for instance,
> >  all module names on posix platforms would have to be utf-8).  Force
> >  translation between encoding when installing packages (But that doesn't
> >  help for people that are creating their modules using their own build
> >  scripts rather than distutils, copying the files using raw tar, etc.)
> > * Change import semantics to allow specifying the encoding of the module on
> >  the filesystem (seems really icky).
> 
> None of this is unique to import -- the same exact issue occurs with 
> open(u'café'). I don't see any reason why import café should be though of as 
> more of a problem, or treated any differently.
> 
It's unique in several ways:

1) With open, you can specify a byte string::
   open(b'caf\xe9.py').read()

   I don't know of any way to do that with import.
   This is needed when the filename is not compatible with your current
   locale.

2) import assigns a name to the module that it imports whereas open lets the
   programmer assign the name.  So even if you can specify what to use as
   a byte string for this filename on this particular filesystem you'd still
   end up with some ugly pseudo-representation of bytes when attempting to
   access it in code::
   import caf\xe9

   caf\xe9.do_something()

-Toshio


pgp3UpXl83i8t.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread James Y Knight

On Jan 19, 2011, at 6:44 PM, Toshio Kuratomi wrote:
> This problem of which encoding to use is a problem that can be
> seen on UNIX systems even now.  Try this:
> 
>  echo 'print("hi")' > café.py
>  convmv -f utf-8 -t latin1 café.py
>  python3 -c 'import café'
> 
> ASCII seems very sensible to me when faced with these ambiguities.
> 
> Other options I can brainstorm that could be explored:
> 
> * Specify an encoding per platform and stick to that.  (So, for instance,
>  all module names on posix platforms would have to be utf-8).  Force
>  translation between encoding when installing packages (But that doesn't
>  help for people that are creating their modules using their own build
>  scripts rather than distutils, copying the files using raw tar, etc.)
> * Change import semantics to allow specifying the encoding of the module on
>  the filesystem (seems really icky).

None of this is unique to import -- the same exact issue occurs with 
open(u'café'). I don't see any reason why import café should be though of as 
more of a problem, or treated any differently.

It's reasonable to recommend that people use ASCII in their module names if 
they want wide portability, but it should still be supported to use non-ASCII.

James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Victor Stinner

Le mercredi 19 janvier 2011 à 15:44 -0800, Toshio Kuratomi a écrit : 
> Additionally, many unix filesystem don't specify a filesystem encoding for
> filenames; they deal in legal and illegal bytes which could lead to
> troubles.  This problem of which encoding to use is a problem that can be
> seen on UNIX systems even now.

If the system is not correctly configured, it is not a bug in Python,
but a bug in the system config. Python relies on the locale to choose
the filesystem encoding (sys.getfilesystemencoding()). Python uses this
encoding to decode and encode all filenames.

> * Specify an encoding per platform and stick to that.

It doesn't work: on UNIX/BSD, the user chooses its own encoding and all
programs will use it.

Anyway, I don't see why it is a problem to have different encodings on
different systems. Each system can use its own encoding. The bug that
I'm trying to solve is a Python bug, not an OS bug.

> * Change import semantics to allow specifying the encoding of the module on
>   the filesystem (seems really icky).

This is a very bad idea. I introduced PYTHONFSENCODING environment
variable in Python 3.2, but then quickly removed it, because it
introduced a lot of inconsistencies.

Victor

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Terry Reedy


On 1/19/2011 6:05 PM, Alexander Belopolsky wrote:

On Wed, Jan 19, 2011 at 5:47 PM, Brett Cannon  wrote:
..

Indeed.  Last time I looked, we still had cProfile in stdlib.


Yes, but that is because no one got around to hiding cProfile behind
profile before we released Python 3.0. I would still like to see it
(slowly) go away from being directly visible.



Another big offender is the idlelib package.  Most of the modules
there are in mixed case.


Given that the individual modules are not documented and that the only 
programs importing the individual modules are other idlelib modules 
(true?) then a rename should be possible. In the other hand, the same 
facts sort of make it unnecessary ;-).


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Terry Reedy


On 1/19/2011 6:44 PM, Toshio Kuratomi wrote:


I believe we now have the situation that a package that works on *nix
could fail on Windows, whereas I believe that patch would *improve*
portability.


I'm not so sure about this


Forget that claim if it is not true. The patch will certainly improve 
consistency with a box so that files that run can also be imported.


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Toshio Kuratomi

On Wed, Jan 19, 2011 at 04:40:24PM -0500, Terry Reedy wrote:
> On 1/19/2011 4:05 PM, Simon Cross wrote:
> 
> >I have no problem with non-ASCII module identifiers being valid
> >syntax. It's a question of whether attempting to translate a non-ASCII
> 
> If the names are the same, ie, produced with the same sequence of
> keystrokes in the save-as box and importing box, then there is no
> translation, at least from the user's view.
> 
> >module name into a file name (so the file can be imported) is a good
> >idea and whether these sorts of files can be safely transferred among
> >diverse filesystems.
> 
> I believe we now have the situation that a package that works on *nix
> could fail on Windows, whereas I believe that patch would *improve*
> portability.
> 
I'm not so sure about this  You may have something that works on Windows
and on *NIX under certain circumstances but it seems likely to fail when
moving files between them (for instance, as packages downloaded from pypi).
Additionally, many unix filesystem don't specify a filesystem encoding for
filenames; they deal in legal and illegal bytes which could lead to
troubles.  This problem of which encoding to use is a problem that can be
seen on UNIX systems even now.  Try this:

  echo 'print("hi")' > café.py
  convmv -f utf-8 -t latin1 café.py
  python3 -c 'import café'

ASCII seems very sensible to me when faced with these ambiguities.

Other options I can brainstorm that could be explored:

* Specify an encoding per platform and stick to that.  (So, for instance,
  all module names on posix platforms would have to be utf-8).  Force
  translation between encoding when installing packages (But that doesn't
  help for people that are creating their modules using their own build
  scripts rather than distutils, copying the files using raw tar, etc.)
* Change import semantics to allow specifying the encoding of the module on
  the filesystem (seems really icky).

-Toshio


pgpsh1AqAY9Vd.pgp
Description: PGP signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Alexander Belopolsky

On Wed, Jan 19, 2011 at 5:47 PM, Brett Cannon  wrote:
..
>> Indeed.  Last time I looked, we still had cProfile in stdlib.
>
> Yes, but that is because no one got around to hiding cProfile behind
> profile before we released Python 3.0. I would still like to see it
> (slowly) go away from being directly visible.
>

Another big offender is the idlelib package.  Most of the modules
there are in mixed case.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Brett Cannon

On Wed, Jan 19, 2011 at 14:23, Alexander Belopolsky
 wrote:
> On Wed, Jan 19, 2011 at 4:40 PM, Terry Reedy  wrote:
> ..
>>> For similar reasons we tend to avoid capital letters in module names.
>>
>> That is a stdlib style guide followed by many, but intentionally not
>> enforced.
>
> Indeed.  Last time I looked, we still had cProfile in stdlib.

Yes, but that is because no one got around to hiding cProfile behind
profile before we released Python 3.0. I would still like to see it
(slowly) go away from being directly visible.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Alexander Belopolsky

On Wed, Jan 19, 2011 at 4:40 PM, Terry Reedy  wrote:
..
>> For similar reasons we tend to avoid capital letters in module names.
>
> That is a stdlib style guide followed by many, but intentionally not
> enforced.

Indeed.  Last time I looked, we still had cProfile in stdlib.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Victor Stinner

Le mercredi 19 janvier 2011 à 12:19 -0800, Glenn Linderman a écrit :
> Since Python allows non-ASCII variable names, I think it should allow 
> non-ASCII module names also, on any platform that supports the 
> appropriate characters in the filesystem.
> 
> Since some platforms already accept them, dropping them would be 
> incompatible.

ok

> If Victor already has a patch coded (i.e. the work is basically done, no 
> waiting in line 3), I'm even more in favor of it.  If it took lots of 
> future hard work, and no one volunteered to do it, that would perhaps be 
> justification for retaining module name restrictions.  I guess that is 
> not the case here, so...

I am volunteer to do the work, and I already have a working patch (but
it is not ready yet to be commited, it requires a long review).

FYI, I rewrote the patch 4 times since one year, for different reasons:

 - the patch is huge, complex, and I was unable to "write it correctly"
   the first time
 - I splitted the work into two big parts: support non-ASCII paths
   (done in Python 3.2) and the other changes in the part two
 - Update an huge patchset on py3k tree is hard, even with git-svn
   (and git svn rebase)
 - In my first tries, I didn't patch the import machinery to support
   non-ASCII module names, I only patched the support of non-ASCII 
   paths

But I don't want to apply such huge patch if Python code developers
don't want to support non-ASCII module names and unencodable paths.

Victor

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Terry Reedy


On 1/19/2011 4:05 PM, Simon Cross wrote:


I have no problem with non-ASCII module identifiers being valid
syntax. It's a question of whether attempting to translate a non-ASCII


If the names are the same, ie, produced with the same sequence of 
keystrokes in the save-as box and importing box, then there is no 
translation, at least from the user's view.



module name into a file name (so the file can be imported) is a good
idea and whether these sorts of files can be safely transferred among
diverse filesystems.


I believe we now have the situation that a package that works on *nix 
could fail on Windows, whereas I believe that patch would *improve* 
portability.



For similar reasons we tend to avoid capital letters in module names.


That is a stdlib style guide followed by many, but intentionally not 
enforced.



--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Georg Brandl

Am 19.01.2011 21:32, schrieb Terry Reedy:
> On 1/19/2011 7:34 AM, Victor Stinner wrote:
>> Hi,
>>
>> I patched Python 3.2 to support modules with non-ASCII paths (*). It
>> works well on all operating systems. But the task is not completly
>> done:
>>
>> (a) Python 3 doesn't support non-ASCII module names (b) Python 3
>> doesn't support unencodable characters in the module path
>>
>> I would like to know if we need to support that. Terry J. Reedy
>> wrote (issue #10828): "I think bugs in core syntax should have high
>> priority. I appreciate your work toward fixing it."
> 
> I am a little shocked at the so-far tepid response to (a), so let me
> defend and explain my claim that it is a bug.
> 
> In the simplest case (from 6.11. The import statement and  2.3. 
> Identifiers and keywords)
> 
> import_stmt ::= "import" module
> module  ::= indentifier
> identifier  ::= 
> 
> There is nothing, nothing, about any restriction on identifiers.

+1.  The restriction on valid identifiers is very sensible (obviously,
since "m" needs to be accessible after "import m"), but a further
restriction seems just arbitrary.


Georg

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Simon Cross

On Wed, Jan 19, 2011 at 10:32 PM, Terry Reedy  wrote:
> I am a little shocked at the so-far tepid response to (a), so let me
> defend and explain my claim that it is a bug.
>
> In the simplest case (from 6.11. The import statement and  2.3. Identifiers
> and keywords)
>
> import_stmt ::= "import" module
> module      ::= indentifier
> identifier  ::= 
>
> There is nothing, nothing, about any restriction on identifiers.

I have no problem with non-ASCII module identifiers being valid
syntax. It's a question of whether attempting to translate a non-ASCII
module name into a file name (so the file can be imported) is a good
idea and whether these sorts of files can be safely transferred among
diverse filesystems.

For similar reasons we tend to avoid capital letters in module names.

Schiavo
Simon
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Victor Stinner

Le mercredi 19 janvier 2011 à 13:38 -0500, Alexander Belopolsky a
écrit :
> PEP 3131 does not distinguish between different types of identifiers,
> so I think it assumes that non-ascii module names should be supported.

My opinion is that we should suport non-ASCII module names and
unencodable paths if it doesn't introduce an overhead (make Python
slower and add a lot of code). My patch adds ~400 lines of code (I think
that it is small: the patch adds many functions), but I think that it
makes Python as fast, or maybe faster.

Victor

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Terry Reedy


On 1/19/2011 7:34 AM, Victor Stinner wrote:

Hi,

I patched Python 3.2 to support modules with non-ASCII paths (*). It
works well on all operating systems. But the task is not completly
done:

(a) Python 3 doesn't support non-ASCII module names (b) Python 3
doesn't support unencodable characters in the module path

I would like to know if we need to support that. Terry J. Reedy
wrote (issue #10828): "I think bugs in core syntax should have high
priority. I appreciate your work toward fixing it."


I am a little shocked at the so-far tepid response to (a), so let me
defend and explain my claim that it is a bug.

In the simplest case (from 6.11. The import statement and  2.3. 
Identifiers and keywords)


import_stmt ::= "import" module
module  ::= indentifier
identifier  ::= 

There is nothing, nothing, about any restriction on identifiers.

The rest of 6.11 discusses the complex import algorithm but leaves out 
the simple semantics that cover 99% of cases (import a ???.py file in a 
directory on sys.path), and never mentions ".py".


So lets go to Tutorial 6. Modules which does explain the simple case: "A 
module is a file containing Python definitions and statements. The file 
name is the module name with the suffix .py appended." So, if xyz is a 
legal identifier and xyx.py exists on sys.path, it is reasonable from 
the docs to expect 'import xyz' to work. (Sys.path is memtioned in the 
reference.)


But we now have the following possibility:

Let xyz.py be

def double(x): return 2*x

if __name__=="__main__":
  if double(2) == 4: print("test passed")

We run the file, get "test passed", and write zyx.py:

import xyz
...

We run zyx and Python says "No module named xyz".

Bad, and quite puzzling to anyone who does not understand the subtle 
difference between running and importing a file.


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Glenn Linderman


On 1/19/2011 11:31 AM, Victor Stinner wrote:

If we decide to reject non-ASCII module names, it should be done on any
operating systems, not only on Windows.


Since Python allows non-ASCII variable names, I think it should allow 
non-ASCII module names also, on any platform that supports the 
appropriate characters in the filesystem.


Since some platforms already accept them, dropping them would be 
incompatible.


If Victor already has a patch coded (i.e. the work is basically done, no 
waiting in line 3), I'm even more in favor of it.  If it took lots of 
future hard work, and no one volunteered to do it, that would perhaps be 
justification for retaining module name restrictions.  I guess that is 
not the case here, so...


+1  on supporting full Unicode module names on all platforms.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Victor Stinner

Le mercredi 19 janvier 2011 à 10:42 -0800, Brett Cannon a écrit :
> > I am not sure what exactly is not supported.  On my OSX system:
> 
> Victor said this is a Windows-specific issue.

Autoquote: "(a) (...) doesn't work with a locale encoding different than
UTF-8"

Hum, it's not exactly the locale encoding, but the Python filesystem
encoding. On Mac OS X, this encoding is *hardcoded* to UTF-8, so it is
possible to use non-ASCII module names on this OS. It is also possible
on other BSD/UNIX systems using UTF-8 locale encoding.

But this issue only concerns any BSD/UNIX using a locale encoding
different than UTF-8. Eg. MvL's buildbot (x86 debian parallel 3.x) uses
ISO-8859-15 (see #10492, issue fixed 13 days ago). Even if UTF-8 becomes
a de facto standard locale encoding, many systems still use something
else. And Python 2 users will complain that their script works with
Python 2 but not with Python 3 :-)

If we decide to reject non-ASCII module names, it should be done on any
operating systems, not only on Windows.

Victor

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Alexander Belopolsky

On Wed, Jan 19, 2011 at 1:42 PM, Brett Cannon  wrote:
..
>> I am not sure what exactly is not supported.  On my OSX system:
>
> Victor said this is a Windows-specific issue.

I missed that part.  In this case, I change my vote to +0 to reflect
my lack of knowledge or exposure to Windows-only issues.  However, if
Victor's patch simplifies the code (as many of his changes in this
area do), I will be happy to review it.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Brett Cannon

On Wed, Jan 19, 2011 at 10:38, Alexander Belopolsky
 wrote:
> On Wed, Jan 19, 2011 at 1:23 PM, Brett Cannon  wrote:
> ..
  (a) Python 3 doesn't support non-ASCII module names
> ..
>> -0 from me (unless the Unicode variable naming PEP says otherwise).
>>
>
> I am not sure what exactly is not supported.  On my OSX system:

Victor said this is a Windows-specific issue.

-Brett

>
> $ ./python.exe
> Python 3.2b2+ ..
>
 import саша
 саша.foo
> 42
 from саша import foo
 foo
> 42
>
>
> PEP 3131 does not distinguish between different types of identifiers,
> so I think it assumes that non-ascii module names should be supported.
>
> +1 on fixing any remaining bugs
>
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Alexander Belopolsky

On Wed, Jan 19, 2011 at 1:23 PM, Brett Cannon  wrote:
..
>>>  (a) Python 3 doesn't support non-ASCII module names
..
> -0 from me (unless the Unicode variable naming PEP says otherwise).
>

I am not sure what exactly is not supported.  On my OSX system:

$ ./python.exe
Python 3.2b2+ ..

>>> import саша
>>> саша.foo
42
>>> from саша import foo
>>> foo
42


PEP 3131 does not distinguish between different types of identifiers,
so I think it assumes that non-ascii module names should be supported.

+1 on fixing any remaining bugs
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Brett Cannon

On Wed, Jan 19, 2011 at 07:01, Simon Cross
 wrote:
> On Wed, Jan 19, 2011 at 2:34 PM, Victor Stinner
>  wrote:
>>  (a) Python 3 doesn't support non-ASCII module names
>
> -0: I'm vaguely against this being supported because I'd rather not
> have to deal with what happens when the guess regarding the filesystem
> encoding is wrong. On the other hand, a general encouragement to stick
> to ASCII module names is probably functionally equivalent without
> imposing a hard restriction.

-0 from me (unless the Unicode variable naming PEP says otherwise).

>
>>  (b) Python 3 doesn't support unencodable characters in the module path
>
> +1: It'd be nice if Python could import modules regardless of what
> folder names people happen to have on their module path.

+1 from me as well (nervously hoping importlib already supports it  =) .
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

2011-01-19 Thread Simon Cross

On Wed, Jan 19, 2011 at 2:34 PM, Victor Stinner
 wrote:
>  (a) Python 3 doesn't support non-ASCII module names

-0: I'm vaguely against this being supported because I'd rather not
have to deal with what happens when the guess regarding the filesystem
encoding is wrong. On the other hand, a general encouragement to stick
to ASCII module names is probably functionally equivalent without
imposing a hard restriction.

>  (b) Python 3 doesn't support unencodable characters in the module path

+1: It'd be nice if Python could import modules regardless of what
folder names people happen to have on their module path.

Schiavo
Simon
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] Import and unicode: part two

2011-01-19 Thread Victor Stinner

Hi,

I patched Python 3.2 to support modules with non-ASCII paths (*). It
works well on all operating systems. But the task is not completly done:

 (a) Python 3 doesn't support non-ASCII module names 
 (b) Python 3 doesn't support unencodable characters in the module path

I would like to know if we need to support that. Terry J. Reedy wrote
(issue #10828): "I think bugs in core syntax should have high priority.
I appreciate your work toward fixing it."

I wrote a patch (issue #3080) fixing both points. If you agree that both
issues should be fixed, I will fix them in Python 3.3.

(a) is the issue #10828 reported recently (january 2011): "import
gui_jämföra" doesn't work with a locale encoding different than UTF-8
(so it doesn't work on Windows).

(b) is specific to Windows: FAT32 and NTFS filesystems store filenames
in unicode, but Python encodes paths to the ANSI code page (which is a
very small subset of Unicode). If a character cannot be encoded to the
code page, you cannot load a module. Eg. add a japanese character in a
directory name on a Windows using cp1252 (english) code page. I don't
think that (b) was already reported by an user, it's more a theorical
problem.

My patch is huge, but it simplifies the code. We doesn't need to
regulary convert from/to UTF-8. And for the functions using
PyUnicodeObject objects (and not a Py_UNICODE* buffer): PyUnicodeObject
stores the string length (it avoids calls to strlen()) and
PyUnicode_FromFormat() doesn't need a buffer size (no risk of buffer
overflow). I suppose that it makes Python faster, but I didn't try.

(*) Python 3.2 doesn't support non-ASCII in the module *name*, only in
the path (sys.path).

Victor Stinner

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

93 matches

Mail list logo