Re: [Python-Dev] Import and unicode: part two
>When switching to a UTF-8 locale, they can also change the file > names of their modules to be encoded in UTF-8. It would be fairly easy > to write a script that identifies non-ASCII file names in a directory > and offers to transcode their names from their current encoding to > UTF-8. In fact, convmv (http://j3e.de/linux/convmv/) does exactly that; it comes as a Debian package also. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On 1/26/2011 4:47 PM, Toshio Kuratomi wrote: There's one further case that I am worried about that has no real "transfer". Since people here seem to think that unicode module names are the future (for instance, the comments about redefining the C locale to include utf-8 and the comments about archiving tools needing to support encoding bits), there are eventually going to be unicode modules that become dependencies of other modules and programs. These will need to be installed on systems. Linux distributions that ship these will need to choose a filesystem encoding for the filenames of these. Likely the sensible thing for them to do is to use utf-8 since all the ones I can think of default to utf-8. But, as Stephen and Victor have pointed out, users change their locale settings to things that aren't utf-8 and save their modules using filenames in that encoding. When they update their OS to a version that has utf-8 python module names, they will find that they have to make a choice. They can either change their locale settings to a utf-8 encoding and have the system installed modules work or they can leave their encoding on their non-utf-8 encoding and have the modules that they've created on-site work. This is not a good position to put users of these systems in. The way this case should work, is that programs that install files (installation is a form of transfer) should transform their names from the encoding used in the transfer medium to the encoding of the filesystem on which they are installed. Python3 should access the files, transforming the names from the encoding of the filesystem on which they are installed to Unicode for use by the program. I think Python3 is trying to do its part, and Victor is trying to make that more robust on more platforms, specifically Windows. The programs that install files, which may include programs that install Python files I don't know, may or may not be doing their part, but clearly there are cases where they do not. Systems that have different encodings for names on the same or different file systems need to have a way to obtain the encoding for the file names, so they can be properly decoded. If they don't have such a way, they are broken. = The rest of this is an attempt to describe the problem of Linux and other systems which use byte strings instead of character strings as file names. No problem, as long as programs allow byte strings as file names. Python3 does not, for the import statement, thus the problem is relevant for discussion here, as has been ongoing. = Since file names are defined to be byte strings, there is no way to obtain the encoding for file names, so they cannot always be decoded, and sometimes not properly decoded, because no one knows which encoding was used to create them, _if any_. Hence, Linux programs that use character strings as file names internally and expect them to match the byte strings in the file system are promoting a fiction: that there is a transformation (encoding) from character strings to byte strings that will match. When using ASCII character strings, they can be transformed to bytes using a simple transformation: identity... but that isn't necessarily correct, if the files were created using EBCDIC (unlikely on Linux systems, but not impossible, since Linux files are byte strings). When using non-ASCII character strings, the fiction promoted is even bigger, and the transformation even harder. Any 8-bit character encoding can pretend that identity is the correct transformation, but the result is mojibake if it isn't. Unicode other multi-byte encodings have an even harder job, because there can be 8-bit sequences that are not legal for some transformations, but are legal for others. This is when the fiction is exposed! As the recent description of glib points out, when the file names are read as bytes, and shown to the user for selection, possibly using some mojibake-generating transformation to characters, the user has a fighting chance to pick the right file, less chance if the transformation is lossy ('?' substitutions, etc.) and/or the names are redundant in their lossless characters. However, when the specification of the name is in characters (such as for Python import, or file names specified as character constants in any application system that provides/permits such), and there are large numbers of transformations that could be used to convert characters to bytes, the problem is harder, and error-prone... programs that want to promote the fiction of using characters for filenames must work harder. It seems that Python on Linux is such a program. One technique is to have conventions agreed on by applications and users to limit the number of encodings used on a particular system to one (optimal) or a few, the latter requires understanding that files created in one encoding may not be accessible by systems that use a diffe
Re: [Python-Dev] Import and unicode: part two
Toshio Kuratomi: > When they update their OS to a version that has > utf-8 python module names, they will find that they have to make a choice. > They can either change their locale settings to a utf-8 encoding and have > the system installed modules work or they can leave their encoding on their > non-utf-8 encoding and have the modules that they've created on-site work. When switching to a UTF-8 locale, they can also change the file names of their modules to be encoded in UTF-8. It would be fairly easy to write a script that identifies non-ASCII file names in a directory and offers to transcode their names from their current encoding to UTF-8. Neil ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Wed, Jan 26, 2011 at 11:12:02AM +0100, "Martin v. Löwis" wrote: > Am 26.01.2011 10:40, schrieb Victor Stinner: > > Le lundi 24 janvier 2011 à 19:26 -0800, Toshio Kuratomi a écrit : > >> Why not locale: > >> * Relying on locale is simply not portable. (...) > >> * Mixing of modules from different locales won't work. (...) > > > > I don't understand what you are talking about. > > I think by "portability", he means "moving files from one computer to > another". He argues that if Python would mandate UTF-8 for all file > names on Unix, moving files in such a way would support portability, > whereas using the locale's filename might not (if the locale use a > different charset on the target system). > > While this is technically true, I don't think it's a helpful way of > thinking: by mandating that file names are UTF-8 when accessed from > Python, we make the actual files inaccessible on both the source and > the target system. > > > I don't understand the relation between the local filesystem encoding > > and the portability. I suppose that you are talking about the > > distribution of a module to other computers. Here the question is how > > the filenames are stored during the transfer. The user is free to use > > any tool, and try to find a tool handling Unicode correctly :-) But it's > > no more the Python problem. > > There are cases where there is no real "transfer", in the sense in which > you are using the word. For example, with NFS, you can access the very > same file simultaneously on two systems, with no file name conversion > (unless you are using NFSv4, and unless your NFSv4 implementations > support the UTF-8 mandate in NFS well). > > Also, if two users of the same machine have different locale settings, > the same file name might be interpreted differently. > Thanks Martin, I think that you understand my view even if you don't share it. There's one further case that I am worried about that has no real "transfer". Since people here seem to think that unicode module names are the future (for instance, the comments about redefining the C locale to include utf-8 and the comments about archiving tools needing to support encoding bits), there are eventually going to be unicode modules that become dependencies of other modules and programs. These will need to be installed on systems. Linux distributions that ship these will need to choose a filesystem encoding for the filenames of these. Likely the sensible thing for them to do is to use utf-8 since all the ones I can think of default to utf-8. But, as Stephen and Victor have pointed out, users change their locale settings to things that aren't utf-8 and save their modules using filenames in that encoding. When they update their OS to a version that has utf-8 python module names, they will find that they have to make a choice. They can either change their locale settings to a utf-8 encoding and have the system installed modules work or they can leave their encoding on their non-utf-8 encoding and have the modules that they've created on-site work. This is not a good position to put users of these systems in. -Toshio pgpRiKtOLoK13.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
> If NFSv3 doesn't reencode filenames for each client and the clients > don't reencode filenames, all clients have to use the same locale > encoding than the server. Otherwise, I don't see how it can work. In practice, users accept that they get mojibake - their editors can still open the files, and they can double-click them in a file browser just fine. So it doesn't really need to work, and users can still use it. > Again, I don't think that Python should do anything special to > workaround these issues. I agree, and I'm certainly in favor of keeping the current code base. Just make sure you understand the reasoning of those opposing. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Jan 26, 2011, at 11:47 AM, Victor Stinner wrote: > Not exactly. Gtk+ uses the glib library, and to encode/decode filenames, > the glib library uses: > > - UTF-8 on Windows > - G_FILENAME_ENCODING environment variable if set (comma-separated list > of encodings) > - UTF-8 if G_BROKEN_FILENAMES env var is set > - or the locale encoding But the documentation says: > On Unix, the character sets are determined by consulting the environment > variables G_FILENAME_ENCODING and G_BROKEN_FILENAMES. On Windows, the > character set used in the GLib API is always UTF-8 and said environment > variables have no effect. > > G_FILENAME_ENCODING may be set to a comma-separated list of character set > names. The special token "@locale" is taken to mean the character set for > thecurrent locale. If G_FILENAME_ENCODING is not set, but G_BROKEN_FILENAMES > is, the character set of the current locale is taken as the filename > encoding. If neither environment variable is set, UTF-8 is taken as the > filename encoding, but the character set of the current locale is also put in > the list of encodings. Which indicates to me that (unless you override the behavior with env vars) it encodes filenames in UTF-8 regardless of the locale, and attempts decoding in UTF-8 primarily. And that only when the filename doesn't make sense in UTF-8, it will also try decoding it in the locale encoding. James ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
Le mercredi 26 janvier 2011 à 08:24 -0500, James Y Knight a écrit : > On Jan 26, 2011, at 4:40 AM, Victor Stinner wrote: > > During > > Python 3.2 development, we tried to be able to use a filesystem encoding > > different than the locale encoding (PYTHONFSENCODING environment > > variable): but it doesn't work simply because Python is not alone in the > > OS. Except Python, all programs speak the same "language": the locale > > encoding. Let's try to give you an example: if create a module with a > > name encoded to UTF-8, your file browser will display mojibake. > > Is that really true? I'm pretty sure GTK+ treats all filenames as > UTF-8 no matter what the locale says. (over-rideable by > G_FILENAME_ENCODING or G_BROKEN_FILENAMES) Not exactly. Gtk+ uses the glib library, and to encode/decode filenames, the glib library uses: - UTF-8 on Windows - G_FILENAME_ENCODING environment variable if set (comma-separated list of encodings) - UTF-8 if G_BROKEN_FILENAMES env var is set - or the locale encoding glib has no type to store a filename, a filename is a raw byte string (char*). It has a nice function to workaround mojibake issues: g_filename_display_name(). This function tries to decode the filename from each encoding of the filename encoding list, if all decodings failed, use UTF-8 and escape undecodable bytes. So yes, if you set G_FILENAME_ENCODING you can fix mojibake issues. But you have to pass the raw bytes filenames to other libraries and programs. The problem with PYTHONFSENCODING is that sys.getfilesystemencoding() is not only used for the filenames, but also for the command line arguments and the environment variables. For more information about glib, see g_filename_to_utf8(), g_filename_display_name() and g_get_filename_charsets() documentation: http://library.gnome.org/devel/glib/2.26/glib-Character-Set-Conversion.html Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Jan 26, 2011, at 4:40 AM, Victor Stinner wrote: > During > Python 3.2 development, we tried to be able to use a filesystem encoding > different than the locale encoding (PYTHONFSENCODING environment > variable): but it doesn't work simply because Python is not alone in the > OS. Except Python, all programs speak the same "language": the locale > encoding. Let's try to give you an example: if create a module with a > name encoded to UTF-8, your file browser will display mojibake. Is that really true? I'm pretty sure GTK+ treats all filenames as UTF-8 no matter what the locale says. (over-rideable by G_FILENAME_ENCODING or G_BROKEN_FILENAMES) James ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
Le mercredi 26 janvier 2011 à 11:12 +0100, "Martin v. Löwis" a écrit : > There are cases where there is no real "transfer", in the sense in which > you are using the word. For example, with NFS, you can access the very > same file simultaneously on two systems, with no file name conversion > (unless you are using NFSv4, and unless your NFSv4 implementations > support the UTF-8 mandate in NFS well). Python encodes the module name to the locale encoding to create a filename. If the locale encoding is not the encoding used on the NFS server, it doesn't work, but I don't think that Python has to workaround this issue. If an user plays with non-ASCII module names, (s)he has to understand that (s)he will have to fight against badly configured systems and tools unable to handle Unicode correctly. We might warn him/her in the documentation. If NFSv3 doesn't reencode filenames for each client and the clients don't reencode filenames, all clients have to use the same locale encoding than the server. Otherwise, I don't see how it can work. > Also, if two users of the same machine have different locale settings, > the same file name might be interpreted differently. Except Mac OS X and Windows, no kernel supports Unicode and so all users of the same computer have to use the same locale encoding, or they will not be able to share non-ASCII filenames. -- Again, I don't think that Python should do anything special to workaround these issues. (Hardcode the module filename encoding to UTF-8 doesn't work for all the reasons explained in other emails.) Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Wed, Jan 26, 2011 at 11:12:02AM +0100, "Martin v. L??wis" wrote: > There are cases where there is no real "transfer", in the sense in which > you are using the word. For example, with NFS, you can access the very > same file simultaneously on two systems, with no file name conversion > (unless you are using NFSv4, and unless your NFSv4 implementations > support the UTF-8 mandate in NFS well). > > Also, if two users of the same machine have different locale settings, > the same file name might be interpreted differently. I have a solution for all these problems, with a price, of course. Let's use utf8+base64. Base64 uses a very restricted subset of ASCII and filenames will never be interpreted whatever filesystem encodings would be. The price is users loose standard OS tools like ls and find. I am partially joking, of course, but only partially. Oleg. -- Oleg Broytmanhttp://phdru.name/p...@phdru.name Programmers don't die, they just GOSUB without RETURN. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
Am 26.01.2011 10:40, schrieb Victor Stinner: > Le lundi 24 janvier 2011 à 19:26 -0800, Toshio Kuratomi a écrit : >> Why not locale: >> * Relying on locale is simply not portable. (...) >> * Mixing of modules from different locales won't work. (...) > > I don't understand what you are talking about. I think by "portability", he means "moving files from one computer to another". He argues that if Python would mandate UTF-8 for all file names on Unix, moving files in such a way would support portability, whereas using the locale's filename might not (if the locale use a different charset on the target system). While this is technically true, I don't think it's a helpful way of thinking: by mandating that file names are UTF-8 when accessed from Python, we make the actual files inaccessible on both the source and the target system. > I don't understand the relation between the local filesystem encoding > and the portability. I suppose that you are talking about the > distribution of a module to other computers. Here the question is how > the filenames are stored during the transfer. The user is free to use > any tool, and try to find a tool handling Unicode correctly :-) But it's > no more the Python problem. There are cases where there is no real "transfer", in the sense in which you are using the word. For example, with NFS, you can access the very same file simultaneously on two systems, with no file name conversion (unless you are using NFSv4, and unless your NFSv4 implementations support the UTF-8 mandate in NFS well). Also, if two users of the same machine have different locale settings, the same file name might be interpreted differently. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
Le lundi 24 janvier 2011 à 19:26 -0800, Toshio Kuratomi a écrit : > Why not locale: > * Relying on locale is simply not portable. (...) > * Mixing of modules from different locales won't work. (...) I don't understand what you are talking about. When you import a module, the module name becomes a filename. On Windows, you can reuse the Unicode name directly as a filename. On the other OSes, you have to encode the name to filesystem encoding. During Python 3.2 development, we tried to be able to use a filesystem encoding different than the locale encoding (PYTHONFSENCODING environment variable): but it doesn't work simply because Python is not alone in the OS. Except Python, all programs speak the same "language": the locale encoding. Let's try to give you an example: if create a module with a name encoded to UTF-8, your file browser will display mojibake. I don't understand the relation between the local filesystem encoding and the portability. I suppose that you are talking about the distribution of a module to other computers. Here the question is how the filenames are stored during the transfer. The user is free to use any tool, and try to find a tool handling Unicode correctly :-) But it's no more the Python problem. Each computer uses a different locale encoding. You have to use it to cooperate with other programs and avoid mojibake. But I don't understand why you write that "Mixing of modules from different locales won't work". If you use a tool storing filenames in your locale encoding (eg. TAR file format... and sometimes the ZIP format), the problem comes from your tool and you should use another tool. I created http://bugs.python.org/issue10972 to workaround ZIP tools supposing that ZIP files use the locale encoding instead of cp497: this issue adds an option to force the usage of the Unicode flag (and so store filenames to UTF-8). Even if initially, I created the issue to workaround a bootstrap issue (#10955). Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
Toshio Kuratomi writes: > Sure ... but with these systems, neither read-modules-as-locale or > read-modules-as-utf-8 are a good solution to work, correct? Good solution, no, but I believe that read-modules-as-locale *should* work to a great extent. AFAIK Python 3 reads Python programs as str (ie, converting to Unicode -- if it doesn't, it *should*). > Especially if the OS does get upgraded but the filesystems with > user data (and user created modules) are migrated as-is, you'll run > into situations where system installed modules are in utf-8 and > user created modules are shift-jis and so something will always be > broken. I don't know what you mean by "system-installed modules". If you're talking about Python itself, it's not a problem. Python doesn't have any Japanese-named modules in any encoding. On the other hand, *everything* that involves scripting (shell scripts, make, etc) related to those filesystems will be broken *unless* the system, after upgrade but before going live, is converted to have an appropriate locale encoding. So I don't really see a problem here. The problem is portability across systems, and that is a problem that only the third-party transports can really deal with. tar and unzip need to be taught how to change file names to the locale, etc. > The only way to make sure that modules work is to restrict them to ASCII-only > on the filesystem. But because unicode module names are seen as > a necessary feature, the question is which way forward is going to lead to > the least brokenness. Which could be locale... but from the python2 > locale-related bugs that I get to look at, I doubt. AFAICS this is going to be site-specific. End of story. Or, if you prefer, "maru-nage". IMHO, Python 2 locale bugs are unlikely to be a good guide to Python 3 locale bugs because in Python 2 most people just ignore locale and use "native" strings (~= bytes in Python 3), and that typically "just works". In Python 3 that just *doesn't* work any more because you get a UnicodeError on import, etc, etc. IMHO, YMMV, and all that. I know *of* such systems (there remain quite a few here used by student and research labs), but the ones I maintain were easy to convert to UTF-8 because I don't export file systems (except my private files for my own use); everything is mediated by Apache and Zope, and browsers are happy to cope if I change from EUC-JP to UTF-8 and then flip the Apache switch to change default encodings. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Wed, Jan 26, 2011 at 11:24:54AM +0900, Stephen J. Turnbull wrote: > Toshio Kuratomi writes: > > > On Linux there's no defined encoding that will work; file names are just > > bytes to the Linux kernel so based on people's argument that the convention > > is and should be that filenames are utf-8 and anything else is > > a misconfigured system -- python should mandate that its module filenames > on > > Linux are utf-8 rather than using the user's locale settings. > > This isn't going to work where I live (Tsukuba). At the national > university alone there are hundreds of pre-existing *nix systems whose > filesystems were often configured a decade or more ago. Even if the > hardware and OS have been upgraded, the filesystems are usually > migrated as-is, with OS configuration tweaks to accomodate them. Many > of them use EUC-JP (and servers often Shift JIS). That means that you > won't be able to read module names with ls, and that will make Python > unacceptable for this purpose. I imagine that in Russia the same is > true for the various Cyrillic encodings. > Sure ... but with these systems, neither read-modules-as-locale or read-modules-as-utf-8 are a good solution to work, correct? Especially if the OS does get upgraded but the filesystems with user data (and user created modules) are migrated as-is, you'll run into situations where system installed modules are in utf-8 and user created modules are shift-jis and so something will always be broken. The only way to make sure that modules work is to restrict them to ASCII-only on the filesystem. But because unicode module names are seen as a necessary feature, the question is which way forward is going to lead to the least brokenness. Which could be locale... but from the python2 locale-related bugs that I get to look at, I doubt. > I really don't think there is anything that can be done here except to > warn people that "Kids, these stunts are performed by highly-trained > professionals. Don't try this at home!" Of course they will anyway, > but at least they will have been warned in sufficiently strong terms > that they might pay attention and be able to recover when they run > into bizarre import exceptions. > So on the subject of warnings... I think a reason it's better to pick an encoding for the platform/filesystem rather than to use locale is because people will get an error or a warning at the appropriate time if that's the case -- the first time they attempt to create and import a module with a filename that's not encoded in the correct encoding for the platform. It's all very well to say: "We wrote in the documentation on http://docs.python.org/distutils/introduction.html#Choosing-a-name that only ASCII names should be used when distributing python modules" but if the interpreter doesn't complain when people use a non-ASCII filename we all know that they aren't going to look in the documentation; they'll try it and if it works they'll learn that habit. -Toshio pgpjrrsvd3wof.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
Toshio Kuratomi writes: > On Linux there's no defined encoding that will work; file names are just > bytes to the Linux kernel so based on people's argument that the convention > is and should be that filenames are utf-8 and anything else is > a misconfigured system -- python should mandate that its module filenames on > Linux are utf-8 rather than using the user's locale settings. This isn't going to work where I live (Tsukuba). At the national university alone there are hundreds of pre-existing *nix systems whose filesystems were often configured a decade or more ago. Even if the hardware and OS have been upgraded, the filesystems are usually migrated as-is, with OS configuration tweaks to accomodate them. Many of them use EUC-JP (and servers often Shift JIS). That means that you won't be able to read module names with ls, and that will make Python unacceptable for this purpose. I imagine that in Russia the same is true for the various Cyrillic encodings. I really don't think there is anything that can be done here except to warn people that "Kids, these stunts are performed by highly-trained professionals. Don't try this at home!" Of course they will anyway, but at least they will have been warned in sufficiently strong terms that they might pay attention and be able to recover when they run into bizarre import exceptions. Oh, yeah, don't forget to apply Victor's patch, which allows Python to keep the promises it can make about consistency. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Tue, Jan 25, 2011 at 10:22:41AM +0100, Xavier Morel wrote: > On 2011-01-25, at 04:26 , Toshio Kuratomi wrote: > > > > * If you can pick a set of encodings that are valid (utf-8 for Linux and > > MacOS > > HFS+ uses UTF-16 in NFD (actually in an Apple-specific variant of NFD). Right > here you've already broken Python modules on OSX. > Others have been saying that Mac OSX's HFS+ uses UTF-8. But the question is not whether UTF-16 or UTF-8 is used by HFS+. It's whether you can sensibly decide on an encoding from the type of system that is being run on. This could be querying the filesystem or a check on sys.platform or some other method. I don't know what detection the current code does. On Linux there's no defined encoding that will work; file names are just bytes to the Linux kernel so based on people's argument that the convention is and should be that filenames are utf-8 and anything else is a misconfigured system -- python should mandate that its module filenames on Linux are utf-8 rather than using the user's locale settings. > > And as far as I know, Linux software/FS generally use NFC (I've already seen > this issue cause trouble) > Linux FS's are bytes with a small blacklist (so you can't use the NULL byte in a filename, for instance). Linux software would be free to use any normal form that they want. If one software used NFC and another used NFD, the FS would record two separate files with two separate filenames. Other programs might or might not display this correctly. Example: $ touch cafe $ python Python 2.7 (r27:82500, Sep 16 2010, 18:02:00) >>> import os >>> import unicodedata >>> a=u'café' >>> b=unicodedata.normalize('NFC', a) >>> c=unicodedata.normalize('NFD', a) >>> open(b.encode('utf8'), 'w').close() >>> open(c.encode('utf8'), 'w').close() >>> os.listdir(u'.') >>> [u'people-etc-changes.txt', u'cafe\u0301', u'cafe', >>> u'people-etc-changes.sha256sum', u'caf\xe9'] >>> os.listdir('.') >>> ['people-etc-changes.txt', 'cafe\xcc\x81', 'cafe', >>> 'people-etc-changes.sha256sum', 'caf\xc3\xa9'] >>> ^D $ ls -al . drwxrwxr-x. 2 badger badger 4096 Jan 25 07:46 . drwxr-xr-x. 17 badger badger 4096 Jan 24 18:27 .. -rw-rw-r--. 1 badger badger 0 Jan 25 07:45 cafe -rw-rw-r--. 1 badger badger 0 Jan 25 07:46 cafe -rw-rw-r--. 1 badger badger 0 Jan 25 07:46 café $ ls -al cafe -rw-rw-r--. 1 badger badger 0 Jan 25 07:45 cafe $ ls -al cafe? -rw-rw-r--. 1 badger badger 0 Jan 25 07:46 cafe Now in this case, the decomposed form of the filename is being displayed incorrectly and the shell treats the decomposed character as two characters instead of one. However, when you view these files in dolphin (the KDE file manager) you properly see café repeated twice. -Toshio pgp2jXsIKYdB7.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On 09:22 am, catch-...@masklinn.net wrote: On 2011-01-25, at 04:26 , Toshio Kuratomi wrote: * If you can pick a set of encodings that are valid (utf-8 for Linux and MacOS HFS+ uses UTF-16 in NFD (actually in an Apple-specific variant of NFD). Right here you've already broken Python modules on OSX. Are you sure about the UTF-16 part? Evidence strongly points towards UTF-8: $ python Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29) [GCC 4.2.1 (Apple Inc. build 5646)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import unicodedata, os >>> file(u'\N{SNOWMAN}', 'w').close() >>> os.listdir('.') ['\xe2\x98\x83'] >>> unicodedata.name('\xe2\x98\x83'.decode('utf-8')) 'SNOWMAN' >>> Jean-Paul ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On 2011-01-25, at 04:26 , Toshio Kuratomi wrote: > > * If you can pick a set of encodings that are valid (utf-8 for Linux and > MacOS HFS+ uses UTF-16 in NFD (actually in an Apple-specific variant of NFD). Right here you've already broken Python modules on OSX. And as far as I know, Linux software/FS generally use NFC (I've already seen this issue cause trouble) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
As Nick points out, nobody really seems to think this is an argument against your patch. I'm going to bow out of this thread after this post, as I'm clearly out of my technical depth. Victor Stinner writes: > Le lundi 24 janvier 2011 11:35:22, Stephen J. Turnbull a écrit : > > ... VFAT-formatted file systems and Shift JIS file names ... > > I missed something: VFAT stores filenames as unicode (whereas FAT only > supports byte filenames). Well, VFAT stores filenames twice: as a 8+3 byte > strings and as a 255 unicode (UTF-16-LE) string (UTF-16-LE). I don't know what it is; I didn't have char-device-level access to the file system, nor did I have the specs (it was a proprietary phone by a Japanese OEM). It *presented* filenames in Shift JIS when mounted on Linux with the vfat filesystem (either "mount -t vfat /dev/sde1 /mnt/gadget" or "mount -t auto /dev/sde1 /mnt/gadget"). Maybe there is some unusual layer to translate from Unicode there, I'm not familiar with Linux kernel drivers and libc facilities (such special-casing is a common pattern in programming for Japanese; remember, the Japanese had to deal with these issues before there was any standard for them). > On which OS do you access this VFAT file system? On Windows, you have two > APIs: bytes (*A) and wide character (*W). If you use the wide character, > there > is explicit encoding at all. Linux has two mount options to control unicode > on > a VFAT filesystem: "codepage" for the byte filenames (use Shift JIS here) > and > "iocharset" for the unicode filenames (I don't understand this > option). I didn't either, in fact this is the first I've heard of it, so I've never tried it. > I suppose that Shift JIS is used to encode the filename in the 8+3 byte > string > form. Could be, but I'm pretty sure these were long filenames, although maybe they were just short enough (that is, I don't recall noticing any truncation when mounted compared to the way they were presented on the phone itself). I don't use that phone anymore, it's in a box of junk equipment somewhere ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Thu, Jan 20, 2011 at 03:27:08PM -0500, Glyph Lefkowitz wrote: > > On Jan 20, 2011, at 11:46 AM, Guido van Rossum wrote: > Same here. *Most* code will never be shared, or will only be shared > between users in the same community. When it goes wrong it's also a > learning opportunity. :-) > > > Despite my usual proclivity for being contrarian, I find myself in agreement > here. Linux users with locales that don't specify UTF-8 frankly _should_ have > to deal with all kinds of nastiness until they can transcode their > filesystems. > MacOS and Windows both have a "right" answer here and your third-party tools > shouldn't create mojibake in your filenames. > However, if this is the consensus, it makes a lot more sense to pick utf-8 as *the* encoding for python module filenames on Linux. Why UTF-8: * UTF-8 can cover the whole range of unicode whereas most (all?) other locale friendly encodings cannot. * UTF-8 is becoming a standard for Linux distributions whether or not Linux users are adopting it. * Third party tools are gaining support for UTF-8 even when they aren't gaining support for generic encodings (If I read the spec on zip correctly, this is actually what's happening there). Why not locale: * Relying on locale is simply not portable. If nothing prevents people from distributing a unicode filename then they will go ahead and do so. If the result works (say, because it's utf-8 and 80% of the Linux userbase is using utf-8) then it will get packaged and distributed and people won't know that it's a problem until someone with a non-utf-8 locale decids to use it. * Mixing of modules from different locales won't work. Suppose that the system python installs the previous module. The local site has other modules that it has installed using a different filename encoding. The users at the site will find that either one or hte other of the two modules won't work. * Because of the portability problems you have no choice but to tell people not to distribute python modules with non-ASCII names. This makes the use of unicode names second class indefintely (until the kernel devs decide that they're wrong to not enforce a filesystem encoding or Linux becomes irrelevant as a platform). * If you can pick a set of encodings that are valid (utf-8 for Linux and MacOS, wide unicode for windows [I get the feeling from other parts of the conversation that Windows won't be so lucky, though]) tools to convert python names become easier to write. If you restrict it far enough, you could even write tools/importers that automatically do the detection. PS: Sorry for not replying immediately, the team I'm on is dealing with an issue at my work and I'm also preparing for a conference later this week. -Toshio pgpq1C0qGW77C.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
Am 24.01.2011 16:39, schrieb Victor Stinner: > Le lundi 24 janvier 2011 11:35:22, Stephen J. Turnbull a écrit : >> ... VFAT-formatted file systems and Shift JIS file names ... > > I missed something: VFAT stores filenames as unicode (whereas FAT only > supports byte filenames). Well, VFAT stores filenames twice: as a 8+3 byte > strings and as a 255 unicode (UTF-16-LE) string (UTF-16-LE). Stephen may not have meant VFAT. Instead, he might have meant FAT32, or, more likely, exFAT. VFAT is patented by Microsoft, so vendors of devices using flash memory cards often don't support VFAT. In any case, file names are encoded in the OEM code page even on VFAT. > On which OS do you access this VFAT file system? On Windows, you have two > APIs: bytes (*A) and wide character (*W). If you use the wide character, > there > is explicit encoding at all. Right ("no explicit encoding"). However, this is actually where things can go wrong: Windows needs to guess the file system, and will guess it uses the OEM code page. If the device writing the file system uses a different OEM code age than the Windows installation reading it, you get moji-bake. This will actually happen with the *A APIs as well: they do *not* give you the file name from disk. Instead, Windows converts the OEM characters on disk to Unicode, and then the Unicode characters to the ANSI code page. > Linux has two mount options to control unicode on > a VFAT filesystem: "codepage" for the byte filenames (use Shift JIS here) and > "iocharset" for the unicode filenames (I don't understand this option). > Anyway, both systems support unicode filenames. Linux doesn't support "unicode file names". Instead, it can support UTF-8. As Oleg explains: you need one encoding for the bytes on disk (to know what they mean, when converted to Unicode), and one encoding to then convert the "abstract" unicode to bytes again to present to the application. This is similar to how *A works on Windows. The iocharset is needed even if the file system is known to use UTF-16 (say, NTFS, VFAT, or Joliet). Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Mon, Jan 24, 2011 at 04:39:39PM +0100, Victor Stinner wrote: > I missed something: VFAT stores filenames as unicode (whereas FAT only > supports byte filenames). Well, VFAT stores filenames twice: as a 8+3 byte > strings and as a 255 unicode (UTF-16-LE) string (UTF-16-LE). > > On which OS do you access this VFAT file system? On Windows, you have two > APIs: bytes (*A) and wide character (*W). If you use the wide character, > there > is explicit encoding at all. Linux has two mount options to control unicode > on > a VFAT filesystem: "codepage" for the byte filenames (use Shift JIS here) and > "iocharset" for the unicode filenames (I don't understand this option). AFAIU, `codepage` is "remote charset" while `iocharset` is "local charset". I.e., to mount windows-1251 filesystem to my linux with koi8-r locale I use codepage=cp866,iocharset=koi8-r (cp866 is OEM encoding for cp1251 ANSI). Oleg. -- Oleg Broytmanhttp://phdru.name/p...@phdru.name Programmers don't die, they just GOSUB without RETURN. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
Le lundi 24 janvier 2011 16:39:39, Victor Stinner a écrit : > Le lundi 24 janvier 2011 11:35:22, Stephen J. Turnbull a écrit : > > ... VFAT-formatted file systems and Shift JIS file names ... > > I missed something: VFAT stores filenames as unicode (whereas FAT only > supports byte filenames). Well, VFAT stores filenames twice: as a 8+3 byte > strings and as a 255 unicode (UTF-16-LE) string (UTF-16-LE). > > On which OS do you access this VFAT file system? On Windows, you have two > APIs: bytes (*A) and wide character (*W). If you use the wide character, > there is explicit encoding at all. Oops, there is *not* explicit encoding a all. Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
Le lundi 24 janvier 2011 11:35:22, Stephen J. Turnbull a écrit : > ... VFAT-formatted file systems and Shift JIS file names ... I missed something: VFAT stores filenames as unicode (whereas FAT only supports byte filenames). Well, VFAT stores filenames twice: as a 8+3 byte strings and as a 255 unicode (UTF-16-LE) string (UTF-16-LE). On which OS do you access this VFAT file system? On Windows, you have two APIs: bytes (*A) and wide character (*W). If you use the wide character, there is explicit encoding at all. Linux has two mount options to control unicode on a VFAT filesystem: "codepage" for the byte filenames (use Shift JIS here) and "iocharset" for the unicode filenames (I don't understand this option). Anyway, both systems support unicode filenames. I suppose that Shift JIS is used to encode the filename in the 8+3 byte string form. Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Mon, Jan 24, 2011 at 8:35 PM, Stephen J. Turnbull wrote: > First of all, these aren't just phones; these are all kinds of gadgets > (the example I gave was a camera). They're not as smart as an Android > or iPhone-like device, and I don't know what OS they use. We're getting a little far afield from the original question though - once it was pointed out that non-ASCII module names already work on some systems but not others, it became fairly clear that Victor's patch is about fixing an existing feature to be more robust rather than adding something new. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
"Martin v. Löwis" writes: > It's one thing how the file systems are formatted, but another thing > how they are presented to APIs. For example, the phones using Windows CE > would have to convert the file names to Unicode in the OS kernel. > > So: for these phones - do you know how they present file names to the > application? First of all, these aren't just phones; these are all kinds of gadgets (the example I gave was a camera). They're not as smart as an Android or iPhone-like device, and I don't know what OS they use. As for "presentation to the application", as I said, my older phones presented themselves as "removable memory devices" (specifically on the USB port), with VFAT-formatted file systems and Shift JIS file names. In that case you can surely have the kinds of problems described, even if the app is not running on the device itself. I don't know if this is still true of more modern devices, but I was a little shocked that is was true at all, even 5 or 6 years ago. That may be one reason why the phone I have now doesn't provide a USB interface at all. That kind of interface is not only unnecessary with Bluetooth, but Bluetooth uses more robust protocols. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
> > Really? I would have thought that cell phones have long been the > > platforms most supportive of Unicode. > > I would think so too, except in Japan. > > However, my previous phones exposed file systems with names encoded in > Shift JIS to USB and IR browsers, though. (My current one uses > Bluetooth, and I don't know how to "get at" the filesystem itself.) A > lot of these devices also tend to present themselves as VFAT-formatted > drives (a la a USB memory stick), and Shift JIS is very commonly used > on those for reasons I don't really understand. It's one thing how the file systems are formatted, but another thing how they are presented to APIs. For example, the phones using Windows CE would have to convert the file names to Unicode in the OS kernel. So: for these phones - do you know how they present file names to the application? Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
Guido van Rossum writes: > Really? I would have thought that cell phones have long been the > platforms most supportive of Unicode. I would think so too, except in Japan. However, my previous phones exposed file systems with names encoded in Shift JIS to USB and IR browsers, though. (My current one uses Bluetooth, and I don't know how to "get at" the filesystem itself.) A lot of these devices also tend to present themselves as VFAT-formatted drives (a la a USB memory stick), and Shift JIS is very commonly used on those for reasons I don't really understand. In any case, AIUI here the problem is like the problem of refactoring a "make"-based system. There are identifiers which are "spelled" one way inside of files which need to match the "spelling" of names of external filesystem objects. If you transport such a set of files to a POSIX system (which AFAIK most servers still are), then it's quite possible that the file names will get translated to the locale's encoding while the identifiers will not. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Sun, Jan 23, 2011 at 6:33 PM, Stephen J. Turnbull wrote: > "Martin v. Löwis" writes: > > Actually, as long people only involve Windows, or only involve Mac, > > it will all work just fine. It's only when they use non-Mac Unix > > (such as Linux), or try to move files across systems using sub-prime > > technology (such as your typical Windows zip utility) they will run > > into problems. > > I believe that the kind of thing that Ishimoto-san has in mind is > things like "smart cameras" that will upload your photos to your blog > with one touch on the cameras screen and other "Web 2.0 for the rest > of us" apps. What with the popularity of Linux and *BSD for such > sites, it's easy to imagine problems of the kind he describes > occurring between those (which will probably be using Shift JIS in > Japan) apps and the websites. Really? I would have thought that cell phones have long been the platforms most supportive of Unicode. IIRC Nokia's Python port to S60 *required* Unicode strings for all system interfaces. Android, using Java, also is pretty much all Unicode inside. Am I naive to generalize from these two examples? (This is not meant as a rhetorical question -- I may well be missing something and am genuinely curious about the answer.) -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
"Martin v. Löwis" writes: > Actually, as long people only involve Windows, or only involve Mac, > it will all work just fine. It's only when they use non-Mac Unix > (such as Linux), or try to move files across systems using sub-prime > technology (such as your typical Windows zip utility) they will run > into problems. I believe that the kind of thing that Ishimoto-san has in mind is things like "smart cameras" that will upload your photos to your blog with one touch on the cameras screen and other "Web 2.0 for the rest of us" apps. What with the popularity of Linux and *BSD for such sites, it's easy to imagine problems of the kind he describes occurring between those (which will probably be using Shift JIS in Japan) apps and the websites. Why people with the skills to be actually using Python would have a problem like that, I don't know, but my experience with Japanese vendors is no different from anywhere else: they put the blame for bugs in systems on any convenient component other than their own or close business partners'. Open source is especially convenient because of the NO WARRANTY section prominently displayed in all licenses. > So the more people get confronted with the poor support of non-ASCII > file names in tools, the faster the tools will improve. It took PKWARE > many years to come up with a reasonable Unicode story - but now it's > really the tools that need to catch up, not the spec. I still agree with this point of view, but there is some scope for discussion of whether these tools should be "included batteries" or not. (Unfortunately I'm not in a position to volunteer to help with them for some time. :-( ) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
>> I don't think anybody is *encouraging* it. The argument is for >> *permitting* it, partly for consistency with other identifiers, and >> partly because of Python's usual "consenting adults" standard for >> permitting "dangerous" practices. > > I'm sorry, I was not clear. I was afraid that saying "learning > opportunity" tempt people to try non-ASCII module names. > In these days, even non technical people have access to Windows, Mac > and Linux boxes at a time. So chances to be annoyed with broken > non-ASCII named files are pretty common. Actually, as long people only involve Windows, or only involve Mac, it will all work just fine. It's only when they use non-Mac Unix (such as Linux), or try to move files across systems using sub-prime technology (such as your typical Windows zip utility) they will run into problems. But then it will be clear whom to blame - and people run in the same problems regardless of whether they move Python modules, or regular files (say, Word documents). So the more people get confronted with the poor support of non-ASCII file names in tools, the faster the tools will improve. It took PKWARE many years to come up with a reasonable Unicode story - but now it's really the tools that need to catch up, not the spec. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Fri, Jan 21, 2011 at 5:45 PM, Stephen J. Turnbull wrote: > Nick Coghlan writes: > > On Fri, Jan 21, 2011 at 3:44 PM, Atsuo Ishimoto > wrote: > > > > I don't want Python to encourage people to use non-ascii module names. > > I don't think anybody is *encouraging* it. The argument is for > *permitting* it, partly for consistency with other identifiers, and > partly because of Python's usual "consenting adults" standard for > permitting "dangerous" practices. I'm sorry, I was not clear. I was afraid that saying "learning opportunity" tempt people to try non-ASCII module names. In these days, even non technical people have access to Windows, Mac and Linux boxes at a time. So chances to be annoyed with broken non-ASCII named files are pretty common. > > I still don't see this as a reason to give up on non-ASCII module > names. Just have the documentation warn that many non-ASCII names > will be non-portable, so use on multiple systems will require care > (maybe gloss that with "probably more care than you want to take"). > Nice gloss. -- Atsuo Ishimoto Mail: ishim...@gembook.org Blog: http://d.hatena.ne.jp/atsuoishimoto/ Twitter: atsuoishimoto ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Thu, 20 Jan 2011 22:25:17 -0500 James Y Knight wrote: > > On Jan 20, 2011, at 3:55 PM, Antoine Pitrou wrote: > > > On Thu, 20 Jan 2011 15:27:08 -0500 > > Glyph Lefkowitz wrote: > >> > >> To support the latter, could we just make sure that zipimport has a > >> consistent, > >> non-locale-or-operating-system-dependent interpretation of encoding? > > > > It already has, but it's dependent on a flag in the zip file itself > > (actually, one flag per archived file in the zip it seems). > > > > (by the way, it would be nice if your text/mail editor wrapped lines at > > 80 characters or something) > > You could complain to Apple, but it seems unlikely that they'd change it. > They broke it intentionally in OSX 10.6.2 for better compatibility with MS > Outlook. > > (for the technically inclined: It still wraps lines at 80 characters in the > raw message, but it uses quoted-printable encoding to escape the line-breaks, > so mail readers which decode quoted-printable but can't flow text are now > S.O.L. Apple used to use the nice format=flowed standard instead.) I think most mail readers are able to word-wrap raw text correctly (even though it still makes your messages look bad amongst a thread of nicely-formatted 80-column messages). The real annoyance is when reading Web archives of mailing-lists, e.g. http://twistedmatrix.com/pipermail/twisted-python/2011-January/023346.html Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
Atsuo Ishimoto writes: > Java, a leading language of IT industry, have already support > non-ASCII class files for years. But I've never seen such files in > production in Japan, and didn't improve situation until now. So why wouldn't Python work the same way? The rest of the world can use non-ASCII modules names sparingly, and Japanese programmers can avoid them diligently. Or learn to use them properly and teach each other; if anybody has the experience of multiple encodings needed to figure out a good way to use the native language in program identifiers despite the encoding problem, my bet is it would be Japan. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
> I don't want Python to encourage people to use non-ascii module names. I don't think the feature is open for debate anymore. PEP 3131 has been accepted (after *long* debates), and I'll pronounce that supporting non-ASCII module names is a direct consequence of having it accepted. Of course, there may be limitations with respect to operating systems, and in the way Python modules integrate with the file system - but that non-ASCII module names must be supported is really out of question. If you would like this to be reverted, you need to write another PEP. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
Nick Coghlan writes: > On Fri, Jan 21, 2011 at 3:44 PM, Atsuo Ishimoto wrote: > > I don't want Python to encourage people to use non-ascii module names. I don't think anybody is *encouraging* it. The argument is for *permitting* it, partly for consistency with other identifiers, and partly because of Python's usual "consenting adults" standard for permitting "dangerous" practices. I realize this is a somewhat problematic distinction in Japan, for several reasons, but it's really not one that can be avoided in computing in any case. The sooner novice programmers learn it, the better. > > Today, seeing UnicodeEncodingError is one of popular reasons for > > newbies to abandon learning Python in Japan. Non-ascii module name is > > an another source of confusion for newbies. > > > > Experienced Japanese programmers may not use non-ascii module names to > > avoid encoding issues. > > > > But novice programmers or non-programmers willing to learn programming > > with Python will wish to use Japanese module names. Their programs > > will stop working if they copy them to another environment. Sooner or > > later, they will see storange ImportError and will start complaining > > "Python sucks! Python doesn't support Japanese!" on Twitter. So ask them, "What language *does* 'support Japanese'?" ;-) Seriously, "support Japanese" is an impossibly hard standard in the current environment. Not only does Japan have 5 more or less standard encodings still in daily use (EUC-JP, ISO-2022-JP, Shift JIS, UTF-8, and UTF-16LE), but many major IT companies have their own variants of the JIS standard character repertoire (all of the variant ideographs I've seen in the wild are in Unicode, but many corporate repertoires add extra symbols that are not), and of course some Microsoft utilities insist on using the deprecated UTF-8 signature with UTF-8. That said, I really don't see module names as a particular problem. By the time your novice is using her own modules (as opposed to importing stdlib and PyPI add-on modules, all with ASCII-only names), she'll be doing file I/O which has all the same problems, AFAICS. True, file names will be strings rather than identifiers, but I don't see why that matters. > > Copying files with non-ascii file name over platform is not easy as it > > sounds. Agreed, it's not trivial. But it's not that hard, either[1], and web hosts and others *could* help by providing checkers for languages that they support. > > What happen if I copy such files from OSX to my web hosting > > server ? Results might differ depending on tools I use to copy and > > platforms. I don't see why this problem is specific to Python modules, as opposed to any file name. > These all sound like good reasons to continue to *advise* against > using non-ASCII module names. +1 > But aside from that, they sound exactly like a lot of the arguments > we heard when Py3k started enforcing the bytes/text distinction > more rigorously: "you're going to break stuff!". Well, not exactly. Enforcing the bytes/text distinction was a change in the definition of Python; breakage was our fault. The change was made because in the (not so) long run it would reduce new breakage. Here, Python is fine (or at least we have some pretty good ideas how to fix it), it's the world that's broken. *Especially* Japan, with its five standard encodings in daily use and scads of private variant repertoires masquerading as standard encodings on top of that. But the whole world is broken because of the NFD/NFC thing. AFAIK, the only file system that tries to enforce an NF is Mac OS X HFS+, and (unfortunately for portability *from* Mac OS X *to* other systems) they chose NFD. Proper NFD support is arguably better for a number of reasons (for one, people regularly invent new composition sequences that will not have precomposed glyphs in any font), but NFC has the advantage that existing fonts support precomposed standard characters while many display engines do not support composition properly yet. And it's likely to stay broken for a while: the move to conformant display engines is going to take more time. I still don't see this as a reason to give up on non-ASCII module names. Just have the documentation warn that many non-ASCII names will be non-portable, so use on multiple systems will require care (maybe gloss that with "probably more care than you want to take"). Footnotes: [1] I actually find copying file names with spaces to be a bigger problem, because it's so hard to get shell quoting right. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Fri, Jan 21, 2011 at 2:59 PM, Nick Coghlan wrote: > > These all sound like good reasons to continue to *advise* against > using non-ASCII module names. But aside from that, they sound exactly > like a lot of the arguments we heard when Py3k started enforcing the > bytes/text distinction more rigorously: "you're going to break > stuff!". No, non-ASCII module names are new breakage you are going to introduce now :) If the advice against using non-ASCII module names is reasonable, why bother supporting them? > > Yes, we know. But if core software development components like Python > don't try to improve their Unicode support, how is the situation ever > going to get better? > Java, a leading language of IT industry, have already support non-ASCII class files for years. But I've never seen such files in production in Japan, and didn't improve situation until now. -- Atsuo Ishimoto Mail: ishim...@gembook.org Blog: http://d.hatena.ne.jp/atsuoishimoto/ Twitter: atsuoishimoto ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Fri, Jan 21, 2011 at 1:46 AM, Guido van Rossum wrote: > On Thu, Jan 20, 2011 at 5:16 AM, Nick Coghlan wrote: >> On Thu, Jan 20, 2011 at 10:08 PM, Simon Cross >> wrote: >>> I'm changing my vote on this to a +1 for two reasons: >>> >>> * Initially I thought this wasn't supported by Python at all but I see >>> that currently it is supported but that support is broken (or at least >>> limited to UTF-8 filesystem encodings). Since support is there, might >>> as well make it better (especially if it tidies up the code base at >>> the same time). >>> >>> * I still don't think it's a good idea to give modules non-ASCII names >>> but the "consenting adults" approach suggests we should let people >>> shoot themselves in the foot if they believe they have good reason to >>> do so. >> >> I'm also +1 on this for the reasons Simon gives. > > Same here. *Most* code will never be shared, or will only be shared > between users in the same community. When it goes wrong it's also a > learning opportunity. :-) > I don't want Python to encourage people to use non-ascii module names. Today, seeing UnicodeEncodingError is one of popular reasons for newbies to abandon learning Python in Japan. Non-ascii module name is an another source of confusion for newbies. Experienced Japanese programmers may not use non-ascii module names to avoid encoding issues. But novice programmers or non-programmers willing to learn programming with Python will wish to use Japanese module names. Their programs will stop working if they copy them to another environment. Sooner or later, they will see storange ImportError and will start complaining "Python sucks! Python doesn't support Japanese!" on Twitter. Copying files with non-ascii file name over platform is not easy as it sounds. What happen if I copy such files from OSX to my web hosting server ? Results might differ depending on tools I use to copy and platforms. Is it a good opportunity to start learnig abound encodings? I don't think so. They should learn concepts of charater set and encodings, Unicode and JIS character sets, some kind of Japanse encodings, number of platform specifix issues, non-standard extention of Microsoft and Apple, and so on. I think they should defer learning these messes until they get ready. -- Atsuo Ishimoto Mail: ishim...@gembook.org Blog: http://d.hatena.ne.jp/atsuoishimoto/ Twitter: atsuoishimoto ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Fri, Jan 21, 2011 at 4:44 PM, Atsuo Ishimoto wrote: > On Fri, Jan 21, 2011 at 2:59 PM, Nick Coghlan wrote: >> >> These all sound like good reasons to continue to *advise* against >> using non-ASCII module names. But aside from that, they sound exactly >> like a lot of the arguments we heard when Py3k started enforcing the >> bytes/text distinction more rigorously: "you're going to break >> stuff!". > > No, non-ASCII module names are new breakage you are going to introduce now :) No, they're not. Non-ASCII module names *already work* in Python 3.1 on UTF-8 filesystems. The portability problem you're complaining about exists now, and Victor is trying to at least partially alleviate it by making these filenames work correctly on more properly configured systems (such as Windows). It won't go away until all filesystem manipulation tools are properly Unicode aware, but that's no reason for us to continue to unnecessarily exacerbate the problem. Given imp_cafe.py: import café And café.py: print('Hello world from {}'.format(__name__)) I get the following result: ~$ python3.1 imp_cafe.py Hello world from café Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On 1/20/2011 12:44 PM, Toshio Kuratomi wrote: The problem occurs in that the code that one of the parties develops (either the students or the professors) is developed on one of those OS's and then used on the other OS. The problem that I reported and hope will be fixed is that private code written and tested on one machine, which will never be distributed, could not be imported on the *same* machine, with nothing changed on that machine except for writing a second file that does the import. If filenames get mangled when file are transported (admittedly more likely with non-ascii chars), that is a different issue. -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Fri, Jan 21, 2011 at 3:44 PM, Atsuo Ishimoto wrote: > I don't want Python to encourage people to use non-ascii module names. > Today, seeing UnicodeEncodingError is one of popular reasons for > newbies to abandon learning Python in Japan. Non-ascii module name is > an another source of confusion for newbies. > > Experienced Japanese programmers may not use non-ascii module names to > avoid encoding issues. > > But novice programmers or non-programmers willing to learn programming > with Python will wish to use Japanese module names. Their programs > will stop working if they copy them to another environment. Sooner or > later, they will see storange ImportError and will start complaining > "Python sucks! Python doesn't support Japanese!" on Twitter. > > Copying files with non-ascii file name over platform is not easy as it > sounds. What happen if I copy such files from OSX to my web hosting > server ? Results might differ depending on tools I use to copy and > platforms. These all sound like good reasons to continue to *advise* against using non-ASCII module names. But aside from that, they sound exactly like a lot of the arguments we heard when Py3k started enforcing the bytes/text distinction more rigorously: "you're going to break stuff!". Yes, we know. But if core software development components like Python don't try to improve their Unicode support, how is the situation ever going to get better? Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Jan 20, 2011, at 3:55 PM, Antoine Pitrou wrote: > On Thu, 20 Jan 2011 15:27:08 -0500 > Glyph Lefkowitz wrote: >> >> To support the latter, could we just make sure that zipimport has a >> consistent, >> non-locale-or-operating-system-dependent interpretation of encoding? > > It already has, but it's dependent on a flag in the zip file itself > (actually, one flag per archived file in the zip it seems). > > (by the way, it would be nice if your text/mail editor wrapped lines at > 80 characters or something) You could complain to Apple, but it seems unlikely that they'd change it. They broke it intentionally in OSX 10.6.2 for better compatibility with MS Outlook. (for the technically inclined: It still wraps lines at 80 characters in the raw message, but it uses quoted-printable encoding to escape the line-breaks, so mail readers which decode quoted-printable but can't flow text are now S.O.L. Apple used to use the nice format=flowed standard instead.) James ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Fri, Jan 21, 2011 at 5:27 AM, Toshio Kuratomi wrote: > I think that both ideas are inferior to mandating that every python module > filename is ascii. From what I'm getting from Victor's posts is that he, at > least, considers the portability problems to be ignorable because dealing > with ambiguous file name encodings is something that he'd like to force > third party tools to deal with. I think you're starting from an incorrect premise: we *already* allow non-ASCII module names in Py3k. They just don't always work properly, hence why people are currently much, much better off using pure ASCII for their module names (as ASCII is still the lowest common denominator for internet communication). However, you are proposing that, instead of attempting to fix at least some of the cases where it doesn't work, we throw up our hands and tell people "Since some poorly configured systems have trouble with this feature, we're taking it away from everybody. Sorry if this breaks your code." While there may be situations where that's a valid approach, this isn't one of them. Yes, non-ASCII filenames are problems for all sorts of reasons (with Python's historically poor support being one of them). The idea is that we're striving to no longer be part of that problem, even if it isn't within our power to fix it entirely. Once we fix the core to handle various Unicode issues, then over time that support can ripple out through the rest of the Python ecosystem - we don't expect everything to magically "just work" as soon as the basic issue in the core is fixed. It's going to be *years* before non-ASCII file names are as portable as pure ASCII ones (it kind of reminds me of the era when you had to avoid spaces in filenames because so many applications choked on them, even after the OS had been updated to support them). As far as the question of filenames not being re-encoded properly when copied between two systems, then yes, that *is* a problem with the third party tools used to do the copying. Such tools will break any code that uses the str APIs to access the filesystem. To deal with the case of undecodable filenames that the import system skips over, it is certainly possibly that importlib or runpy (probably the former) could acquire a function that allowed a named file to imported directly (with a specific module name) rather than requiring the import system to search for it. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
Am 20.01.2011 12:51, schrieb Victor Stinner: > You only give theorical arguments Read Anathem lately? ;) Georg ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On 1/20/2011 12:27 PM, Glyph Lefkowitz wrote: To support the latter, could we just make sure that zipimport has a consistent, non-locale-or-operating-system-dependent interpretation of encoding? That way a distributed egg would be importable from a zipfile regardless of how screwed up the distribution target machine's filesystem is. (And this is yet more motivation for distributors to set zip_safe=True.) I guess zip_safe is a distutils thing, and I haven't (yet) used distutils. But regarding zip files, I was trying to figure out if ZipFile module supported the CP437/UTF-8 flag, but its documentation seems to predate that concept, and just talks about unencoded byte streams. Yet, I think I have Python3 code that passes str to the filenames, and that works, so some amount of encoding and decoding to something must be happening behind the documentation's back? It does seem that if a ZipFile is created with the UTF-8 flag turned on, that Python should respect that, and that should be independent of the file system configured encoding on the local machine on which the ZipFile is used (as long as the name of the ZipFile is usable). I do know that listing filenames from a zip file created without the UTF-8 flag, using ZipFile to access it and place the names inside a web page that specifies its encoding to be UTF-8 produces illegal characters, so I've become tuned in recently to the zip files do have such a flag, and have been learning the right options to turn it on for the command line tools I use to create zip files... but was surprised when investigating the same for ZipFile. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Thu, 20 Jan 2011 15:27:08 -0500 Glyph Lefkowitz wrote: > > To support the latter, could we just make sure that zipimport has a > consistent, > non-locale-or-operating-system-dependent interpretation of encoding? It already has, but it's dependent on a flag in the zip file itself (actually, one flag per archived file in the zip it seems). (by the way, it would be nice if your text/mail editor wrapped lines at 80 characters or something) Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
Toshio Kuratomi: > My examples that you're replying to involve two "properly > configured" OS's. The Linux workstations are configured with a UTF-8 > locale. The Windows OS's use wide character unicode. The problem occurs in > that the code that one of the parties develops (either the students or the > professors) is developed on one of those OS's and then used on the other OS. This implies a symmetric issue,. but I can not see how there can be a problem with non-ASCII module names on Windows as the file system allows all Unicode characters so can represent any module name. OS X is also based on Unicode file names. While it is possible to mount file systems on Windows or OS X that do not support Unicode file names these are a very unusual situation that will cause problems in other ways. Common Linux distributions like Ubuntu and Fedora now default to UTF-8 locales. The situations in which users may encounter installations that do not support Unicode file names have reduced greatly. Neil ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Jan 20, 2011, at 11:46 AM, Guido van Rossum wrote: > On Thu, Jan 20, 2011 at 5:16 AM, Nick Coghlan wrote: >> On Thu, Jan 20, 2011 at 10:08 PM, Simon Cross >> wrote: >>> I'm changing my vote on this to a +1 for two reasons: >>> >>> * Initially I thought this wasn't supported by Python at all but I see >>> that currently it is supported but that support is broken (or at least >>> limited to UTF-8 filesystem encodings). Since support is there, might >>> as well make it better (especially if it tidies up the code base at >>> the same time). >>> >>> * I still don't think it's a good idea to give modules non-ASCII names >>> but the "consenting adults" approach suggests we should let people >>> shoot themselves in the foot if they believe they have good reason to >>> do so. >> >> I'm also +1 on this for the reasons Simon gives. > > Same here. *Most* code will never be shared, or will only be shared > between users in the same community. When it goes wrong it's also a > learning opportunity. :-) Despite my usual proclivity for being contrarian, I find myself in agreement here. Linux users with locales that don't specify UTF-8 frankly _should_ have to deal with all kinds of nastiness until they can transcode their filesystems. MacOS and Windows both have a "right" answer here and your third-party tools shouldn't create mojibake in your filenames. However, I feel that we should not necessarily be making non-ASCII programmers second-class citizens, if they are to be supported at all. The obvious outcome of the current regime is, if you want your code to work in the wider world, you have to make everything ASCII, so non-ASCII programmers have to do a huge amount of extra work to prepare their stuff for distribution. As an english speaker I'd be happy about that, but as a person with a lot of Chinese in-laws, it gives me pause. There is a difference between sharing code for inspection and editing (where a little codec pain is good for the soul: set your locale to UTF-8 and forget it already!) and sharing code so that a (non-programming) user can just run it. If I can write software in English and distribute it to Chinese people, fair's fair, they should be able to write it in chinese and have it work on my computer. To support the latter, could we just make sure that zipimport has a consistent, non-locale-or-operating-system-dependent interpretation of encoding? That way a distributed egg would be importable from a zipfile regardless of how screwed up the distribution target machine's filesystem is. (And this is yet more motivation for distributors to set zip_safe=True.)___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Thu, Jan 20, 2011 at 01:43:03PM -0500, Alexander Belopolsky wrote: > On Thu, Jan 20, 2011 at 12:44 PM, Toshio Kuratomi wrote: > > .. My examples that you're replying to involve two "properly > > configured" OS's. The Linux workstations are configured with a UTF-8 > > locale. The Windows OS's use wide character unicode. The problem occurs in > > that the code that one of the parties develops (either the students or the > > professors) is developed on one of those OS's and then used on the other OS. > > > > I re-read your posts on this thread, but could not find the examples > that you refer to. > Examples might be a bad word in this context. Victor was commenting on the two brainstorm ideas for alternatives to ascii-only that I had. One was: * Mandate that every python module on a platform has a specific encoding (rather than the value of the locale) The other was: * allow using byte strings for import I think that both ideas are inferior to mandating that every python module filename is ascii. From what I'm getting from Victor's posts is that he, at least, considers the portability problems to be ignorable because dealing with ambiguous file name encodings is something that he'd like to force third party tools to deal with. -Toshio pgpdh2k6Fwv56.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Thu, Jan 20, 2011 at 12:44 PM, Toshio Kuratomi wrote: > .. My examples that you're replying to involve two "properly > configured" OS's. The Linux workstations are configured with a UTF-8 > locale. The Windows OS's use wide character unicode. The problem occurs in > that the code that one of the parties develops (either the students or the > professors) is developed on one of those OS's and then used on the other OS. > I re-read your posts on this thread, but could not find the examples that you refer to. ISTM, your hypothetical students should have no problem as long as their professor uses proper tools to package her code. For example, if she uses a recent version of zip that supports the Info-ZIP Unicode Comment Extra Field (see http://www.pkware.com/documents/casestudies/APPNOTE.TXT) and students use similarly up to date unzip tool, the shared code should work as expected. Similarly, I would be surprised if Samba server would not be able to present a shared Linux partition that uses UTF-8 encoding to a Windows client in a way that will make wopen() work as expected. The problem with current Python import mechanism is that it does not use wopen() on Windows and instead, attempts to encode Unicode module name into a mythical single-byte filesystem encoding (locale ANSI code page?) and calls byte-oriented open(char *) on the result. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Thu, Jan 20, 2011 at 11:45 AM, Andy Teijelo wrote: .. > but if the code said: > > import café > > then Python would look, in any platform, for a file named: > > café.py or café.py or something nicer. > > Something along the lines of xmlcharrefreplace. > Just an idea. Curiously, something like this already happens on OSX when filename is not valid UTF-8. For example, >>> open(b'\xdb\xcd', 'w').close() >>> open(b'\xdb\xcd') <_io.TextIOWrapper name=b'\xdb\xcd' mode='r' encoding='UTF-8'> but the actual file created is named "%DB%CD". (Looks like URL-encoding). ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Thu, Jan 20, 2011 at 12:51:29PM +0100, Victor Stinner wrote: > Le mercredi 19 janvier 2011 à 20:39 -0800, Toshio Kuratomi a écrit : > > Teaching students to write non-portable code (relying on filesystem encoding > > where your solution is, don't upload to pypi anything that has non-ascii > > filenames) seems like the exact opposite of how you'd want to shape a young > > student's understanding of good programming practices. > > That was already discuted before: see PEP 3131. > http://www.python.org/dev/peps/pep-3131/#common-objections > > If the teacher choose to use non-ASCII, (s)he is responsible to explain > the consequences to his/her students :-) > It's not discussed in that PEP section. The PEP section says this: "People claim that they will not be able to use a library if to do so they have to use characters they cannot type on their keyboards." Whether you can type it at your keyboard or not is not the problem here. The problem is portability. The students and professors are sharing code with each other. But because of a mixture of operating systems (let alone locale settings), the code written by one partner is unable to run on the computer of the other. If non-ascii filenames without a defined encoding are considered a feature, python cannot even issue a descriptive error when this occurs. It can only say that it could not find the module but not why. A restriction on module names to ascii only could actually state that module names are not allowed to be non-ASCII when it encounters the import line. > > > In a school, you can use the same configuration > > > (encoding) on all computers. > > > > > In a school computer lab perhaps. But not on all the students' and > > professors' machines. How many professors will be cursing python when they > > discover that the example code that they wrote on their Linux workstation > > doesn't work when the students try to use it in their windows computer lab? > > Because some students use a stupid or misconfigured OS, Python should > only accept ASCII names? Just a note -- you'll get much farther if you refrain from calling names. It just makes me think that you aren't reading and understanding the issue I'm raising. My examples that you're replying to involve two "properly configured" OS's. The Linux workstations are configured with a UTF-8 locale. The Windows OS's use wide character unicode. The problem occurs in that the code that one of the parties develops (either the students or the professors) is developed on one of those OS's and then used on the other OS. > So, why do Python 3 support non-ASCII > filenames: it is very well known that non-ASCII filenames is the root in > many troubles! Should we simply drop unicode support for all filenames? > And maybe restrict bytes filenames to bytes in [0; 127]? Or better, > restrict to [32; 126] (U+007f causes some troubles in some terminals). > If you want to argue that because python3 supports non-ascii filenames in other code, then the logical extension is that the import mechanism should support importing module names defined by byte sequences. I happen to think that import has a lot of differences between it and other filenames as I've said three times now. > I think that in 2011, non-ASCII filenames are well supported on all > (modern) operating systems. Issues with non-ASCII filenames are OS > specific and should be fixed by the user (the admin of the computer). > > > Additionally, those other filesystem operations have > > been growing the ability to take byte values and encoding parameters because > > unicode translation via a single filesystem encoding is a good default but > > not a complete solution. > > If you are unable to configure correctly your system to decode/encode > correctly filenames, you should just avoid non-ASCII characters in the > module names. > This seems like an argument to only have unicode versions of all filesystem operations. Since you've been spearheading the effort to have bytes versions of things that access filenames, environment variables, etc, I don't think that you seriously mean that. Perhaps there is a language issue here. > You only give theorical arguments: did you at least try to use non-ASCII > module names on your system with Python 3.2? I suppose that it will just > work and you will never notice that the unicode module name (on "import > café") in encoded to bytes. > Yes I did and I got it to fail a cornercase as I showed twice with the same example in other posts. However, I want to make clear here that the issue is not that I can create a non-ascii filename and then import it. The issue is that I can create a non-ascii filename and then try to share it with the usual tools and it won't work on the recipient's system. (A tangent is whether the recipient's system is physically distinct from mine or only has a different environment on the same physical host.) > It fails on on OSes using filesystem encodings other than UTF-8 (eg
Re: [Python-Dev] Import and unicode: part two
(Hi, I'm writing from an address different to the one I'm subscribed with to the list because I don't have reverse dns in my mail server and mail.python.org rejects my messages. I hope that's not much trouble) Maybe Python should always use an ASCII encodable filename for modules: a translation of the module name into an ASCII encodable string that, preferrably, was the same as the module name if the module name didn't have any non-ASCII characters. Like, if the code said: import cafe Python would look for a file named: cafe.py but if the code said: import café then Python would look, in any platform, for a file named: café.py or café.py or something nicer. Something along the lines of xmlcharrefreplace. Just an idea. Andy. El 1/20/11 12:21 a.m., Glyph Lefkowitz escribió: On Jan 20, 2011, at 12:19 AM, Glenn Linderman wrote: Now if the stuff after m_ was the hex UTF-8 of "café", that could get interesting :) (As it happens, it's the hex digest of the MD5 of the UTF-8 of café... ;-)) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/andy%40lists.teijelo.net ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Thu, Jan 20, 2011 at 5:16 AM, Nick Coghlan wrote: > On Thu, Jan 20, 2011 at 10:08 PM, Simon Cross > wrote: >> I'm changing my vote on this to a +1 for two reasons: >> >> * Initially I thought this wasn't supported by Python at all but I see >> that currently it is supported but that support is broken (or at least >> limited to UTF-8 filesystem encodings). Since support is there, might >> as well make it better (especially if it tidies up the code base at >> the same time). >> >> * I still don't think it's a good idea to give modules non-ASCII names >> but the "consenting adults" approach suggests we should let people >> shoot themselves in the foot if they believe they have good reason to >> do so. > > I'm also +1 on this for the reasons Simon gives. Same here. *Most* code will never be shared, or will only be shared between users in the same community. When it goes wrong it's also a learning opportunity. :-) > I should have a chance to look at the patch this weekend. -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Thu, Jan 20, 2011 at 10:08 PM, Simon Cross wrote: > I'm changing my vote on this to a +1 for two reasons: > > * Initially I thought this wasn't supported by Python at all but I see > that currently it is supported but that support is broken (or at least > limited to UTF-8 filesystem encodings). Since support is there, might > as well make it better (especially if it tidies up the code base at > the same time). > > * I still don't think it's a good idea to give modules non-ASCII names > but the "consenting adults" approach suggests we should let people > shoot themselves in the foot if they believe they have good reason to > do so. I'm also +1 on this for the reasons Simon gives. I should have a chance to look at the patch this weekend. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Wed, Jan 19, 2011 at 5:01 PM, Simon Cross wrote: > On Wed, Jan 19, 2011 at 2:34 PM, Victor Stinner > wrote: >> (a) Python 3 doesn't support non-ASCII module names > > -0: I'm vaguely against this being supported because I'd rather not > have to deal with what happens when the guess regarding the filesystem > encoding is wrong. On the other hand, a general encouragement to stick > to ASCII module names is probably functionally equivalent without > imposing a hard restriction. I'm changing my vote on this to a +1 for two reasons: * Initially I thought this wasn't supported by Python at all but I see that currently it is supported but that support is broken (or at least limited to UTF-8 filesystem encodings). Since support is there, might as well make it better (especially if it tidies up the code base at the same time). * I still don't think it's a good idea to give modules non-ASCII names but the "consenting adults" approach suggests we should let people shoot themselves in the foot if they believe they have good reason to do so. Schiavo Simon ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
Le mercredi 19 janvier 2011 à 20:39 -0800, Toshio Kuratomi a écrit : > Teaching students to write non-portable code (relying on filesystem encoding > where your solution is, don't upload to pypi anything that has non-ascii > filenames) seems like the exact opposite of how you'd want to shape a young > student's understanding of good programming practices. That was already discuted before: see PEP 3131. http://www.python.org/dev/peps/pep-3131/#common-objections If the teacher choose to use non-ASCII, (s)he is responsible to explain the consequences to his/her students :-) > > In a school, you can use the same configuration > > (encoding) on all computers. > > > In a school computer lab perhaps. But not on all the students' and > professors' machines. How many professors will be cursing python when they > discover that the example code that they wrote on their Linux workstation > doesn't work when the students try to use it in their windows computer lab? Because some students use a stupid or misconfigured OS, Python should only accept ASCII names? So, why do Python 3 support non-ASCII filenames: it is very well known that non-ASCII filenames is the root in many troubles! Should we simply drop unicode support for all filenames? And maybe restrict bytes filenames to bytes in [0; 127]? Or better, restrict to [32; 126] (U+007f causes some troubles in some terminals). I think that in 2011, non-ASCII filenames are well supported on all (modern) operating systems. Issues with non-ASCII filenames are OS specific and should be fixed by the user (the admin of the computer). > Additionally, those other filesystem operations have > been growing the ability to take byte values and encoding parameters because > unicode translation via a single filesystem encoding is a good default but > not a complete solution. If you are unable to configure correctly your system to decode/encode correctly filenames, you should just avoid non-ASCII characters in the module names. You only give theorical arguments: did you at least try to use non-ASCII module names on your system with Python 3.2? I suppose that it will just work and you will never notice that the unicode module name (on "import café") in encoded to bytes. It fails on on OSes using filesystem encodings other than UTF-8 (eg. Windows)... because of a Python bug, and I just asked if I have to fix this bug (or if we should deny non-ASCII names). If the bug is fixed, it will works everywhere. > Your solution creates modules which aren't portable More and more operating systems use a filesystem encoding able to encode any Unicode characters. ASCII-only always give you the best portability, but I think that today you can start to play with (at least) ISO-8859-1 characters (café should work on all operating systems). If you don't Unicode issues (I personally love them!), just use ASCII everywhere. > One of my proposals creates python code which isn't portable. The other one > suffers some of the same disadvantages as your solution in portability but > allows for tools that could automatically correct modules. __import__('café'.encode('UTF-8')) or __import__('café'.encode('ISO-8859-1')) is less portable than __import__('café'). > You think that if a module is named appropriately on one system but is not > portable to another > system, that's fine. No, I am not saying that. I say that if your name is broken while you transfer your project from a system to another (eg. decompressing an archive creates filenames with mojibake in the filenames), you should fix your transfer procedure (eg. use another archive format, use a script to fix filenames, or anything else), but don't try to handle invalid filenames. > Setting system locale to ASCII for use in system-wide scripts This is stupid :-) Yes, on such system you, cannot open *any* non-ASCII file with Python 3 (except if you work, as Python 2, on bytes filenames). Python cannot do anything to improve Unicode support on such system: only the administrator have to something to do for that. I know that you can give me many examples of systems where Unicode doesn't work because the system is not correctly configured. But my opinion is that we should support non-ASCII names because there are somewhere "some" systems where Unicode is fully functionnal :-) Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Jan 19, 2011, at 11:39 PM, Toshio Kuratomi wrote: > Tangent: This is not true about Linux. UTF-8 is a matter of the > interpretation of the filesystem bytes that the user specifies by setting > their system locale. Setting system locale to ASCII for use in system-wide > scripts, is quite common as is changing locale settings in other parts of > the world (as I can tell you from the bug reports colleagues CC me on to fix > for the problems with unicode support in their python2 programs). Fortunately, there's been some (slow) movement towards adding a "C.UTF-8" locale and using that by default where "C" (ASCII) is currently used. So that may be less of a problem in a few years time. James ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On 1/19/2011 11:20 PM, Toshio Kuratomi wrote: On Wed, Jan 19, 2011 at 09:02:17PM -0800, Glenn Linderman wrote: On 1/19/2011 8:39 PM, Toshio Kuratomi wrote: use this:: import cafe as café When you do things this way you do not have to translate between unknown encodings into unicode. Everything is within python source where you have a defined encoding. This is a great way of converting non-portable module names, if the module ever leaves the bounds of its computer, and runs into problems there. You're missing a piece here. If you mandate ascii you can convert to a unicode name using "import as" because python knows that it has ascii text from the filesystem when it converts it to an abstract unicode string that you've specified in the program text. You cannot go the other way because python lacks the information (the encoding of the filename on the filesystem) to do the transformation. Your demonstration of such an easy solution to the concerns you raise convinces me more than ever that it is acceptable to allow non-ASCII module names. For those programmers in a single locale environment, it'll just work. And for those not in a single locale environment, there is your above simple solution to achieve portability without changing large numbers of lines of code. Does my demonstration that you can't do that mean that it's no longer acceptable? :-) /me guesses that the relative merits of being forced to write portable code vs convenience of writing a module name in your native script still has a different balance than in mine, thus the smiley :-) -Toshio Sadly, you didn't demonstrate it, you seem to have misunderstood my statement, which was probably not all that clear, somehow. Let me try again. User codes module café.py, tests, debugs, completes, is happy. User moves code to a different computer, different locale, no é character, module can't be found, is sad. User renames file to cafefromuser.py, changes the import statement from import café to import cafefromuser as café module now imports successfully, no other code changes needed. User is happy again, thanks Toshio for great solution to file system encoding problem. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Wed, Jan 19, 2011 at 09:02:17PM -0800, Glenn Linderman wrote: > On 1/19/2011 8:39 PM, Toshio Kuratomi wrote: > > use this:: > >import cafe as café > > When you do things this way you do not have to translate between unknown > encodings into unicode. Everything is within python source where you have > a defined encoding. > > > This is a great way of converting non-portable module names, if the module > ever > leaves the bounds of its computer, and runs into problems there. > You're missing a piece here. If you mandate ascii you can convert to a unicode name using "import as" because python knows that it has ascii text from the filesystem when it converts it to an abstract unicode string that you've specified in the program text. You cannot go the other way because python lacks the information (the encoding of the filename on the filesystem) to do the transformation. > Your demonstration of such an easy solution to the concerns you raise > convinces > me more than ever that it is acceptable to allow non-ASCII module names. For > those programmers in a single locale environment, it'll just work. And for > those not in a single locale environment, there is your above simple solution > to achieve portability without changing large numbers of lines of code. > Does my demonstration that you can't do that mean that it's no longer acceptable? :-) /me guesses that the relative merits of being forced to write portable code vs convenience of writing a module name in your native script still has a different balance than in mine, thus the smiley :-) -Toshio pgpVg5DKpRDXA.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Thu, Jan 20, 2011 at 12:11 AM, Glyph Lefkowitz wrote: .. >> But for local code, having to think up an ASCII name for a module rather >> than use the obvious native-language name, is just brain-burden when >> creating the code. > > Is it really? You already had to type 'import', presumably if you can think > in Python you can think in ASCII. Yes, it is a burden. For example, Russian word "щи" can be transliterated into ASCII as "schi", "shchi", "stchi", or even "wji". There are many incompatible standards and neither is well-known or "natural". Reading transliterated Cyrillic text is not hard, but guessing the correct spelling is nearly impossible. Good programming style guides recommend avoiding arbitrary contractions in variable names for the same reason. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Jan 20, 2011, at 12:19 AM, Glenn Linderman wrote: > Now if the stuff after m_ was the hex UTF-8 of "café", that could get > interesting :) (As it happens, it's the hex digest of the MD5 of the UTF-8 of café... ;-))___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On 1/19/2011 9:11 PM, Glyph Lefkowitz wrote: On Jan 20, 2011, at 12:02 AM, Glenn Linderman wrote: But for local code, having to think up an ASCII name for a module rather than use the obvious native-language name, is just brain-burden when creating the code. Is it really? You already had to type 'import', presumably if you can think in Python you can think in ASCII. There is a difference between memorizing and typing keywords, and inventing new names in non-native scripts. It is hard to even invent all the names in one's native language; if restricted to inventing them, even some of them, in some non-native script such as ASCII, it is just brain-burden indeed. (After my experiences with namespace crowding in Twisted, I'm inclined to suggest something more like "import m_07117FE4A1EBD544965DC19573183DA2 as café" - then I never need to worry about "café2" looking ugly or "cafe" being incompatible :).) Now if the stuff after m_ was the hex UTF-8 of "café", that could get interesting :) But now you are talking about automating the creation of ASCII file names from the actual non-ASCII names of the modules, or something. Sadly, the module is not required to contain its name, so if it differs from the filename, some global view or non-Python annotation would be required to create/maintain the mapping. [This paragraph is only semi-serious, like yours.] ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Wed, Jan 19, 2011 at 11:39 PM, Toshio Kuratomi wrote: .. > Teaching students to write non-portable code (relying on filesystem encoding > where your solution is, don't upload to pypi anything that has non-ascii > filenames) seems like the exact opposite of how you'd want to shape a young > student's understanding of good programming practices. > Let's not confuse language definition with the quality of implementation. It would be a perfectly valid Python implementation that would run on a system that does not even have a traditional filesystem and "import foo" looks up foo module code in an in-memory database. Should Python be redefined so that module names are case insensitive simply because case insensitive filesystems are still popular? I don't think so. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Jan 20, 2011, at 12:02 AM, Glenn Linderman wrote: > But for local code, having to think up an ASCII name for a module rather than > use the obvious native-language name, is just brain-burden when creating the > code. Is it really? You already had to type 'import', presumably if you can think in Python you can think in ASCII. (After my experiences with namespace crowding in Twisted, I'm inclined to suggest something more like "import m_07117FE4A1EBD544965DC19573183DA2 as café" - then I never need to worry about "café2" looking ugly or "cafe" being incompatible :).) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On 1/19/2011 8:39 PM, Toshio Kuratomi wrote: use this:: import cafe as café When you do things this way you do not have to translate between unknown encodings into unicode. Everything is within python source where you have a defined encoding. This is a great way of converting non-portable module names, if the module ever leaves the bounds of its computer, and runs into problems there. It may be that the best practices for writing platform portable modules should include * ASCII module filenames * Code that can handle 16 or 32 bit Unicode * and likely some other things. But for local code, having to think up an ASCII name for a module rather than use the obvious native-language name, is just brain-burden when creating the code. Your demonstration of such an easy solution to the concerns you raise convinces me more than ever that it is acceptable to allow non-ASCII module names. For those programmers in a single locale environment, it'll just work. And for those not in a single locale environment, there is your above simple solution to achieve portability without changing large numbers of lines of code. Glenn ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Wed, Jan 19, 2011 at 9:07 PM, Toshio Kuratomi wrote: .. > Do you have a solution to the problem? I haven't looked at your patch so > perhaps you have an ingenous method of translating from the unicode > representation of the module in the import statement to the bytes in > arbitrary encodings on the filesystem that I haven't thought of. If I understand what Victor's patch does correctly, it allows Python on Windows to bypass translation from Unicode to bytes by using Windows "wide character" APIs. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Thu, Jan 20, 2011 at 03:51:05AM +0100, Victor Stinner wrote: > For a lesson at school, it is nice to write examples in the > mother language, instead of using "raw" english with ASCII identifiers > and filenames. Then use this:: import cafe as café When you do things this way you do not have to translate between unknown encodings into unicode. Everything is within python source where you have a defined encoding. Teaching students to write non-portable code (relying on filesystem encoding where your solution is, don't upload to pypi anything that has non-ascii filenames) seems like the exact opposite of how you'd want to shape a young student's understanding of good programming practices. > In a school, you can use the same configuration > (encoding) on all computers. > In a school computer lab perhaps. But not on all the students' and professors' machines. How many professors will be cursing python when they discover that the example code that they wrote on their Linux workstation doesn't work when the students try to use it in their windows computer lab? How many students will be upset when the code they turn in runs on their professor's test machine if the lab computers were booted into the Linux partition but not if the they were booted into Windows? > > > > > * Specify an encoding per platform and stick to that. > > > > > > It doesn't work: on UNIX/BSD, the user chooses its own encoding and all > > > programs will use it. > > > > > (...) This prevents getting a mixture of encodings of modules (...) > > If you have an issue with encodings, when have to fix it when you create > a module (on disk), not when you load a module (it is too late). > It's not too late to throw a clear error of what's wrong. > > I haven't looked at your patch so > > perhaps you have an ingenous method of translating from the unicode > > representation of the module in the import statement to the bytes in > > arbitrary encodings on the filesystem that I haven't thought of. > > On Windows, My patch tries to avoid any conversion: it uses unicode > everywhere. > > On other OSes, it uses the Python filesystem encoding to encode a module > name (as it is done for any other operation on the filesystem with an > unicode filename). > The other interfaces are somewhat of a red herring here. As I wrote in another email, importing modules has ramifications that open(), for instance, does not. Additionally, those other filesystem operations have been growing the ability to take byte values and encoding parameters because unicode translation via a single filesystem encoding is a good default but not a complete solution. I think that this problem demands a complete solution, however, and it seems to me that limiting the scope of the problem is the most pleasant method to accomplish this. Your solution creates modules which aren't portable. One of my proposals creates python code which isn't portable. The other one suffers some of the same disadvantages as your solution in portability but allows for tools that could automatically correct modules. > -- > > Python 3 supports bytes filename to be able to read/copy/delete > undecodable filenames, filenames stored in a encoding different than the > system encoding, broken filenames. It is also possible to access these > files using PEP 383 (with surrogate characters). This is useful to use > Python on an old system. > > > If you don't, however, then really - ASCII-only seems like the sanest > > of the three solutions I can think of. > > But a (Python 3) module is not supposed to have a broken filename. If it > is the case, you have better to fix its name, instead of trying to fix > the problem later (in Python). > We agree that there should not be broken module names. However it seems we very hotly disagree about the definition of that. You think that if a module is named appropriately on one system but is not portable to another system, that's fine. I think that portability between systems is very important and sacrificing that so that someone can locally use a module with non-ASCII characters doesn't have a justifiable reward. > With UTF-8 filesystem encoding (eg. on Mac OS X, and most Linux setups), > it is already possible to use non-ASCII module names. > Tangent: This is not true about Linux. UTF-8 is a matter of the interpretation of the filesystem bytes that the user specifies by setting their system locale. Setting system locale to ASCII for use in system-wide scripts, is quite common as is changing locale settings in other parts of the world (as I can tell you from the bug reports colleagues CC me on to fix for the problems with unicode support in their python2 programs). Allowing module names incompatible with ascii without specifying an encoding will just lead to bug reports down the line. Relatively few programmers understand the difference between the python unicode abstraction and the byte representations possible for those strings. Allowing
Re: [Python-Dev] Import and unicode: part two
Le mercredi 19 janvier 2011 à 18:07 -0800, Toshio Kuratomi a écrit : > Saying that multiple encodings on a single system is a misconfiguration > every time it comes up does not make it true. Yes, each filesystem can have its own encoding. For example, this is supported by Linux. Python doesn't support such configuration, but this limitation is wider than the import machinery. If you consider it import enough, please open an issue. > To the existing list I'd add getting a package from pypi -- > neither tar nor zip files contain encoding information about the filenames. ZIP contain a flag to indicate the encoding: cp437 or UTF-8. TAR has an extension called "PAX" which stores filenames as UTF-8. But yes, most tarballs store filenames as raw byte strings. Anyway, if you would like to share your code on PyPI, you should not use non-ASCII module names (or any other non-ASCII name/identifier :-)). Python 3 supports non-ASCII identifiers (PEP 3131), but the developer is responsible to decide if (s)he uses it or not, depending on its audience. For a lesson at school, it is nice to write examples in the mother language, instead of using "raw" english with ASCII identifiers and filenames. In a school, you can use the same configuration (encoding) on all computers. > > > * Specify an encoding per platform and stick to that. > > > > It doesn't work: on UNIX/BSD, the user chooses its own encoding and all > > programs will use it. > > > (...) This prevents getting a mixture of encodings of modules (...) If you have an issue with encodings, when have to fix it when you create a module (on disk), not when you load a module (it is too late). > (...) I mean something at the python code level:: > >import café encoded_as('latin1') Import a module using its byte name? You mean that café filename was not encoded to the Python filesystem encoding, but to other (wrong) encoding, at the creation of the module. As written before, you should fix your filename, instead of using an (ugly) workaround in Python. > I haven't looked at your patch so > perhaps you have an ingenous method of translating from the unicode > representation of the module in the import statement to the bytes in > arbitrary encodings on the filesystem that I haven't thought of. On Windows, My patch tries to avoid any conversion: it uses unicode everywhere. On other OSes, it uses the Python filesystem encoding to encode a module name (as it is done for any other operation on the filesystem with an unicode filename). -- Python 3 supports bytes filename to be able to read/copy/delete undecodable filenames, filenames stored in a encoding different than the system encoding, broken filenames. It is also possible to access these files using PEP 383 (with surrogate characters). This is useful to use Python on an old system. > If you don't, however, then really - ASCII-only seems like the sanest > of the three solutions I can think of. But a (Python 3) module is not supposed to have a broken filename. If it is the case, you have better to fix its name, instead of trying to fix the problem later (in Python). With UTF-8 filesystem encoding (eg. on Mac OS X, and most Linux setups), it is already possible to use non-ASCII module names. Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Thu, Jan 20, 2011 at 01:26:01AM +0100, Victor Stinner wrote: > Le mercredi 19 janvier 2011 à 15:44 -0800, Toshio Kuratomi a écrit : > > Additionally, many unix filesystem don't specify a filesystem encoding for > > filenames; they deal in legal and illegal bytes which could lead to > > troubles. This problem of which encoding to use is a problem that can be > > seen on UNIX systems even now. > > If the system is not correctly configured, it is not a bug in Python, > but a bug in the system config. Python relies on the locale to choose > the filesystem encoding (sys.getfilesystemencoding()). Python uses this > encoding to decode and encode all filenames. > Saying that multiple encodings on a single system is a misconfiguration every time it comes up does not make it true. There's been multiple examples of how you can end up with multiple encodings of filenames on a single system listed in past threads: multiple users with different encodings for their locales, mounting remote filesystems, downloading a file To the existing list I'd add getting a package from pypi -- neither tar nor zip files contain encoding information about the filenames. Therefore if I create an sdist of a python module using non-ascii filenames using a locale of latin1 and then upload to pypi, people downloading that on a utf-8 using locale will end up not being able to use the module. > > * Specify an encoding per platform and stick to that. > > It doesn't work: on UNIX/BSD, the user chooses its own encoding and all > programs will use it. > The proposal is that you ignore that when talking about loading and creating (I mentioned distutils because my thought was that distutils could grow the ability to translate from the system locale to a chosen neutral encoding when running setup.py any of the dist commands but that doesn't address the issue when testing a module that you've just written so perhaps that's not necessary.) python modules. Python modules would have a set of defined filesystem encodings per system. This prevents getting a mixture of encodings of modules and having things work in one location but fail when used somewhere else. Instead, you get an upfront failure until you correct the encoding. > Anyway, I don't see why it is a problem to have different encodings on > different systems. Each system can use its own encoding. The bug that > I'm trying to solve is a Python bug, not an OS bug. > There is no OS bug here. There is perhaps an OS design flaw but it's not a flaw that will be going away soon (in part, because the present OS designers do not see it as an OS flaw... to them it's a bug in code that attempts to build a simpler interface on top of it.) > > * Change import semantics to allow specifying the encoding of the module on > > the filesystem (seems really icky). > > This is a very bad idea. I introduced PYTHONFSENCODING environment > variable in Python 3.2, but then quickly removed it, because it > introduced a lot of inconsistencies. > Thanks for getting rid of that, PYTHONFSENCODING is a bad idea because it doesn't solve the underlying issues. However, when I say specifying the encoding of the module on the filesystem, I don't mean something global like PYTHONFSENCODING -- I mean something at the python code level:: import café encoded_as('latin1') After thinking about this one, though, I don't think it will work either. This takes care of importing modules where the fs encoding of the module is known but it doesn't where the fs encoding may be translated between platforms. I believe that this could arise when untarring a module on windows using winzip or similar that gives you the option of translating from utf-8 bytes into bytes that have meaning as characters on that platform, for instance. Do you have a solution to the problem? I haven't looked at your patch so perhaps you have an ingenous method of translating from the unicode representation of the module in the import statement to the bytes in arbitrary encodings on the filesystem that I haven't thought of. If you don't, however, then really - ASCII-only seems like the sanest of the three solutions I can think of. -Toshio pgpxKdCbo8dSk.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Wed, Jan 19, 2011 at 07:11:52PM -0500, James Y Knight wrote: > On Jan 19, 2011, at 6:44 PM, Toshio Kuratomi wrote: > > This problem of which encoding to use is a problem that can be > > seen on UNIX systems even now. Try this: > > > > echo 'print("hi")' > café.py > > convmv -f utf-8 -t latin1 café.py > > python3 -c 'import café' > > > > ASCII seems very sensible to me when faced with these ambiguities. > > > > Other options I can brainstorm that could be explored: > > > > * Specify an encoding per platform and stick to that. (So, for instance, > > all module names on posix platforms would have to be utf-8). Force > > translation between encoding when installing packages (But that doesn't > > help for people that are creating their modules using their own build > > scripts rather than distutils, copying the files using raw tar, etc.) > > * Change import semantics to allow specifying the encoding of the module on > > the filesystem (seems really icky). > > None of this is unique to import -- the same exact issue occurs with > open(u'café'). I don't see any reason why import café should be though of as > more of a problem, or treated any differently. > It's unique in several ways: 1) With open, you can specify a byte string:: open(b'caf\xe9.py').read() I don't know of any way to do that with import. This is needed when the filename is not compatible with your current locale. 2) import assigns a name to the module that it imports whereas open lets the programmer assign the name. So even if you can specify what to use as a byte string for this filename on this particular filesystem you'd still end up with some ugly pseudo-representation of bytes when attempting to access it in code:: import caf\xe9 caf\xe9.do_something() -Toshio pgp3UpXl83i8t.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Jan 19, 2011, at 6:44 PM, Toshio Kuratomi wrote: > This problem of which encoding to use is a problem that can be > seen on UNIX systems even now. Try this: > > echo 'print("hi")' > café.py > convmv -f utf-8 -t latin1 café.py > python3 -c 'import café' > > ASCII seems very sensible to me when faced with these ambiguities. > > Other options I can brainstorm that could be explored: > > * Specify an encoding per platform and stick to that. (So, for instance, > all module names on posix platforms would have to be utf-8). Force > translation between encoding when installing packages (But that doesn't > help for people that are creating their modules using their own build > scripts rather than distutils, copying the files using raw tar, etc.) > * Change import semantics to allow specifying the encoding of the module on > the filesystem (seems really icky). None of this is unique to import -- the same exact issue occurs with open(u'café'). I don't see any reason why import café should be though of as more of a problem, or treated any differently. It's reasonable to recommend that people use ASCII in their module names if they want wide portability, but it should still be supported to use non-ASCII. James ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
Le mercredi 19 janvier 2011 à 15:44 -0800, Toshio Kuratomi a écrit : > Additionally, many unix filesystem don't specify a filesystem encoding for > filenames; they deal in legal and illegal bytes which could lead to > troubles. This problem of which encoding to use is a problem that can be > seen on UNIX systems even now. If the system is not correctly configured, it is not a bug in Python, but a bug in the system config. Python relies on the locale to choose the filesystem encoding (sys.getfilesystemencoding()). Python uses this encoding to decode and encode all filenames. > * Specify an encoding per platform and stick to that. It doesn't work: on UNIX/BSD, the user chooses its own encoding and all programs will use it. Anyway, I don't see why it is a problem to have different encodings on different systems. Each system can use its own encoding. The bug that I'm trying to solve is a Python bug, not an OS bug. > * Change import semantics to allow specifying the encoding of the module on > the filesystem (seems really icky). This is a very bad idea. I introduced PYTHONFSENCODING environment variable in Python 3.2, but then quickly removed it, because it introduced a lot of inconsistencies. Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On 1/19/2011 6:05 PM, Alexander Belopolsky wrote: On Wed, Jan 19, 2011 at 5:47 PM, Brett Cannon wrote: .. Indeed. Last time I looked, we still had cProfile in stdlib. Yes, but that is because no one got around to hiding cProfile behind profile before we released Python 3.0. I would still like to see it (slowly) go away from being directly visible. Another big offender is the idlelib package. Most of the modules there are in mixed case. Given that the individual modules are not documented and that the only programs importing the individual modules are other idlelib modules (true?) then a rename should be possible. In the other hand, the same facts sort of make it unnecessary ;-). -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On 1/19/2011 6:44 PM, Toshio Kuratomi wrote: I believe we now have the situation that a package that works on *nix could fail on Windows, whereas I believe that patch would *improve* portability. I'm not so sure about this Forget that claim if it is not true. The patch will certainly improve consistency with a box so that files that run can also be imported. -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Wed, Jan 19, 2011 at 04:40:24PM -0500, Terry Reedy wrote: > On 1/19/2011 4:05 PM, Simon Cross wrote: > > >I have no problem with non-ASCII module identifiers being valid > >syntax. It's a question of whether attempting to translate a non-ASCII > > If the names are the same, ie, produced with the same sequence of > keystrokes in the save-as box and importing box, then there is no > translation, at least from the user's view. > > >module name into a file name (so the file can be imported) is a good > >idea and whether these sorts of files can be safely transferred among > >diverse filesystems. > > I believe we now have the situation that a package that works on *nix > could fail on Windows, whereas I believe that patch would *improve* > portability. > I'm not so sure about this You may have something that works on Windows and on *NIX under certain circumstances but it seems likely to fail when moving files between them (for instance, as packages downloaded from pypi). Additionally, many unix filesystem don't specify a filesystem encoding for filenames; they deal in legal and illegal bytes which could lead to troubles. This problem of which encoding to use is a problem that can be seen on UNIX systems even now. Try this: echo 'print("hi")' > café.py convmv -f utf-8 -t latin1 café.py python3 -c 'import café' ASCII seems very sensible to me when faced with these ambiguities. Other options I can brainstorm that could be explored: * Specify an encoding per platform and stick to that. (So, for instance, all module names on posix platforms would have to be utf-8). Force translation between encoding when installing packages (But that doesn't help for people that are creating their modules using their own build scripts rather than distutils, copying the files using raw tar, etc.) * Change import semantics to allow specifying the encoding of the module on the filesystem (seems really icky). -Toshio pgpsh1AqAY9Vd.pgp Description: PGP signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Wed, Jan 19, 2011 at 5:47 PM, Brett Cannon wrote: .. >> Indeed. Last time I looked, we still had cProfile in stdlib. > > Yes, but that is because no one got around to hiding cProfile behind > profile before we released Python 3.0. I would still like to see it > (slowly) go away from being directly visible. > Another big offender is the idlelib package. Most of the modules there are in mixed case. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Wed, Jan 19, 2011 at 14:23, Alexander Belopolsky wrote: > On Wed, Jan 19, 2011 at 4:40 PM, Terry Reedy wrote: > .. >>> For similar reasons we tend to avoid capital letters in module names. >> >> That is a stdlib style guide followed by many, but intentionally not >> enforced. > > Indeed. Last time I looked, we still had cProfile in stdlib. Yes, but that is because no one got around to hiding cProfile behind profile before we released Python 3.0. I would still like to see it (slowly) go away from being directly visible. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Wed, Jan 19, 2011 at 4:40 PM, Terry Reedy wrote: .. >> For similar reasons we tend to avoid capital letters in module names. > > That is a stdlib style guide followed by many, but intentionally not > enforced. Indeed. Last time I looked, we still had cProfile in stdlib. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
Le mercredi 19 janvier 2011 à 12:19 -0800, Glenn Linderman a écrit : > Since Python allows non-ASCII variable names, I think it should allow > non-ASCII module names also, on any platform that supports the > appropriate characters in the filesystem. > > Since some platforms already accept them, dropping them would be > incompatible. ok > If Victor already has a patch coded (i.e. the work is basically done, no > waiting in line 3), I'm even more in favor of it. If it took lots of > future hard work, and no one volunteered to do it, that would perhaps be > justification for retaining module name restrictions. I guess that is > not the case here, so... I am volunteer to do the work, and I already have a working patch (but it is not ready yet to be commited, it requires a long review). FYI, I rewrote the patch 4 times since one year, for different reasons: - the patch is huge, complex, and I was unable to "write it correctly" the first time - I splitted the work into two big parts: support non-ASCII paths (done in Python 3.2) and the other changes in the part two - Update an huge patchset on py3k tree is hard, even with git-svn (and git svn rebase) - In my first tries, I didn't patch the import machinery to support non-ASCII module names, I only patched the support of non-ASCII paths But I don't want to apply such huge patch if Python code developers don't want to support non-ASCII module names and unencodable paths. Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On 1/19/2011 4:05 PM, Simon Cross wrote: I have no problem with non-ASCII module identifiers being valid syntax. It's a question of whether attempting to translate a non-ASCII If the names are the same, ie, produced with the same sequence of keystrokes in the save-as box and importing box, then there is no translation, at least from the user's view. module name into a file name (so the file can be imported) is a good idea and whether these sorts of files can be safely transferred among diverse filesystems. I believe we now have the situation that a package that works on *nix could fail on Windows, whereas I believe that patch would *improve* portability. For similar reasons we tend to avoid capital letters in module names. That is a stdlib style guide followed by many, but intentionally not enforced. -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
Am 19.01.2011 21:32, schrieb Terry Reedy: > On 1/19/2011 7:34 AM, Victor Stinner wrote: >> Hi, >> >> I patched Python 3.2 to support modules with non-ASCII paths (*). It >> works well on all operating systems. But the task is not completly >> done: >> >> (a) Python 3 doesn't support non-ASCII module names (b) Python 3 >> doesn't support unencodable characters in the module path >> >> I would like to know if we need to support that. Terry J. Reedy >> wrote (issue #10828): "I think bugs in core syntax should have high >> priority. I appreciate your work toward fixing it." > > I am a little shocked at the so-far tepid response to (a), so let me > defend and explain my claim that it is a bug. > > In the simplest case (from 6.11. The import statement and 2.3. > Identifiers and keywords) > > import_stmt ::= "import" module > module ::= indentifier > identifier ::= > > There is nothing, nothing, about any restriction on identifiers. +1. The restriction on valid identifiers is very sensible (obviously, since "m" needs to be accessible after "import m"), but a further restriction seems just arbitrary. Georg ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Wed, Jan 19, 2011 at 10:32 PM, Terry Reedy wrote: > I am a little shocked at the so-far tepid response to (a), so let me > defend and explain my claim that it is a bug. > > In the simplest case (from 6.11. The import statement and 2.3. Identifiers > and keywords) > > import_stmt ::= "import" module > module ::= indentifier > identifier ::= > > There is nothing, nothing, about any restriction on identifiers. I have no problem with non-ASCII module identifiers being valid syntax. It's a question of whether attempting to translate a non-ASCII module name into a file name (so the file can be imported) is a good idea and whether these sorts of files can be safely transferred among diverse filesystems. For similar reasons we tend to avoid capital letters in module names. Schiavo Simon ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
Le mercredi 19 janvier 2011 à 13:38 -0500, Alexander Belopolsky a écrit : > PEP 3131 does not distinguish between different types of identifiers, > so I think it assumes that non-ascii module names should be supported. My opinion is that we should suport non-ASCII module names and unencodable paths if it doesn't introduce an overhead (make Python slower and add a lot of code). My patch adds ~400 lines of code (I think that it is small: the patch adds many functions), but I think that it makes Python as fast, or maybe faster. Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On 1/19/2011 7:34 AM, Victor Stinner wrote: Hi, I patched Python 3.2 to support modules with non-ASCII paths (*). It works well on all operating systems. But the task is not completly done: (a) Python 3 doesn't support non-ASCII module names (b) Python 3 doesn't support unencodable characters in the module path I would like to know if we need to support that. Terry J. Reedy wrote (issue #10828): "I think bugs in core syntax should have high priority. I appreciate your work toward fixing it." I am a little shocked at the so-far tepid response to (a), so let me defend and explain my claim that it is a bug. In the simplest case (from 6.11. The import statement and 2.3. Identifiers and keywords) import_stmt ::= "import" module module ::= indentifier identifier ::= There is nothing, nothing, about any restriction on identifiers. The rest of 6.11 discusses the complex import algorithm but leaves out the simple semantics that cover 99% of cases (import a ???.py file in a directory on sys.path), and never mentions ".py". So lets go to Tutorial 6. Modules which does explain the simple case: "A module is a file containing Python definitions and statements. The file name is the module name with the suffix .py appended." So, if xyz is a legal identifier and xyx.py exists on sys.path, it is reasonable from the docs to expect 'import xyz' to work. (Sys.path is memtioned in the reference.) But we now have the following possibility: Let xyz.py be def double(x): return 2*x if __name__=="__main__": if double(2) == 4: print("test passed") We run the file, get "test passed", and write zyx.py: import xyz ... We run zyx and Python says "No module named xyz". Bad, and quite puzzling to anyone who does not understand the subtle difference between running and importing a file. -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On 1/19/2011 11:31 AM, Victor Stinner wrote: If we decide to reject non-ASCII module names, it should be done on any operating systems, not only on Windows. Since Python allows non-ASCII variable names, I think it should allow non-ASCII module names also, on any platform that supports the appropriate characters in the filesystem. Since some platforms already accept them, dropping them would be incompatible. If Victor already has a patch coded (i.e. the work is basically done, no waiting in line 3), I'm even more in favor of it. If it took lots of future hard work, and no one volunteered to do it, that would perhaps be justification for retaining module name restrictions. I guess that is not the case here, so... +1 on supporting full Unicode module names on all platforms. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
Le mercredi 19 janvier 2011 à 10:42 -0800, Brett Cannon a écrit : > > I am not sure what exactly is not supported. On my OSX system: > > Victor said this is a Windows-specific issue. Autoquote: "(a) (...) doesn't work with a locale encoding different than UTF-8" Hum, it's not exactly the locale encoding, but the Python filesystem encoding. On Mac OS X, this encoding is *hardcoded* to UTF-8, so it is possible to use non-ASCII module names on this OS. It is also possible on other BSD/UNIX systems using UTF-8 locale encoding. But this issue only concerns any BSD/UNIX using a locale encoding different than UTF-8. Eg. MvL's buildbot (x86 debian parallel 3.x) uses ISO-8859-15 (see #10492, issue fixed 13 days ago). Even if UTF-8 becomes a de facto standard locale encoding, many systems still use something else. And Python 2 users will complain that their script works with Python 2 but not with Python 3 :-) If we decide to reject non-ASCII module names, it should be done on any operating systems, not only on Windows. Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Wed, Jan 19, 2011 at 1:42 PM, Brett Cannon wrote: .. >> I am not sure what exactly is not supported. On my OSX system: > > Victor said this is a Windows-specific issue. I missed that part. In this case, I change my vote to +0 to reflect my lack of knowledge or exposure to Windows-only issues. However, if Victor's patch simplifies the code (as many of his changes in this area do), I will be happy to review it. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Wed, Jan 19, 2011 at 10:38, Alexander Belopolsky wrote: > On Wed, Jan 19, 2011 at 1:23 PM, Brett Cannon wrote: > .. (a) Python 3 doesn't support non-ASCII module names > .. >> -0 from me (unless the Unicode variable naming PEP says otherwise). >> > > I am not sure what exactly is not supported. On my OSX system: Victor said this is a Windows-specific issue. -Brett > > $ ./python.exe > Python 3.2b2+ .. > import саша саша.foo > 42 from саша import foo foo > 42 > > > PEP 3131 does not distinguish between different types of identifiers, > so I think it assumes that non-ascii module names should be supported. > > +1 on fixing any remaining bugs > ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Wed, Jan 19, 2011 at 1:23 PM, Brett Cannon wrote: .. >>> (a) Python 3 doesn't support non-ASCII module names .. > -0 from me (unless the Unicode variable naming PEP says otherwise). > I am not sure what exactly is not supported. On my OSX system: $ ./python.exe Python 3.2b2+ .. >>> import саша >>> саша.foo 42 >>> from саша import foo >>> foo 42 PEP 3131 does not distinguish between different types of identifiers, so I think it assumes that non-ascii module names should be supported. +1 on fixing any remaining bugs ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Wed, Jan 19, 2011 at 07:01, Simon Cross wrote: > On Wed, Jan 19, 2011 at 2:34 PM, Victor Stinner > wrote: >> (a) Python 3 doesn't support non-ASCII module names > > -0: I'm vaguely against this being supported because I'd rather not > have to deal with what happens when the guess regarding the filesystem > encoding is wrong. On the other hand, a general encouragement to stick > to ASCII module names is probably functionally equivalent without > imposing a hard restriction. -0 from me (unless the Unicode variable naming PEP says otherwise). > >> (b) Python 3 doesn't support unencodable characters in the module path > > +1: It'd be nice if Python could import modules regardless of what > folder names people happen to have on their module path. +1 from me as well (nervously hoping importlib already supports it =) . ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Import and unicode: part two
On Wed, Jan 19, 2011 at 2:34 PM, Victor Stinner wrote: > (a) Python 3 doesn't support non-ASCII module names -0: I'm vaguely against this being supported because I'd rather not have to deal with what happens when the guess regarding the filesystem encoding is wrong. On the other hand, a general encouragement to stick to ASCII module names is probably functionally equivalent without imposing a hard restriction. > (b) Python 3 doesn't support unencodable characters in the module path +1: It'd be nice if Python could import modules regardless of what folder names people happen to have on their module path. Schiavo Simon ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Import and unicode: part two
Hi, I patched Python 3.2 to support modules with non-ASCII paths (*). It works well on all operating systems. But the task is not completly done: (a) Python 3 doesn't support non-ASCII module names (b) Python 3 doesn't support unencodable characters in the module path I would like to know if we need to support that. Terry J. Reedy wrote (issue #10828): "I think bugs in core syntax should have high priority. I appreciate your work toward fixing it." I wrote a patch (issue #3080) fixing both points. If you agree that both issues should be fixed, I will fix them in Python 3.3. (a) is the issue #10828 reported recently (january 2011): "import gui_jämföra" doesn't work with a locale encoding different than UTF-8 (so it doesn't work on Windows). (b) is specific to Windows: FAT32 and NTFS filesystems store filenames in unicode, but Python encodes paths to the ANSI code page (which is a very small subset of Unicode). If a character cannot be encoded to the code page, you cannot load a module. Eg. add a japanese character in a directory name on a Windows using cp1252 (english) code page. I don't think that (b) was already reported by an user, it's more a theorical problem. My patch is huge, but it simplifies the code. We doesn't need to regulary convert from/to UTF-8. And for the functions using PyUnicodeObject objects (and not a Py_UNICODE* buffer): PyUnicodeObject stores the string length (it avoids calls to strlen()) and PyUnicode_FromFormat() doesn't need a buffer size (no risk of buffer overflow). I suppose that it makes Python faster, but I didn't try. (*) Python 3.2 doesn't support non-ASCII in the module *name*, only in the path (sys.path). Victor Stinner ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com