On 1/26/2011 4:47 PM, Toshio Kuratomi wrote:
There's one further case that I am worried about that has no real
"transfer".  Since people here seem to think that unicode module names are
the future (for instance, the comments about redefining the C locale to
include utf-8 and the comments about archiving tools needing to support
encoding bits), there are eventually going to be unicode modules that become
dependencies of other modules and programs.  These will need to be installed
on systems.  Linux distributions that ship these will need to choose
a filesystem encoding for the filenames of these.  Likely the sensible thing
for them to do is to use utf-8 since all the ones I can think of default to
utf-8.  But, as Stephen and Victor have pointed out, users change their
locale settings to things that aren't utf-8 and save their modules using
filenames in that encoding.  When they update their OS to a version that has
utf-8 python module names, they will find that they have to make a choice.
They can either change their locale settings to a utf-8 encoding and have
the system installed modules work or they can leave their encoding on their
non-utf-8 encoding and have the modules that they've created on-site work.

This is not a good position to put users of these systems in.

The way this case should work, is that programs that install files (installation is a form of transfer) should transform their names from the encoding used in the transfer medium to the encoding of the filesystem on which they are installed.

Python3 should access the files, transforming the names from the encoding of the filesystem on which they are installed to Unicode for use by the program.

I think Python3 is trying to do its part, and Victor is trying to make that more robust on more platforms, specifically Windows.

The programs that install files, which may include programs that install Python files I don't know, may or may not be doing their part, but clearly there are cases where they do not.

Systems that have different encodings for names on the same or different file systems need to have a way to obtain the encoding for the file names, so they can be properly decoded. If they don't have such a way, they are broken.

=====
The rest of this is an attempt to describe the problem of Linux and other systems which use byte strings instead of character strings as file names. No problem, as long as programs allow byte strings as file names. Python3 does not, for the import statement, thus the problem is relevant for discussion here, as has been ongoing.
=====

Since file names are defined to be byte strings, there is no way to obtain the encoding for file names, so they cannot always be decoded, and sometimes not properly decoded, because no one knows which encoding was used to create them, _if any_.

Hence, Linux programs that use character strings as file names internally and expect them to match the byte strings in the file system are promoting a fiction: that there is a transformation (encoding) from character strings to byte strings that will match.

When using ASCII character strings, they can be transformed to bytes using a simple transformation: identity... but that isn't necessarily correct, if the files were created using EBCDIC (unlikely on Linux systems, but not impossible, since Linux files are byte strings).

When using non-ASCII character strings, the fiction promoted is even bigger, and the transformation even harder. Any 8-bit character encoding can pretend that identity is the correct transformation, but the result is mojibake if it isn't. Unicode other multi-byte encodings have an even harder job, because there can be 8-bit sequences that are not legal for some transformations, but are legal for others. This is when the fiction is exposed!

As the recent description of glib points out, when the file names are read as bytes, and shown to the user for selection, possibly using some mojibake-generating transformation to characters, the user has a fighting chance to pick the right file, less chance if the transformation is lossy ('?' substitutions, etc.) and/or the names are redundant in their lossless characters.

However, when the specification of the name is in characters (such as for Python import, or file names specified as character constants in any application system that provides/permits such), and there are large numbers of transformations that could be used to convert characters to bytes, the problem is harder, and error-prone... programs that want to promote the fiction of using characters for filenames must work harder. It seems that Python on Linux is such a program.

One technique is to have conventions agreed on by applications and users to limit the number of encodings used on a particular system to one (optimal) or a few, the latter requires understanding that files created in one encoding may not be accessible by systems that use a different one... until they are renamed. Subsets of applications and users can the happily share files with others of their encoding, and with the subset of files that can be decoded successfully by their encoding, even though it is not correct. (often ASCII, or a few mojibake characters learned for cross-subset usage.) When multiple encodings are used without such conventions, chaos results.

Another technique that would be amusing is to use Base64 (as Oleg suggested), URL-encoding, or some other mapping that transforms non-ASCII names to ASCII character sequences and the identity mapping to obtain bytes, and then Python could ship such files to any system, as long as it always included that mapping as one of the encodings it would try to find files. This would probably be the most powerful solution, but would only need to be applied to those systems that do not use characters for filenames. It could, in fact, be applied on any system that uses a subset of characters for filenames, and hence transcends the need for Unicode support in a file system to use Unicode names in Python3 import statements. It would likely be problematical for use with 3rd-party libraries, however.

Another technique would be to try each possible encoding in turn, in some defined order, and the filesystem searched for that byte string as a file name, possibly matching files that shouldn't have been matched. To limit that search, such programs could allow configuration of an smaller ordered list of encodings to be tried to limit the search, and a specific one to be used for the creation of new files; this opens up the possibility of not trying the "right" encoding, for some rogue file name.

This would be an issue and implementation for Linux systems, but would not need to be used on systems such as MacOS (which defines a particular encoding) or Windows (which defines a particular encoding) etc. When mounting filesystems that use byte string file names on systems with a define encoding, it should be the responsibility of the mounting system to do such transformations, and possibly have such configurations, and possibly have mappings or renaming facilities, and possibly prohibit access to files whose names cannot be transformed (of course, one can always punt by configuring latin-1 or other encodings that can match any byte string, but that produces mojibake, and then there is no surety that particular files will appear to have the name that programs expect).

Of course, Victor's patch is addressing Windows issues, and Windows has defined encodings, it is just a matter of using the proper APIs to see them, and should be accepted.

It sounds like the current situation on Linux is that Python can access the subset of files that match the locale encoding for which it is run. It sounds like it would be inappropriate for Python to begin shipping files with non-ASCII names as part of its Linux distribution, unless facilities are created or tools used to remap non-ASCII names to the local locale encoding. Locales that are not ASCII supersets (in character repertoire, not encoding) could not be supported. Locales that do not support all the characters used in files shipped with Python could not be supported. Since locales vary wildly in their available non-ASCII names, that limits Python eithr to shipping ASCII names only, or restricting the locales that are supported to those that support the characters used.

I suppose that Victor's patch would point out most or all the places where such transformations would have to be implemented, if it is important to support systems having byte string file names whose users cannot agree to use a single encoding for transforming to/from characters.
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to