Re: [Python-Dev] Import and unicode: part two

Glenn Linderman Wed, 26 Jan 2011 18:54:40 -0800

On 1/26/2011 4:47 PM, Toshio Kuratomi wrote:

There's one further case that I am worried about that has no real
"transfer".  Since people here seem to think that unicode module names are
the future (for instance, the comments about redefining the C locale to
include utf-8 and the comments about archiving tools needing to support
encoding bits), there are eventually going to be unicode modules that become
dependencies of other modules and programs.  These will need to be installed
on systems.  Linux distributions that ship these will need to choose
a filesystem encoding for the filenames of these.  Likely the sensible thing
for them to do is to use utf-8 since all the ones I can think of default to
utf-8.  But, as Stephen and Victor have pointed out, users change their
locale settings to things that aren't utf-8 and save their modules using
filenames in that encoding.  When they update their OS to a version that has
utf-8 python module names, they will find that they have to make a choice.
They can either change their locale settings to a utf-8 encoding and have
the system installed modules work or they can leave their encoding on their
non-utf-8 encoding and have the modules that they've created on-site work.


This is not a good position to put users of these systems in.

The way this case should work, is that programs that install files(installation is a form of transfer) should transform their names fromthe encoding used in the transfer medium to the encoding of thefilesystem on which they are installed.

Python3 should access the files, transforming the names from theencoding of the filesystem on which they are installed to Unicode foruse by the program.

I think Python3 is trying to do its part, and Victor is trying to makethat more robust on more platforms, specifically Windows.

The programs that install files, which may include programs that installPython files I don't know, may or may not be doing their part, butclearly there are cases where they do not.

Systems that have different encodings for names on the same or differentfile systems need to have a way to obtain the encoding for the filenames, so they can be properly decoded. If they don't have such a way,they are broken.


=====

The rest of this is an attempt to describe the problem of Linux andother systems which use byte strings instead of character strings asfile names. No problem, as long as programs allow byte strings as filenames. Python3 does not, for the import statement, thus the problem isrelevant for discussion here, as has been ongoing.

=====

Since file names are defined to be byte strings, there is no way toobtain the encoding for file names, so they cannot always be decoded,and sometimes not properly decoded, because no one knows which encodingwas used to create them, _if any_.

Hence, Linux programs that use character strings as file namesinternally and expect them to match the byte strings in the file systemare promoting a fiction: that there is a transformation (encoding) fromcharacter strings to byte strings that will match.

When using ASCII character strings, they can be transformed to bytesusing a simple transformation: identity... but that isn't necessarilycorrect, if the files were created using EBCDIC (unlikely on Linuxsystems, but not impossible, since Linux files are byte strings).

When using non-ASCII character strings, the fiction promoted is evenbigger, and the transformation even harder. Any 8-bit characterencoding can pretend that identity is the correct transformation, butthe result is mojibake if it isn't. Unicode other multi-byte encodingshave an even harder job, because there can be 8-bit sequences that arenot legal for some transformations, but are legal for others. This iswhen the fiction is exposed!

As the recent description of glib points out, when the file names areread as bytes, and shown to the user for selection, possibly using somemojibake-generating transformation to characters, the user has afighting chance to pick the right file, less chance if thetransformation is lossy ('?' substitutions, etc.) and/or the names areredundant in their lossless characters.

However, when the specification of the name is in characters (such asfor Python import, or file names specified as character constants in anyapplication system that provides/permits such), and there are largenumbers of transformations that could be used to convert characters tobytes, the problem is harder, and error-prone... programs that want topromote the fiction of using characters for filenames must work harder.It seems that Python on Linux is such a program.

One technique is to have conventions agreed on by applications and usersto limit the number of encodings used on a particular system to one(optimal) or a few, the latter requires understanding that files createdin one encoding may not be accessible by systems that use a differentone... until they are renamed. Subsets of applications and users canthe happily share files with others of their encoding, and with thesubset of files that can be decoded successfully by their encoding, eventhough it is not correct. (often ASCII, or a few mojibake characterslearned for cross-subset usage.) When multiple encodings are usedwithout such conventions, chaos results.

Another technique that would be amusing is to use Base64 (as Olegsuggested), URL-encoding, or some other mapping that transformsnon-ASCII names to ASCII character sequences and the identity mapping toobtain bytes, and then Python could ship such files to any system, aslong as it always included that mapping as one of the encodings it wouldtry to find files. This would probably be the most powerful solution,but would only need to be applied to those systems that do not usecharacters for filenames. It could, in fact, be applied on any systemthat uses a subset of characters for filenames, and hence transcends theneed for Unicode support in a file system to use Unicode names inPython3 import statements. It would likely be problematical for usewith 3rd-party libraries, however.

Another technique would be to try each possible encoding in turn, insome defined order, and the filesystem searched for that byte string asa file name, possibly matching files that shouldn't have been matched.To limit that search, such programs could allow configuration of ansmaller ordered list of encodings to be tried to limit the search, and aspecific one to be used for the creation of new files; this opens up thepossibility of not trying the "right" encoding, for some rogue file name.

This would be an issue and implementation for Linux systems, but wouldnot need to be used on systems such as MacOS (which defines a particularencoding) or Windows (which defines a particular encoding) etc. Whenmounting filesystems that use byte string file names on systems with adefine encoding, it should be the responsibility of the mounting systemto do such transformations, and possibly have such configurations, andpossibly have mappings or renaming facilities, and possibly prohibitaccess to files whose names cannot be transformed (of course, one canalways punt by configuring latin-1 or other encodings that can match anybyte string, but that produces mojibake, and then there is no suretythat particular files will appear to have the name that programs expect).

Of course, Victor's patch is addressing Windows issues, and Windows hasdefined encodings, it is just a matter of using the proper APIs to seethem, and should be accepted.

It sounds like the current situation on Linux is that Python can accessthe subset of files that match the locale encoding for which it is run.It sounds like it would be inappropriate for Python to begin shippingfiles with non-ASCII names as part of its Linux distribution, unlessfacilities are created or tools used to remap non-ASCII names to thelocal locale encoding. Locales that are not ASCII supersets (incharacter repertoire, not encoding) could not be supported. Localesthat do not support all the characters used in files shipped with Pythoncould not be supported. Since locales vary wildly in their availablenon-ASCII names, that limits Python eithr to shipping ASCII names only,or restricting the locales that are supported to those that support thecharacters used.

I suppose that Victor's patch would point out most or all the placeswhere such transformations would have to be implemented, if it isimportant to support systems having byte string file names whose userscannot agree to use a single encoding for transforming to/from characters.

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Import and unicode: part two

Reply via email to