[issue10952] Don't normalize module names to NFKC?
Changes by Atsuo Ishimoto ishim...@gembook.org: -- nosy: +ishimoto ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10952 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10952] Don't normalize module names to NFKC?
Atsuo Ishimoto added the comment: Converting identifiers to NFKC is problematic to work with FULLWIDTH letters such as 'a'(FULLWIDTH LATIN SMALL LETTER A). We can create module named 'aaa.py', but this module could not be imported on all platforms I know. import aaa Traceback (most recent call last): File stdin, line 1, in module ImportError: No module named 'aaa' Talking about Japanese environment, I don't see benefit to normalize variable names. FULLWIDTH/HALFWIDTH compatibility characters are commonly used here, and they are recognized different characters. It would be too late to argue, but converting to normal form NKC instead of NFKC would be better. Python distinguishes small letters and large letters, but doesn't distinguish fullwidth and halfwidth. This is a pretty surprising behavior to me. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10952 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10952] Don't normalize module names to NFKC?
Changes by Ezio Melotti ezio.melo...@gmail.com: -- nosy: +ezio.melotti versions: -Python 3.1 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10952 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10952] Don't normalize module names to NFKC?
STINNER Victor victor.stin...@haypocalc.com added the comment: It looks like there is nothing interesting to do here, so I close the issue (which is not a bug :-)). -- resolution: - invalid status: open - closed ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10952 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10952] Don't normalize module names to NFKC?
STINNER Victor victor.stin...@haypocalc.com added the comment: b) what if the file system implementation mangles file names. I'd use the same approach as with case-insensitive lookups: verify that the file we read is really the one we want. Only Mac OS X and the HFS+ filesystem normalize filenames (to a variant of NFD). But such normalization is a good thing! I mean that I don't think that we have anything to do for that. --- The user creates café.py file, name written with the keyboard in NFD: cafe\u0301 (this is very unlikely, all operating systems prefer NFC for the keyboard, but it's just to give an example). Mac OS X normalizes the filename to NFD: cafe\u0301.py is created in the filesystem. Then (s)he tries to import the café module: write import café with his/her NFD keyboard. Python normalizes café to NFKC (caf\xe9) and then tries to read caf\xe9.py. Mac OS X normalizes the filename to NFD: cafe \u0301.py, and this file, so it works as expected. --- I suppose that any filesystem normalization is good, because it avoids surprising behaviours (eg. having two files cafe\u0301 and caf\xe9 with names rendered exactly the same on screen). We should maybe patch Windows, Mac OS, Linux co to normalize to NFKC :-) a) how can users make sure that they name the files correctly? For a), wrt. I'm not able to write U+03BC with my keyboard, I say tough luck - don't use that character in a module name, then. Somebody with a Greek keyboard will have no problems doing that. Even if I try to agree with don't use that character in a module name: it can be surprising for an English who would like to use µTorrent (U +00B5) module name in his/her project. She/He can creates µTorrent.py with his non-Greek keyboard (\xb5Torrent.py), but than import µTorrent (import \xb5Torrent) fails: ImportError: No module named µTorrent. The error message is ImportError: No module named \u03BCTorrent: the identifier is normalized, but remember that µ (U+00B5) and μ (U+03BC) are rendered exactly the same by most fonts. We should at least document this surprising behaviour in the import documentation. Something like: WARNING: Non-ASCII characters in module names are normalized to NFKC by the Python parser ([PEP 3131]). For example, import µTorrent (µ: U +00B5) is normalized to import μTorrent (μ: U+03BC): Python will try to open \u03BCTorrent.py (or \u03BCTorrent/__init__.py), and not \xB5Torrent.py (or \xB5Torrent/__init__.py). This is really the same as any other non-ASCII character which you are unable to type: it just means that you can't conveniently enter the respective Python identifier. Just try importing саша, for example. Get a different keyboard. I disagree. For identifiers in the source code, it works (transparently) as expected. A Greek starts a project using µTorrent (\u03BCTorrent) identifier in its source code (a variable name, not a module name). An English writes a patch using µTorrent written with \xB5Torrent: both forms are accepted by Python, and it works. exec)) it works -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10952 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10952] Don't normalize module names to NFKC?
STINNER Victor victor.stin...@haypocalc.com added the comment: There is also issue c) what if the filesystem encoding can only represent a compatibility character, say U+00B5, but not its NFKC equivalent, U+03BC? It is the same problem than not being able to write U+03BC with a keyboard: in this setup, don't use U+00B5 or U+03BC. More generally: don't use non-ASCII characters if your setup is not fully Unicode compliant, or fix your setup :-) -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10952 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10952] Don't normalize module names to NFKC?
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: On Thu, Jan 20, 2011 at 8:06 AM, STINNER Victor rep...@bugs.python.org wrote: .. There is also issue c) what if the filesystem encoding can only represent a compatibility character, say U+00B5, but not its NFKC equivalent, U+03BC? It is the same problem than not being able to write U+03BC with a keyboard: No. This is a different problem and I agree with Martin that keyboard limitations are not an issue. With proper tools one can create '\u03BCTorrent.py file even if the keyboard does not have a '\u03BC' key as long as the filesystem is capable of storing such file. Python itself is one such tool: with open('\u03BCTorrent.py'.encode(fsencoding), 'w') as f: ... However, if fsencoding = 'latin-1', the code above will fail. One possible solution to this problem is to define a 'compat' error handler that would detect unencodable strings with encodable compatibility equivalents and produce encoding of an NFKC equivalent string instead of raising an error. ISTM, that in the Latin-1 encoding, there are only five affected characters: ... dec = decomposition(chr(i)) ... if dec and dec.startswith('compat'): ...print(U+00%02X '%s' (%s): %s %(i, chr(i), name(chr(i)), dec)) ... U+00A8 '¨' (DIAERESIS): compat 0020 0308 U+00AF '¯' (MACRON): compat 0020 0304 U+00B4 '´' (ACUTE ACCENT): compat 0020 0301 U+00B5 'µ' (MICRO SIGN): compat 03BC U+00B8 '¸' (CEDILLA): compat 0020 0327 I suspect that the number of affected characters in the other encodings is similarly small. If we further limit special handling to characters that are valid in identifiers, U+00B5 will end up being the only such character in Latin-1. An import mechanism using encode(fsencoding, 'compat') will, when given either import \u00B5Torrent or import \u03BCTorrent in source file, open \u03BCTorrent.py when fsencoding='utf-8' and \u00B5Torrent.py if fsencoding='latin-1'. A packaging mechanism that prepares code developed on a Latin-1 filesystem for distribution, would have to NFKC-normalize filenames before encoding them using UTF-8. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10952 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10952] Don't normalize module names to NFKC?
STINNER Victor victor.stin...@haypocalc.com added the comment: A packaging mechanism that prepares code developed on a Latin-1 filesystem for distribution, would have to NFKC-normalize filenames before encoding them using UTF-8. It causes portability issues: if you copy a non-ASCII module on a new host, the program will work or not depending on the filesystem encoding. Having to transform the filename when you copy a file, just to fix a corner case, is a pain. One possible solution to this problem is to define a 'compat' error handler that would detect unencodable strings with encodable compatibility equivalents and produce encoding of an NFKC equivalent string instead of raising an error. Only few people use non-ASCII module names and most operating systems are able to store all Unicode characters, so I don't think that we need to support U+00B5 in a module name with Latin1 filesystem at all. If you use an old system using Latin1 filesystem, you have to limit your expectation on Python unicode support :-) os.fsencode() and os.fsdecode() already use a custom error handler: surrogateescape. compat will conflict with surrogateescape. Loading a module concatenates two parts: a path from sys.path (decoded from the filesystem encoding and surrogateescape error handler) and a module name. If custom is used to encode the filename, the module name will be encoded correctly, but not the path. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10952 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10952] Don't normalize module names to NFKC?
Martin v. Löwis mar...@v.loewis.de added the comment: There is also issue c) what if the filesystem encoding can only represent a compatibility character, say U+00B5, but not its NFKC equivalent, U+03BC? That should be considered as similar to file systems that just cannot represent certain characters at all - e.g. many of the non-ASCII characters, or no upper-case letters. If you have such a file system, you just cannot use these characters in a module name. Rename your modules, then, or put the modules in a zipfile (or use some other import hook). However, this code will always fail because '\xB5Torrent' will be normalized into '\u03BCTorrent' and a file named '\u03BCTorrent.py' cannot be created on a filesystem with Latin-1 encoding. Tough luck. The filesystem just doesn't support GREEK SMALL LETTER MU, just as it doesn't support all the other greek characters. It may be fun coming up with these border cases. But I really don't see a need to support them. If you really need to have that letter in a module name, reformat your disk with a better file system. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10952 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10952] Don't normalize module names to NFKC?
Martin v. Löwis mar...@v.loewis.de added the comment: Only Mac OS X and the HFS+ filesystem normalize filenames (to a variant of NFD). But such normalization is a good thing! I mean that I don't think that we have anything to do for that. That may well be - I don't have a case where this would cause problems, either. We should at least document this surprising behaviour in the import documentation. There are also are better ways to support the user than mere documentation. For example,, the exception message could be more helpful, and IDLE could warn the user when saving the file in the first place. WARNING: Non-ASCII characters in module names are normalized to NFKC by the Python parser ([PEP 3131]). For example, import µTorrent (µ: U +00B5) is normalized to import μTorrent (μ: U+03BC): Python will try to open \u03BCTorrent.py (or \u03BCTorrent/__init__.py), and not \xB5Torrent.py (or \xB5Torrent/__init__.py). I can't believe this is a real problem. I'd defer warning about made-up problems until real users report them as a real problem. I disagree. If you disagree strongly, please write a PEP. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10952 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10952] Don't normalize module names to NFKC?
New submission from STINNER Victor victor.stin...@haypocalc.com: The Python 3 parser normalizes all identifiers using NFKC (as described in the PEP 3131). Examples: - U+00B5 (µ: Micro sign) is normalized to U+03BC (μ: Greek small letter mu) - U+FB03 (ffi: Latin small ligature ffi) is normalized to 'ffi' The problem is that it does also normalize module names, but not the filename. The module name in the Python source code is written with the keyboard (eg. U+00B5 in my case) and then normalized to NFKC (= U+03BC). The filename is also written using the keyboard (U+00B5), but it is never normalized. Attached script tests the current behaviour using µTorrent name with U+00B5 and U+03BC: import with U+00B5 or U+03BC use the filename with U+03BC. The problem is that I'm able to write 'µ' (U+00B5) with my keyboard, but not U+03BC (μ). -- components: Interpreter Core, Unicode files: module_name.py messages: 126577 nosy: haypo priority: normal severity: normal status: open title: Don't normalize module names to NFKC? versions: Python 3.1, Python 3.2, Python 3.3 Added file: http://bugs.python.org/file20459/module_name.py ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10952 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10952] Don't normalize module names to NFKC?
Changes by Alexander Belopolsky belopol...@users.sourceforge.net: -- nosy: +belopolsky ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10952 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10952] Don't normalize module names to NFKC?
STINNER Victor victor.stin...@haypocalc.com added the comment: µTorrent.py filename example comes from #10754. This issue is unrelated to the Python parser or the import machinery: it is a surprising behaviour of the MBCS codec which replaces unencodable characters to a similar glyph. I changed the MBCS in Python 3.2 to be strict (it now raises an error on unencodable character). -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10952 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10952] Don't normalize module names to NFKC?
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: This proposal makes sense because it would make import µTorrent behave the same as µTorrent = __import__('µTorrent') However, I think this is a feature request and a language change because the current grammar is import_stmt ::= import module .. module ::= (identifier .)* identifier and in order to implement the proposed feature, module will have to become a separate AST node that won't be treated as identifier. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10952 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10952] Don't normalize module names to NFKC?
STINNER Victor victor.stin...@haypocalc.com added the comment: New problem: if the parser doesn't normalize module names on import, it does still normalize module names on other instructions. Example: import \xB5Torrent; del \xB5Torrent raises an error on del because the parser normalized del identifier (the second module name) = import \xB5Torrent; del \u03BCTorrent. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10952 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10952] Don't normalize module names to NFKC?
STINNER Victor victor.stin...@haypocalc.com added the comment: See also #3080 (which is not directly related). -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10952 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10952] Don't normalize module names to NFKC?
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: On Wed, Jan 19, 2011 at 9:21 PM, STINNER Victor rep...@bugs.python.org wrote: .. New problem: if the parser doesn't normalize module names on import, it does still normalize module names on other instructions. Example: import \xB5Torrent; del \xB5Torrent raises an error on del because the parser normalized del identifier (the second module name) = import \xB5Torrent; del \u03BCTorrent. This won't be a problem if you make import \xB5Torrent behave as \xB5Torrent = __import__('\xB5Torrent'). The latter is equivalent to \u03BCTorrent = __import__('\xB5Torrent'). -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10952 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10952] Don't normalize module names to NFKC?
STINNER Victor victor.stin...@haypocalc.com added the comment: This won't be a problem if you make import \xB5Torrent behave as (...) \u03BCTorrent = __import__('\xB5Torrent') import name is compiled to IMPORT_NAME(name); STORE_NAME(name) bytecode instructions. So you proposed to compile it to IMPORT_NAME(name); STORE_NAME(normalized_name) if name is different than the normalized name. Ok, I think that it is possible. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10952 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10952] Don't normalize module names to NFKC?
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: Victor Ok, I think that it is possible. While it is possible, I am not sure it is a good idea. For example, if a filesystem uses encoding that is capable of distinguishing between \xB5Torrent.py and \u03BCTorrent.py, should import \xB5Torrent and import \u03BCTorrent import different modules? -- nosy: +loewis ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10952 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10952] Don't normalize module names to NFKC?
Martin v. Löwis mar...@v.loewis.de added the comment: I think this issue falls into a similar category as support for case-insensitive but case-preserving file systems. Python uses regular file system lookups, but then may need to verify whether it got the right one. I'd like to request that PEP 3131 is followed as it stands: identifier lookup uses NFKC, period. This gives two issues: a) how can users make sure that they name the files correctly? and b) what if the file system implementation mangles file names. For b), I'd use the same approach as with case-insensitive lookups: verify that the file we read is really the one we want. For a), wrt. I'm not able to write U+03BC with my keyboard, I say tough luck - don't use that character in a module name, then. Somebody with a Greek keyboard will have no problems doing that. This is really the same as any other non-ASCII character which you are unable to type: it just means that you can't conveniently enter the respective Python identifier. Just try importing саша, for example. Get a different keyboard. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10952 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10952] Don't normalize module names to NFKC?
Alexander Belopolsky belopol...@users.sourceforge.net added the comment: On Thu, Jan 20, 2011 at 1:19 AM, Martin v. Löwis rep...@bugs.python.org wrote: .. I'd like to request that PEP 3131 is followed as it stands: identifier lookup uses NFKC, period. This gives two issues: a) how can users make sure that they name the files correctly? and b) what if the file system implementation mangles file names. There is also issue c) what if the filesystem encoding can only represent a compatibility character, say U+00B5, but not its NFKC equivalent, U+03BC? Suppose you have a system with both locale and FS encodings being Latin-1. You can write Python code using Latin-1 and the following is valid bytestream: b'# encoding: latin-1\nimport \xB5Torrent\n However, this code will always fail because '\xB5Torrent' will be normalized into '\u03BCTorrent' and a file named '\u03BCTorrent.py' cannot be created on a filesystem with Latin-1 encoding. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10952 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com