[issue10952] Don't normalize module names to NFKC?

2013-02-24 Thread Atsuo Ishimoto

Changes by Atsuo Ishimoto ishim...@gembook.org:


--
nosy: +ishimoto

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10952
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10952] Don't normalize module names to NFKC?

2013-02-24 Thread Atsuo Ishimoto

Atsuo Ishimoto added the comment:

Converting identifiers to NFKC is problematic to work with FULLWIDTH letters 
such as 'a'(FULLWIDTH LATIN SMALL LETTER A).

We can create module named 'aaa.py', but this module could not be imported on 
all platforms I know.

 import aaa
Traceback (most recent call last):
  File stdin, line 1, in module
ImportError: No module named 'aaa'

Talking about Japanese environment, I don't see benefit to normalize variable 
names. FULLWIDTH/HALFWIDTH compatibility characters are commonly used here, and 
they are recognized different characters.  It would be too late to argue, but 
converting to normal form NKC instead of NFKC would be better. Python 
distinguishes small letters and large letters, but doesn't distinguish 
fullwidth and halfwidth. This is a pretty surprising behavior to me.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10952
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10952] Don't normalize module names to NFKC?

2013-02-24 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


--
nosy: +ezio.melotti
versions:  -Python 3.1

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10952
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10952] Don't normalize module names to NFKC?

2011-01-27 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

It looks like there is nothing interesting to do here, so I close the issue 
(which is not a bug :-)).

--
resolution:  - invalid
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10952
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10952] Don't normalize module names to NFKC?

2011-01-20 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

 b) what if the file system implementation mangles file names.
 
 I'd use the same approach as with case-insensitive lookups: verify
 that the file we read is really the one we want.

Only Mac OS X and the HFS+ filesystem normalize filenames (to a variant
of NFD). But such normalization is a good thing! I mean that I don't
think that we have anything to do for that.

---
The user creates café.py file, name written with the keyboard in NFD:
cafe\u0301 (this is very unlikely, all operating systems prefer NFC for
the keyboard, but it's just to give an example). Mac OS X normalizes the
filename to NFD: cafe\u0301.py is created in the filesystem.

Then (s)he tries to import the café module: write import café with
his/her NFD keyboard. Python normalizes café to NFKC (caf\xe9) and then
tries to read caf\xe9.py. Mac OS X normalizes the filename to NFD: cafe
\u0301.py, and this file, so it works as expected.
---

I suppose that any filesystem normalization is good, because it avoids
surprising behaviours (eg. having two files cafe\u0301 and caf\xe9 with
names rendered exactly the same on screen). We should maybe patch
Windows, Mac OS, Linux  co to normalize to NFKC :-)

 a) how can users make sure that they name the files correctly?

  For a), wrt. I'm not able to write U+03BC with my keyboard, I say
 tough luck - don't use that character in a module name, then.
 Somebody with a Greek keyboard will have no problems doing that. 

Even if I try to agree with don't use that character in a module name:
it can be surprising for an English who would like to use µTorrent (U
+00B5) module name in his/her project. She/He can creates µTorrent.py
with his non-Greek keyboard (\xb5Torrent.py), but than import µTorrent
(import \xb5Torrent) fails: ImportError: No module named µTorrent. The
error message is ImportError: No module named \u03BCTorrent: the
identifier is normalized, but remember that µ (U+00B5) and μ (U+03BC)
are rendered exactly the same by most fonts.

We should at least document this surprising behaviour in the import
documentation. Something like:

 WARNING: Non-ASCII characters in module names are normalized to NFKC
by the Python parser ([PEP 3131]). For example, import µTorrent (µ: U
+00B5) is normalized to import μTorrent (μ: U+03BC): Python will try to
open \u03BCTorrent.py (or \u03BCTorrent/__init__.py), and not
\xB5Torrent.py (or \xB5Torrent/__init__.py). 

 This is really the same as any other non-ASCII character which you are
 unable to type: it just means that you can't conveniently enter the
 respective Python identifier. Just try importing саша, for example.
 Get a different keyboard.

I disagree. For identifiers in the source code, it works (transparently)
as expected.

A Greek starts a project using µTorrent (\u03BCTorrent) identifier in
its source code (a variable name, not a module name). An English writes
a patch using µTorrent written with \xB5Torrent: both forms are accepted
by Python, and it works.

exec))
it works

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10952
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10952] Don't normalize module names to NFKC?

2011-01-20 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

 There is also issue c) what if the filesystem encoding can only
 represent a compatibility character, say U+00B5, but not its NFKC
 equivalent, U+03BC?

It is the same problem than not being able to write U+03BC with a keyboard: in 
this setup, don't use U+00B5 or U+03BC. More generally: don't use non-ASCII 
characters if your setup is not fully Unicode compliant, or fix your setup :-)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10952
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10952] Don't normalize module names to NFKC?

2011-01-20 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Thu, Jan 20, 2011 at 8:06 AM, STINNER Victor rep...@bugs.python.org wrote:
..
 There is also issue c) what if the filesystem encoding can only
 represent a compatibility character, say U+00B5, but not its NFKC
 equivalent, U+03BC?

 It is the same problem than not being able to write U+03BC with a keyboard:

No.  This is a different problem and I agree with Martin that keyboard
limitations are not an issue.  With proper tools one can create
'\u03BCTorrent.py file even if the keyboard does not have a '\u03BC'
key as long as the filesystem is capable of storing such file.  Python
itself is one such tool:

 with open('\u03BCTorrent.py'.encode(fsencoding), 'w') as f: ...

However, if fsencoding = 'latin-1', the code above will fail.

One possible solution to this problem is to define a 'compat' error
handler that would detect unencodable strings with encodable
compatibility equivalents and produce encoding of an NFKC equivalent
string instead of raising an error.  ISTM, that in the Latin-1
encoding, there are only five affected characters:

... dec = decomposition(chr(i))
... if dec and dec.startswith('compat'):
...print(U+00%02X '%s' (%s): %s %(i, chr(i), name(chr(i)), dec))
...
U+00A8 '¨' (DIAERESIS): compat 0020 0308
U+00AF '¯' (MACRON): compat 0020 0304
U+00B4 '´' (ACUTE ACCENT): compat 0020 0301
U+00B5 'µ' (MICRO SIGN): compat 03BC
U+00B8 '¸' (CEDILLA): compat 0020 0327

I suspect that the number of affected characters in the other
encodings is similarly small.  If we further limit special handling to
characters that are valid in identifiers, U+00B5 will end up being the
only such character in Latin-1.

An import mechanism using encode(fsencoding, 'compat') will, when
given either import \u00B5Torrent or  import \u03BCTorrent in
source file, open  \u03BCTorrent.py when fsencoding='utf-8'  and
\u00B5Torrent.py if fsencoding='latin-1'.   A packaging mechanism
that prepares code developed on a Latin-1 filesystem for distribution,
would have to NFKC-normalize filenames before encoding them using
UTF-8.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10952
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10952] Don't normalize module names to NFKC?

2011-01-20 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

 A packaging mechanism that prepares code developed on a Latin-1
 filesystem for distribution, would have to NFKC-normalize 
 filenames before encoding them using UTF-8.

It causes portability issues: if you copy a non-ASCII module on a new
host, the program will work or not depending on the filesystem encoding.
Having to transform the filename when you copy a file, just to fix a
corner case, is a pain.

 One possible solution to this problem is to define a 'compat' error
 handler that would detect unencodable strings with encodable
 compatibility equivalents and produce encoding of an NFKC equivalent
 string instead of raising an error.

Only few people use non-ASCII module names and most operating systems
are able to store all Unicode characters, so I don't think that we need
to support U+00B5 in a module name with Latin1 filesystem at all. If you
use an old system using Latin1 filesystem, you have to limit your
expectation on Python unicode support :-)

os.fsencode() and os.fsdecode() already use a custom error handler:
surrogateescape. compat will conflict with surrogateescape. Loading a
module concatenates two parts: a path from sys.path (decoded from the
filesystem encoding and surrogateescape error handler) and a module
name. If custom is used to encode the filename, the module name will be
encoded correctly, but not the path.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10952
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10952] Don't normalize module names to NFKC?

2011-01-20 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

 There is also issue c) what if the filesystem encoding can only
 represent a compatibility character, say U+00B5, but not its NFKC
 equivalent, U+03BC?

That should be considered as similar to file systems that just cannot
represent certain characters at all - e.g. many of the non-ASCII
characters, or no upper-case letters. If you have such a file system,
you just cannot use these characters in a module name. Rename your
modules, then, or put the modules in a zipfile (or use some other
import hook).

 However, this code will always fail because '\xB5Torrent' will be
 normalized into '\u03BCTorrent' and a file named '\u03BCTorrent.py'
 cannot be created on a filesystem with Latin-1 encoding.

Tough luck. The filesystem just doesn't support GREEK SMALL LETTER MU,
just as it doesn't support all the other greek characters.

It may be fun coming up with these border cases. But I really don't
see a need to support them. If you really need to have that letter
in a module name, reformat your disk with a better file system.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10952
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10952] Don't normalize module names to NFKC?

2011-01-20 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

 Only Mac OS X and the HFS+ filesystem normalize filenames (to a variant
 of NFD). But such normalization is a good thing! I mean that I don't
 think that we have anything to do for that.

That may well be - I don't have a case where this would cause problems,
either.

 We should at least document this surprising behaviour in the import
 documentation.

There are also are better ways to support the user than mere
documentation. For example,, the exception message could be more
helpful, and IDLE could warn the user when saving the file in the
first place.

  WARNING: Non-ASCII characters in module names are normalized to NFKC
 by the Python parser ([PEP 3131]). For example, import µTorrent (µ: U
 +00B5) is normalized to import μTorrent (μ: U+03BC): Python will try to
 open \u03BCTorrent.py (or \u03BCTorrent/__init__.py), and not
 \xB5Torrent.py (or \xB5Torrent/__init__.py). 

I can't believe this is a real problem. I'd defer warning about made-up
problems until real users report them as a real problem.

 I disagree.

If you disagree strongly, please write a PEP.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10952
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10952] Don't normalize module names to NFKC?

2011-01-19 Thread STINNER Victor

New submission from STINNER Victor victor.stin...@haypocalc.com:

The Python 3 parser normalizes all identifiers using NFKC (as described in the 
PEP 3131). Examples:
 - U+00B5 (µ: Micro sign) is normalized to U+03BC (μ: Greek small letter mu)
 - U+FB03 (ffi: Latin small ligature ffi) is normalized to 'ffi'

The problem is that it does also normalize module names, but not the filename.

The module name in the Python source code is written with the keyboard (eg. 
U+00B5 in my case) and then normalized to NFKC (= U+03BC). The filename is 
also written using the keyboard (U+00B5), but it is never normalized.

Attached script tests the current behaviour using µTorrent name with U+00B5 
and U+03BC: import with U+00B5 or U+03BC use the filename with U+03BC.

The problem is that I'm able to write 'µ' (U+00B5) with my keyboard, but not 
U+03BC (μ).

--
components: Interpreter Core, Unicode
files: module_name.py
messages: 126577
nosy: haypo
priority: normal
severity: normal
status: open
title: Don't normalize module names to NFKC?
versions: Python 3.1, Python 3.2, Python 3.3
Added file: http://bugs.python.org/file20459/module_name.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10952
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10952] Don't normalize module names to NFKC?

2011-01-19 Thread Alexander Belopolsky

Changes by Alexander Belopolsky belopol...@users.sourceforge.net:


--
nosy: +belopolsky

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10952
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10952] Don't normalize module names to NFKC?

2011-01-19 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

µTorrent.py filename example comes from #10754.

This issue is unrelated to the Python parser or the import machinery: it is a 
surprising behaviour of the MBCS codec which replaces unencodable characters to 
a similar glyph. I changed the MBCS in Python 3.2 to be strict (it now raises 
an error on unencodable character).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10952
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10952] Don't normalize module names to NFKC?

2011-01-19 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

This proposal makes sense because it would make

import µTorrent

behave the same as

µTorrent = __import__('µTorrent')

However, I think this is a feature request and a language change because the 
current grammar is

import_stmt ::=  import module ..
module  ::=  (identifier .)* identifier

and in order to implement the proposed feature, module will have to become a 
separate AST node that won't be treated as identifier.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10952
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10952] Don't normalize module names to NFKC?

2011-01-19 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

New problem: if the parser doesn't normalize module names on import, it does 
still normalize module names on other instructions.

Example: import \xB5Torrent; del \xB5Torrent raises an error on del because 
the parser normalized del identifier (the second module name) = import 
\xB5Torrent; del \u03BCTorrent.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10952
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10952] Don't normalize module names to NFKC?

2011-01-19 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

See also #3080 (which is not directly related).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10952
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10952] Don't normalize module names to NFKC?

2011-01-19 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Wed, Jan 19, 2011 at 9:21 PM, STINNER Victor rep...@bugs.python.org wrote:
..
 New problem: if the parser doesn't normalize module names on import, it does 
 still
 normalize module names on other instructions.

 Example: import \xB5Torrent; del \xB5Torrent raises an error on del because 
 the parser
 normalized del identifier (the second module name) = import \xB5Torrent; 
 del \u03BCTorrent.


This won't be a problem if you make import \xB5Torrent behave as
\xB5Torrent = __import__('\xB5Torrent').  The latter is equivalent
to \u03BCTorrent =  __import__('\xB5Torrent').

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10952
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10952] Don't normalize module names to NFKC?

2011-01-19 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

 This won't be a problem if you make 
 import \xB5Torrent 
 behave as (...)
 \u03BCTorrent =  __import__('\xB5Torrent')

import name is compiled to IMPORT_NAME(name); STORE_NAME(name) bytecode 
instructions. So you proposed to compile it to IMPORT_NAME(name); 
STORE_NAME(normalized_name) if name is different than the normalized name. Ok, 
I think that it is possible.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10952
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10952] Don't normalize module names to NFKC?

2011-01-19 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

Victor Ok, I think that it is possible.

While it is possible, I am not sure it is a good idea.  For example, if a 
filesystem uses encoding that is capable of distinguishing between 
\xB5Torrent.py and \u03BCTorrent.py, should import \xB5Torrent and 
import \u03BCTorrent import different modules?

--
nosy: +loewis

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10952
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10952] Don't normalize module names to NFKC?

2011-01-19 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

I think this issue falls into a similar category as support for 
case-insensitive but case-preserving file systems. Python uses regular file 
system lookups, but then may need to verify whether it got the right one.

I'd like to request that PEP 3131 is followed as it stands: identifier lookup 
uses NFKC, period. This gives two issues: a) how can users make sure that they 
name the files correctly? and b) what if the file system implementation mangles 
file names.

For b), I'd use the same approach as with case-insensitive lookups: verify that 
the file we read is really the one we want. For a), wrt. I'm not able to write 
U+03BC with my keyboard, I say tough luck - don't use that character in a 
module name, then. Somebody with a Greek keyboard will have no problems doing 
that. This is really the same as any other non-ASCII character which you are 
unable to type: it just means that you can't conveniently enter the respective 
Python identifier. Just try importing саша, for example. Get a different 
keyboard.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10952
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10952] Don't normalize module names to NFKC?

2011-01-19 Thread Alexander Belopolsky

Alexander Belopolsky belopol...@users.sourceforge.net added the comment:

On Thu, Jan 20, 2011 at 1:19 AM, Martin v. Löwis rep...@bugs.python.org wrote:
..
 I'd like to request that PEP 3131 is followed as it stands: identifier lookup 
 uses NFKC,
 period. This gives two issues: a) how can users make sure that they name the 
 files
 correctly? and b) what if the file system implementation mangles file names.


There is also issue c) what if the filesystem encoding can only
represent a compatibility character, say U+00B5, but not its NFKC
equivalent, U+03BC?  Suppose you have a system with both locale and FS
encodings being Latin-1.  You can write Python code using Latin-1 and
the following is valid bytestream:

b'# encoding: latin-1\nimport \xB5Torrent\n

However, this code will always fail because '\xB5Torrent' will be
normalized into '\u03BCTorrent' and a file named '\u03BCTorrent.py'
cannot be created on a filesystem with Latin-1 encoding.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue10952
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com