[issue10209] Mac OS X: Decompose filenames on encode, and precompose filenames on decode

2010-10-28 Thread Ronald Oussoren

Ronald Oussoren  added the comment:

For completeness sake: Apple's Cocoa APIs do not renormalize strings, that is: 
I've created a file named 'één' in the Terminal, then (using a python 3.2 
build):

# Terminal input seems NFC:
>>> len('één')
3

# Output from os.listdir isn't:
>>> os.listdir('.')
['één']
>>> len(_[0])
5

# Output from the Cocoa equivalant also isn't:
>>> import Foundation
>>> mgr = Foundation.NSFileManager.defaultManager()
>>> mgr.directoryContentsAtPath_('.')
(
"e\U0301e\U0301n"
)
>>> len(_[0])
5

BTW. fsdecode(fsencode(x)) cannot in general be a no-op, unicode normalizations 
can screw things up (with the now withdrawn proposal the expression wouldn't be 
a no-op for NFD strings).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10209] Mac OS X: Decompose filenames on encode, and precompose filenames on decode

2010-10-28 Thread STINNER Victor

Changes by STINNER Victor :


--
resolution:  -> fixed
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10209] Mac OS X: Decompose filenames on encode, and precompose filenames on decode

2010-10-28 Thread STINNER Victor

STINNER Victor  added the comment:

I now agree with Martin: "Mac OS X: Decompose filenames on encode,  and 
precompose filenames on decode" was a bad idea, fix the test is the right 
solution.

test_pep277 now pass on "x86 Tiger 3.x" buildbot, and so I can close this issue 
and issue #8423.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10209] Mac OS X: Decompose filenames on encode, and precompose filenames on decode

2010-10-28 Thread STINNER Victor

STINNER Victor  added the comment:

> - r85897 disables the filenames that are normalized differently by Python and 
> by darwin
> - r85899 disables test_normalize and test_listdir tests

It looks like r85897 is enough to fix test_pep277 on "x86 Tiger 3.x" buildbot. 
But r85899 should not make the situation worse :-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10209] Mac OS X: Decompose filenames on encode, and precompose filenames on decode

2010-10-28 Thread STINNER Victor

STINNER Victor  added the comment:

> The problem with test_normalize() and test_listdir() of test_pep277
> is maybe that these tests are irrevelant on Mac OS X?

I tried a different approach (different than my patch and the svn branch):
 - r85897 disables the filenames that are normalized differently by Python and 
by darwin
 - r85899 disables test_normalize and test_listdir tests

Let's watch the buildbots...

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10209] Mac OS X: Decompose filenames on encode, and precompose filenames on decode

2010-10-28 Thread STINNER Victor

STINNER Victor  added the comment:

> My question is rather why it failed in the first place, 
> when issue8207 had supposedly fixed it.

r79426 (of #8207) only disabled some tests.

The problem with test_normalize() and test_listdir() of test_pep277 is maybe 
that these tests are irrevelant on Mac OS X?

I still don't understand exaclty why the tests fail and what the tests do check.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10209] Mac OS X: Decompose filenames on encode, and precompose filenames on decode

2010-10-28 Thread Martin v . Löwis

Martin v. Löwis  added the comment:

> Yes, but not exactly... Mac OS X NFD normalization is a little bit
> different than Python's normalization: see msg105669 and 
> http://developer.apple.com/library/mac/#qa/qa2001/qa1173.html

I see. This is one more reason not to convert strings into NFD, no?

> I don't understand why test_pep277 pass on issue10209 branch, but it
> works. I suppose that normalize the filename to NFD in Python avoids
> some Mac OS X normalization bugs?

My question is rather why it failed in the first place, when issue8207
had supposedly fixed it.

> I propose to normalize to NFC because Qt does that.

Hmm. I find that a weak argument - in particular given that the
system will normalize then in turn anyway, and to a slightly different
normalform. So what is Qt's motivation to normalize?

> On Linux, the keyboard uses NFC.

I think this is technically incorrect. When you press é, then some
scan code is generated. That goes through various mapping layers.
The outcome will depend on how specifically these layers are
configured.

> Which norm is used on Mac OS X, eg. for the keyboard?

Same reasoning: pressing a key initially does not generate any Unicode
at all. My guess is that when eventually a character is generated
(e.g. on the terminal), no normal form is used; instead, it most likely
will always strive to generate a single character (even if that is not
normalized). See

http://developer.apple.com/library/mac/#qa/qa2001/qa1235.html

which says "Macintosh keyboards generally produce precomposed Unicode"

> Anyway, I think that os.fsencode(os.fsdecode(name)) should be equal
> to name.

I agree. and that is currently already the case.

> If it's different, "open(name, 'w').close(); name in
> listdir()" is False (on systems storing filenames as bytes). So if
> you change fsdecode(), fsencode() should also be changed.

I'm saying that fsdecode shouldn't change, either, the primary reason
being backwards compatibility here.

--
title: Mac OS X: Decompose filenames on encode, and precompose filenames on 
decode -> Mac OS X: Decompose filenames on encode,  and precompose filenames on 
decode

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10209] Mac OS X: Decompose filenames on encode, and precompose filenames on decode

2010-10-28 Thread STINNER Victor

STINNER Victor  added the comment:

Some pointers.

"MacFUSE"
http://code.google.com/p/macfuse/issues/detail?id=139#c2

"FILENAME_ENCODING_PROPOSAL" (MacFUSE)
http://code.google.com/p/macfuse/wiki/FILENAME_ENCODING_PROPOSAL

"Converting to Precomposed Unicode"
http://developer.apple.com/library/mac/#qa/qa2001/qa1235.html

"Unicode NFD and file attachment on Mac OS X" (filenames of email attachments)
http://lists.w3.org/Archives/Public/www-international/2003OctDec/0079.html
extract: " the applications dealing with these files names should convert it to 
NFC before sending it to the wire."

"Bug: TWiki on Mac OS X server with I18N generates odd looking file names"
http://twiki.org/cgi-bin/view/Codev/MacOSXFilesystemEncodingWithI18N
(search "NFD" or "HFS+")

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10209] Mac OS X: Decompose filenames on encode, and precompose filenames on decode

2010-10-28 Thread STINNER Victor

STINNER Victor  added the comment:

> I'd like to see this patch reverted.

I created a specific branch to test the patch (I also patched 
PyUnicode_EncodeFSDefault() and PyUnicode_DecodeFSDefaultAndSize()): 
issue10209. test_pep277 now pass in this branch!

> encoding with NFD should not be necessary, as the system will 
> do that, anyway.

Yes, but not exactly... Mac OS X NFD normalization is a little bit different 
than Python's normalization: see msg105669 and
http://developer.apple.com/library/mac/#qa/qa2001/qa1173.html

I don't understand why test_pep277 pass on issue10209 branch, but it works. I 
suppose that normalize the filename to NFD in Python avoids some Mac OS X 
normalization bugs?

> decoding with NFC is incompatible with previous Python releases,
> I can't see why NFC is conceptually better than NFD.

I propose to normalize to NFC because Qt does that.

On Linux, the keyboard uses NFC. Eg. press é key writes U+00e9, not U+0065 
U+0301. If you ask the user to write a filename, the filename will be stored in 
the same norm. So indirectly, Linux stores filenames as NFC.

Which norm is used on Mac OS X, eg. for the keyboard?

To display a filename, the norm is not important. With my patch, the norm is 
also no more important when accessing to the filesystem (no more strange Mac OS 
X normalization bug). So it's only important when comparing two filenames. If 
the two filenames are normalized in different norms (eg. NFC vs NFD), they will 
be seen as different even if they are the same name.

--

Anyway, I think that os.fsencode(os.fsdecode(name)) should be equal to name. If 
it's different, "open(name, 'w').close(); name in listdir()" is False (on 
systems storing filenames as bytes). So if you change fsdecode(), fsencode() 
should also be changed.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10209] Mac OS X: Decompose filenames on encode, and precompose filenames on decode

2010-10-28 Thread Martin v . Löwis

Martin v. Löwis  added the comment:

I'd like to see this patch reverted. I don't think it is useful.

1. encoding with NFD should not be necessary, as the system will do that, 
anyway.
2. decoding with NFC is incompatible with previous Python releases, and I can't 
see why NFC is conceptually better than NFD.

To give an analogy: if we have a case-insensitive file system, we don't 
normalize into lower-case, either, do we?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10209] Mac OS X: Decompose filenames on encode, and precompose filenames on decode

2010-10-26 Thread STINNER Victor

STINNER Victor  added the comment:

Patch for os.fsencode/fsdecode importing unicodedata in the function (instead 
of a global import). unicodedata module is not builtin and is dynamically 
loaded. We should maybe ignore ImportError if the module is not available? With 
a warning?

For PyUnicode_EncodeFSDefault() and PyUnicode_DecodeFSDefault(AndSize)() (C 
implementation), we can maybe use a hook (eg. implemented as as configurable 
callback) and set the hook after loading the unicodedata module.

It would be easier if unicodedata would be builtin module :-)

--
keywords: +patch
Added file: http://bugs.python.org/file19377/10209.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10209] Mac OS X: Decompose filenames on encode, and precompose filenames on decode

2010-10-26 Thread STINNER Victor

New submission from STINNER Victor :

PyUnicode_EncodeFSDefault() and os.fsencode() should decompose the filename 
(NFD) before encoding it to utf-8.

PyUnicode_DecodeFSDefault(AndSize)() and os.fsdecode() should precompose the 
filename (NFC) after decoding it from utf-8.

Qt library does this on Mac: see locale_encode()/locale_decode() (filename 
encoder/decoder) functions in src/corelib/io/qfile.cpp.

It should fix some issues of test_pep277 on Mac OS X (see #8423).

I'm not completly sure that we should do that :-)

(I used the nosy list from issues #4388 and #8423).

--

Technical Q&A QA1173, Text Encodings in VFS:
http://developer.apple.com/mac/library/qa/qa2001/qa1173.html

Q: I'm writing a file system (VFS) plug-in for Mac OS X. How do I handle text 
encodings correctly?
A: In Mac OS X's VFS API file names are, by definition, canonically decomposed 
Unicode, encoded using UTF-8. This raises a number of interesting issues. (...)

--
assignee: ronaldoussoren
components: Interpreter Core, Macintosh, Unicode
messages: 119662
nosy: MrJean1, amaury.forgeotdarc, db3l, flox, haypo, ixokai, loewis, 
mark.dickinson, michael.foord, ned.deily, piro, pitrou, ronaldoussoren, 
rpetrov, skip.montanaro, slmnhq
priority: normal
severity: normal
status: open
title: Mac OS X: Decompose filenames on encode, and precompose filenames on 
decode
versions: Python 3.2

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com