date:20090428

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman

On approximately 4/28/2009 10:52 PM, came the following characters from 
the keyboard of Martin v. Löwis:

C. File on disk with the invalid surrogate code, accessed via the str
interface, no decoding happens, matches in memory the file on disk with
the byte that translates to the same surrogate, accessed via the bytes
interface.  Ambiguity.

Is that an alternative to A and B?

I guess it is an adjunct to case B, the current PEP.

It is what happens when using the PEP on a system that provides both
bytes and str interfaces, and both get used.


Your formulation is a bit too stenographic to me, but please trust me
that there is *no* ambiguity in the case you construct.



No Martin, the point of reviewing the PEP is to _not_ trust you, even 
though you are generally very knowledgeable and very trustworthy.  It is 
much easier to find problems before something is released, or even 
coded, than it is afterwards.




By "accessed via the str interface", I assume you do something like

  fn = "some string"
  open(fn)

You are wrong in assuming "no decoding happens", and that "matches
in memory the file on disk" (whatever that means - how do I match
a file on disk in memory??). What happens instead is that fn
gets *encoded* with the file system encoding, and the python-escape
handler. This will *not* produce an ambiguity.



You assumed, and maybe I wasn't clear in my statement.

By "accessed via the str interface" I mean that (on Windows) the wide 
string interface would be used to obtain a file name.  Now, suppose that 
the file name returned contains "abc" followed by the half-surrogate 
U+DC10 -- four 16-bit codes.


Then, ask for the same filename via the bytes interface, using UTF-8 
encoding.  The PEP says that the above name would get translated to 
"abc" followed by 3 half-surrogates, corresponding to the 3 UTF-8 bytes 
used to represent the half-surrogate that is actually in the file name, 
specifically U+DCED U+DCB0 U+DC90.  This means that one name on disk can 
be seen as two different names in memory.


Now posit another file which, when accessed via the str interface, has 
the name "abc" followed by U+DCED U+DCB0 U+DC90.


Looks ambiguous to me.  Now if you have a scheme for handling this case, 
fine, but I don't understand it from what is written in the PEP.




If you think there is an ambiguity in that you can use both the
byte interface and the string interface to access the same file:
this would be a ridiculous interpretation. *Of course* you can
access /etc/passwd both as "/etc/passwd" and b"/etc/passwd",
there is nothing ambiguous about that.


Yes, this would be a ridiculous interpretation of "ambiguous".


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 (again)

2009-04-28 Thread Thomas Breuel

On Wed, Apr 29, 2009 at 07:45, "Martin v. Löwis"  wrote:

> Your claim was
> that PEP 383 may have unfortunate effects on Windows,

No, I simply think that PEP 383 is not sufficiently specified to be able to
tell.

> and I'm telling
> you that it won't, because the behavior of Python on Windows won't
> change at all.

A justification for your proposal is that there are differences between
Python on UNIX and Windows that you would like to reduce.  But depending on
where you introduce utf-8b coding on UNIX, you may also have to introduce it
on Windows in order to keep the platforms consistent.

So whatever the problem - it's there already, and the
> PEP is not going to change it.

OK, so you are saying that under PEP 383, utf-8b wouldn't be used anywhere
on Windows by default.  That's not clear from your proposal.

It's also not clear from your proposal where utf-8b will get used on UNIX
systems.  Some of the places that have been suggested are: open, os.listdir,
sys.argv, os.getenv. There are other potential ones, like print, write, and
os.system.  And what about text file and string conversions: will utf-8b
become the default, or optional, or unavailable?

Each of those choices potentially has significant implications.  I'm just
asking what those choices are so that one can then talk about the
implications and see whether this proposal is a good one or whether other
alternatives are better.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-Dev PEP 383: Non-decodable Bytes in System Character?Interfaces

2009-04-28 Thread Martin v. Löwis

> I would like utility functions to perform:
>   os-bytes->funny-encoded
>   funny-encoded->os-bytes
> or explicit example code snippets for same in the PEP text.

Done!

Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis

> I'm more concerned with your (yours? someone else's?) mention of shift
> characters. I'm unfamiliar with these encodings: to translate such a
> thing into a Latin example, is it the case that there are schemes with
> valid encodings that look like:
> 
>   [SHIFT] a b c
> 
> which would produce "ABC" in unicode, which is ambiguous with:
> 
>   A B C
> 
> which would also produce "ABC"?

No: the "shift" in "shift-jis" is not really about the shift key.
See http://en.wikipedia.org/wiki/Shift-JIS

Regards,
Martin


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis

>> The Python UTF-8 codec will happily encode half-surrogates; people argue
>> that it is a bug that it does so, however, it would help in this
>> specific case.
> 
> Can we use this encoding scheme for writing into files as well?  We've
> turned the filename with undecodable bytes into a string with half
> surrogates.  Putting that string into a file has to turn them into bytes
> at some level.  Can we use the python-escape error handler to achieve
> that somehow?

Sure: if you are aware that what you write to the stream is actually
a file name, you should encode it with the file system encoding, and
the python-escape handler. However, it's questionable that the same
approach is right for the rest of the data that goes into the file.

If you use a different encoding on the stream, yet still use the
python-escape handler, you may end up with completely non-sensical
bytes. In practice, it probably won't be that bad - python-escape
has likely escaped all non-ASCII bytes, so that on re-encoding with
a different encoding, only the ASCII characters get encoded, which
likely will work fine.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis

>>> C. File on disk with the invalid surrogate code, accessed via the str
>>> interface, no decoding happens, matches in memory the file on disk with
>>> the byte that translates to the same surrogate, accessed via the bytes
>>> interface.  Ambiguity.
>>
>> Is that an alternative to A and B?
> 
> I guess it is an adjunct to case B, the current PEP.
> 
> It is what happens when using the PEP on a system that provides both
> bytes and str interfaces, and both get used.

Your formulation is a bit too stenographic to me, but please trust me
that there is *no* ambiguity in the case you construct.

By "accessed via the str interface", I assume you do something like

  fn = "some string"
  open(fn)

You are wrong in assuming "no decoding happens", and that "matches
in memory the file on disk" (whatever that means - how do I match
a file on disk in memory??). What happens instead is that fn
gets *encoded* with the file system encoding, and the python-escape
handler. This will *not* produce an ambiguity.

If you think there is an ambiguity in that you can use both the
byte interface and the string interface to access the same file:
this would be a ridiculous interpretation. *Of course* you can
access /etc/passwd both as "/etc/passwd" and b"/etc/passwd",
there is nothing ambiguous about that.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 (again)

2009-04-28 Thread Martin v. Löwis

> The wide APIs use UTF-16.  UTF-16 suffers from the same problem as
> UTF-8: not all sequences of words are valid UTF-16 sequences.  In
> particular, sequences containing isolated surrogate pairs are not
> well-formed according to the Unicode standard.  Therefore, the existence
> of a wide character API function does not guarantee that the wide
> character strings it returns can be converted into valid unicode
> strings.  And, in fact, Windows Vista happily creates files with
> malformed UTF-16 encodings, and os.listdir() happily returns them.

Whatever. What does that have to do with PEP 383? Your claim was
that PEP 383 may have unfortunate effects on Windows, and I'm telling
you that it won't, because the behavior of Python on Windows won't
change at all. So whatever the problem - it's there already, and the
PEP is not going to change it.

I personally don't see a problem here - *of course* os.listdir will
report invalid utf-16 encodings, if that's what is stored on disk.
It doesn't matter whether the file names are valid wrt. some
specification. What matters is that you can access all the files.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 (again)

2009-04-28 Thread Thomas Breuel

>
> It cannot crash Python; it can only crash
> hypothetical third-party programs or libraries with deficient error
> checking and
> unreasonable assumptions about input data.


The error checking isn't necessarily deficient.  For example, a safe and
legitimate thing to do is for third party libraries to throw a C++
exception, raise a Python exception, or delete the half surrogate.  Any of
those would break one of the use cases people have been talking about,
namely being able to present the output from os.listdir() to the user, say
in a file selector, and then access that file.

(and, of course, you haven't even proven those programs or libraries exist)
>

PEP 383 is a proposal that suggests changing Python such that malformed
unicode strings become a required part of Python and such that Pyhon writes
illegal UTF-8 encodings to UTF-8 encoded file systems.  Those are big
changes, and it's legitimate to ask that PEP 383 address the implications of
that choice before it's made.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman

On approximately 4/28/2009 4:06 PM, came the following characters from 
the keyboard of Cameron Simpson:

I think I may be able to resolve Glenn's issues with the scheme lower
down (through careful use of definitions and hand waving).
  


Close.  You at least resolved what you thought my issue was.  And, you 
did make me more comfortable with the idea that I, in programs I write, 
would not be adversely affected by the PEP if implemented.  While I can 
see that the PEP no doubt solves the os.listdir / open problem on POSIX 
systems for Python 3 + PEP programs that don't use 3rd party libraries, 
it does require programs that do use 3rd party libraries to be recoded 
with your functions -- which so far the PEP hasn't embraced.  Or, to use 
the bytes APIs directly to get file names for 3rd party libraries -- but 
the directly ported, filenames-as-strings type of applications that 
could call 3rd party filenames-as-bytes libraries in 2.x must be tweaked 
to do something different than they did before.




On 27Apr2009 23:52, Glenn Linderman  wrote:
  
On approximately 4/27/2009 7:11 PM, came the following characters from  
the keyboard of Cameron Simpson:


[...]
  

There may be puns. So what? Use the right strings for the right purpose
and all will be well.

I think what is missing here, and missing from Martin's PEP, is some
utility functions for the os.* namespace.

PROPOSAL: add to the PEP the following functions:

  os.fsdecode(bytes) -> funny-encoded Unicode
This is what os.listdir() does to produce the strings it hands out.
  os.fsencode(funny-string) -> bytes
This is what open(filename,..) does to turn the filename into bytes
for the POSIX open.
  os.pathencode(your-string) -> funny-encoded-Unicode
This is what you must do to a de novo string to turn it into a
string suitable for use by open.
Importantly, for most strings not hand crafted to have weird
sequences in them, it is a no-op. But it will recode your puns
for survival.
  

[...]
  
So assume a non-decodable sequence in a name.  That puts us into   
Martin's funny-decode scheme.  His funny-decode scheme produces a 
bare  string, indistinguishable from a bare string that would be 
produced by a  str API that happens to contain that same sequence.  
Data puns.



See my proposal above. Does it address your concerns? A program still
must know the providence of the string, and _if_ you're working with
non-decodable sequences in a names then you should transmute then into
the funny encoding using the os.pathencode() function described above.

In this way the punning issue can be avoided.
_Lacking_ such a function, your punning concern is valid.
  

Seems like one would also desire os.pathdecode to do the reverse.



Yes.

  
And  
also versions that take or produce bytes from funny-encoded strings.



Isn't that the first two functions above?
  


Yes, sorry.

Then, if programs were re-coded to perform these transformations on what  
you call de novo strings, then the scheme would work.
But I think a large part of the incentive for the PEP is to try to  
invent a scheme that intentionally allows for the puns, so that programs  
do not need to be recoded in this manner, and yet still work.  I don't  
think such a scheme exists.



I agree no such scheme exists. I don't think it can, just using strings.

But _unless_ you have made a de novo handcrafted string with
ill-formed sequences in it, you don't need to bother because you
won't _have_ puns. If Martin's using half surrogates to encode
"undecodable" bytes, then no normal string should conflict because a
normal string will contain _only_ Unicode scalar values. Half surrogate
code points are not such.

The advantage here is that unless you've deliberately constructed an
ill-formed unicode string, you _do_not_ need to recode into
funny-encoding, because you are already compatible. Somewhat like one
doesn't need to recode ASCII into UTF-8, because ASCII is unchanged.
  


Right.  And I don't intend to generate ill-formed Unicode strings, in my 
programs.  But I might well read their names from other sources.


It is nice, and thank you for emphasizing (although I already did 
realize it, back there in the far reaches of the brain) that all the 
data puns are between ill-formed Unicode strings, and undecodable bytes 
strings.  That is a nice property of the PEP's encoding/decoding 
method.  I'm not sure it outweighs the disadvantage of taking unreadable 
gibberish, and producing indecipherable gibberish (codepoints with no 
glyphs), though, when there are ways to produce decipherable gibberish 
instead... or at least mostly-decipherable gibberish.  Another idea 
forms described below.


If there is going to be a required transformation from de novo strings  
to funny-encoded strings, then why not make one that people can actually  
see and compare and decode from the displayable form, by using  
displayable characters instead of

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman

On approximately 4/28/2009 7:40 PM, came the following characters from 
the keyboard of R. David Murray:

On Tue, 28 Apr 2009 at 13:37, Glenn Linderman wrote:
C. File on disk with the invalid surrogate code, accessed via the str 
interface, no decoding happens, matches in memory the file on disk 
with the byte that translates to the same surrogate, accessed via the 
bytes interface. Ambiguity.


Unless I'm missing something, one of these is type str, and the other is 
type bytes, so no ambiguity.



You are missing that the bytes value would get decoded to a str; thus 
both are str; so ambiguity is possible.


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Cameron Simpson

On 28Apr2009 13:37, Glenn Linderman  wrote:
> On approximately 4/28/2009 1:25 PM, came the following characters from  
> the keyboard of Martin v. Löwis:
>>> The UTF-8b representation suffers from the same potential ambiguities as
>>> the PUA characters... 
>>
>> Not at all the same ambiguities. Here, again, the two choices:
>>
>> A. use PUA characters to represent undecodable bytes, in particular for
>>UTF-8 (the PEP actually never proposed this to happen).
>>This introduces an ambiguity: two different files in the same
>>directory may decode to the same string name, if one has the PUA
>>character, and the other has a non-decodable byte that gets decoded
>>to the same PUA character.
>>
>> B. use UTF-8b, representing the byte will ill-formed surrogate codes.
>>The same ambiguity does *NOT* exist. If a file on disk already
>>contains an invalid surrogate code in its file name, then the UTF-8b
>>decoder will recognize this as invalid, and decode it byte-for-byte,
>>into three surrogate codes. Hence, the file names that are different
>>on disk are also different in memory. No ambiguity.
>
> C. File on disk with the invalid surrogate code, accessed via the str  
> interface, no decoding happens, matches in memory the file on disk with  
> the byte that translates to the same surrogate, accessed via the bytes  
> interface.  Ambiguity.

Is this a Windows example, or (now I think on it) an equivalent POSIX example
of using the PEP where the locale encoding is UTF-16?

In either case, I would say one could make an argument for being stricter
in reading in OS-native sequences. Grant that NTFS doesn't prevent
half-surrogates in filenames, and likewise that POSIX won't because to
the OS they're just bytes. On decoding, require well-formed data. When
you hit ill-formed data, treat the nasty half surrogate as a PAIR of
bytes to be escaped in the resulting decode.

Ambiguity avoided.

I'm more concerned with your (yours? someone else's?) mention of shift
characters. I'm unfamiliar with these encodings: to translate such a
thing into a Latin example, is it the case that there are schemes with
valid encodings that look like:

  [SHIFT] a b c

which would produce "ABC" in unicode, which is ambiguous with:

  A B C

which would also produce "ABC"?

Cheers,
-- 
Cameron Simpson  DoD#743
http://www.cskk.ezoshosting.com/cs/

Helicopters are considerably more expensive [than fixed wing aircraft],
which is only right because they don't actually fly, but just beat
the air into submission.- Paul Tomblin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] Proposed: a new function-based C API for declaring Python types

2009-04-28 Thread Larry Hastings



EXECUTIVE SUMMARY

I've written a patch against py3k trunk creating a new function-based
API for creating  extension types in C.  This allows PyTypeObject to
become a (mostly) private structure.


THE PROBLEM

Here's how you create an extension type using the current API.

 * First, find some code that already has a working type declaration.
   Copy and paste their fifty-line PyTypeObject declaration, then
   hack it up until it looks like what you need.

 * Next--hey!  There *is* no next, you're done.  You can immediately
   create an object using your type and pass it into the Python
   interpreter and it would work fine.  You are encouraged to call
   PyType_Ready(), but this isn't required and it's often skipped.

This approach causes two problems.

 1) The Python interpreter *must support* and *cannot change*
the PyTypeObject structure, forever.  Any meaningful change to
the structure will break every extension.   This has many
consequences:
  a) Fields that are no longer used must be left in place,
 forever, as ignored placeholders if need be.  Py3k cleaned
 up a lot of these, but it's already picked up a new one
 ("tp_compare" is now "tp_reserved").
  b) Internal implementation details of the type system must
 be public.
  c) The interpreter can't even use a different structure
 internally, because extensions are free to pass in objects
 using PyTypeObjects the interpreter has never seen before.

 2) As a programming interface this lacks a certain gentility.  It
clearly *works*, but it requires programmers to copy and paste
with a large structure mostly containing NULLs, which they must
pick carefully through to change just a few fields.


THE SOLUTION

My patch creates a new function-based extension type definition API.
You create a type by calling PyType_New(), then call various accessor
functions on the type (PyType_SetString and the like), and when your
type has been completely populated you must call PyType_Activate()
to enable it for use.

With this API available, extension authors no longer need to directly
see the innards of the PyTypeObject structure.  Well, most of the
fields anyway.  There are a few shortcut macros in CPython that need
to continue working for performance reasons, so the "tp_flags" and
"tp_dealloc" fields need to remain publically visible.

One feature worth mentioning is that the API is type-safe.  Many such
APIs would have had one generic "PyType_SetPointer", taking an
identifier for the field and a void * for its value, but this would
have lost type safety.  Another approach would have been to have one
accessor per field ("PyType_SetAddFunction"), but this would have
exploded the number of functions in the API.  My API splits the
difference: each distinct *type* has its own set of accessors
("PyType_GetSSizeT") which takes an identifier specifying which
field you wish to get or set.


SIDE-EFFECTS OF THE API

The major change resulting from this API: all PyTypeObjects must now
be *pointers* rather than static instances.  For example, the external
declaration of PyType_Type itself changes from this:
   PyAPI_DATA(PyTypeObject) PyType_Type;
to this:
   PyAPI_DATA(PyTypeObject *) PyType_Type;

This gives rise to the first headache caused by the API: type casts
on type objects.  It took me a day and a half to realize that this,
from Modules/_weakref.c:
   PyModule_AddObject(m, "ref",
  (PyObject *) &_PyWeakref_RefType);
really needed to be this:
   PyModule_AddObject(m, "ref",
  (PyObject *) _PyWeakref_RefType);

Hopefully I've already found most of these in CPython itself, but
this sort of code surely lurks in extensions yet to be touched.

(Pro-tip: if you're working with this patch, and you see a crash,
and gdb shows you something like this at the top of the stack:
   #0  0x081056d8 in visit_decref (op=0x8247aa0, data=0x0)
  at Modules/gcmodule.c:323
   323 if (PyObject_IS_GC(op)) {
your problem is an errant &, likely on a type object you're passing
in to the interpreter.  Think--what did you touch recently?  Or debug
it by salting your code with calls to collect(NUM_GENERATIONS-1).)


Another irksome side-effect of the API: because of "tp_flags" and
"tp_dealloc", I now have two declarations of PyTypeObject.  There's
the externally-visible one in Include/object.h, which lets external
parties see "tp_dealloc" and "tp_flags".  Then there's the internal
one in Objects/typeprivate.h which is the real structure.  Since
declaring a type twice is a no-no, the external one is gated on
   #ifndef PY_TYPEPRIVATE
If you're a normal Python extension programmer, you'd include Python.h
as normal:
   #include "Python.h"
Python implementation files that need to see the real PyTypeObject
structure now look like this:
   #define PY_TYPEPRIVATE
   #include "Python.h"
   #include "../Objects/typeprivate.h"

Also, since the structure of

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Toshio Kuratomi

Martin v. Löwis wrote:
>> Since the serialization of the Unicode string is likely to use UTF-8,
>> and the string for  such a file will include half surrogates, the
>> application may raise an exception when encoding the names for a
>> configuration file. These encoding exceptions will be as rare as the
>> unusual names (which the careful I18N aware developer has probably
>> eradicated from his system), and thus will appear late.
> 
> There are trade-offs to any solution; if there was a solution without
> trade-offs, it would be implemented already.
> 
> The Python UTF-8 codec will happily encode half-surrogates; people argue
> that it is a bug that it does so, however, it would help in this
> specific case.

Can we use this encoding scheme for writing into files as well?  We've
turned the filename with undecodable bytes into a string with half
surrogates.  Putting that string into a file has to turn them into bytes
at some level.  Can we use the python-escape error handler to achieve
that somehow?

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Cameron Simpson

On 28Apr2009 14:37, Thomas Breuel  wrote:
| But the biggest problem with the proposal is that it isn't needed: if you
| want to be able to turn arbitrary byte sequences into unicode strings and
| back, just set your encoding to iso8859-15.  That already works and it
| doesn't require any changes.

No it doesn't. It does transcode without throwing exceptions. On POSIX.
(On Windows? I doubt it - windows isn't using an 8-bit scheme. I
believe.) But it utter destorys any hope of working in any other locale
nicely. The PEP lets you work losslessly in other locales.

It _may_ require some app care for particular very weird strings
that don't come from the filesystem, but as far as I can see only in
circumstances where such care would be needed anyway i.e. you've got to
do special stuff for weirdness in the first place. Weird == "ill-formed
unicode string" here.

Cheers,
-- 
Cameron Simpson  DoD#743
http://www.cskk.ezoshosting.com/cs/

I just kept it wide-open thinking it would correct itself.
Then I ran out of talent.   - C. Fittipaldi
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread R. David Murray


On Tue, 28 Apr 2009 at 13:37, Glenn Linderman wrote:
C. File on disk with the invalid surrogate code, accessed via the str 
interface, no decoding happens, matches in memory the file on disk with the 
byte that translates to the same surrogate, accessed via the bytes interface. 
Ambiguity.


Unless I'm missing something, one of these is type str, and the other is 
type bytes, so no ambiguity.


--David
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-Dev PEP 383: Non-decodable Bytes in System Character?Interfaces

2009-04-28 Thread Cameron Simpson

On 28Apr2009 11:49, Antoine Pitrou  wrote:
| Paul Moore  gmail.com> writes:
| > 
| > I've yet to hear anyone claim that they would have an actual problem
| > with a specific piece of code they have written.
| 
| Yep, that's the problem. Lots of theoretical problems noone has ever 
encountered
| brought up against a PEP which resolves some actual problems people encounter 
on
| a regular basis.
| 
| For the record, I'm +1 on the PEP being accepted and implemented as soon as
| possible (preferably before 3.1).

I am also +1 on this.

I would like utility functions to perform:
  os-bytes->funny-encoded
  funny-encoded->os-bytes
or explicit example code snippets for same in the PEP text.
-- 
Cameron Simpson  DoD#743
http://www.cskk.ezoshosting.com/cs/

This person is currently undergoing electric shock therapy at Agnews
Developmental Center in San Jose, California. All his opinions are static,
please ignore him.  Thank you,  Nurse Ratched
- the sig quote of Bob "Another beer, please" Christ 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Toshio Kuratomi

Zooko O'Whielacronx wrote:
> On Apr 28, 2009, at 6:46 AM, Hrvoje Niksic wrote:
>> If you switch to iso8859-15 only in the presence of undecodable UTF-8,
>> then you have the same round-trip problem as the PEP: both b'\xff' and
>> b'\xc3\xbf' will be converted to u'\u00ff' without a way to
>> unambiguously recover the original file name.
> 
> Why do you say that?  It seems to work as I expected here:
> 
 '\xff'.decode('iso-8859-15')
> u'\xff'
 '\xc3\xbf'.decode('iso-8859-15')
> u'\xc3\xbf'



 '\xff'.decode('cp1252')
> u'\xff'
 '\xc3\xbf'.decode('cp1252')
> u'\xc3\xbf'
> 

You're not showing that this is a fallback path.  What won't work is
first trying a local encoding (in the following example, utf-8) and then
if that doesn't work, trying a one-byte encoding like iso8859-15:

try:
file1 = '\xff'.decode('utf-8')
except UnicodeDecodeError:
file1 = '\xff'.decode('iso-8859-15')
print repr(file1)

try:
file2 = '\xc3\xbf'.decode('utf-8')
except UnicodeDecodeError:
file2 = '\xc3\xbf'.decode('iso-8859-15')
print repr(file2)


That prints:
  u'\xff'
  u'\xff'

The two encodings can map different bytes to the same unicode code point
 so you can't do this type of thing without recording what encoding was
used in the translation.

-Toshio



signature.asc
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman

On approximately 4/28/2009 2:01 PM, came the following characters from 
the keyboard of MRAB:

Glenn Linderman wrote:
On approximately 4/28/2009 11:55 AM, came the following characters 
from the keyboard of MRAB:

I've been thinking of "python-escape" only in terms of UTF-8, the only
encoding mentioned in the PEP. In UTF-8, bytes 0x00 to 0x7F are
decodable.



UTF-8 is only mentioned in the sense of having special handling for 
re-encoding; all the other locales/encodings are implicit.  But I also 
went down that path to some extent.




But if you're talking about using it with other encodings, eg
shift-jisx0213, then I'd suggest the following:

1. Bytes 0x00 to 0xFF which can't normally be decoded are decoded to
half surrogates U+DC00 to U+DCFF.



This makes 256 different escape codes.



Speaking personally, I won't call them 'escape codes'. I'd use the term
'escape code' to mean a character that changes the interpretation of the
next character(s).



OK, I won't be offended if you don't call them 'escape codes'. :)  But 
what else to call them?


My use of that term is a bit backwards, perhaps... what happens is that 
because these 256 half surrogates are used to decode otherwise 
undecodable bytes, they themselves must be "escaped" or translated into 
something different, when they appear in the byte sequence.  The process 
 described reserves a set of codepoints for use, and requires that that 
same set of codepoints be translated using a similar mechanism to avoid 
their untranslated appearance in the resulting str.  Escape codes have 
the same sort of characteristic... by replacing their normal use for 
some other use, they must themselves have a replacement.


Anyway, I think we are communicating successfully.



2. Bytes which would have decoded to half surrogates U+DC00 to U+DCFF
are treated as though they are undecodable bytes.



This provides escaping for the 256 different escape codes, which is 
lacking from the PEP.




3. Half surrogates U+DC00 to U+DCFF which can be produced by decoding
are encoded to bytes 0x00 to 0xFF.



This reverses the escaping.



4. Codepoints, including half surrogates U+DC00 to U+DCFF, which can't
be produced by decoding raise an exception.



This is confusing.  Did you mean "excluding" instead of "including"?


Perhaps I should've said "Any codepoint which can't be produced by
decoding should raise an exception".



Yes, your rephrasing is clearer, regarding your intention.



For example, decoding with UTF-8b will never produce U+DC00, therefore
attempting to encode U+DC00 should raise an exception and not produce
0x00.



Decoding with UTF-8b might never produce U+DC00, but then again, it 
won't handle the random byte string, either.




I think I've covered all the possibilities. :-)



You might have.  Seems like there could be a simpler scheme, though...

1. Define an escape codepoint.  It could be U+003F or U+DC00 or U+F817 
or pretty much any defined Unicode codepoint outside the range U+0100 
to U+01FF (see rule 3 for why).  Only one escape codepoint is needed, 
this is easier for humans to comprehend.


2. When the escape codepoint is decoded from the byte stream for a 
bytes interface or found in a str on the str interface, double it.


3. When an undecodable byte 0xPQ is found, decode to the escape 
codepoint, followed by codepoint U+01PQ, where P and Q are hex digits.


4. When encoding, a sequence of two escape codepoints would be encoded 
as one escape codepoint, and a sequence of the escape codepoint 
followed by codepoint U+01PQ would be encoded as byte 0xPQ.  Escape 
codepoints not followed by the escape codepoint, or by a codepoint in 
the range U+0100 to U+01FF would raise an exception.


5. Provide functions that will perform the same decoding and encoding 
as would be done by the system calls, for both bytes and str interfaces.



This differs from my previous proposal in three ways:

A. Doesn't put a marker at the beginning of the string (which I said 
wasn't necessary even then).


B. Allows for a choice of escape codepoint, the previous proposal 
suggested a specific one.  But the final solution will only have a 
single one, not a user choice, but an implementation choice.


C. Uses the range U+0100 to U+01FF for the escape codes, rather than 
U+ to U+00FF.  This avoids introducing the NULL character and 
escape characters into the decoded str representation, yet still uses 
characters for which glyphs are commonly available, are non-combining, 
and are easily distinguishable one from another.


Rationale:

The use of codepoints with visible glyphs makes the escaped string 
friendlier to display systems, and to people.  I still recommend using 
U+003F as the escape codepoint, but certainly one with a typcially 
visible glyph available.  This avoids what I consider to be an 
annoyance with the PEP, that the codepoints used are not ones that are 
easily displayed, so endecodable names could easily result in long 
strings of indisting

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Cameron Simpson

I think I may be able to resolve Glenn's issues with the scheme lower
down (through careful use of definitions and hand waving).

On 27Apr2009 23:52, Glenn Linderman  wrote:
> On approximately 4/27/2009 7:11 PM, came the following characters from  
> the keyboard of Cameron Simpson:
[...]
>> There may be puns. So what? Use the right strings for the right purpose
>> and all will be well.
>>
>> I think what is missing here, and missing from Martin's PEP, is some
>> utility functions for the os.* namespace.
>>
>> PROPOSAL: add to the PEP the following functions:
>>
>>   os.fsdecode(bytes) -> funny-encoded Unicode
>> This is what os.listdir() does to produce the strings it hands out.
>>   os.fsencode(funny-string) -> bytes
>> This is what open(filename,..) does to turn the filename into bytes
>> for the POSIX open.
>>   os.pathencode(your-string) -> funny-encoded-Unicode
>> This is what you must do to a de novo string to turn it into a
>> string suitable for use by open.
>> Importantly, for most strings not hand crafted to have weird
>> sequences in them, it is a no-op. But it will recode your puns
>> for survival.
[...]
>>> So assume a non-decodable sequence in a name.  That puts us into   
>>> Martin's funny-decode scheme.  His funny-decode scheme produces a 
>>> bare  string, indistinguishable from a bare string that would be 
>>> produced by a  str API that happens to contain that same sequence.  
>>> Data puns.
>>> 
>>
>> See my proposal above. Does it address your concerns? A program still
>> must know the providence of the string, and _if_ you're working with
>> non-decodable sequences in a names then you should transmute then into
>> the funny encoding using the os.pathencode() function described above.
>>
>> In this way the punning issue can be avoided.
>> _Lacking_ such a function, your punning concern is valid.
>
> Seems like one would also desire os.pathdecode to do the reverse.

Yes.

> And  
> also versions that take or produce bytes from funny-encoded strings.

Isn't that the first two functions above?

> Then, if programs were re-coded to perform these transformations on what  
> you call de novo strings, then the scheme would work.
> But I think a large part of the incentive for the PEP is to try to  
> invent a scheme that intentionally allows for the puns, so that programs  
> do not need to be recoded in this manner, and yet still work.  I don't  
> think such a scheme exists.

I agree no such scheme exists. I don't think it can, just using strings.

But _unless_ you have made a de novo handcrafted string with
ill-formed sequences in it, you don't need to bother because you
won't _have_ puns. If Martin's using half surrogates to encode
"undecodable" bytes, then no normal string should conflict because a
normal string will contain _only_ Unicode scalar values. Half surrogate
code points are not such.

The advantage here is that unless you've deliberately constructed an
ill-formed unicode string, you _do_not_ need to recode into
funny-encoding, because you are already compatible. Somewhat like one
doesn't need to recode ASCII into UTF-8, because ASCII is unchanged.

> If there is going to be a required transformation from de novo strings  
> to funny-encoded strings, then why not make one that people can actually  
> see and compare and decode from the displayable form, by using  
> displayable characters instead of lone surrogates?

Because that would _not_ be a no-op for well formed Unicode strings.

That reason is sufficient for me.

I consider the fact that well-formed Unicode -> funny-encoded is a no-op
to be an enormous feature of Martin's scheme.

Unless I'm missing something, there _are_no_puns_ between funny-encoded
strings and well formed unicode strings.

 I suppose if your program carefully constructs a unicode string riddled
 with half-surrogates etc and imagines something specific should happen
 to them on the way to being POSIX bytes then you might have a problem...

>>> Right.  Or someone else's program does that.

I've just spent a cosy 20 minutes with my copy of Unicode 5.0 and a
coffee, reading section 3.9 (Unicode Encoding Forms).

I now do not believe your scenario makes sense.

Someone can construct a Python3 string containing code points that
includes surrogates. Granted.

However such a string is not meaningful because it is not well-formed
(D85).  It's ill-formed (D84). It is not sane to expect it to
translate into a POSIX byte sequence, be it UTF-8 or anything else,
unless it is accompanied by some kind of explicit mapping provided by
the programmer.  Absent that mapping, it's nonsense in much the same
way that a non-decodable UTF-8 byte sequence is nonsense.

For example, Martin's funny-encoding is such an explicit mapping.

>>>I only want to use 
>>> Unicode  file names.  But if those other file names exist, I want to 
>>> be able to  access them, and not accidentally get a different file.

But those other

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman

On approximately 4/28/2009 2:02 PM, came the following characters from 
the keyboard of Martin v. Löwis:

Glenn Linderman wrote:

On approximately 4/28/2009 1:25 PM, came the following characters from
the keyboard of Martin v. Löwis:

The UTF-8b representation suffers from the same potential ambiguities as
the PUA characters... 

Not at all the same ambiguities. Here, again, the two choices:

A. use PUA characters to represent undecodable bytes, in particular for
   UTF-8 (the PEP actually never proposed this to happen).
   This introduces an ambiguity: two different files in the same
   directory may decode to the same string name, if one has the PUA
   character, and the other has a non-decodable byte that gets decoded
   to the same PUA character.

B. use UTF-8b, representing the byte will ill-formed surrogate codes.
   The same ambiguity does *NOT* exist. If a file on disk already
   contains an invalid surrogate code in its file name, then the UTF-8b
   decoder will recognize this as invalid, and decode it byte-for-byte,
   into three surrogate codes. Hence, the file names that are different
   on disk are also different in memory. No ambiguity.

C. File on disk with the invalid surrogate code, accessed via the str
interface, no decoding happens, matches in memory the file on disk with
the byte that translates to the same surrogate, accessed via the bytes
interface.  Ambiguity.


Is that an alternative to A and B?


I guess it is an adjunct to case B, the current PEP.

It is what happens when using the PEP on a system that provides both 
bytes and str interfaces, and both get used.


On a Windows system, perhaps the ambiguous case would be the use of the 
str API and bytes APIs producing different memory names for the same 
file that contains a (Unicode-illegal) half surrogate.  The 
half-surrogate would seem to get decoded to 3 half surrogates if 
accessed via the bytes interface, but only one via the str interface. 
The version with 3 half surrogates could match another name that 
actually contains 3 half surrogates, that is accessed via the str interface.


I can't actually tell by reading the PEP whether it affects Windows 
bytes interfaces or is only implemented on POSIX, so that POSIX has a 
str interface.


If it is only implemented on POSIX, then the current scheme (now 
escaping the hundreds of escape codes) could work, within a single 
platform... but it would still suffer from displaying garbage (sequences 
of replacement characters) in file listings displayed or printed.  There 
is no way, once the string is adjusted to contain replacement characters 
for display, to distinguish one file name from another, if they are 
identical except for a same-length sequence of different undecodable bytes.


The concept of a function that allows the same decoding and encoding 
process for 3rd party interfaces is still missing from the PEP; 
implementation of the PEP would require that all interfaces to 3rd party 
software that accept file names would have to be transcoded by the 
interface layer.  Or else such software would have to use the bytes 
interfaces directly, and if they do, there is no need for the PEP.


So I see the PEP as a partial solution to a limited problem, that on the 
one hand potentially produces indistinguishable sequences of replacement 
characters in filenames, rather than the mojibake (which is at least 
distinguishable), and on the other hand, doesn't help software that also 
uses 3rd party libraries to avoid the use of bytes APIs for accessing 
file names.  There are other encodings that produce more distinguishable 
mojibake, and would work in the same situations as the PEP.


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 (again)

2009-04-28 Thread Antoine Pitrou

Thomas Breuel  gmail.com> writes:
> 
> And, in fact, Windows Vista happily creates files with malformed UTF-16
encodings, and os.listdir() happily returns them. 

The PEP won't change that, so what's the problem exactly?

> Under your proposal, passing the output from a correctly implemented file
system or other OS function to a correctly written library using unicode strings
may crash Python.

That's a very dishonest formulation. It cannot crash Python; it can only crash
hypothetical third-party programs or libraries with deficient error checking and
unreasonable assumptions about input data.

(and, of course, you haven't even proven those programs or libraries exist)

Antoine.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 (again)

2009-04-28 Thread Thomas Breuel

>
> On Windows, the Wide APIs are already used throughout the code base,
>  e.g. SetEnvironmentVariableW/_wenviron. If you need to find out the
> specific API for a specific functionality, please read the source code.
> [...]
>
No, I don't assume that. I assume that all functions are strictly
> available in a Wide character version, and have verified that they are.


The wide APIs use UTF-16.  UTF-16 suffers from the same problem as UTF-8:
not all sequences of words are valid UTF-16 sequences.  In particular,
sequences containing isolated surrogate pairs are not well-formed according
to the Unicode standard.  Therefore, the existence of a wide character API
function does not guarantee that the wide character strings it returns can
be converted into valid unicode strings.  And, in fact, Windows Vista
happily creates files with malformed UTF-16 encodings, and os.listdir()
happily returns them.


> If you can crash Python that way,
> nothing gets worse by this PEP - you can then *already* crash Python
> in that way.


Yes, but AFAIK, Python does not currently have functions that, as part of
correct usage and normal operation, are intended to generate malformed
unicode strings.

Under your proposal, passing the output from a correctly implemented file
system or other OS function to a correctly written library using unicode
strings may crash Python.  In order to avoid that, every library that's
built into Python would have to be checked and updated to deal with both the
Unicode standard and your extension to it.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis

Glenn Linderman wrote:
> On approximately 4/28/2009 1:25 PM, came the following characters from
> the keyboard of Martin v. Löwis:
>>> The UTF-8b representation suffers from the same potential ambiguities as
>>> the PUA characters... 
>>
>> Not at all the same ambiguities. Here, again, the two choices:
>>
>> A. use PUA characters to represent undecodable bytes, in particular for
>>UTF-8 (the PEP actually never proposed this to happen).
>>This introduces an ambiguity: two different files in the same
>>directory may decode to the same string name, if one has the PUA
>>character, and the other has a non-decodable byte that gets decoded
>>to the same PUA character.
>>
>> B. use UTF-8b, representing the byte will ill-formed surrogate codes.
>>The same ambiguity does *NOT* exist. If a file on disk already
>>contains an invalid surrogate code in its file name, then the UTF-8b
>>decoder will recognize this as invalid, and decode it byte-for-byte,
>>into three surrogate codes. Hence, the file names that are different
>>on disk are also different in memory. No ambiguity.
> 
> C. File on disk with the invalid surrogate code, accessed via the str
> interface, no decoding happens, matches in memory the file on disk with
> the byte that translates to the same surrogate, accessed via the bytes
> interface.  Ambiguity.

Is that an alternative to A and B?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread MRAB


Glenn Linderman wrote:
On approximately 4/28/2009 11:55 AM, came the following characters from 
the keyboard of MRAB:

I've been thinking of "python-escape" only in terms of UTF-8, the only
encoding mentioned in the PEP. In UTF-8, bytes 0x00 to 0x7F are
decodable.



UTF-8 is only mentioned in the sense of having special handling for 
re-encoding; all the other locales/encodings are implicit.  But I also 
went down that path to some extent.




But if you're talking about using it with other encodings, eg
shift-jisx0213, then I'd suggest the following:

1. Bytes 0x00 to 0xFF which can't normally be decoded are decoded to
half surrogates U+DC00 to U+DCFF.



This makes 256 different escape codes.



Speaking personally, I won't call them 'escape codes'. I'd use the term
'escape code' to mean a character that changes the interpretation of the
next character(s).


2. Bytes which would have decoded to half surrogates U+DC00 to U+DCFF
are treated as though they are undecodable bytes.



This provides escaping for the 256 different escape codes, which is 
lacking from the PEP.




3. Half surrogates U+DC00 to U+DCFF which can be produced by decoding
are encoded to bytes 0x00 to 0xFF.



This reverses the escaping.



4. Codepoints, including half surrogates U+DC00 to U+DCFF, which can't
be produced by decoding raise an exception.



This is confusing.  Did you mean "excluding" instead of "including"?


Perhaps I should've said "Any codepoint which can't be produced by
decoding should raise an exception".

For example, decoding with UTF-8b will never produce U+DC00, therefore
attempting to encode U+DC00 should raise an exception and not produce
0x00.




I think I've covered all the possibilities. :-)



You might have.  Seems like there could be a simpler scheme, though...

1. Define an escape codepoint.  It could be U+003F or U+DC00 or U+F817 
or pretty much any defined Unicode codepoint outside the range U+0100 to 
U+01FF (see rule 3 for why).  Only one escape codepoint is needed, this 
is easier for humans to comprehend.


2. When the escape codepoint is decoded from the byte stream for a bytes 
interface or found in a str on the str interface, double it.


3. When an undecodable byte 0xPQ is found, decode to the escape 
codepoint, followed by codepoint U+01PQ, where P and Q are hex digits.


4. When encoding, a sequence of two escape codepoints would be encoded 
as one escape codepoint, and a sequence of the escape codepoint followed 
by codepoint U+01PQ would be encoded as byte 0xPQ.  Escape codepoints 
not followed by the escape codepoint, or by a codepoint in the range 
U+0100 to U+01FF would raise an exception.


5. Provide functions that will perform the same decoding and encoding as 
would be done by the system calls, for both bytes and str interfaces.



This differs from my previous proposal in three ways:

A. Doesn't put a marker at the beginning of the string (which I said 
wasn't necessary even then).


B. Allows for a choice of escape codepoint, the previous proposal 
suggested a specific one.  But the final solution will only have a 
single one, not a user choice, but an implementation choice.


C. Uses the range U+0100 to U+01FF for the escape codes, rather than 
U+ to U+00FF.  This avoids introducing the NULL character and escape 
characters into the decoded str representation, yet still uses 
characters for which glyphs are commonly available, are non-combining, 
and are easily distinguishable one from another.


Rationale:

The use of codepoints with visible glyphs makes the escaped string 
friendlier to display systems, and to people.  I still recommend using 
U+003F as the escape codepoint, but certainly one with a typcially 
visible glyph available.  This avoids what I consider to be an annoyance 
with the PEP, that the codepoints used are not ones that are easily 
displayed, so endecodable names could easily result in long strings of 
indistinguishable substitution characters.



Perhaps the escape character should be U+005C. ;-)

It, like MRAB's proposal, also avoids data puns, which is a major 
problem with the PEP.  I consider this proposal to be easier to 
understand than MRAB's proposal, or the PEP, because of the single 
escape codepoint and the use of visible characters.


This proposal, like my initial one, also decodes and encodes (just the 
escape codes) values on the str interfaces.  This is necessary to avoid 
data puns on systems that provide both types of interfaces.


This proposal could be used for programs that use str values, and easily 
migrates to a solution that provides an object that provides an 
abstraction for system interfaces that have two forms.




___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis

> Others have made this suggestion, and it is helpful to the PEP, but not
> sufficient.  As implemented as an error handler, I'm not sure that the
> b'\xed\xb3\xbf' sequence would trigger the error handler, if the UTF-8
> decoder is happy with it.  Which, in my testing, it is.

Rest assured that the utf-8b codec will work the way it is specified.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman

On approximately 4/28/2009 1:25 PM, came the following characters from 
the keyboard of Martin v. Löwis:

The UTF-8b representation suffers from the same potential ambiguities as
the PUA characters... 


Not at all the same ambiguities. Here, again, the two choices:

A. use PUA characters to represent undecodable bytes, in particular for
   UTF-8 (the PEP actually never proposed this to happen).
   This introduces an ambiguity: two different files in the same
   directory may decode to the same string name, if one has the PUA
   character, and the other has a non-decodable byte that gets decoded
   to the same PUA character.

B. use UTF-8b, representing the byte will ill-formed surrogate codes.
   The same ambiguity does *NOT* exist. If a file on disk already
   contains an invalid surrogate code in its file name, then the UTF-8b
   decoder will recognize this as invalid, and decode it byte-for-byte,
   into three surrogate codes. Hence, the file names that are different
   on disk are also different in memory. No ambiguity.


C. File on disk with the invalid surrogate code, accessed via the str 
interface, no decoding happens, matches in memory the file on disk with 
the byte that translates to the same surrogate, accessed via the bytes 
interface.  Ambiguity.


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman

On approximately 4/28/2009 6:01 AM, came the following characters from 
the keyboard of Lino Mastrodomenico:

2009/4/28 Glenn Linderman :

The switch from PUA to half-surrogates does not resolve the issues with the
encoding not being a 1-to-1 mapping, though.  The very fact that you  think
you can get away with use of lone surrogates means that other people might,
accidentally or intentionally, also use lone surrogates for some other
purpose.  Even in file names.


It does solve this issue, because (unlike e.g. U+F01FF) '\udcff' is
not a valid Unicode character (not a character at all, really) and the
only way you can put this in a POSIX filename is if you use a very
lenient  UTF-8 encoder that gives you b'\xed\xb3\xbf'.



Wrong.

An 8859-1 locale allows any byte sequence to placed into a POSIX filename.

And while U+DCFF is illegal alone in Unicode, it is not illegal in 
Python str values.  And from my testing, Python 3's current UTF-8 
encoder will happily provide exactly the bytes value you mention when 
given U+DCFF.




Since this byte sequence doesn't represent a valid character when
decoded with UTF-8, it should simply be considered an invalid UTF-8
sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not*
'\udcff').

Martin: maybe the PEP should say this explicitly?

Note that the round-trip works without ambiguities between '\udcff' in
the filename:

b'\xed\xb3\xbf' -> '\udced\udcb3\udcbf' -> b'\xed\xb3\xbf'

and b'\xff' in the filename, decoded by Python to '\udcff':

b'\xff' -> '\udcff' -> b'\xff'



Others have made this suggestion, and it is helpful to the PEP, but not 
sufficient.  As implemented as an error handler, I'm not sure that the 
b'\xed\xb3\xbf' sequence would trigger the error handler, if the UTF-8 
decoder is happy with it.  Which, in my testing, it is.



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman

On approximately 4/28/2009 11:55 AM, came the following characters from 
the keyboard of MRAB:

I've been thinking of "python-escape" only in terms of UTF-8, the only
encoding mentioned in the PEP. In UTF-8, bytes 0x00 to 0x7F are
decodable.



UTF-8 is only mentioned in the sense of having special handling for 
re-encoding; all the other locales/encodings are implicit.  But I also 
went down that path to some extent.




But if you're talking about using it with other encodings, eg
shift-jisx0213, then I'd suggest the following:

1. Bytes 0x00 to 0xFF which can't normally be decoded are decoded to
half surrogates U+DC00 to U+DCFF.



This makes 256 different escape codes.



2. Bytes which would have decoded to half surrogates U+DC00 to U+DCFF
are treated as though they are undecodable bytes.



This provides escaping for the 256 different escape codes, which is 
lacking from the PEP.




3. Half surrogates U+DC00 to U+DCFF which can be produced by decoding
are encoded to bytes 0x00 to 0xFF.



This reverses the escaping.



4. Codepoints, including half surrogates U+DC00 to U+DCFF, which can't
be produced by decoding raise an exception.



This is confusing.  Did you mean "excluding" instead of "including"?



I think I've covered all the possibilities. :-)



You might have.  Seems like there could be a simpler scheme, though...

1. Define an escape codepoint.  It could be U+003F or U+DC00 or U+F817 
or pretty much any defined Unicode codepoint outside the range U+0100 to 
U+01FF (see rule 3 for why).  Only one escape codepoint is needed, this 
is easier for humans to comprehend.


2. When the escape codepoint is decoded from the byte stream for a bytes 
interface or found in a str on the str interface, double it.


3. When an undecodable byte 0xPQ is found, decode to the escape 
codepoint, followed by codepoint U+01PQ, where P and Q are hex digits.


4. When encoding, a sequence of two escape codepoints would be encoded 
as one escape codepoint, and a sequence of the escape codepoint followed 
by codepoint U+01PQ would be encoded as byte 0xPQ.  Escape codepoints 
not followed by the escape codepoint, or by a codepoint in the range 
U+0100 to U+01FF would raise an exception.


5. Provide functions that will perform the same decoding and encoding as 
would be done by the system calls, for both bytes and str interfaces.



This differs from my previous proposal in three ways:

A. Doesn't put a marker at the beginning of the string (which I said 
wasn't necessary even then).


B. Allows for a choice of escape codepoint, the previous proposal 
suggested a specific one.  But the final solution will only have a 
single one, not a user choice, but an implementation choice.


C. Uses the range U+0100 to U+01FF for the escape codes, rather than 
U+ to U+00FF.  This avoids introducing the NULL character and escape 
characters into the decoded str representation, yet still uses 
characters for which glyphs are commonly available, are non-combining, 
and are easily distinguishable one from another.


Rationale:

The use of codepoints with visible glyphs makes the escaped string 
friendlier to display systems, and to people.  I still recommend using 
U+003F as the escape codepoint, but certainly one with a typcially 
visible glyph available.  This avoids what I consider to be an annoyance 
with the PEP, that the codepoints used are not ones that are easily 
displayed, so endecodable names could easily result in long strings of 
indistinguishable substitution characters.


It, like MRAB's proposal, also avoids data puns, which is a major 
problem with the PEP.  I consider this proposal to be easier to 
understand than MRAB's proposal, or the PEP, because of the single 
escape codepoint and the use of visible characters.


This proposal, like my initial one, also decodes and encodes (just the 
escape codes) values on the str interfaces.  This is necessary to avoid 
data puns on systems that provide both types of interfaces.


This proposal could be used for programs that use str values, and easily 
migrates to a solution that provides an object that provides an 
abstraction for system interfaces that have two forms.



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis

> The UTF-8b representation suffers from the same potential ambiguities as
> the PUA characters... 

Not at all the same ambiguities. Here, again, the two choices:

A. use PUA characters to represent undecodable bytes, in particular for
   UTF-8 (the PEP actually never proposed this to happen).
   This introduces an ambiguity: two different files in the same
   directory may decode to the same string name, if one has the PUA
   character, and the other has a non-decodable byte that gets decoded
   to the same PUA character.

B. use UTF-8b, representing the byte will ill-formed surrogate codes.
   The same ambiguity does *NOT* exist. If a file on disk already
   contains an invalid surrogate code in its file name, then the UTF-8b
   decoder will recognize this as invalid, and decode it byte-for-byte,
   into three surrogate codes. Hence, the file names that are different
   on disk are also different in memory. No ambiguity.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 (again)

2009-04-28 Thread Martin v. Löwis

MRAB wrote:
> Martin v. Löwis wrote:
>>> Furthermore, I don't believe that PEP 383 works consistently on Windows,
>>
>> What makes you say that? PEP 383 will have no effect on Windows,
>> compared to the status quo, whatsoever.
>>
> You could argue that if Windows is actually returning UTF-16 with half
> surrogates that they should be altered to conform to what UTF-8 would
> have returned.

Perhaps - but this is not what the PEP specifies (and intentionally so).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 (again)

2009-04-28 Thread Martin v. Löwis

> Your proposal says that utf-8b would be used for file systems, but then
> you also say that it might be used for command line arguments and
> environment variables.  So, which specific APIs will it be used with on
> Windows and on POSIX systems?

On Windows, the Wide APIs are already used throughout the code base,
e.g. SetEnvironmentVariableW/_wenviron. If you need to find out the
specific API for a specific functionality, please read the source code.

> Or will utf-8b simply not be available
> on Windows at all?

It will be available, but it won't be used automatically for
anything.

> What happens if I create a Python version of tar,
> utf-8b strings slip in there, and I try to use them on Windows?

No need to create it - the tarfile module is already there. By
"in there", do you mean on the file system, or in the tarfile?

> You also assume that all Windows file system functions strictly conform
> to UTF-16 in practice (not just on paper).  Have you verified that?

No, I don't assume that. I assume that all functions are strictly
available in a Wide character version, and have verified that they are.

> What's the situation on Windows CE?

I can't see how this question is relevant to the PEP. The PEP says this:

# On Windows, Python uses the wide character APIs to access
# character-oriented APIs, allowing direct conversion of the
# environmental data to Python str objects.

This is what it already does, and this is what it will continue to do.

> Another question on Linux: what happens when I decode a file system path
> with utf-8b and then pass the resulting unicode string to Gnome?  To
> Qt?

You probably get moji-bake, or an error, I didn't try.

> To windows.forms?  To Java?

How do you do that, on Linux?

> To a unicode regular expression library?

You mean, SRE? SRE will match the code points as individual characters,
class Cs. You should have been able to find out that for yourself.

> To wprintf?

Depends on the wprintf implementation.

> AFAIK, the behavior of most libraries is
> undefined for the kinds of unicode strings you construct, and it may be
> undefined in a bad way (crash, buffer overflow, whatever).

Indeed so. This is intentional. If you can crash Python that way,
nothing gets worse by this PEP - you can then *already* crash Python
in that way.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-28 Thread Zooko O'Whielacronx


On Apr 28, 2009, at 13:01 PM, Thomas Breuel wrote:

(2) Should the default UTF-8 encoder for file system operations be  
allowed to generate illegal byte sequences?


I think that's a definite no; if I set the encoding for a device to  
UTF-8, I never want Python to try to write illegal UTF-8 strings to  
my device.

...
If people really want the option of (3c), then I think encoders  
related to the file system should by default reject those strings  
as illegal because the potential problems from writing them are  
just too serious.  Printing routines and UI routines could display  
them without error (but some clear indication), of course.


For what it is worth, sometimes we have to write bytes to a POSIX  
filesystem even though those bytes are not the encoding of any string  
in the filesystem's "alleged encoding".  The reason is that it is  
common for there to be filenames which are not the encodings of  
anything in the filesystem's alleged encoding, and the user expects  
my tool (Tahoe-LAFS [1]) to copy that name to a distributed storage  
grid and then copy it back unchanged.  Even though, I re-iterate,  
that name is *not* a valid encoding of anything in the current encoding.


This doesn't argue that this behavior has to be the *default*  
behavior, but it is sometimes necessary.


It's too bad that POSIX is so far behind Mac OS X in this respect.   
(Also so far behind Windows, but I use Mac as the example to show how  
it is possible to build a better system on top of POSIX.)  Hopefully  
David Wheeler's proposals to tighten the requirements in Linux  
filesystems will catch on: [2].


Regards,

Zooko

[1] http://allmydata.org
[2] http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 (again)

2009-04-28 Thread Thomas Breuel

On Tue, Apr 28, 2009 at 20:45, "Martin v. Löwis"  wrote:

> > Furthermore, I don't believe that PEP 383 works consistently on Windows,
>
> What makes you say that? PEP 383 will have no effect on Windows,
> compared to the status quo, whatsoever.
>

That's what you believe, but it's not clear to me that that follows from
your proposal.

Your proposal says that utf-8b would be used for file systems, but then you
also say that it might be used for command line arguments and environment
variables.  So, which specific APIs will it be used with on Windows and on
POSIX systems?   Or will utf-8b simply not be available on Windows at all?
What happens if I create a Python version of tar, utf-8b strings slip in
there, and I try to use them on Windows?

You also assume that all Windows file system functions strictly conform to
UTF-16 in practice (not just on paper).  Have you verified that?  It
certainly isn't true across all versions of Windows (since NT originally
used UCS-2).   What's the situation on Windows CE?

Another question on Linux: what happens when I decode a file system path
with utf-8b and then pass the resulting unicode string to Gnome?  To Qt?  To
windows.forms?  To Java?  To a unicode regular expression library?  To
wprintf?  AFAIK, the behavior of most libraries is undefined for the kinds
of unicode strings you construct, and it may be undefined in a bad way
(crash, buffer overflow, whatever).

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman

On approximately 4/28/2009 10:53 AM, came the following characters from 
the keyboard of James Y Knight:


On Apr 28, 2009, at 2:50 AM, Martin v. Löwis wrote:


James Y Knight wrote:

Hopefully it can be assumed that your locale encoding really is a
non-overlapping superset of ASCII, as is required by POSIX...


Can you please point to the part of the POSIX spec that says that
such overlapping is forbidden?


I can't find it...I would've thought it would be on this page:
http://opengroup.org/onlinepubs/007908775/xbd/charset.html
but it's not (at least, not obviously). That does say (effectively) that 
all encodings must be supersets of ASCII and use the same codepoints, 
though.


However, ISO-2022 being inappropriate for LC_CTYPE usage is the entire 
reason why EUC-JP was created, so I'm pretty sure that it is in fact 
inappropriate, and I cannot find any evidence of it ever being used on 
any system.



It would seem from the definition of ISO-2022 that what it calls "escape 
sequences" is in your POSIX spec called "locking-shift encoding". 
Therefore, the second bullet item under the "Character Encoding" heading 
prohibits use of ISO-2022, for whatever uses that document defines 
(which, since you referenced it, I assume means locales, and possibly 
file system encodings, but I'm not familiar with the structure of all 
the POSIX standards documents).


A locking-shift encoding (where the state of the character is determined 
by a shift code that may affect more than the single character following 
it) cannot be defined with the current character set description file 
format. Use of a locking-shift encoding with any of the standard 
utilities in the XCU specification or with any of the functions in the 
XSH specification that do not specifically mention the effects of 
state-dependent encoding is implementation-dependent.





 From http://en.wikipedia.org/wiki/EUC-JP:
"To get the EUC form of an ISO-2022 character, the most significant bit 
of each 7-bit byte of the original ISO 2022 codes is set (by adding 128 
to each of these original 7-bit codes); this allows software to easily 
distinguish whether a particular byte in a character string belongs to 
the ISO-646 code or the ISO-2022 (EUC) code."


Also:
http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html



I'm a bit scared at the prospect that U+DCAF could turn into "/", that
just screams security vulnerability to me.  So I'd like to propose that
only 0x80-0xFF <-> U+DC80-U+DCFF should ever be allowed to be
encoded/decoded via the error handler.


It would be actually U+DC2f that would turn into /.


Yes, I meant to say DC2F, sorry for the confusion.


I'm happy to exclude that range from the mapping if POSIX really
requires an encoding not to be overlapping with ASCII.


I think it has to be excluded from mapping in order to not introduce 
security issues.


However...

There's also SHIFT-JIS to worry about...which apparently some people 
actually want to use as their default encoding, despite it being broken 
to do so. RedHat apparently refuses to provide it as a locale charset 
(due to its brokenness), and it's also not available by default on my 
Debian system. People do unfortunately seem to actually use it in real 
life.


https://bugzilla.redhat.com/show_bug.cgi?id=136290

So, I'd like to propose this:
The "python-escape" error handler when given a non-decodable byte from 
0x80 to 0xFF will produce values of U+DC80 to U+DCFF. When given a 
non-decodable byte from 0x00 to 0x7F, it will be converted to 
U+-U+007F. On the encoding side, values from U+DC80 to U+DCFF are 
encoded into 0x80 to 0xFF, and all other characters are treated in 
whatever way the encoding would normally treat them.


This proposal obviously works for all non-overlapping ASCII supersets, 
where 0x00 to 0x7F always decode to U+00 to U+7F. But it also works for 
Shift-JIS and other similar ASCII-supersets with overlaps in trailing 
bytes of a multibyte sequence. So, a sequence like 
"\x81\xFD".decode("shift-jis", "python-escape") will turn into 
u"\uDC81\u00fd". Which will then properly encode back into "\x81\xFD".


The character sets this *doesn't* work for are: ebcdic code pages 
(obviously completely unsuitable for a locale encoding on unix), 



Why is that obvious?  The only thing I saw that could exclude EBCDIC 
would be the requirement that the codes be positive in a char, but on a 
system where the C compiler treats char as unsigned, EBCDIC would qualify.


Of course, the use of EBCDIC would also restrict the other possible code 
pages to those derived from EBCDIC (rather than the bulk of code pages 
that are derived from ASCII), due to:


If the encoded values associated with each member of the portable 
character set are not invariant across all locales supported by the 
implementation, the results achieved by an application accessing those 
locales are unspecified.



iso2022-* (covered above), and shift-jisx0213 (because it has replaced \ 
with yen, and - with

Re: [Python-Dev] PEP 383 (again)

2009-04-28 Thread MRAB


Martin v. Löwis wrote:

Furthermore, I don't believe that PEP 383 works consistently on Windows,


What makes you say that? PEP 383 will have no effect on Windows,
compared to the status quo, whatsoever.


You could argue that if Windows is actually returning UTF-16 with half
surrogates that they should be altered to conform to what UTF-8 would
have returned.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Zooko O'Whielacronx

On Apr 28, 2009, at 6:46 AM, Hrvoje Niksic wrote:

Are you proposing to unconditionally encode file names as  
iso8859-15, or to do so only when undecodeable bytes are encountered?

For what it is worth, what we have previously planned to do for the  
Tahoe project is the second of these -- decode using some 1-byte  
encoding such as iso-8859-1, iso-8859-15, or windows-1252 only in the  
case that attempting to decode the bytes using the local alleged  
encoding failed.

If you switch to iso8859-15 only in the presence of undecodable  
UTF-8, then you have the same round-trip problem as the PEP: both  
b'\xff' and b'\xc3\xbf' will be converted to u'\u00ff' without a  
way to unambiguously recover the original file name.

Why do you say that?  It seems to work as I expected here:

>>> '\xff'.decode('iso-8859-15')
u'\xff'
>>> '\xc3\xbf'.decode('iso-8859-15')
u'\xc3\xbf'
>>>
>>>
>>>
>>> '\xff'.decode('cp1252')
u'\xff'
>>> '\xc3\xbf'.decode('cp1252')
u'\xc3\xbf'

Regards,

Zooko
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-28 Thread Thomas Breuel

I think we should break up this problem into several parts:

(1) Should the default UTF-8 decoder fail if it gets an illegal byte
sequence.

It's probably OK for the default decoder to be lenient in some way (see
below).

(2) Should the default UTF-8 encoder for file system operations be allowed
to generate illegal byte sequences?

I think that's a definite no; if I set the encoding for a device to UTF-8, I
never want Python to try to write illegal UTF-8 strings to my device.

(3) What kind of representation should the UTF-8 decoder return for illegal
inputs?

There are actually several choices: (a) it could guess what the actual
encoding is and use that, (b) it could return a valid unicode string that
indicates the illegal characters but does not re-encode to the original byte
sequence, or (c) it could return some kind of non-standard representation
that encodes back into the original byte sequence.

PEP 383 violated (2), and I think that's a bad thing.

I think the best solution would be to use (3a) and fall back to (3b) if that
doesn't work.  If people try to write those strings, they will always get
written as correctly encoded UTF-8 strings.

If people really want the option of (3c), then I think encoders related to
the file system should by default reject those strings as illegal because
the potential problems from writing them are just too serious.  Printing
routines and UI routines could display them without error (but some clear
indication), of course.

There is yet another option, which is arguably the "right" one: make the
results of os.listdir() subclasses of string that keep track of where they
came from.  If you write back to the same device, it just writes the same
byte sequence.  But if you write to other devices and the byte sequence is
illegal according to its encoding, you get an error.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread MRAB


James Y Knight wrote:


On Apr 28, 2009, at 2:50 AM, Martin v. Löwis wrote:


James Y Knight wrote:

Hopefully it can be assumed that your locale encoding really is a
non-overlapping superset of ASCII, as is required by POSIX...


Can you please point to the part of the POSIX spec that says that
such overlapping is forbidden?


I can't find it...I would've thought it would be on this page:
http://opengroup.org/onlinepubs/007908775/xbd/charset.html
but it's not (at least, not obviously). That does say (effectively) that 
all encodings must be supersets of ASCII and use the same codepoints, 
though.


However, ISO-2022 being inappropriate for LC_CTYPE usage is the entire 
reason why EUC-JP was created, so I'm pretty sure that it is in fact 
inappropriate, and I cannot find any evidence of it ever being used on 
any system.


 From http://en.wikipedia.org/wiki/EUC-JP:
"To get the EUC form of an ISO-2022 character, the most significant bit 
of each 7-bit byte of the original ISO 2022 codes is set (by adding 128 
to each of these original 7-bit codes); this allows software to easily 
distinguish whether a particular byte in a character string belongs to 
the ISO-646 code or the ISO-2022 (EUC) code."


Also:
http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html



I'm a bit scared at the prospect that U+DCAF could turn into "/", that
just screams security vulnerability to me.  So I'd like to propose that
only 0x80-0xFF <-> U+DC80-U+DCFF should ever be allowed to be
encoded/decoded via the error handler.


It would be actually U+DC2f that would turn into /.


Yes, I meant to say DC2F, sorry for the confusion.


I'm happy to exclude that range from the mapping if POSIX really
requires an encoding not to be overlapping with ASCII.


I think it has to be excluded from mapping in order to not introduce 
security issues.


However...

There's also SHIFT-JIS to worry about...which apparently some people 
actually want to use as their default encoding, despite it being broken 
to do so. RedHat apparently refuses to provide it as a locale charset 
(due to its brokenness), and it's also not available by default on my 
Debian system. People do unfortunately seem to actually use it in real 
life.


https://bugzilla.redhat.com/show_bug.cgi?id=136290

So, I'd like to propose this:
The "python-escape" error handler when given a non-decodable byte from 
0x80 to 0xFF will produce values of U+DC80 to U+DCFF. When given a 
non-decodable byte from 0x00 to 0x7F, it will be converted to 
U+-U+007F. On the encoding side, values from U+DC80 to U+DCFF are 
encoded into 0x80 to 0xFF, and all other characters are treated in 
whatever way the encoding would normally treat them.


This proposal obviously works for all non-overlapping ASCII supersets, 
where 0x00 to 0x7F always decode to U+00 to U+7F. But it also works for 
Shift-JIS and other similar ASCII-supersets with overlaps in trailing 
bytes of a multibyte sequence. So, a sequence like 
"\x81\xFD".decode("shift-jis", "python-escape") will turn into 
u"\uDC81\u00fd". Which will then properly encode back into "\x81\xFD".


The character sets this *doesn't* work for are: ebcdic code pages 
(obviously completely unsuitable for a locale encoding on unix), 
iso2022-* (covered above), and shift-jisx0213 (because it has replaced \ 
with yen, and - with overline).


If it's desirable to work with shift_jisx0213, a modification of the 
proposal can be made: Change the second sentence to: "When given a 
non-decodable byte from 0x00 to 0x7F, that byte must be the second or 
later byte in a multibyte sequence. In such a case, the error handler 
will produce the encoding of that byte if it was standing alone (thus in 
most encodings, \x00-\x7f turn into U+00-U+7F)."


It sounds from https://bugzilla.novell.com/show_bug.cgi?id=162501 like 
some people do actually use shift_jisx0213, unfortunately.



I've been thinking of "python-escape" only in terms of UTF-8, the only
encoding mentioned in the PEP. In UTF-8, bytes 0x00 to 0x7F are
decodable.

But if you're talking about using it with other encodings, eg
shift-jisx0213, then I'd suggest the following:

1. Bytes 0x00 to 0xFF which can't normally be decoded are decoded to
half surrogates U+DC00 to U+DCFF.

2. Bytes which would have decoded to half surrogates U+DC00 to U+DCFF
are treated as though they are undecodable bytes.

3. Half surrogates U+DC00 to U+DCFF which can be produced by decoding
are encoded to bytes 0x00 to 0xFF.

4. Codepoints, including half surrogates U+DC00 to U+DCFF, which can't
be produced by decoding raise an exception.

I think I've covered all the possibilities. :-)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman

On approximately 4/28/2009 10:00 AM, came the following characters from 
the keyboard of Martin v. Löwis:



An alternative that doesn't suffer from the risk of not being able to
store decoded strings would have been the use of PUA characters, but
people rejected it because of the potential ambiguities. So they clearly
dislike one risk more than the other. UTF-8b is primarily meant as
an in-memory representation.


The UTF-8b representation suffers from the same potential ambiguities as 
the PUA characters... perhaps slightly less likely in practice, due to 
the use of Unicode-illegal characters, but exactly the same theoretical 
likelihood in the space of Python-acceptable character codes.


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 (again)

2009-04-28 Thread Martin v. Löwis

> Furthermore, I don't believe that PEP 383 works consistently on Windows,

What makes you say that? PEP 383 will have no effect on Windows,
compared to the status quo, whatsoever.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 (again)

2009-04-28 Thread Thomas Breuel

>
> However, it is "mission creep": Martin didn't volunteer to
> write a PEP for it, he volunteered to write a PEP to solve the
> "roundtrip the value of os.listdir()" problem.  And he succeeded, up
> to some minor details.


Yes, it solves that problem.  But that doesn't come without cost.

Most importantly, now Python writes illegal UTF-8 strings even if the user
chose a UTF-8 encoding.   That means that illegal UTF-8 encodings can
propagate anywhere, without warning.

Furthermore, I don't believe that PEP 383 works consistently on Windows, and
it causes programs to behave differently in unintuitive ways on Windows and
Linux.

I'll suggest an alternative in a separate message.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread James Y Knight



On Apr 28, 2009, at 2:50 AM, Martin v. Löwis wrote:


James Y Knight wrote:

Hopefully it can be assumed that your locale encoding really is a
non-overlapping superset of ASCII, as is required by POSIX...


Can you please point to the part of the POSIX spec that says that
such overlapping is forbidden?


I can't find it...I would've thought it would be on this page:
http://opengroup.org/onlinepubs/007908775/xbd/charset.html
but it's not (at least, not obviously). That does say (effectively)  
that all encodings must be supersets of ASCII and use the same  
codepoints, though.


However, ISO-2022 being inappropriate for LC_CTYPE usage is the entire  
reason why EUC-JP was created, so I'm pretty sure that it is in fact  
inappropriate, and I cannot find any evidence of it ever being used on  
any system.


From http://en.wikipedia.org/wiki/EUC-JP:
"To get the EUC form of an ISO-2022 character, the most significant  
bit of each 7-bit byte of the original ISO 2022 codes is set (by  
adding 128 to each of these original 7-bit codes); this allows  
software to easily distinguish whether a particular byte in a  
character string belongs to the ISO-646 code or the ISO-2022 (EUC)  
code."


Also:
http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html


I'm a bit scared at the prospect that U+DCAF could turn into "/",  
that
just screams security vulnerability to me.  So I'd like to propose  
that

only 0x80-0xFF <-> U+DC80-U+DCFF should ever be allowed to be
encoded/decoded via the error handler.


It would be actually U+DC2f that would turn into /.


Yes, I meant to say DC2F, sorry for the confusion.


I'm happy to exclude that range from the mapping if POSIX really
requires an encoding not to be overlapping with ASCII.


I think it has to be excluded from mapping in order to not introduce  
security issues.


However...

There's also SHIFT-JIS to worry about...which apparently some people  
actually want to use as their default encoding, despite it being  
broken to do so. RedHat apparently refuses to provide it as a locale  
charset (due to its brokenness), and it's also not available by  
default on my Debian system. People do unfortunately seem to actually  
use it in real life.


https://bugzilla.redhat.com/show_bug.cgi?id=136290

So, I'd like to propose this:
The "python-escape" error handler when given a non-decodable byte from  
0x80 to 0xFF will produce values of U+DC80 to U+DCFF. When given a non- 
decodable byte from 0x00 to 0x7F, it will be converted to U+-U 
+007F. On the encoding side, values from U+DC80 to U+DCFF are encoded  
into 0x80 to 0xFF, and all other characters are treated in whatever  
way the encoding would normally treat them.


This proposal obviously works for all non-overlapping ASCII supersets,  
where 0x00 to 0x7F always decode to U+00 to U+7F. But it also works  
for Shift-JIS and other similar ASCII-supersets with overlaps in  
trailing bytes of a multibyte sequence. So, a sequence like  
"\x81\xFD".decode("shift-jis", "python-escape") will turn into  
u"\uDC81\u00fd". Which will then properly encode back into "\x81\xFD".


The character sets this *doesn't* work for are: ebcdic code pages  
(obviously completely unsuitable for a locale encoding on unix),  
iso2022-* (covered above), and shift-jisx0213 (because it has replaced  
\ with yen, and - with overline).


If it's desirable to work with shift_jisx0213, a modification of the  
proposal can be made: Change the second sentence to: "When given a non- 
decodable byte from 0x00 to 0x7F, that byte must be the second or  
later byte in a multibyte sequence. In such a case, the error handler  
will produce the encoding of that byte if it was standing alone (thus  
in most encodings, \x00-\x7f turn into U+00-U+7F)."


It sounds from https://bugzilla.novell.com/show_bug.cgi?id=162501 like  
some people do actually use shift_jisx0213, unfortunately.


James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis


> If the PEP depends on this being changed, it should be mentioned in the
> PEP.

The PEP says that the utf-8b codec decodes invalid bytes into low
surrogates. I have now clarified that a strict definition of UTF-8
is assumed for utf-8b.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis

> Since the serialization of the Unicode string is likely to use UTF-8,
> and the string for  such a file will include half surrogates, the
> application may raise an exception when encoding the names for a
> configuration file. These encoding exceptions will be as rare as the
> unusual names (which the careful I18N aware developer has probably
> eradicated from his system), and thus will appear late.

There are trade-offs to any solution; if there was a solution without
trade-offs, it would be implemented already.

The Python UTF-8 codec will happily encode half-surrogates; people argue
that it is a bug that it does so, however, it would help in this
specific case.

An alternative that doesn't suffer from the risk of not being able to
store decoded strings would have been the use of PUA characters, but
people rejected it because of the potential ambiguities. So they clearly
dislike one risk more than the other. UTF-8b is primarily meant as
an in-memory representation.

> Or say de/serialization succeeds. Since the resulting Unicode string
> differs depending on the encoding (which is a good thing; it is
> supposed to make most cases mostly readable), when the filesystem
> encoding changes (say from legacy to UTF-8), the "name" changes, and
> deserialized references to it become stale.

That problem has nothing to do with the PEP. If the encoding changes,
LRU entries may get stale even if there were no encoding errors at
all. Suppose the old encoding was Latin-1, and the new encoding is
KOI8-R, then all file names are decodable before and afterwards, yet
the string representation changes. Applications that want to protect
themselves against that happening need to store byte representations
of the file names, not character representations. Depending on the
configuration file format, that may or may not be possible.

I find the case pretty artificial, though: if the locale encoding
changes, all file names will look incorrect to the user, so he'll
quickly switch back, or rename all the files. As an application
supporting a LRU list, I would remove/hide all entries that don't
correlate to existing files - after all, the user may have as well
deleted the file in the LRU list.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis

> It does solve this issue, because (unlike e.g. U+F01FF) '\udcff' is
> not a valid Unicode character (not a character at all, really) and the
> only way you can put this in a POSIX filename is if you use a very
> lenient  UTF-8 encoder that gives you b'\xed\xb3\xbf'.
> 
> Since this byte sequence doesn't represent a valid character when
> decoded with UTF-8, it should simply be considered an invalid UTF-8
> sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not*
> '\udcff').
> 
> Martin: maybe the PEP should say this explicitly?

Sure, will do.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 (again)

2009-04-28 Thread Martin v. Löwis

> If we follow your approach, that ISO8859-15 string will get turned into
> an escaped unicode string inside Python.  If I understand your proposal
> correctly, if it's a output file name and gets passed to Python's open
> function, Python will then decode that string and end up with an
> ISO8859-15 byte sequence, which it will write to disk literally, even if
> the encoding for the system is UTF-8.   That's the wrong thing to do.

I don't think anything can, or should be, done about that. If you had
byte-oriented interfaces (as you do in 2.x), exactly the same thing will
happen: the name of the file will be the very same byte sequence as the
one passed on the command line. Most Unix users here agree that this is
the right thing to happen.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 (again)

2009-04-28 Thread Stephen J. Turnbull

Thomas Breuel writes:

 > PEP 383 doesn't make it any easier; it just turns one set of
 > problems into another.

That's false.  There is an interesting class of problems of the form
"get a list of names from the OS and allow the user to select from it,
and retrieve corresponding content."  People are *very* often able to
decode complete gibberish, as long as it's the only gibberish in a
list.  Ditto partial gibberish.  In that case, PEP 383 allows the
content retrieval operation to complete.

There are probably other problems that this PEP solves.

 > Actually, it makes it worse,

Again, it gives you different problems, which may be better and may be
worse according to the user's requirements.  Currently, you often get
an exception, and running the program again is no help.  The user must
clean up the list to make progress.  This may or may not be within the
user's capacity (eg, read-only media).

 > since any problems that show up now show up far from the source of
 > the problem, and since it can lead to security problems and/or data
 > loss.

Yes.  This is a point I have been at pains to argue elsewhere in this
thread.  However, it is "mission creep": Martin didn't volunteer to
write a PEP for it, he volunteered to write a PEP to solve the
"roundtrip the value of os.listdir()" problem.  And he succeeded, up
to some minor details.

 > The problem may well be with the program using the wrong encodings or
 > incorrectly ignoring encoding information.  Furthermore, even if it is user
 > error, the program needs to validate its inputs and put up a meaningful
 > error message, not mangle the disk.  To detect such program bugs, it's
 > important that when Python detects an incorrect encoding that it doesn't
 > quietly continue with an incorrect string.

I agree.  Guido, however, responded that "Practicality beats purity"
to a similar point in the PEP 263 discussion.

Be aware that you're fighting an uphill battle here.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Stephen J. Turnbull

Paul Moore writes:

 > But it seems to me that there is an assumption that problems will
 > arise when code gets a potentially funny-decoded string and doesn't
 > know where it came from.
 > 
 > Is that a real concern?

Yes, it's a real concern.  I don't think it's possible to show a small
piece of code one could point at and say "without a better API I bet
you can't write this correctly," though.  Rather, my experience with
Emacs and various mail packages is that without type information it is
impossible to keep track of the myriad bits and pieces of text that
are recombining like pig flu, and eventually one breaks out and causes
an error.  It's usually easy to fix, but so are the next hundred
similar regressions, and in the meantime a hundred users have suffered
more or less damage or at least annoyance.

There's no question that dealing with escapes of funny-decoded strings
to uprepared code paths is mission creep compared to Martin's stated
purpose for PEP 383, but it is also a real problem.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Michael Urman

On Mon, Apr 27, 2009 at 23:43, Stephen J. Turnbull  wrote:
> Nobody said we were at the stage of *saving* the [attachment]!

But speaking of saving files, I think that's the biggest hole in this
that has been nagging at the back of my mind. This PEP intends to
allow easy access to filenames and other environment strings which are
not restricted to known encodings. What happens if the detected
encoding changes? There may be difficulties de/serializing these
names, such as for an MRU list.

Since the serialization of the Unicode string is likely to use UTF-8,
and the string for  such a file will include half surrogates, the
application may raise an exception when encoding the names for a
configuration file. These encoding exceptions will be as rare as the
unusual names (which the careful I18N aware developer has probably
eradicated from his system), and thus will appear late.

Or say de/serialization succeeds. Since the resulting Unicode string
differs depending on the encoding (which is a good thing; it is
supposed to make most cases mostly readable), when the filesystem
encoding changes (say from legacy to UTF-8), the "name" changes, and
deserialized references to it become stale.

This can probably be handled through careful use of the same
encoding/decoding scheme, if relevant, but that sounds like we've just
moved the problem from fs/environment access to serialization. Is that
good enough? For other uses the API knew whether it was
environmentally aware, but serialization probably will not. Should
this PEP make recommendations about how to save filenames in
configuration files?

-- 
Michael Urman
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 (again)

2009-04-28 Thread Duncan Booth

Hrvoje Niksic  wrote:

> Assume a UTF-8 locale.  A file named b'\xff', being an invalid UTF-8 
> sequence, will be converted to the half-surrogate '\udcff'.  However,
> a file named b'\xed\xb3\xbf', a valid[1] UTF-8 sequence, will also be 
> converted to '\udcff'.  Those are quite different POSIX pathnames; how
> will Python know which one it was when I later pass '\udcff' to
> open()? 
> 
> 
> [1]
> I'm assuming that it's valid UTF8 because it passes through Python
> 2.5's '\xed\xb3\xbf'.decode('utf-8').  I don't claim to be a UTF-8
> expert.

I'm not a UTF-8 expert either, but I got bitten by this yesterday. I was 
uploading a file to a Google Search Appliance and it was rejected as 
invalid UTF-8 despite having been encoded into UTF-8 by Python.

The cause was a byte sequence which decoded to a half surrogate similar to 
your example above. Python will happily decode and encode such sequences, 
but as I found to my cost other systems reject them.

Reading wikipedia implies that Python is wrong to accept these sequences 
and I think (though I'm not a lawyer) that RFC 3629 also implies this:

"The definition of UTF-8 prohibits encoding character numbers between 
U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form 
(as surrogate pairs) and do not directly represent characters."

 and

"Implementations of the decoding algorithm above MUST protect against 
decoding invalid sequences."

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] One more proposed formatting change for 3.1

2009-04-28 Thread Paul Moore

2009/4/28 Mark Dickinson :
> Here's one more proposed change, this time for formatting
> of floats using format() and the empty presentation type.
> To avoid repeating myself, here's the text from the issue
> I just opened:
>
> http://bugs.python.org/issue5864
>
> """
> In all versions of Python from 2.6 up, I get the following behaviour:
>
 format(123.456, '.4')
> '123.5'
 format(1234.56, '.4')
> '1235.0'
 format(12345.6, '.4')
> '1.235e+04'
>
> The first and third results are as I expect, but the second is somewhat
> misleading: it gives 5 significant digits when only 4 were requested,
> and moreover the last digit is incorrect.
>
> I propose that Python 2.7 and Python 3.1 be changed so that the output
> for the second line above is '1.235e+03'.
> """
>
> This issue seems fairly clear cut to me, and I doubt that there's been
> enough uptake of 'format' yet for this to risk significant breakage.  So
> unless there are objections I'll plan to make this change before this
> weekend's beta.

+1
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Lino Mastrodomenico

2009/4/28 Hrvoje Niksic :
> Lino Mastrodomenico wrote:
>>
>> Since this byte sequence [b'\xed\xb3\xbf'] doesn't represent a valid
>> character when
>> decoded with UTF-8, it should simply be considered an invalid UTF-8
>> sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not*
>> '\udcff').
>
> "Should be considered" or "will be considered"?  Python 3.0's UTF-8 decoder
> happily accepts it and returns u'\udcff':
>
 b'\xed\xb3\xbf'.decode('utf-8')
> '\udcff'

Only for the new utf-8b encoding (if Martin agrees), while the
existing utf-8 is fine as is (or at least waaay outside the scope of
this PEP).

-- 
Lino Mastrodomenico
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] lone surrogates in utf-8

2009-04-28 Thread Antoine Pitrou

Hrvoje Niksic  avl.com> writes:
> 
> "Should be considered" or "will be considered"?  Python 3.0's UTF-8 
> decoder happily accepts it and returns u'\udcff':
> 
>  >>> b'\xed\xb3\xbf'.decode('utf-8')
> '\udcff'

Yes, there is already a bug entry for it:
http://bugs.python.org/issue3672

I think we could happily fix it for 3.1 (perhaps leaving 2.7 unchanged for
compatibility reasons - I don't know if some people may rely on the current
behaviour).



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Hrvoje Niksic


Lino Mastrodomenico wrote:

Since this byte sequence [b'\xed\xb3\xbf'] doesn't represent a valid character 
when
decoded with UTF-8, it should simply be considered an invalid UTF-8
sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not*
'\udcff').


"Should be considered" or "will be considered"?  Python 3.0's UTF-8 
decoder happily accepts it and returns u'\udcff':


>>> b'\xed\xb3\xbf'.decode('utf-8')
'\udcff'

If the PEP depends on this being changed, it should be mentioned in the PEP.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System C haracter Interfaces

2009-04-28 Thread Antoine Pitrou

Thomas Breuel  gmail.com> writes:
> 
> How can you bring up practical problems against something that hasn't been
implemented?

The PEP is simple enough that you can simulate its effect by manually computing
the resulting unicode string for a hypothetical broken filename. Several people
have already done so in this thread.

> The fact that no other language or library does this is perhaps an indication
that it isn't the right thing to do.

According to some messages, it seems Java and Mono actually use this kind of
workaround. Though I haven't checked (I don't use those languages).

> But the biggest problem with the proposal is that it isn't needed: if you want
to be able to turn arbitrary byte sequences into unicode strings and back, just
set your encoding to iso8859-15.  That already works

That doesn't work at all. With your proposal, any non-ASCII filename will be
unreadable; not only the broken ones.

Antoine.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Lino Mastrodomenico

2009/4/28 Glenn Linderman :
> The switch from PUA to half-surrogates does not resolve the issues with the
> encoding not being a 1-to-1 mapping, though.  The very fact that you  think
> you can get away with use of lone surrogates means that other people might,
> accidentally or intentionally, also use lone surrogates for some other
> purpose.  Even in file names.

It does solve this issue, because (unlike e.g. U+F01FF) '\udcff' is
not a valid Unicode character (not a character at all, really) and the
only way you can put this in a POSIX filename is if you use a very
lenient  UTF-8 encoder that gives you b'\xed\xb3\xbf'.

Since this byte sequence doesn't represent a valid character when
decoded with UTF-8, it should simply be considered an invalid UTF-8
sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not*
'\udcff').

Martin: maybe the PEP should say this explicitly?

Note that the round-trip works without ambiguities between '\udcff' in
the filename:

b'\xed\xb3\xbf' -> '\udced\udcb3\udcbf' -> b'\xed\xb3\xbf'

and b'\xff' in the filename, decoded by Python to '\udcff':

b'\xff' -> '\udcff' -> b'\xff'

-- 
Lino Mastrodomenico
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 (again)

2009-04-28 Thread R. David Murray


On Tue, 28 Apr 2009 at 09:30, Thomas Breuel wrote:

Therefore, when Python encounters path names on a file system
that are not consistent with the (assumed) encoding for that file
system, Python should raise an error.


This is what happens currently, and users are quite unhappy about it.


We need to keep "users" and "programmers" distinct here.

Programmers may find it inconvenient that they have to spend time figuring
out and deal with platform-dependent file system encoding issues and
errors.  But internationalization and unicode are hard, that's just a fact
of life.


And most programmers won't do it, because most programmers write for
an English speaking audience and have no clue about unicode issues.
That is probably slowly changing, but it is still true, I think.


End users, however, are going to be quite unhappy if they get a string of
gibberish for a file name because you decided to interpret some non-Unicode
string as UTF-8-with-extra-bytes.


No, end users expect the gibberish, because they get it all the time
(at least on Unix) when dealing with international filenames.  They
expect to be able to manipulate such files _despite_ the gibberish.
(I speak here as an end user who does this!!)


Or some Python program might copy files from an ISO8859-15 encoded file
system to a UTF-8 encoded file system, and instead of getting an error when
the encodings are set incorrectly, Python would quietly create ISO8859-15
encoded file names, making the target file system inconsistent.


As will almost all unix programs, and the unix OS itself.  On Unix,
you can't make the file system inconsistent by doing this, because
filenames are just byte strings with no NULLs.

How _does_ Windows handle this?  Would a Windows program complain, or
would it happily record the gibberish?  I suspect the latter, but
I don't use Windows so I don't know.


There is a lot of potential for major problems for end users with your
proposals.  In both cases, what should happen is that the end user gets an
error, submits a bug, and the programmer figures out how to deal with the
encoding issues correctly.


What would actually happen is that the user would abandon the program
that didn't work for one (not written in Python) that did.  If the
programmer was lucky they'd get a bug report, which they wouldn't
be able to do anything about since Python wouldn't be providing the
tools to let them fix it (ie: there are currently no bytes interfaces
for environ or the command line in python3).


Yes, users can do that (to a degree), but they are still unhappy about
it. The approach actually fails for command line arguments


As it should: if I give an ISO8859-15 encoded command line argument to a
Python program that expects a UTF-8 encoding, the Python program should tell
me that there is something wrong when it notices that.  Quietly continuing
is the wrong thing to do.


Imagine you are on a unix system, and you have gotten from somewhere a
file whose name is encoded in something other than UTF-8 (I have a
number of those on my system).  Now imagine that I want to run a python
program against that file, passing the name in on the command line.
I type the program name, the first few (non-mangled) characters, and hit
tab for completion, and my shell automagically puts the escaped bytes
onto the command line.  Or perhaps I cut and paste from an 'ls' listing
into a quoted string on the command line.

Python is now getting the mangled filename passed in on the command
line, and if the python program can't manipulate that file like any
other file on my disk I am going to be mightily pissed.

This is the _reality_ of current unix systems, like it or not.  The same
apparently applies to Windows, though in that case the mangled names may
be fewer and you tend to pick them from a GUI interface rather than do
cut-and-paste or tab completion.


If we follow your approach, that ISO8859-15 string will get turned into an
escaped unicode string inside Python.  If I understand your proposal
correctly, if it's a output file name and gets passed to Python's open
function, Python will then decode that string and end up with an ISO8859-15
byte sequence, which it will write to disk literally, even if the encoding
for the system is UTF-8.   That's the wrong thing to do.


Right.  Like I said, that's what most (almost all) Unix/Linux programs
_do_.

Now, in some future world where everyone (including Windows) acts like
we are hearing OS/X does and rejects the garbled encoding _at the OS
level_, then we'd be able to trust the file system encoding (FSDO trust)
and there would be no need for this PEP or any similar solution.

--David
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Hrvoje Niksic


Thomas Breuel wrote:
But the biggest problem with the proposal is that it isn't needed: if 
you want to be able to turn arbitrary byte sequences into unicode 
strings and back, just set your encoding to iso8859-15.  That already 
works and it doesn't require any changes.


Are you proposing to unconditionally encode file names as iso8859-15, or 
to do so only when undecodeable bytes are encountered?


If you unconditionally set encoding to iso8859-15, then you are 
effectively reverting to treating file names as bytes, regardless of the 
locale.  You're also angering a lot of European users who expect 
iso8859-2, etc.


If you switch to iso8859-15 only in the presence of undecodable UTF-8, 
then you have the same round-trip problem as the PEP: both b'\xff' and 
b'\xc3\xbf' will be converted to u'\u00ff' without a way to 
unambiguously recover the original file name.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 (again)

2009-04-28 Thread Hrvoje Niksic


Lino Mastrodomenico wrote:

Let's suppose that I use Python 2.x or something else to create a file
with name b'\xff'. My (Linux) system has a sane configuration and the
filesystem encoding is UTF-8, so it's an invalid name but the kernel
will blindly accept it anyway.

With this PEP, Python 3.1 listdir() will convert b'\xff' to the string '\udcff'.


One question that really bothers me about this proposal is the following:

Assume a UTF-8 locale.  A file named b'\xff', being an invalid UTF-8 
sequence, will be converted to the half-surrogate '\udcff'.  However, a 
file named b'\xed\xb3\xbf', a valid[1] UTF-8 sequence, will also be 
converted to '\udcff'.  Those are quite different POSIX pathnames; how 
will Python know which one it was when I later pass '\udcff' to open()?


A poster hinted at this question, but I haven't seen it answered, yet.


[1]
I'm assuming that it's valid UTF8 because it passes through Python 2.5's 
'\xed\xb3\xbf'.decode('utf-8').  I don't claim to be a UTF-8 expert.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Thomas Breuel

>
> Yep, that's the problem. Lots of theoretical problems noone has ever
> encountered
> brought up against a PEP which resolves some actual problems people
> encounter on
> a regular basis.


How can you bring up practical problems against something that hasn't been
implemented?

The fact that no other language or library does this is perhaps an
indication that it isn't the right thing to do.

But the biggest problem with the proposal is that it isn't needed: if you
want to be able to turn arbitrary byte sequences into unicode strings and
back, just set your encoding to iso8859-15.  That already works and it
doesn't require any changes.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Ronald Oussoren


For what it's worth, the OSX API's seem to behave as follows:

* If you create a file with an non-UTF8 name on a HFS+ filesystem the  
system automaticly encodes the name.


That is,  open(chr(255), 'w') will silently create a file named '%FF'  
instead of the name you'd expect on a unix system.


* If you mount an NFS filesystem from a linux host and that directory  
contains a file named chr(255)


- unix-level tools will see a file with the expected name (just like  
on linux)
- Cocoa's NSFileManager returns u"?" as the filename, that is when the  
filename cannot be decoded using UTF-8 the name returned by the high- 
level API is mangled. This is regardless of the setting of LANG.
- I haven't found a way yet to access files whose names are not valid  
UTF-8 using the high-level Cocoa API's.


The latter two are interesting because Cocoa has a unicode filesystem  
API on top of a POSIX C-API, just like Python 3.x. I guess the choosen  
behaviour works out on OSX (where users are unlikely to run into this  
issue), but could be more problematic on other POSIX systems.


Ronald

On 28 Apr, 2009, at 14:03, Michael Foord wrote:


Paul Moore wrote:

2009/4/28 Antoine Pitrou :


Paul Moore  gmail.com> writes:

I've yet to hear anyone claim that they would have an actual  
problem

with a specific piece of code they have written.

Yep, that's the problem. Lots of theoretical problems noone has  
ever encountered
brought up against a PEP which resolves some actual problems  
people encounter on

a regular basis.

For the record, I'm +1 on the PEP being accepted and implemented  
as soon as

possible (preferably before 3.1).



In case it's not clear, I am also +1 on the PEP as it stands.



Me 2

Michael

Paul.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk




--
http://www.ironpythoninaction.com/

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/ronaldoussoren%40mac.com




smime.p7s
Description: S/MIME cryptographic signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 (again)

2009-04-28 Thread Lino Mastrodomenico

2009/4/28 Thomas Breuel :
> If we follow PEP 383, you will get lots of errors anyway because those
> strings, when encoded in utf-8b, will result in an error when you try to
> write them on a Windows file system or any other system that doesn't allow
> the byte sequences that the utf-8b encodes.

I'm not sure if when you say "write them on a Windows FS" you mean
from within Windows itself or a filesystem mounted on another OS, so
I'll cover both cases.

Let's suppose that I use Python 2.x or something else to create a file
with name b'\xff'. My (Linux) system has a sane configuration and the
filesystem encoding is UTF-8, so it's an invalid name but the kernel
will blindly accept it anyway.

With this PEP, Python 3.1 listdir() will convert b'\xff' to the string '\udcff'.

Now if this string somehow ends up in a Python 3.1 program running on
Windows and it tries to create a file with this name, it will work (no
exception will be raised). The Windows GUI will display the standard
"invalid character" symbol (an empty box) when listing this file, but
this seems reasonable since the original file was displayed as "?" by
the Linux console and with another invalid character symbol by the
GNOME file manager.

OTOH if I write the same file on a Windows filesystem mounted on
another OS, there will be in place an automatic translation (probably
done by the OS kernel) from the user-visible filesystem encoding (see
e.g. the "iocharset" or "utf8" mount options for vfat on Linux) to
UTF-16. Which means that the write will fail with something like:

IOError: [Errno 22] invalid filename: b'/media/windows_disk/\xff'

(The "problem" is that a vfat filesystem mounted with the "utf8"
option on Linux will only accept byte sequences that are valid UTF-8,
or at least reasonably similar: e.g. b'\xed\xb3\xbf' is accepted.)

Again this seems reasonable since it already happens in Python 2 and
with pretty much any other software, including GNU cp.

I don't see how Martin can do better than this.

Well, ok, I guess he could break into my house and rename the original
file to something sane...

-- 
Lino Mastrodomenico
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Can not run under python 2.6

2009-04-28 Thread Jianchun Zhou

OK, Thanks a lot.

On Tue, Apr 28, 2009 at 8:06 PM, Michael Foord wrote:

> Jianchun Zhou wrote:
>
>> Hi, there:
>>
>> I am new to python, and now I got a trouble:
>>
>> I have an application named canola, it is written under python 2.5, and
>> can run normally under python 2.5
>>
>> But when it comes under python 2.6, problem up, it says:
>>
>> Traceback (most recent call last):
>>  File "/usr/lib/python2.6/site-packages/terra/core/plugin_manager.py",
>> line 151, in _load_plugins
>>classes = plg.load()
>>  File "/usr/lib/python2.6/site-packages/terra/core/plugin_manager.py",
>> line 94, in load
>>mod = self._ldr.load()
>>  File "/usr/lib/python2.6/site-packages/terra/core/module_loader.py", line
>> 42, in load
>>mod = __import__(modpath, fromlist=[mod_name])
>> ImportError: Import by filename is not supported.
>>
>> Any body any idea what should I do?
>>
>
> The Python-Dev mailing list is for the development of Python and not with
> Python. You will get a much better response asking on the comp.lang.python
> (python-list) or python-tutor newsgroups / mailing lists. comp.lang.python
> has both google groups and gmane gateways and so is easy to post to.
>
> For the particular problem you mention it is an intentional change and so
> the code in canola will need to be modified in order to run under Python
> 2.6.
>
> All the best,
>
> Michael Foord
>
>
>> --
>> Best Regards
>> 
>>
>> ___
>> Python-Dev mailing list
>> Python-Dev@python.org
>> http://mail.python.org/mailman/listinfo/python-dev
>> Unsubscribe:
>> http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk
>>
>>
>
>
> --
> http://www.ironpythoninaction.com/
>
>


-- 
Best Regards
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Can not run under python 2.6

2009-04-28 Thread Michael Foord


Jianchun Zhou wrote:

Hi, there:

I am new to python, and now I got a trouble:

I have an application named canola, it is written under python 2.5, 
and can run normally under python 2.5


But when it comes under python 2.6, problem up, it says:

Traceback (most recent call last):
  File 
"/usr/lib/python2.6/site-packages/terra/core/plugin_manager.py", line 
151, in _load_plugins

classes = plg.load()
  File 
"/usr/lib/python2.6/site-packages/terra/core/plugin_manager.py", line 
94, in load

mod = self._ldr.load()
  File "/usr/lib/python2.6/site-packages/terra/core/module_loader.py", 
line 42, in load

mod = __import__(modpath, fromlist=[mod_name])
ImportError: Import by filename is not supported.

Any body any idea what should I do?


The Python-Dev mailing list is for the development of Python and not 
with Python. You will get a much better response asking on the 
comp.lang.python (python-list) or python-tutor newsgroups / mailing 
lists. comp.lang.python has both google groups and gmane gateways and so 
is easy to post to.


For the particular problem you mention it is an intentional change and 
so the code in canola will need to be modified in order to run under 
Python 2.6.


All the best,

Michael Foord



--
Best Regards


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk
  



--
http://www.ironpythoninaction.com/

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Michael Foord


Paul Moore wrote:

2009/4/28 Antoine Pitrou :
  

Paul Moore  gmail.com> writes:


I've yet to hear anyone claim that they would have an actual problem
with a specific piece of code they have written.
  

Yep, that's the problem. Lots of theoretical problems noone has ever encountered
brought up against a PEP which resolves some actual problems people encounter on
a regular basis.

For the record, I'm +1 on the PEP being accepted and implemented as soon as
possible (preferably before 3.1).



In case it's not clear, I am also +1 on the PEP as it stands.
  


Me 2

Michael

Paul.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk
  



--
http://www.ironpythoninaction.com/

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Paul Moore

2009/4/28 Antoine Pitrou :
> Paul Moore  gmail.com> writes:
>>
>> I've yet to hear anyone claim that they would have an actual problem
>> with a specific piece of code they have written.
>
> Yep, that's the problem. Lots of theoretical problems noone has ever 
> encountered
> brought up against a PEP which resolves some actual problems people encounter 
> on
> a regular basis.
>
> For the record, I'm +1 on the PEP being accepted and implemented as soon as
> possible (preferably before 3.1).

In case it's not clear, I am also +1 on the PEP as it stands.

Paul.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] One more proposed formatting change for 3.1

2009-04-28 Thread Mark Dickinson

Here's one more proposed change, this time for formatting
of floats using format() and the empty presentation type.
To avoid repeating myself, here's the text from the issue
I just opened:

http://bugs.python.org/issue5864

"""
In all versions of Python from 2.6 up, I get the following behaviour:

>>> format(123.456, '.4')
'123.5'
>>> format(1234.56, '.4')
'1235.0'
>>> format(12345.6, '.4')
'1.235e+04'

The first and third results are as I expect, but the second is somewhat
misleading: it gives 5 significant digits when only 4 were requested,
and moreover the last digit is incorrect.

I propose that Python 2.7 and Python 3.1 be changed so that the output
for the second line above is '1.235e+03'.
"""

This issue seems fairly clear cut to me, and I doubt that there's been
enough uptake of 'format' yet for this to risk significant breakage.  So
unless there are objections I'll plan to make this change before this
weekend's beta.

Mark
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] Can not run under python 2.6

2009-04-28 Thread Jianchun Zhou

Hi, there:

I am new to python, and now I got a trouble:

I have an application named canola, it is written under python 2.5, and can
run normally under python 2.5

But when it comes under python 2.6, problem up, it says:

Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/terra/core/plugin_manager.py", line
151, in _load_plugins
classes = plg.load()
  File "/usr/lib/python2.6/site-packages/terra/core/plugin_manager.py", line
94, in load
mod = self._ldr.load()
  File "/usr/lib/python2.6/site-packages/terra/core/module_loader.py", line
42, in load
mod = __import__(modpath, fromlist=[mod_name])
ImportError: Import by filename is not supported.

Any body any idea what should I do?

-- 
Best Regards
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System C haracter Interfaces

2009-04-28 Thread Antoine Pitrou

Paul Moore  gmail.com> writes:
> 
> I've yet to hear anyone claim that they would have an actual problem
> with a specific piece of code they have written.

Yep, that's the problem. Lots of theoretical problems noone has ever encountered
brought up against a PEP which resolves some actual problems people encounter on
a regular basis.

For the record, I'm +1 on the PEP being accepted and implemented as soon as
possible (preferably before 3.1).

Regards

Antoine.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 (again)

2009-04-28 Thread Oleg Broytmann

On Tue, Apr 28, 2009 at 11:32:26AM +0200, Thomas Breuel wrote:
> On Tue, Apr 28, 2009 at 11:00, Oleg Broytmann  wrote:
> >   I have an FTP server to which clients with different local encodings
> > are connecting. FTP protocol doesn't have a notion of encoding so filenames
> > on the filesystem are in koi8-r, cp1251 and utf-8 encodings - all in one
> > directory! What should os.listdir() return for that directory? What is a
> > correct encoding for that directory?!
> 
> I don't know what it should do (ftplib needs to worry about that).

   There is no ftplib there. FTP server is ProFTPd, ftp clients of all
sort, one, e.g., an ftp client built-in into an automatic web-camera.
   I use python programs to process files after they have been uploaded.
The programs access FTP directory as a part of local filesystem.

> I do know
> what it shouldn't do, however: it sould not return a utf-8b string which,
> when used to create a file, will create a file reproducing the byte sequence
> of the remote machine; that's wrong.

   That certainly wrong. But at least the approach allows python programs
to list all files in a directory - currently AFAIU os.listdir() silently
skips undecodeable filenames. And after a program gets all files it can
process it further - it can cleanup filenames (base64-encode them, e.g.),
but at least it can do something, where currently it cannot.

PS. It seems I started to argue for the PEP. Well, well...

Oleg.
-- 
 Oleg Broytmannhttp://phd.pp.ru/p...@phd.pp.ru
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 (again)

2009-04-28 Thread Thomas Breuel

On Tue, Apr 28, 2009 at 11:00, Oleg Broytmann  wrote:

> On Tue, Apr 28, 2009 at 10:37:45AM +0200, Thomas Breuel wrote:
> > Returning an error for an incorrect encoding doesn't make
> > internationalization harder, it makes it easier because it makes
> debugging
> > easier.
>
>What is a "correct encoding"?
>
>   I have an FTP server to which clients with different local encodings
> are connecting. FTP protocol doesn't have a notion of encoding so filenames
> on the filesystem are in koi8-r, cp1251 and utf-8 encodings - all in one
> directory! What should os.listdir() return for that directory? What is a
> correct encoding for that directory?!

I don't know what it should do (ftplib needs to worry about that). I do know
what it shouldn't do, however: it sould not return a utf-8b string which,
when used to create a file, will create a file reproducing the byte sequence
of the remote machine; that's wrong.

  If any program starts to raise errors Python becomes completely unusable
> for me! But is there anything I can debug here?

If we follow PEP 383, you will get lots of errors anyway because those
strings, when encoded in utf-8b, will result in an error when you try to
write them on a Windows file system or any other system that doesn't allow
the byte sequences that the utf-8b encodes.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Paul Moore

2009/4/28 Glenn Linderman :
> So assume a non-decodable sequence in a name.  That puts us into Martin's
> funny-decode scheme.  His funny-decode scheme produces a bare string,
> indistinguishable from a bare string that would be produced by a str API
> that happens to contain that same sequence.  Data puns.
>
> So when open is handed the string, should it open the file with the name
> that matches the string, or the file with the name that funny-decodes to the
> same string?  It can't know, unless it knows that the string is a
> funny-decoded string or not.

Sorry for picking on Glenn's comment - it's only one of many in this
thread. But it seems to me that there is an assumption that problems
will arise when code gets a potentially funny-decoded string and
doesn't know where it came from.

Is that a real concern? How many programs really don't know where
their data came from? Maybe a general-purpose library routine *might*
just need to document explicitly how it handles funny-encoded data (I
can't actually imagine anything that would, but I'll concede it may be
possible) but that's just a matter of documenting your assumptions -
no better or worse than many other cases.

This all sounds similar to the idea of "tainted" data in security - if
you lose track of untrusted data from the environment, you expose
yourself to potential security issues. So the same techniques should
be relevant here (including ignoring it if your application isn't such
that it's s concern!)

I've yet to hear anyone claim that they would have an actual problem
with a specific piece of code they have written. (NB, if such a claim
has been made, feel free to point me to it - I admit I've been
skimming this thread at times).

Paul.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 (again)

2009-04-28 Thread Oleg Broytmann

On Tue, Apr 28, 2009 at 10:37:45AM +0200, Thomas Breuel wrote:
> Returning an error for an incorrect encoding doesn't make
> internationalization harder, it makes it easier because it makes debugging
> easier.

   What is a "correct encoding"?

   I have an FTP server to which clients with different local encodings
are connecting. FTP protocol doesn't have a notion of encoding so filenames
on the filesystem are in koi8-r, cp1251 and utf-8 encodings - all in one
directory! What should os.listdir() return for that directory? What is a
correct encoding for that directory?!

   If any program starts to raise errors Python becomes completely unusable
for me! But is there anything I can debug here?

Oleg.
-- 
 Oleg Broytmannhttp://phd.pp.ru/p...@phd.pp.ru
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 (again)

2009-04-28 Thread Thomas Breuel

>
>
>Until it's hard there will be no internationalization. A fact of life,
> damn it. Programmers are lazy, and have many problems to solve.


PEP 383 doesn't make it any easier; it just turns one set of problems into
another.  Actually, it makes it worse, since any problems that show up now
show up far from the source of the problem, and since it can lead to
security problems and/or data loss.


>And the programmer answers "The program is expected a correct
> environment, good filenames, etc." and closes the issue with the resolution
> "User error, will not fix".


The problem may well be with the program using the wrong encodings or
incorrectly ignoring encoding information.  Furthermore, even if it is user
error, the program needs to validate its inputs and put up a meaningful
error message, not mangle the disk.  To detect such program bugs, it's
important that when Python detects an incorrect encoding that it doesn't
quietly continue with an incorrect string.

Furthermore, if you don't provide clear error messages, it often takes a
significant amount of time for each issue to determine that it is user
error.


>   I am not arguing for or against the PEP in question. Python certainly
> has to have a way to make portable i18n less hard or else the number of
> portable internationalized program will be about zero. What the way should
> be - I don't know.


Returning an error for an incorrect encoding doesn't make
internationalization harder, it makes it easier because it makes debugging
easier.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 (again)

2009-04-28 Thread Oleg Broytmann

On Tue, Apr 28, 2009 at 09:30:01AM +0200, Thomas Breuel wrote:
> Programmers may find it inconvenient that they have to spend time figuring
> out and deal with platform-dependent file system encoding issues and
> errors.  But internationalization and unicode are hard, that's just a fact
> of life.

   Until it's hard there will be no internationalization. A fact of life,
damn it. Programmers are lazy, and have many problems to solve.

> end user gets an
> error, submits a bug, and the programmer figures out how to deal with the
> encoding issues correctly.

   And the programmer answers "The program is expected a correct
environment, good filenames, etc." and closes the issue with the resolution
"User error, will not fix".

   I am not arguing for or against the PEP in question. Python certainly
has to have a way to make portable i18n less hard or else the number of
portable internationalized program will be about zero. What the way should
be - I don't know.

Oleg.
-- 
 Oleg Broytmannhttp://phd.pp.ru/p...@phd.pp.ru
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 (again)

2009-04-28 Thread Thomas Breuel

> > Therefore, when Python encounters path names on a file system
> > that are not consistent with the (assumed) encoding for that file
> > system, Python should raise an error.
>
> This is what happens currently, and users are quite unhappy about it.


We need to keep "users" and "programmers" distinct here.

Programmers may find it inconvenient that they have to spend time figuring
out and deal with platform-dependent file system encoding issues and
errors.  But internationalization and unicode are hard, that's just a fact
of life.

End users, however, are going to be quite unhappy if they get a string of
gibberish for a file name because you decided to interpret some non-Unicode
string as UTF-8-with-extra-bytes.

Or some Python program might copy files from an ISO8859-15 encoded file
system to a UTF-8 encoded file system, and instead of getting an error when
the encodings are set incorrectly, Python would quietly create ISO8859-15
encoded file names, making the target file system inconsistent.

There is a lot of potential for major problems for end users with your
proposals.  In both cases, what should happen is that the end user gets an
error, submits a bug, and the programmer figures out how to deal with the
encoding issues correctly.


> Yes, users can do that (to a degree), but they are still unhappy about
> it. The approach actually fails for command line arguments


As it should: if I give an ISO8859-15 encoded command line argument to a
Python program that expects a UTF-8 encoding, the Python program should tell
me that there is something wrong when it notices that.  Quietly continuing
is the wrong thing to do.

If we follow your approach, that ISO8859-15 string will get turned into an
escaped unicode string inside Python.  If I understand your proposal
correctly, if it's a output file name and gets passed to Python's open
function, Python will then decode that string and end up with an ISO8859-15
byte sequence, which it will write to disk literally, even if the encoding
for the system is UTF-8.   That's the wrong thing to do.

As is, these interfaces are incomplete - they don't support command
> line arguments, or environment variables. If you want to complete them,
> you should write a PEP.


There's no point in scratching when there's no itch.

Tom

PS:

> Quietly escaping a bad UTF-8 encoding with private Unicode characters is
> > unlikely to be the right thing
>
> And indeed, the PEP stopped using PUA characters.


Let me rephrase this: "quietly escaping a bad UTF-8 encoding is unlikely to
be the right thing"; it doesn't matter how you do it.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

76 matches

Mail list logo