Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Barry Scott


On 30 Apr 2009, at 05:52, Martin v. Löwis wrote:


How do get a printable unicode version of these path strings if they
contain none unicode data?


Define printable. One way would be to use a regular expression,
replacing all codes in a certain range with a question mark.


What I mean by printable is that the string must be valid unicode
that I can print to a UTF-8 console or place as text in a UTF-8
web page.

I think your PEP gives me a string that will not encode to
valid UTF-8 that the outside of python world likes. Did I get this
point wrong?





I'm guessing that an app has to understand that filenames come in  
two forms
unicode and bytes if its not utf-8 data. Why not simply return  
string if

its valid utf-8 otherwise return bytes?


That would have been an alternative solution, and the one that 2.x  
uses

for listdir. People didn't like it.


In our application we are running fedora with the assumption that the
filenames are UTF-8. When Windows systems FTP files to our system
the files are in CP-1251(?) and not valid UTF-8.

What we have to do is detect these non UTF-8 filename and get the
users to rename them.

Having an algorithm that says if its a string no problem, if its
a byte deal with the exceptions seems simple.

How do I do this detection with the PEP proposal?
Do I end up using the byte interface and doing the utf-8 decode
myself?

Barry

--
http://mail.python.org/mailman/listinfo/python-list


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Martin v. Löwis
 How do get a printable unicode version of these path strings if they
 contain none unicode data?

 Define printable. One way would be to use a regular expression,
 replacing all codes in a certain range with a question mark.
 
 What I mean by printable is that the string must be valid unicode
 that I can print to a UTF-8 console or place as text in a UTF-8
 web page.
 
 I think your PEP gives me a string that will not encode to
 valid UTF-8 that the outside of python world likes. Did I get this
 point wrong?

You are right. However, if your *only* requirement is that it should
be printable, then this is fairly underspecified. One way to get
a printable string would be this function

def printable_string(unprintable):
  return 

This will always return a printable version of the input string...

 In our application we are running fedora with the assumption that the
 filenames are UTF-8. When Windows systems FTP files to our system
 the files are in CP-1251(?) and not valid UTF-8.

That would be a bug in your FTP server, no? If you want all file names
to be UTF-8, then your FTP server should arrange for that.

 Having an algorithm that says if its a string no problem, if its
 a byte deal with the exceptions seems simple.
 
 How do I do this detection with the PEP proposal?
 Do I end up using the byte interface and doing the utf-8 decode
 myself?

No, you should encode using the strict error handler, with the
locale encoding. If the file name encodes successfully, it's correct,
otherwise, it's broken.

Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Barry Scott


On 30 Apr 2009, at 21:06, Martin v. Löwis wrote:

How do get a printable unicode version of these path strings if  
they

contain none unicode data?


Define printable. One way would be to use a regular expression,
replacing all codes in a certain range with a question mark.


What I mean by printable is that the string must be valid unicode
that I can print to a UTF-8 console or place as text in a UTF-8
web page.

I think your PEP gives me a string that will not encode to
valid UTF-8 that the outside of python world likes. Did I get this
point wrong?


You are right. However, if your *only* requirement is that it should
be printable, then this is fairly underspecified. One way to get
a printable string would be this function

def printable_string(unprintable):
 return 


Ha ha! Indeed this works, but I would have to try to turn enough of the
string into a reasonable hint at the name of the file so the user can
some chance of know what is being reported.




This will always return a printable version of the input string...


In our application we are running fedora with the assumption that the
filenames are UTF-8. When Windows systems FTP files to our system
the files are in CP-1251(?) and not valid UTF-8.


That would be a bug in your FTP server, no? If you want all file names
to be UTF-8, then your FTP server should arrange for that.


Not a bug its the lack of a feature. We use ProFTPd that has just  
implemented
what is required. I forget the exact details - they are at work - when  
the ftp client
asks for the FEAT of the ftp server the server can say use UTF-8.  
Supporting

that in the server was apparently none-trivia.






Having an algorithm that says if its a string no problem, if its
a byte deal with the exceptions seems simple.

How do I do this detection with the PEP proposal?
Do I end up using the byte interface and doing the utf-8 decode
myself?


No, you should encode using the strict error handler, with the
locale encoding. If the file name encodes successfully, it's correct,
otherwise, it's broken.


O.k. I understand.

Barry

--
http://mail.python.org/mailman/listinfo/python-list


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread norseman

Martin v. Löwis wrote:

How do get a printable unicode version of these path strings if they
contain none unicode data?

Define printable. One way would be to use a regular expression,
replacing all codes in a certain range with a question mark.

What I mean by printable is that the string must be valid unicode
that I can print to a UTF-8 console or place as text in a UTF-8
web page.

I think your PEP gives me a string that will not encode to
valid UTF-8 that the outside of python world likes. Did I get this
point wrong?


You are right. However, if your *only* requirement is that it should
be printable, then this is fairly underspecified. One way to get
a printable string would be this function

def printable_string(unprintable):
  return 

This will always return a printable version of the input string...


No it will not.
It will return either nothing at all or a '\x00' depending on how a NULL
is treated. Neither prints on paper, screen or any where else. If you 
get the cases where all bytes are not translating or printable locally 
then you get nothing out.  Duplicate file names usually abound too.







In our application we are running fedora with the assumption that the
filenames are UTF-8. When Windows systems FTP files to our system
the files are in CP-1251(?) and not valid UTF-8.


That would be a bug in your FTP server, no? If you want all file names
to be UTF-8, then your FTP server should arrange for that.


Which seems to be exactly what he's trying to do.




Having an algorithm that says if its a string no problem, if its
a byte deal with the exceptions seems simple.

How do I do this detection with the PEP proposal?


If no one has an 'elegant' solution, toss PEP and do what has to be
done.  I find the classroom is seldom related to reality.


Do I end up using the byte interface and doing the utf-8 decode
myself?


No, you should encode using the strict error handler, with the
locale encoding. If the file name encodes successfully, it's correct,
otherwise, it's broken.


Exactly his problem to solve. How does he fix the broken



Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list





Barry;
First: See if the sender(s) will use a different font. :)
I would suggest you read raw bytes and handle the problem in
the usual logical way. (Translate what you can, if it looks readable
keep it otherwise send it back if possible.)  If you have to keep a
junked up name, try using a thesaurus or soundex (I know I spelled that
wrong) to help keep the meaning/sound of the file name.  If the name is
one of those computer generated gobbeldigoops - build a translation
table to use for incoming and for getting back to original bit patterns
later. Your name won't be the same but ...  Plug it into that handy 
utility you just wrote and you can talk much more effectively with sender.


If you can get the page-thingy (CP-1251 or whatever) specs you
can be well ahead of the game.  There are programs out there that will
convert (better or lessor) between page specs.  Some work in-line. 
Watch out for Python's print function not being completely compatible 
with reality. The high bit bytes in ASCII have been in use for quite 
some time and are (or at least supposed to be) part of the page to page 
spec translations. You probably will need to know (or make a close 
guess) of the 'from' language to get plausible results.  If the files 
are coming across the Pacific it might be a good time to form a 
collaboration. (a case of: we agree that 'that' bit pattern in your 
filename will become 'this' in ours. Reversal required, as in A becomes 
C incoming and C becomes A outgoing.)


Note:  Different machines store things differently. Intel stores High
byte last, Sun stores it first. It can be handy to know the machinery.
Net transport programs are supposed to send Sun order, not all do.




Steve

--
http://mail.python.org/mailman/listinfo/python-list


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Barry Scott


On 22 Apr 2009, at 07:50, Martin v. Löwis wrote:



If the locale's encoding is UTF-8, the file system encoding is set to
a new encoding utf-8b. The UTF-8b codec decodes non-decodable bytes
(which must be = 0x80) into half surrogate codes U+DC80..U+DCFF.



Forgive me if this has been covered. I've been reading this thread for  
a long time

and still have a 100 odd replies to go...

How do get a printable unicode version of these path strings if they  
contain

none unicode data?

I'm guessing that an app has to understand that filenames come in two  
forms
unicode and bytes if its not utf-8 data. Why not simply return string  
if its valid
utf-8 otherwise return bytes? Then in the app you check for the type  
for the object,

string or byte and deal with reporting errors appropriately.

Barry

--
http://mail.python.org/mailman/listinfo/python-list


Re: PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Cameron Simpson
On 29Apr2009 23:41, Barry Scott ba...@barrys-emacs.org wrote:
 On 22 Apr 2009, at 07:50, Martin v. Löwis wrote:
 If the locale's encoding is UTF-8, the file system encoding is set to
 a new encoding utf-8b. The UTF-8b codec decodes non-decodable bytes
 (which must be = 0x80) into half surrogate codes U+DC80..U+DCFF.

 Forgive me if this has been covered. I've been reading this thread for a 
 long time and still have a 100 odd replies to go...

 How do get a printable unicode version of these path strings if they  
 contain none unicode data?

Personally, I'd use repr(). One might ask, what would you expect to see
if you were printing such a string?

 I'm guessing that an app has to understand that filenames come in two  
 forms unicode and bytes if its not utf-8 data. Why not simply return string 
 if 
 its valid utf-8 otherwise return bytes? Then in the app you check for the 
 type for 
 the object, string or byte and deal with reporting errors appropriately.

Because it complicates the app enormously, for every app.

It would be _nice_ to just call os.listdir() et al with strings, get
strings, and not worry.

With strings becoming unicode in Python3, on POSIX you have an issue of
deciding how to get its filenames-are-bytes into a string and the
reverse. One could naively map the byte values to the same Unicode code
points, but that results in strings that do not contain the same
characters as the user/app expects for byte values above 127.

Since POSIX does not really have a filesystem level character encoding,
just a user environment setting that says how the current user encodes
characters into bytes (UTF-8 is increasingly common and useful, but
it is not universal), it is more useful to decode filenames on the
assumption that they represent characters in the user's (current) encoding
convention; that way when things are displayed they are meaningful,
and they interoperate well with strings made by the user/app. If all
the filenames were actually encoded that way when made, that works. But
different users may adopt different conventions, and indeed a user may
have used ACII or and ISO8859-* coding in the past and be transitioning
to something else now, so they will have a bunch of files in different
encodings.

The PEP uses the user's current encoding with a handler for byte
sequences that don't decode to valid Unicode scaler values in
a fashion that is reversible. That is, you get strings out of
listdir() and those strings will go back in (eg to open()) perfectly
robustly.

Previous approaches would either silently hide non-decodable names in
listdir() results or throw exceptions when the decode failed or mangle
things no reversably. I believe Python3 went with the first option
there.

The PEP at least lets programs naively access all files that exist,
and create a filename from any well-formed unicode string provided that
the filesystem encoding permits the name to be encoded.

The lengthy discussion mostly revolves around:

  - Glenn points out that strings that came _not_ from listdir, and that are
_not_ well-formed unicode (== have bare surrogates in them) but that
were intended for use as filenames will conflict with the PEP's scheme -
programs must know that these strings came from outside and must be
translated into the PEP's funny-encoding before use in the os.*
functions. Previous to the PEP they would get used directly and
encode differently after the PEP, thus producing different POSIX
filenames. Breakage.

  - Glenn would like the encoding to use Unicode scalar values only,
using a rare-in-filenames character.
That would avoid the issue with outside' strings that contain
surrogates. To my mind it just moves the punning from rare illegal
strings to merely uncommon but legal characters.

  - Some parties think it would be better to not return strings from
os.listdir but a subclass of string (or at least a duck-type of
string) that knows where it came from and is also handily
recognisable as not-really-a-string for purposes of deciding
whether is it PEP-funny-encoded by direct inspection.

Cheers,
-- 
Cameron Simpson c...@zip.com.au DoD#743
http://www.cskk.ezoshosting.com/cs/

The peever can look at the best day in his life and sneer at it.
- Jim Hill, JennyGfest '95
--
http://mail.python.org/mailman/listinfo/python-list


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis
 How do get a printable unicode version of these path strings if they
 contain none unicode data?

Define printable. One way would be to use a regular expression,
replacing all codes in a certain range with a question mark.

 I'm guessing that an app has to understand that filenames come in two forms
 unicode and bytes if its not utf-8 data. Why not simply return string if
 its valid utf-8 otherwise return bytes?

That would have been an alternative solution, and the one that 2.x uses
for listdir. People didn't like it.

Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-26 Thread Adrian
How about another str-like type, a sequence of char-or-bytes? Could be
called strbytes or stringwithinvalidcharacters. It would support
whatever subset of str functionality makes sense / is easy to
implement plus a to_escaped_str() method (that does the escaping the
PEP talks about) for people who want to use regexes or other str-only
stuff.

Here is a description by example:
os.listdir('.') - [strbytes('normal_file'), strbytes('bad', 128, 'file')]
strbytes('a')[0] - strbytes('a')
strbytes('bad', 128, 'file')[3] - strbytes(128)
strbytes('bad', 128, 'file').to_escaped_str() - 'bad?128file'

Having a separate type is cleaner than a str that isn't exactly what
it represents. And making the escaping an explicit (but
rarely-needed) step would be less surprising for users. Anyway, I
don't know a whole lot about this issue so there may an obvious reason
this is a bad idea.

On Wed, Apr 22, 2009 at 6:50 AM, Martin v. Löwis mar...@v.loewis.de wrote:
 I'm proposing the following PEP for inclusion into Python 3.1.
 Please comment.

 Regards,
 Martin

 PEP: 383
 Title: Non-decodable Bytes in System Character Interfaces
 Version: $Revision: 71793 $
 Last-Modified: $Date: 2009-04-22 08:42:06 +0200 (Mi, 22. Apr 2009) $
 Author: Martin v. Löwis mar...@v.loewis.de
 Status: Draft
 Type: Standards Track
 Content-Type: text/x-rst
 Created: 22-Apr-2009
 Python-Version: 3.1
 Post-History:

 Abstract
 

 File names, environment variables, and command line arguments are
 defined as being character data in POSIX; the C APIs however allow
 passing arbitrary bytes - whether these conform to a certain encoding
 or not. This PEP proposes a means of dealing with such irregularities
 by embedding the bytes in character strings in such a way that allows
 recreation of the original byte string.

 Rationale
 =

 The C char type is a data type that is commonly used to represent both
 character data and bytes. Certain POSIX interfaces are specified and
 widely understood as operating on character data, however, the system
 call interfaces make no assumption on the encoding of these data, and
 pass them on as-is. With Python 3, character strings use a
 Unicode-based internal representation, making it difficult to ignore
 the encoding of byte strings in the same way that the C interfaces can
 ignore the encoding.

 On the other hand, Microsoft Windows NT has correct the original
 design limitation of Unix, and made it explicit in its system
 interfaces that these data (file names, environment variables, command
 line arguments) are indeed character data, by providing a
 Unicode-based API (keeping a C-char-based one for backwards
 compatibility).

 For Python 3, one proposed solution is to provide two sets of APIs: a
 byte-oriented one, and a character-oriented one, where the
 character-oriented one would be limited to not being able to represent
 all data accurately. Unfortunately, for Windows, the situation would
 be exactly the opposite: the byte-oriented interface cannot represent
 all data; only the character-oriented API can. As a consequence,
 libraries and applications that want to support all user data in a
 cross-platform manner have to accept mish-mash of bytes and characters
 exactly in the way that caused endless troubles for Python 2.x.

 With this PEP, a uniform treatment of these data as characters becomes
 possible. The uniformity is achieved by using specific encoding
 algorithms, meaning that the data can be converted back to bytes on
 POSIX systems only if the same encoding is used.

 Specification
 =

 On Windows, Python uses the wide character APIs to access
 character-oriented APIs, allowing direct conversion of the
 environmental data to Python str objects.

 On POSIX systems, Python currently applies the locale's encoding to
 convert the byte data to Unicode. If the locale's encoding is UTF-8,
 it can represent the full set of Unicode characters, otherwise, only a
 subset is representable. In the latter case, using private-use
 characters to represent these bytes would be an option. For UTF-8,
 doing so would create an ambiguity, as the private-use characters may
 regularly occur in the input also.

 To convert non-decodable bytes, a new error handler python-escape is
 introduced, which decodes non-decodable bytes using into a private-use
 character U+F01xx, which is believed to not conflict with private-use
 characters that currently exist in Python codecs.

 The error handler interface is extended to allow the encode error
 handler to return byte strings immediately, in addition to returning
 Unicode strings which then get encoded again.

 If the locale's encoding is UTF-8, the file system encoding is set to
 a new encoding utf-8b. The UTF-8b codec decodes non-decodable bytes
 (which must be = 0x80) into half surrogate codes U+DC80..U+DCFF.

 Discussion
 ==

 While providing a uniform API to non-decodable bytes, this interface
 has the limitation that chosen representation only

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-26 Thread Martin v. Löwis
 How about another str-like type, a sequence of char-or-bytes?

That would be a different PEP. I personally like my own proposal
more, but feel free to propose something different.

Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Martin v. Löwis
Cameron Simpson wrote:
 On 22Apr2009 08:50, Martin v. Löwis mar...@v.loewis.de wrote:
 | File names, environment variables, and command line arguments are
 | defined as being character data in POSIX;
 
 Specific citation please? I'd like to check the specifics of this.

For example, on environment variables:

http://opengroup.org/onlinepubs/007908799/xbd/envvar.html

# For values to be portable across XSI-conformant systems, the value
# must be composed of characters from the portable character set (except
# NUL and as indicated below).

# Environment variable names used by the utilities in the XCU
# specification consist solely of upper-case letters, digits and the _
# (underscore) from the characters defined in Portable Character Set .
# Other characters may be permitted by an implementation;

Or, on command line arguments:

http://opengroup.org/onlinepubs/007908799/xsh/execve.html

# The arguments represented by arg0, ... are pointers to null-terminated
# character strings

where a character string is A contiguous sequence of characters
terminated by and including the first null byte., and a character
is

# A sequence of one or more bytes representing a single graphic symbol
# or control code. This term corresponds to the ISO C standard term
# multibyte character (multi-byte character), where a single-byte
# character is a special case of a multi-byte character. Unlike the
# usage in the ISO C standard, character here has no necessary
# relationship with storage space, and byte is used when storage space
# is discussed.

 So you're proposing that all POSIX OS interfaces (which use byte strings)
 interpret those byte strings into Python3 str objects, with a codec
 that will accept arbitrary byte sequences losslessly and is totally
 reversible, yes?

Correct.

 And, I hope, that the os.* interfaces silently use it by default.

Correct.

 | Applications that need to process the original byte
 | strings can obtain them by encoding the character strings with the
 | file system encoding, passing python-escape as the error handler
 | name.
 
 -1
 
 This last sentence kills the idea for me, unless I'm missing something.
 Which I may be, of course.
 
 POSIX filesystems _do_not_ have a file system encoding.

Why is that a problem for the PEP?

 If I'm writing a general purpose UNIX tool like chmod or find, I expect
 it to work reliably on _any_ UNIX pathname. It must be totally encoding
 blind. If I speak to the os.* interface to open a file, I expect to hand
 it bytes and have it behave.

See the other messages. If you want to do that, you can continue to.

 I'm very much in favour of being able to work in strings for most
 purposes, but if I use the os.* interfaces on a UNIX system it is
 necessary to be _able_ to work in bytes, because UNIX file pathnames
 are bytes.

Please re-read the PEP. It provides a way of being able to access any
POSIX file name correctly, and still pass strings.

 If there isn't a byte-safe os.* facility in Python3, it will simply be
 unsuitable for writing low level UNIX tools.

Why is that? The mechanism in the PEP is precisely defined to allow
writing low level UNIX tools.

 Finally, I have a small python program whose whole purpose in life
 is to transcode UNIX filenames before transfer to a MacOSX HFS
 directory, because of HFS's enforced particular encoding. What approach
 should a Python app take to transcode UNIX pathnames under your scheme?

Compute the corresponding character strings, and use them.

Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Martin v. Löwis
 If the bytes are mapped to single half surrogate codes instead of the
 normal pairs (low+high), then I can see that decoding could never be
 ambiguous and encoding could produce the original bytes.

I was confused by Markus Kuhn's original UTF-8b specification. I have
now changed the PEP to avoid using PUA characters at all.

Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


Re: PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Zooko O'Whielacronx
Thanks for writing this PEP 383, MvL.  I recently ran into this  
problem in Python 2.x in the Tahoe project [1].  The Tahoe project  
should be considered a good use case showing what some people need.   
For example, the assumption that a file will later be written back  
into the same local filesystem (and thus luckily use the same  
encoding) from which it originally came doesn't hold for us, because  
Tahoe is used for file-sharing as well as for backup-and-restore.


One of my first conclusions in pursuing this issue is that we can  
never use the Python 2.x unicode APIs on Linux, just as we can never  
use the Python 2.x str APIs on Windows [2].  (You mentioned this  
ugliness in your PEP.)  My next conclusion was that the Linux way of  
doing encoding of filenames really sucks compared to, for example,  
the Mac OS X way.  I'm heartened to see what David Wheeler is trying  
to persuade the maintainers of Linux filesystems to improve some of  
this: [3].


My final conclusion was that we needed to have two kinds of  
workaround for the Linux suckage: first, if decoding using the  
suggested filesystem encoding fails, then we fall back to mojibake  
[4] by decoding with iso-8859-1 (or else with windows-1252 -- I'm not  
sure if it matters and I haven't yet understood if utf-8b offers  
another alternative for this case).  Second, if decoding succeeds  
using the suggested filesystem encoding on Linux, then write down the  
encoding that we used and include that with the filename.  This  
expands the size of our filenames significantly, but it is the only  
way to allow some future programmer to undo the damage of a falsely- 
successful decoding.  Here's our whole plan: [5].


Regards,

Zooko

[1] http://allmydata.org
[2] http://allmydata.org/pipermail/tahoe-dev/2009-March/001379.html #  
see the footnote of this message

[3] http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html
[4] http://en.wikipedia.org/wiki/Mojibake
[5] http://allmydata.org/trac/tahoe/ticket/534#comment:47
--
http://mail.python.org/mailman/listinfo/python-list


Re: PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Cameron Simpson
On 25Apr2009 14:07, Martin v. Löwis mar...@v.loewis.de wrote:
| Cameron Simpson wrote:
|  On 22Apr2009 08:50, Martin v. Löwis mar...@v.loewis.de wrote:
|  | File names, environment variables, and command line arguments are
|  | defined as being character data in POSIX;
|  
|  Specific citation please? I'd like to check the specifics of this.
| For example, on environment variables:
| http://opengroup.org/onlinepubs/007908799/xbd/envvar.html
[...]
| http://opengroup.org/onlinepubs/007908799/xsh/execve.html
[...]

Thanks.

|  So you're proposing that all POSIX OS interfaces (which use byte strings)
|  interpret those byte strings into Python3 str objects, with a codec
|  that will accept arbitrary byte sequences losslessly and is totally
|  reversible, yes?
| 
| Correct.
| 
|  And, I hope, that the os.* interfaces silently use it by default.
| 
| Correct.

Ok, then I'm probably good with the PEP. Though I have a quite strong
desire to be able to work in bytes at need without doing multiple
encode/decode steps.

|  | Applications that need to process the original byte
|  | strings can obtain them by encoding the character strings with the
|  | file system encoding, passing python-escape as the error handler
|  | name.
|  
|  -1
|  This last sentence kills the idea for me, unless I'm missing something.
|  Which I may be, of course.
|  POSIX filesystems _do_not_ have a file system encoding.
| 
| Why is that a problem for the PEP?

Because you said above by encoding the character strings with the file
system encoding, which is a fiction.

|  If I'm writing a general purpose UNIX tool like chmod or find, I expect
|  it to work reliably on _any_ UNIX pathname. It must be totally encoding
|  blind. If I speak to the os.* interface to open a file, I expect to hand
|  it bytes and have it behave.
| 
| See the other messages. If you want to do that, you can continue to.
| 
|  I'm very much in favour of being able to work in strings for most
|  purposes, but if I use the os.* interfaces on a UNIX system it is
|  necessary to be _able_ to work in bytes, because UNIX file pathnames
|  are bytes.
| 
| Please re-read the PEP. It provides a way of being able to access any
| POSIX file name correctly, and still pass strings.
| 
|  If there isn't a byte-safe os.* facility in Python3, it will simply be
|  unsuitable for writing low level UNIX tools.
| 
| Why is that? The mechanism in the PEP is precisely defined to allow
| writing low level UNIX tools.

Then implicitly it's byte safe. Clearly I'm being unclear; I mean
original OS-level byte strings must be obtainable undamaged, and it must
be possible to create/work on OS objects starting with a byte string as
the pathname.

|  Finally, I have a small python program whose whole purpose in life
|  is to transcode UNIX filenames before transfer to a MacOSX HFS
|  directory, because of HFS's enforced particular encoding. What approach
|  should a Python app take to transcode UNIX pathnames under your scheme?
| 
| Compute the corresponding character strings, and use them.

In Python2 I've been going (ignoring checks for unchanged names):

  - Obtain the old name and interpret it into a str() correctly.
I mean here that I go:
  unicode_name = unicode(name, srcencoding)
in old Python2 speak. name is a bytes string obtained from listdir()
and srcencoding is the encoding known to have been used when the old name
was constructed. Eg iso8859-1.
  - Compute the new name in the desired encoding. For MacOSX HFS,
that's:
  utf8_name = unicodedata.normalize('NFD',unicode_name).encode('utf8')
Still in Python2 speak, that's a byte string.
  - os.rename(name, utf8_name)

Under your scheme I imagine this is amended. I would change your
listdir_b() function as follows:

  def listdir_b(bytestring, fse=None):
   if fse is None:
   fse = sys.getfilesystemencoding()
   string = bytestring.decode(fse, python-escape)
   for fn in os.listdir(string):
   yield fn.encoded(fse, python-escape)

So, internally, os.listdir() takes a string and encodes it to an
_unspecified_ encoding in bytes, and opens the directory with that
byte string using POSIX opendir(3).

How does listdir() ensure that the byte string it passes to the underlying
opendir(3) is identical to 'bytestring' as passed to listdir_b()?

It seems from the PEP that On POSIX systems, Python currently applies the
locale's encoding to convert the byte data to Unicode. Your extension
is to augument that by expressing the non-decodable byte sequences in a
non-conflicting way for reversal later, yes?

That seems to double the complexity of my example application, since
it wants to interpret the original bytes in a caller-specified fashion,
not using the locale defaults.

So I must go:

  def macify(dirname, srcencoding):
# I need this to reverse your encoding scheme
fse = sys.getfilesystemencoding()
# I'll pretend dirname is ready for use
# it possibly has had to undergo the 

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-24 Thread Lino Mastrodomenico
2009/4/22 Martin v. Löwis mar...@v.loewis.de:
 To convert non-decodable bytes, a new error handler python-escape is
 introduced, which decodes non-decodable bytes using into a private-use
 character U+F01xx, which is believed to not conflict with private-use
 characters that currently exist in Python codecs.

Why not use U+DCxx for non-UTF-8 encodings too?

Overall I like the PEP: I think it's the best proposal so far that
doesn't put an heavy burden on applications that only want to do
simple things with the API.

-- 
Lino Mastrodomenico
--
http://mail.python.org/mailman/listinfo/python-list


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-24 Thread Martin v. Löwis
 Why not use U+DCxx for non-UTF-8 encodings too?

I thought of that, and was tricked into believing that only U+DC8x
is a half surrogate. Now I see that you are right, and have fixed
the PEP accordingly.

Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


Re: PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-23 Thread Cameron Simpson
On 22Apr2009 08:50, Martin v. L�wis mar...@v.loewis.de wrote:
| File names, environment variables, and command line arguments are
| defined as being character data in POSIX;

Specific citation please? I'd like to check the specifics of this.

| the C APIs however allow
| passing arbitrary bytes - whether these conform to a certain encoding
| or not.

Indeed.

| This PEP proposes a means of dealing with such irregularities
| by embedding the bytes in character strings in such a way that allows
| recreation of the original byte string.
[...]

So you're proposing that all POSIX OS interfaces (which use byte strings)
interpret those byte strings into Python3 str objects, with a codec
that will accept arbitrary byte sequences losslessly and is totally
reversible, yes?

And, I hope, that the os.* interfaces silently use it by default.

| For most applications, we assume that they eventually pass data
| received from a system interface back into the same system
| interfaces. For example, and application invoking os.listdir() will
| likely pass the result strings back into APIs like os.stat() or
| open(), which then encodes them back into their original byte
| representation. Applications that need to process the original byte
| strings can obtain them by encoding the character strings with the
| file system encoding, passing python-escape as the error handler
| name.

-1

This last sentence kills the idea for me, unless I'm missing something.
Which I may be, of course.

POSIX filesystems _do_not_ have a file system encoding.

The user's environment suggests a preferred encoding via the locale
stuff, and apps honouring that will make nice looking byte strings as
filenames for that user. (Some platforms, like MacOSX' HFS filesystems,
_do_ enforce an encoding, and a quite specific variety of UTF-8 it is;
I would say they're not a full UNIX filesystem _precisely_ because they
reject certain byte strings that are valid on other UNIX filesystems.
What will your proposal do here? I can imagine it might cope with
existing names, but what happens when the user creates a new name?)

Further, different users can use different locales and encodings.
If they do it in different work areas they'll be perfectly happy;
if they do it in a shared area doubtless confusion will reign,
but only in the users' minds, not in the filesystem.

If I'm writing a general purpose UNIX tool like chmod or find, I expect
it to work reliably on _any_ UNIX pathname. It must be totally encoding
blind. If I speak to the os.* interface to open a file, I expect to hand
it bytes and have it behave. As an explicit example, I would be just fine
with python's open(filename, w) to take a string and encode it for use,
but _not_ ok for os.open() to require me to supply a string and cross
my fingers and hope something sane happens when it is turned into bytes
for the UNIX system call.

I'm very much in favour of being able to work in strings for most
purposes, but if I use the os.* interfaces on a UNIX system it is
necessary to be _able_ to work in bytes, because UNIX file pathnames
are bytes.

If there isn't a byte-safe os.* facility in Python3, it will simply be
unsuitable for writing low level UNIX tools. And I very much like using
Python2 for that.

Finally, I have a small python program whose whole purpose in life
is to transcode UNIX filenames before transfer to a MacOSX HFS
directory, because of HFS's enforced particular encoding. What approach
should a Python app take to transcode UNIX pathnames under your scheme?

Cheers,
-- 
Cameron Simpson c...@zip.com.au DoD#743
http://www.cskk.ezoshosting.com/cs/

The nice thing about standards is that you have so many to choose from;
furthermore, if you do not like any of them, you can just wait for next
year's model.   - Andrew S. Tanenbaum
--
http://mail.python.org/mailman/listinfo/python-list


Re: PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-23 Thread Cameron Simpson
On 24Apr2009 09:27, I wrote:
| If I'm writing a general purpose UNIX tool like chmod or find, I expect
| it to work reliably on _any_ UNIX pathname. It must be totally encoding
| blind. If I speak to the os.* interface to open a file, I expect to hand
| it bytes and have it behave. As an explicit example, I would be just fine
| with python's open(filename, w) to take a string and encode it for use,
| but _not_ ok for os.open() to require me to supply a string and cross
| my fingers and hope something sane happens when it is turned into bytes
| for the UNIX system call.
| 
| I'm very much in favour of being able to work in strings for most
| purposes, but if I use the os.* interfaces on a UNIX system it is
| necessary to be _able_ to work in bytes, because UNIX file pathnames
| are bytes.

Just to follow up to my own words here, I would be ok for all the
pure-byte stuff to be off in the posix module if os.* goes pure
character instead of bytes or bytes+strings.
-- 
Cameron Simpson c...@zip.com.au DoD#743
http://www.cskk.ezoshosting.com/cs/

... that, in a few years, all great physical constants will have been
approximately estimated, and that the only occupation which will be
left to men of science will be to carry these measurements to another
place of decimals.  - James Clerk Maxwell (1813-1879)
  Scientific Papers 2, 244, October 1871
--
http://mail.python.org/mailman/listinfo/python-list


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-23 Thread James Y Knight


On Apr 22, 2009, at 2:50 AM, Martin v. Löwis wrote:


I'm proposing the following PEP for inclusion into Python 3.1.
Please comment.


+1. Even if some people still want a low-level bytes API, it's  
important that the easy case be easy. That is: the majority of Python  
applications should *just work, damnit* even with not-properly-encoded- 
in-current-LC_CTYPE filenames. It looks like this proposal  
accomplishes that, and does so in a relatively nice fashion.


James
--
http://mail.python.org/mailman/listinfo/python-list


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-23 Thread MRAB

Martin v. Löwis wrote:

MRAB wrote:

Martin v. Löwis wrote:
[snip]

To convert non-decodable bytes, a new error handler python-escape is
introduced, which decodes non-decodable bytes using into a private-use
character U+F01xx, which is believed to not conflict with private-use
characters that currently exist in Python codecs.

The error handler interface is extended to allow the encode error
handler to return byte strings immediately, in addition to returning
Unicode strings which then get encoded again.

If the locale's encoding is UTF-8, the file system encoding is set to
a new encoding utf-8b. The UTF-8b codec decodes non-decodable bytes
(which must be = 0x80) into half surrogate codes U+DC80..U+DCFF.


If the byte stream happens to include a sequence which decodes to
U+F01xx, shouldn't that raise an exception?


I apparently have not expressed it clearly, so please help me improve
the text. What I mean is this:

- if the environment encoding (for lack of better name) is UTF-8,
  Python stops using the utf-8 codec under this PEP, and switches
  to the utf-8b codec.
- otherwise (env encoding is not utf-8), undecodable bytes get decoded
  with the error handler. In this case, U+F01xx will not occur
  in the byte stream, since no other codec ever produces this PUA
  character (this is not fully true - UTF-16 may also produce PUA
  characters, but they can't appear as env encodings).
So the case you are referring to should not happen.


I think what's confusing me is that you talk about mapping non-decodable
bytes to U+F01xx, but you also talk about decoding to half surrogate
codes U+DC80..U+DCFF.

If the bytes are mapped to single half surrogate codes instead of the
normal pairs (low+high), then I can see that decoding could never be
ambiguous and encoding could produce the original bytes.
--
http://mail.python.org/mailman/listinfo/python-list


PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Martin v. Löwis
I'm proposing the following PEP for inclusion into Python 3.1.
Please comment.

Regards,
Martin

PEP: 383
Title: Non-decodable Bytes in System Character Interfaces
Version: $Revision: 71793 $
Last-Modified: $Date: 2009-04-22 08:42:06 +0200 (Mi, 22. Apr 2009) $
Author: Martin v. Löwis mar...@v.loewis.de
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 22-Apr-2009
Python-Version: 3.1
Post-History:

Abstract


File names, environment variables, and command line arguments are
defined as being character data in POSIX; the C APIs however allow
passing arbitrary bytes - whether these conform to a certain encoding
or not. This PEP proposes a means of dealing with such irregularities
by embedding the bytes in character strings in such a way that allows
recreation of the original byte string.

Rationale
=

The C char type is a data type that is commonly used to represent both
character data and bytes. Certain POSIX interfaces are specified and
widely understood as operating on character data, however, the system
call interfaces make no assumption on the encoding of these data, and
pass them on as-is. With Python 3, character strings use a
Unicode-based internal representation, making it difficult to ignore
the encoding of byte strings in the same way that the C interfaces can
ignore the encoding.

On the other hand, Microsoft Windows NT has correct the original
design limitation of Unix, and made it explicit in its system
interfaces that these data (file names, environment variables, command
line arguments) are indeed character data, by providing a
Unicode-based API (keeping a C-char-based one for backwards
compatibility).

For Python 3, one proposed solution is to provide two sets of APIs: a
byte-oriented one, and a character-oriented one, where the
character-oriented one would be limited to not being able to represent
all data accurately. Unfortunately, for Windows, the situation would
be exactly the opposite: the byte-oriented interface cannot represent
all data; only the character-oriented API can. As a consequence,
libraries and applications that want to support all user data in a
cross-platform manner have to accept mish-mash of bytes and characters
exactly in the way that caused endless troubles for Python 2.x.

With this PEP, a uniform treatment of these data as characters becomes
possible. The uniformity is achieved by using specific encoding
algorithms, meaning that the data can be converted back to bytes on
POSIX systems only if the same encoding is used.

Specification
=

On Windows, Python uses the wide character APIs to access
character-oriented APIs, allowing direct conversion of the
environmental data to Python str objects.

On POSIX systems, Python currently applies the locale's encoding to
convert the byte data to Unicode. If the locale's encoding is UTF-8,
it can represent the full set of Unicode characters, otherwise, only a
subset is representable. In the latter case, using private-use
characters to represent these bytes would be an option. For UTF-8,
doing so would create an ambiguity, as the private-use characters may
regularly occur in the input also.

To convert non-decodable bytes, a new error handler python-escape is
introduced, which decodes non-decodable bytes using into a private-use
character U+F01xx, which is believed to not conflict with private-use
characters that currently exist in Python codecs.

The error handler interface is extended to allow the encode error
handler to return byte strings immediately, in addition to returning
Unicode strings which then get encoded again.

If the locale's encoding is UTF-8, the file system encoding is set to
a new encoding utf-8b. The UTF-8b codec decodes non-decodable bytes
(which must be = 0x80) into half surrogate codes U+DC80..U+DCFF.

Discussion
==

While providing a uniform API to non-decodable bytes, this interface
has the limitation that chosen representation only works if the data
get converted back to bytes with the python-escape error handler
also. Encoding the data with the locale's encoding and the (default)
strict error handler will raise an exception, encoding them with UTF-8
will produce non-sensical data.

For most applications, we assume that they eventually pass data
received from a system interface back into the same system
interfaces. For example, and application invoking os.listdir() will
likely pass the result strings back into APIs like os.stat() or
open(), which then encodes them back into their original byte
representation. Applications that need to process the original byte
strings can obtain them by encoding the character strings with the
file system encoding, passing python-escape as the error handler
name.

Copyright
=

This document has been placed in the public domain.
--
http://mail.python.org/mailman/listinfo/python-list


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Nick Coghlan
Martin v. Löwis wrote:
 I'm proposing the following PEP for inclusion into Python 3.1.
 Please comment.

That seems like a much nicer solution than having parallel bytes/Unicode
APIs everywhere.

When the locale encoding is UTF-8, would UTF-8b also be used for the
command line decoding and environment variable encoding/decoding? (the
PEP currently only states that the encoding switch will be done for the
file system encoding - it is silent regarding the other two system
interfaces).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
---
--
http://mail.python.org/mailman/listinfo/python-list


Re: PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread MRAB

Martin v. Löwis wrote:
[snip]

To convert non-decodable bytes, a new error handler python-escape is
introduced, which decodes non-decodable bytes using into a private-use
character U+F01xx, which is believed to not conflict with private-use
characters that currently exist in Python codecs.

The error handler interface is extended to allow the encode error
handler to return byte strings immediately, in addition to returning
Unicode strings which then get encoded again.

If the locale's encoding is UTF-8, the file system encoding is set to
a new encoding utf-8b. The UTF-8b codec decodes non-decodable bytes
(which must be = 0x80) into half surrogate codes U+DC80..U+DCFF.


If the byte stream happens to include a sequence which decodes to
U+F01xx, shouldn't that raise an exception?
--
http://mail.python.org/mailman/listinfo/python-list


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Walter Dörwald
Martin v. Löwis wrote:

 I'm proposing the following PEP for inclusion into Python 3.1.
 Please comment.
 
 Regards,
 Martin
 
 PEP: 383
 Title: Non-decodable Bytes in System Character Interfaces
 Version: $Revision: 71793 $
 Last-Modified: $Date: 2009-04-22 08:42:06 +0200 (Mi, 22. Apr 2009) $
 Author: Martin v. Löwis mar...@v.loewis.de
 Status: Draft
 Type: Standards Track
 Content-Type: text/x-rst
 Created: 22-Apr-2009
 Python-Version: 3.1
 Post-History:
 
 Abstract
 
 
 File names, environment variables, and command line arguments are
 defined as being character data in POSIX; the C APIs however allow
 passing arbitrary bytes - whether these conform to a certain encoding
 or not. This PEP proposes a means of dealing with such irregularities
 by embedding the bytes in character strings in such a way that allows
 recreation of the original byte string.
 
 Rationale
 =
 
 The C char type is a data type that is commonly used to represent both
 character data and bytes. Certain POSIX interfaces are specified and
 widely understood as operating on character data, however, the system
 call interfaces make no assumption on the encoding of these data, and
 pass them on as-is. With Python 3, character strings use a
 Unicode-based internal representation, making it difficult to ignore
 the encoding of byte strings in the same way that the C interfaces can
 ignore the encoding.
 
 On the other hand, Microsoft Windows NT has correct the original

correct - corrected

 design limitation of Unix, and made it explicit in its system
 interfaces that these data (file names, environment variables, command
 line arguments) are indeed character data, by providing a
 Unicode-based API (keeping a C-char-based one for backwards
 compatibility).
 
 [...]
 
 Specification
 =
 
 On Windows, Python uses the wide character APIs to access
 character-oriented APIs, allowing direct conversion of the
 environmental data to Python str objects.
 
 On POSIX systems, Python currently applies the locale's encoding to
 convert the byte data to Unicode. If the locale's encoding is UTF-8,
 it can represent the full set of Unicode characters, otherwise, only a
 subset is representable. In the latter case, using private-use
 characters to represent these bytes would be an option. For UTF-8,
 doing so would create an ambiguity, as the private-use characters may
 regularly occur in the input also.
 
 To convert non-decodable bytes, a new error handler python-escape is
 introduced, which decodes non-decodable bytes using into a private-use
 character U+F01xx, which is believed to not conflict with private-use
 characters that currently exist in Python codecs.

Would this mean that real private use characters in the file name would
raise an exception? How? The UTF-8 decoder doesn't pass those bytes to
any error handler.

 The error handler interface is extended to allow the encode error
 handler to return byte strings immediately, in addition to returning
 Unicode strings which then get encoded again.

Then the error callback for encoding would become specific to the target
encoding. Would this mean that the handler checks which encoding is used
and behaves like strict if it doesn't recognize the encoding?

 If the locale's encoding is UTF-8, the file system encoding is set to
 a new encoding utf-8b. The UTF-8b codec decodes non-decodable bytes
 (which must be = 0x80) into half surrogate codes U+DC80..U+DCFF.

Is this done by the codec, or the error handler? If it's done by the
codec I don't see a reason for the python-escape error handler.

 Discussion
 ==
 
 While providing a uniform API to non-decodable bytes, this interface
 has the limitation that chosen representation only works if the data
 get converted back to bytes with the python-escape error handler
 also.

I thought the error handler would be used for decoding.

 Encoding the data with the locale's encoding and the (default)
 strict error handler will raise an exception, encoding them with UTF-8
 will produce non-sensical data.
 
 For most applications, we assume that they eventually pass data
 received from a system interface back into the same system
 interfaces. For example, and application invoking os.listdir() will

and - an

 likely pass the result strings back into APIs like os.stat() or
 open(), which then encodes them back into their original byte
 representation. Applications that need to process the original byte
 strings can obtain them by encoding the character strings with the
 file system encoding, passing python-escape as the error handler
 name.

Servus,
   Walter
--
http://mail.python.org/mailman/listinfo/python-list


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Martin v. Löwis
 correct - corrected

Thanks, fixed.

 To convert non-decodable bytes, a new error handler python-escape is
 introduced, which decodes non-decodable bytes using into a private-use
 character U+F01xx, which is believed to not conflict with private-use
 characters that currently exist in Python codecs.
 
 Would this mean that real private use characters in the file name would
 raise an exception? How? The UTF-8 decoder doesn't pass those bytes to
 any error handler.

The python-escape codec is only used/meaningful if the env encoding
is not UTF-8. For any other encoding, it is assumed that no character
actually maps to the private-use characters.

 The error handler interface is extended to allow the encode error
 handler to return byte strings immediately, in addition to returning
 Unicode strings which then get encoded again.
 
 Then the error callback for encoding would become specific to the target
 encoding.

Why would it become specific? It can work the same way for any encoding:
take U+F01xx, and generate the byte xx.

 If the locale's encoding is UTF-8, the file system encoding is set to
 a new encoding utf-8b. The UTF-8b codec decodes non-decodable bytes
 (which must be = 0x80) into half surrogate codes U+DC80..U+DCFF.
 
 Is this done by the codec, or the error handler? If it's done by the
 codec I don't see a reason for the python-escape error handler.

utf-8b is a new codec. However, the utf-8b codec is only used if the
env encoding would otherwise be utf-8. For utf-8b, the error handler
is indeed unnecessary.

 While providing a uniform API to non-decodable bytes, this interface
 has the limitation that chosen representation only works if the data
 get converted back to bytes with the python-escape error handler
 also.
 
 I thought the error handler would be used for decoding.

It's used in both directions: for decoding, it converts \xXX to
U+F01XX. For encoding, U+F01XX will trigger an error, which is then
handled by the handler to produce \xXX.

 and - an

Thanks, fixed.

Regards,
Martin

--
http://mail.python.org/mailman/listinfo/python-list


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Martin v. Löwis
MRAB wrote:
 Martin v. Löwis wrote:
 [snip]
 To convert non-decodable bytes, a new error handler python-escape is
 introduced, which decodes non-decodable bytes using into a private-use
 character U+F01xx, which is believed to not conflict with private-use
 characters that currently exist in Python codecs.

 The error handler interface is extended to allow the encode error
 handler to return byte strings immediately, in addition to returning
 Unicode strings which then get encoded again.

 If the locale's encoding is UTF-8, the file system encoding is set to
 a new encoding utf-8b. The UTF-8b codec decodes non-decodable bytes
 (which must be = 0x80) into half surrogate codes U+DC80..U+DCFF.

 If the byte stream happens to include a sequence which decodes to
 U+F01xx, shouldn't that raise an exception?

I apparently have not expressed it clearly, so please help me improve
the text. What I mean is this:

- if the environment encoding (for lack of better name) is UTF-8,
  Python stops using the utf-8 codec under this PEP, and switches
  to the utf-8b codec.
- otherwise (env encoding is not utf-8), undecodable bytes get decoded
  with the error handler. In this case, U+F01xx will not occur
  in the byte stream, since no other codec ever produces this PUA
  character (this is not fully true - UTF-16 may also produce PUA
  characters, but they can't appear as env encodings).
So the case you are referring to should not happen.

Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Walter Dörwald
Martin v. Löwis wrote:
 correct - corrected
 
 Thanks, fixed.
 
 To convert non-decodable bytes, a new error handler python-escape is
 introduced, which decodes non-decodable bytes using into a private-use
 character U+F01xx, which is believed to not conflict with private-use
 characters that currently exist in Python codecs.
 Would this mean that real private use characters in the file name would
 raise an exception? How? The UTF-8 decoder doesn't pass those bytes to
 any error handler.
 
 The python-escape codec is only used/meaningful if the env encoding
 is not UTF-8. For any other encoding, it is assumed that no character
 actually maps to the private-use characters.

Which should be true for any encoding from the pre-unicode era, but not
for UTF-16/32 and variants.

 The error handler interface is extended to allow the encode error
 handler to return byte strings immediately, in addition to returning
 Unicode strings which then get encoded again.
 Then the error callback for encoding would become specific to the target
 encoding.
 
 Why would it become specific? It can work the same way for any encoding:
 take U+F01xx, and generate the byte xx.

If any error callback emits bytes these byte sequences must be legal in
the target encoding, which depends on the target encoding itself.

However for the normal use of this error handler this might be
irrelevant, because those filenames that get encoded were constructed in
such a way that reencoding them regenerates the original byte sequence.

 If the locale's encoding is UTF-8, the file system encoding is set to
 a new encoding utf-8b. The UTF-8b codec decodes non-decodable bytes
 (which must be = 0x80) into half surrogate codes U+DC80..U+DCFF.
 Is this done by the codec, or the error handler? If it's done by the
 codec I don't see a reason for the python-escape error handler.
 
 utf-8b is a new codec. However, the utf-8b codec is only used if the
 env encoding would otherwise be utf-8. For utf-8b, the error handler
 is indeed unnecessary.

Wouldn't it make more sense to be consistent how non-decodable bytes get
decoded? I.e. should the utf-8b codec decode those bytes to PUA
characters too (and refuse to encode then, so the error handler outputs
them)?

 While providing a uniform API to non-decodable bytes, this interface
 has the limitation that chosen representation only works if the data
 get converted back to bytes with the python-escape error handler
 also.
 I thought the error handler would be used for decoding.
 
 It's used in both directions: for decoding, it converts \xXX to
 U+F01XX. For encoding, U+F01XX will trigger an error, which is then
 handled by the handler to produce \xXX.

But only for non-UTF8 encodings?

Servus,
   Walter
--
http://mail.python.org/mailman/listinfo/python-list


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread M.-A. Lemburg
On 2009-04-22 22:06, Walter Dörwald wrote:
 Martin v. Löwis wrote:
 correct - corrected
 Thanks, fixed.

 To convert non-decodable bytes, a new error handler python-escape is
 introduced, which decodes non-decodable bytes using into a private-use
 character U+F01xx, which is believed to not conflict with private-use
 characters that currently exist in Python codecs.
 Would this mean that real private use characters in the file name would
 raise an exception? How? The UTF-8 decoder doesn't pass those bytes to
 any error handler.
 The python-escape codec is only used/meaningful if the env encoding
 is not UTF-8. For any other encoding, it is assumed that no character
 actually maps to the private-use characters.
 
 Which should be true for any encoding from the pre-unicode era, but not
 for UTF-16/32 and variants.

Actually it's not even true for the pre-Unicode codecs. It was and is common
for Asian companies to use company specific symbols in private areas
or extended versions of CJK character sets.

Microsoft even published an editor for Asian users create their
own glyphs as needed:

http://msdn.microsoft.com/en-us/library/cc194861.aspx

Here's an overview for some US companies using such extensions:


http://scripts.sil.org/cms/SCRIPTs/page.php?site_id=nrsiitem_id=VendorUseOfPUA
(it's no surprise that most of these actually defined their own charsets)

SIL even started a registry for the private use areas (PUAs):

http://scripts.sil.org/cms/SCRIPTs/page.php?site_id=nrsicat_id=UnicodePUA

This is their current list of assignments:


http://scripts.sil.org/cms/SCRIPTs/page.php?site_id=nrsiitem_id=SILPUAassignments

and here's how to register:


http://scripts.sil.org/cms/SCRIPTs/page.php?site_id=nrsicat_id=UnicodePUA#404a261e

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 22 2009)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
--
http://mail.python.org/mailman/listinfo/python-list


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Martin v. Löwis
 The python-escape codec is only used/meaningful if the env encoding
 is not UTF-8. For any other encoding, it is assumed that no character
 actually maps to the private-use characters.
 
 Which should be true for any encoding from the pre-unicode era, but not
 for UTF-16/32 and variants.

Right. However, these can't appear as environment/file system encodings,
because they use null bytes.

 Why would it become specific? It can work the same way for any encoding:
 take U+F01xx, and generate the byte xx.
 
 If any error callback emits bytes these byte sequences must be legal in
 the target encoding, which depends on the target encoding itself.

No. The whole process started with data having an *invalid* encoding
in the source encoding (which, after the roundtrip, is now the
target encoding). So the python-escape error handler deliberately
produces byte sequences that are invalid in the environment encoding
(hence the additional permission of having it produce bytes instead
of characters).

 However for the normal use of this error handler this might be
 irrelevant, because those filenames that get encoded were constructed in
 such a way that reencoding them regenerates the original byte sequence.

Exactly so. The error handler is not of much use outside this specific
scenario.

 utf-8b is a new codec. However, the utf-8b codec is only used if the
 env encoding would otherwise be utf-8. For utf-8b, the error handler
 is indeed unnecessary.
 
 Wouldn't it make more sense to be consistent how non-decodable bytes get
 decoded? I.e. should the utf-8b codec decode those bytes to PUA
 characters too (and refuse to encode then, so the error handler outputs
 them)?

Unfortunately, that won't work. If the original encoding is UTF-8, and
uses PUA characters, then, on re-encoding, it's not possible to tell
whether to encode as a PUA character, or as an invalid byte.

This was my original proposal a year ago, and people immediately
suggested that it is not at all acceptable if there is the slightest
chance of information loss. Hence the current PEP.

 I thought the error handler would be used for decoding.
 It's used in both directions: for decoding, it converts \xXX to
 U+F01XX. For encoding, U+F01XX will trigger an error, which is then
 handled by the handler to produce \xXX.
 
 But only for non-UTF8 encodings?

Right. For ease of use, the implementation will specify the error
handler regardless, and the recommended use for applications will
be to use the error handler regardless. For utf-8b, the error
handler will never be invoked, since all input can be converted
always.

Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


Re: PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread v+python
On Apr 21, 11:50 pm, Martin v. Löwis mar...@v.loewis.de wrote:
 I'm proposing the following PEP for inclusion into Python 3.1.
 Please comment.


Basically the scheme doesn't work.  Aside from that, it is very close.

There are tons of encoding schemes that could work... they don't have
to include half-surrogates or bytes.  What they have to do, is make
sure that they are uniformly applied to all appropriate strings.

The problem with this, and other preceding schemes that have been
discussed here, is that there is no means of ascertaining whether a
particular file name str was obtained from a str API, or was funny-
decoded from a bytes API... and thus, there is no means of reliably
ascertaining whether a particular filename str should be passed to a
str API, or funny-encoded back to bytes.

The assumption in the 2nd Discussion paragraph may hold for a large
percentage of cases, maybe even including some number of 9s, but it is
not guaranteed, and cannot be enforced, therefore there are cases that
could fail.  Whether those failure cases are a concern or not is an
open question.  Picking a character (I don't find U+F01xx in the
Unicode standard, so I don't know what it is) that is obscure, and
unlikely to be used in real file names, might help the heuristic
nature of the encoding and decoding avoid most conflicts, but provides
no guarantee that data puns will not occur in practice.  Today's
obscure character is tomorrows commonly used character, perhaps.
Someone not on this list may be happily using that character for their
own nefarious, incompatible purpose.

As I realized in the email-sig, in talking about decoding corrupted
headers, there is only one way to guarantee this... to encode _all_
character sequences, from _all_ interfaces.  Basically it requires
reserving an escape character (I'll use ? in these examples -- yes, an
ASCII question mark -- happens to be illegal in Windows filenames so
all the better on that platform, but the specific character doesn't
matter... avoiding / \ and . is probably good, though).

So the rules would be, when obtaining a file name from the bytes OS
interface, that doesn't properly decode according to UTF-8, decode it
by placing a ? at the beginning, then for each decodable UTF-8
sequence, add a Unicode character -- unless the character is ?, in
which case you add two ??, and for each non-decodable byte sequence,
place a ? and two hex digits, or a ? and a half surrogate code, or a ?
and whatever gibberish you like.  Two hex digits are fine by me, and
will serve for this discussion.

ALSO, when obtaining a file name from the str OS interfaces, encode it
too... if it contains  a ? at the front, it must be replaced by ???
and then any other ? in the name doubled.

Then you have a string that can/must be encoded to be used on either
str or bytes OS interfaces... or any other interfaces that want str or
bytes... but whichever they want, you can do a decode, or determine
that you can't, into that form.  The encode and decode functions
should be available for coders to use, that code to external
interfaces, either OS or 3rd party packages, that do not use this
encoding scheme.  This encoding scheme would be used throughout all
Python APIs (most of which would need very little change to
accommodate it).  However, programs would have to keep track of
whether they were dealing with encoded or unencoded strings, if they
use both types in their program (an example, is hard-coded file names
or file name parts).

The initial ? is not strictly necessary for this scheme to work, but I
think it would be a good flag to the user that this name has been
altered.

This scheme does not depend on assumptions about the use of file
names.

This scheme would be enhanced if the file name APIs returned a subtype
of str for the encoded names, but that should be considered only a
hint, not a requirement.

When encoding file name strings to pass to bytes APIs, the ? followed
by two hex digits would be converted to a byte.  Leading ? would be
dropped, and ?? would convert to ?.  I don't believe failures are
possible when encoding to bytes.

When encoding file name strings to pass to str APIs, the discovery
of ? followed by two hex digits would raise an exception, the file
name is not acceptable to a str API.  However, leading ? would be
dropped, and ?? would convert to ?, and if no ? followed by two hex
digits were found, the file name would be successfully converted for
use on the str API.

Note that not even on Unix/Posix is it particularly easy nor useful to
place a ? into file names from command lines due to shell escapes,
etc.  The use of ? in file names also interferes with easy ability to
specifically match them in globs, etc.

Anything short of such an encoding of both types of interfaces, such
that it is known that all python-manipulated filenames will be
encoded, will have data puns that provide a potential for failure in
edge cases.

Note that in this scheme, no file names that are fully