Re: [Python-Dev] PEP 383 (again)

2009-04-27 Thread Martin v. Löwis
> PEP-383 attempts to represent non-UTF-8 byte sequences in Unicode > strings in a reversible way. That isn't really true; it is not, inherently, about UTF-8. Instead, it tries to represent non-filesystem-encoding byte sequence in Unicode strings in a reversible way. > Quietly escaping a bad UTF-

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Martin v. Löwis
> Does the PEP take into consideration the normalising behaviour of Mac > OSX ? We've had some ongoing challenges in bzr related to this with bzr. No, that's completely out of scope, AFAICT. I don't even know what the issues are, so I'm not able to propose a solution, at the moment. Regards, Mart

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Glenn Linderman
On approximately 4/27/2009 7:11 PM, came the following characters from the keyboard of Cameron Simpson: On 27Apr2009 18:15, Glenn Linderman wrote: The problem with this, and other preceding schemes that have been discussed here, is that there is no means of ascertaining whether a particular

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Martin v. Löwis
James Y Knight wrote: > Hopefully it can be assumed that your locale encoding really is a > non-overlapping superset of ASCII, as is required by POSIX... Can you please point to the part of the POSIX spec that says that such overlapping is forbidden? > I'm a bit scared at the prospect that U+DCAF

[Python-Dev] PEP 383 (again)

2009-04-27 Thread Thomas Breuel
I thought PEP-383 was a fairly neat approach, but after thinking about it, I now think that it is wrong. PEP-383 attempts to represent non-UTF-8 byte sequences in Unicode strings in a reversible way. But how do those non-UTF-8 byte sequences get into those path names in the first place? Most lik

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Glenn Linderman
On approximately 4/27/2009 8:39 PM, came the following characters from the keyboard of Martin v. Löwis: I'm not suggesting the PEP should solve the problem of mounting foreign file systems, although if it doesn't it should probably point that out. I'm just suggesting that if the people that writ

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Robert Collins
On Mon, 2009-04-27 at 22:25 -0700, Glenn Linderman wrote: > > Indeed, that was the missing piece. I'd forgotten about the > encodings > that use escape sequences, rather than UTF-8, and DBCS. I don't > think > those encodings are permitted by POSIX file systems, but I suppose > they > could s

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Glenn Linderman
On approximately 4/27/2009 8:35 PM, came the following characters from the keyboard of Martin v. Löwis: Glenn Linderman wrote: On approximately 4/27/2009 12:42 PM, came the following characters from the keyboard of Martin v. Löwis: It's a private use area. It will never carry an official charac

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread James Y Knight
On Apr 27, 2009, at 11:35 PM, Martin v. Löwis wrote: No. You seem to assume that all bytes < 128 decode successfully always. I believe this assumption is wrong, in general: py> "\x1b$B' \x1b(B".decode("iso-2022-jp") #2.x syntax Traceback (most recent call last): File "", line 1, in Unicode

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Stephen J. Turnbull
Tony Nelson writes: > At 16:09 + 04/27/2009, Antoine Pitrou wrote: > >Stephen J. Turnbull xemacs.org> writes: > >> > >> I hate to break it to you, but most stages of mail processing have > >> very little to do with SMTP. In particular, processing MIME > >> attachments often requires dea

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Stephen J. Turnbull
Michael Foord writes: > The problem you don't address, which is still the reality for most > programmers (especially Mac OS X where filesystem encoding is UTF 8), is > that programmers *are* going to treat filenames as strings. > The proposed PEP allows that to work for them - whatever plat

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Martin v. Löwis
>> I don't understand what you're saying. py3k filenames are all >> unicode, even on POSIX systems, > > > How is that possible on POSIX systems where the underlying file system > uses bytes for filenames? > > If I write a piece of Python code: > > filename = 'some path/some name' > > I m

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Cameron Simpson
On 27Apr2009 21:58, Benjamin Peterson wrote: | 2009/4/27 Cameron Simpson : | > PROPOSAL: add to the PEP the following functions: [...] | > and for me, I would like to see: | >  os.setfilesystemencoding(coding) | > | > Currently os.getfilesystemencoding() returns you the encoding based on | > the c

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Martin v. Löwis
> I'm not suggesting the PEP should solve the problem of mounting foreign > file systems, although if it doesn't it should probably point that out. > I'm just suggesting that if the people that write software to solve the > problem of mounting foreign file systems have already solved the naming >

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Martin v. Löwis
Glenn Linderman wrote: > On approximately 4/27/2009 12:42 PM, came the following characters from > the keyboard of Martin v. Löwis: It's a private use area. It will never carry an official character assignment. >>> >>> I know that U+F - U+F is a private use area. I don't find a >

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Benjamin Peterson
2009/4/27 Cameron Simpson : > > PROPOSAL: add to the PEP the following functions: > >  os.fsdecode(bytes) -> funny-encoded Unicode >    This is what os.listdir() does to produce the strings it hands out. >  os.fsencode(funny-string) -> bytes >    This is what open(filename,..) does to turn the file

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Cameron Simpson
On 27Apr2009 18:15, Glenn Linderman wrote: > The problem with this, and other preceding schemes that have been > discussed here, is that there is no means of ascertaining whether a > particular file name str was obtained from a str API, or was funny- > decoded from a bytes API... a

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Glenn Linderman
On approximately 4/27/2009 5:42 PM, came the following characters from the keyboard of Cameron Simpson: I think that, almost independent of this PEP, there should be an os.fsencode() function that takes a byte string (as a POSIX OS call will take) and performs the _same_ byte->string encoding tha

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Glenn Linderman
On approximately 4/27/2009 2:14 PM, came the following characters from the keyboard of Cameron Simpson: On 27Apr2009 00:07, Glenn Linderman wrote: On approximately 4/25/2009 5:22 AM, came the following characters from the keyboard of Martin v. Löwis: The problem with this, and other p

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Benjamin Peterson
2009/4/27 Cameron Simpson : > I think that, almost independent of this PEP, there should be an > os.fsencode() function that takes a byte string (as a POSIX OS call > will take) and performs the _same_ byte->string encoding that listdir() > and friends are doing under the hood. And a partner os.fsd

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Cameron Simpson
On 27Apr2009 23:27, Simon Cross wrote: | On Mon, Apr 27, 2009 at 9:48 PM, "Martin v. Löwis" wrote: | > As Cameron says: it's out of the scope of the PEP. It really depends how | > the operating system deals with them. Most likely, the files are not | > accessible - not only not from Python, but a

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Cameron Simpson
On 27Apr2009 21:48, Martin v. L�wis wrote: | >>> There are still issues regarding how Windows and POSIX programs that | >>> are sharing cross-mounted file systems might communicate file names | >>> between each other, which is not at all clear from the PEP. If this | >>> is an insoluble or un-

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Steven D'Aprano
On Tue, 28 Apr 2009 04:13:47 am Antoine Pitrou wrote: > Stephen J. Turnbull xemacs.org> writes: ... > > So what you'll get here, AFAICS, is a new situation where many > > Windows-centric programmers will produce code that's incapable of > > dealing with non-Unicode input because they don't have to

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Glenn Linderman
On approximately 4/27/2009 12:48 PM, came the following characters from the keyboard of Martin v. Löwis: There are still issues regarding how Windows and POSIX programs that are sharing cross-mounted file systems might communicate file names between each other, which is not at all clear from th

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Glenn Linderman
On approximately 4/27/2009 12:42 PM, came the following characters from the keyboard of Martin v. Löwis: It's a private use area. It will never carry an official character assignment. I know that U+F - U+F is a private use area. I don't find a definition of U+F01xx to know what the not

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System C haracter Interfaces

2009-04-27 Thread Michael Foord
Stephen J. Turnbull wrote: Antoine Pitrou writes: > > or (better for 2.x, where bytes are strings as far as most > > programmers are concerned) as a new data type, > > I'm -1 on any new string-like type (for file paths or whatever > else) with custom encoding/decoding semantics. It's the

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System C haracter Interfaces

2009-04-27 Thread Antoine Pitrou
Simon Cross gmail.com> writes: > > $ touch $'\xFF\xAA\xFF' > $ vi $'\xFF\xAA\xFF' > $ egrep foo $'\xFF\xAA\xFF' > > All worked fine from my Bash shell with locale encoding set to UTF-8. The PEP is precisely about making py3k able to better handle these files (right now os.listdir() doesn't retu

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Martin v. Löwis
> $ touch $'\xFF\xAA\xFF' > $ vi $'\xFF\xAA\xFF' > $ egrep foo $'\xFF\xAA\xFF' > > All worked fine from my Bash shell with locale encoding set to UTF-8. > I can also open the created file from the GNOME editor file dialog (it > even tells me the filename is not valid in my locale's encoding). The

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Simon Cross
On Mon, Apr 27, 2009 at 9:48 PM, "Martin v. Löwis" wrote: > As Cameron says: it's out of the scope of the PEP. It really depends how > the operating system deals with them. Most likely, the files are not > accessible - not only not from Python, but also not accessible from > any other Unix program

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Cameron Simpson
On 27Apr2009 00:07, Glenn Linderman wrote: > On approximately 4/25/2009 5:22 AM, came the following characters from > the keyboard of Martin v. Löwis: >>> The problem with this, and other preceding schemes that have been >>> discussed here, is that there is no means of ascertaining whether a >>>

Re: [Python-Dev] 2.6.2 Vista installer failure on upgrade from 2.6.1

2009-04-27 Thread Martin v. Löwis
Jim Kleckner wrote: > I went to upgrade a Vista machine from 2.6.1 to 2.6.2 and got error 2755 > with the message "system cannot open the device or file". > > I uninstalled 2.6.1, removing all residual files also, and got the error > message again. > > When I ran msiexec as follows to get a log,

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Martin v. Löwis
>>> There are still issues regarding how Windows and POSIX programs that >>> are sharing cross-mounted file systems might communicate file names >>> between each other, which is not at all clear from the PEP. If this >>> is an insoluble or un-addressed issue, it should be stated. (It is >>> pr

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Martin v. Löwis
>> It's a private use area. It will never carry an official character >> assignment. > > > I know that U+F - U+F is a private use area. I don't find a > definition of U+F01xx to know what the notation means. Are you picking > a particular character within the private use area, or a part

Re: [Python-Dev] UTF-8 Decoder

2009-04-27 Thread Antoine Pitrou
Jeroen Ruigrok van der Werven in-nomine.org> writes: > > So on medium and large datasets the decoder of Bjoern is very interesting, > but the tiny case (just Bjoern's name) is quite a tad bit slower. The other > cases seems more typical of what the average use in Python would be. Keep in mind wh

Re: [Python-Dev] UTF-8 Decoder

2009-04-27 Thread Jeroen Ruigrok van der Werven
-On [20090414 16:43], Antoine Pitrou (solip...@pitrou.net) wrote: >If you have some time on your hands, you could try benchmarking it against >Python 3.1's (py3k) decoder. There are two cases to consider: Bjoern actually did it himself already: http://bjoern.hoehrmann.de/utf-8/decoder/dfa/#perfor

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Antoine Pitrou
Stephen J. Turnbull xemacs.org> writes: > > Excuse me, but I can't see a scheme that encodes bytes as Unicodes but > only sometimes as a "clean separation". Yet it is. Filenames are all unicode, without exception, and there's no implicit conversion to bytes. That's a clean separation. > So what

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System C haracter Interfaces

2009-04-27 Thread Tony Nelson
At 16:09 + 04/27/2009, Antoine Pitrou wrote: >Stephen J. Turnbull xemacs.org> writes: >> >> I hate to break it to you, but most stages of mail processing have >> very little to do with SMTP. In particular, processing MIME >> attachments often requires dealing with file names. > >AFAIK, the fi

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Tony Nelson
At 23:39 -0700 04/26/2009, Glenn Linderman wrote: >On approximately 4/25/2009 5:35 AM, came the following characters from >the keyboard of Martin v. Löwis: >>> Because the encoding is not reliably reversible. >> >> Why do you say that? The encoding is completely reversible >> (unless we disagree on

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System C haracter Interfaces

2009-04-27 Thread Stephen J. Turnbull
Antoine Pitrou writes: > > or (better for 2.x, where bytes are strings as far as most > > programmers are concerned) as a new data type, > > I'm -1 on any new string-like type (for file paths or whatever > else) with custom encoding/decoding semantics. It's the best way to > ruin the clean

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Stephen J. Turnbull
Paul Moore writes: > 2009/4/27 Stephen J. Turnbull : > > I believe there are solutions that don't have that problem. > > Specifically, if the return values were bytes, or (better for 2.x, > > where bytes are strings as far as most programmers are concerned) as a > > new data type, to indicate

[Python-Dev] 2.6.2 Vista installer failure on upgrade from 2.6.1

2009-04-27 Thread Jim Kleckner
I went to upgrade a Vista machine from 2.6.1 to 2.6.2 and got error 2755 with the message "system cannot open the device or file". I uninstalled 2.6.1, removing all residual files also, and got the error message again. When I ran msiexec as follows to get a log, it magically worked: msiexec

Re: [Python-Dev] Dropping bytes "support" in json

2009-04-27 Thread Damien Diederen
Hi Antoine, Antoine Pitrou writes: > Damien Diederen crosstwine.com> writes: >> I couldn't figure out a way to get rid of it short of multi-#including >> "templates" and playing with the C preprocessor, however, and have the >> nagging feeling the latter would be frowned upon by the maintainers

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System C haracter Interfaces

2009-04-27 Thread Antoine Pitrou
Stephen J. Turnbull xemacs.org> writes: > > I hate to break it to you, but most stages of mail processing have > very little to do with SMTP. In particular, processing MIME > attachments often requires dealing with file names. AFAIK, the file name is only there as an indication for the user whe

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System?Character?Interfaces

2009-04-27 Thread Aahz
On Mon, Apr 27, 2009, Antoine Pitrou wrote: > Stephen J. Turnbull xemacs.org> writes: >> >> If >> you see a broken encoding once, you're likely to see it a million times >> (spammers have the most broken software) or maybe have it raise an >> unhandled Exception a dozen times (in rate of using bu

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Paul Moore
2009/4/27 Stephen J. Turnbull : > I believe there are solutions that don't have that problem. > Specifically, if the return values were bytes, or (better for 2.x, > where bytes are strings as far as most programmers are concerned) as a > new data type, to indicate that they're not text until the cl

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Stephen J. Turnbull
Antoine Pitrou writes: > I'm not sure how mail being stuck in a pipeline has anything to do > with Martin's proposal (which deals with file paths, not with > SMTP...). I hate to break it to you, but most stages of mail processing have very little to do with SMTP. In particular, processing MIM

Re: [Python-Dev] Dropping bytes "support" in json

2009-04-27 Thread Antoine Pitrou
Damien Diederen crosstwine.com> writes: > > I couldn't figure out a way to get rid of it short of multi-#including > "templates" and playing with the C preprocessor, however, and have the > nagging feeling the latter would be frowned upon by the maintainers. > > There is a precedent with xmltok.

Re: [Python-Dev] Dropping bytes "support" in json

2009-04-27 Thread Damien Diederen
Hi Eric, "Eric Smith" writes: >> I couldn't figure out a way to get rid of it short of multi-#including >> "templates" and playing with the C preprocessor, however, and have the >> nagging feeling the latter would be frowned upon by the maintainers. > > Not sure if this is exactly what you mean,

Re: [Python-Dev] Dropping bytes "support" in json

2009-04-27 Thread Bob Ippolito
On Mon, Apr 27, 2009 at 7:25 AM, Damien Diederen wrote: > > Antoine Pitrou writes: >> Hello, >> >> We're in the process of forward-porting the recent (massive) json >> updates to 3.1, and we are also thinking of dropping remnants of >> support of the bytes type in the json library (in 3.1, again)

Re: [Python-Dev] Dropping bytes "support" in json

2009-04-27 Thread Eric Smith
> I couldn't figure out a way to get rid of it short of multi-#including > "templates" and playing with the C preprocessor, however, and have the > nagging feeling the latter would be frowned upon by the maintainers. Not sure if this is exactly what you mean, but look at Objects/stringlib. str.for

[Python-Dev] Windows buildbots failing test_types in trunk

2009-04-27 Thread Eric Smith
Mark Dickinson pointed out to me that the trunk buildbots are failing under Windows. After some analysis, I think this is because of a change I made to use _toupper in integer formatting. The correct solution to this is to implement issue 5793 to come up with a working, cross-platform, locale

Re: [Python-Dev] Dropping bytes "support" in json

2009-04-27 Thread Damien Diederen
Hello, Antoine Pitrou writes: > Hello, > > We're in the process of forward-porting the recent (massive) json > updates to 3.1, and we are also thinking of dropping remnants of > support of the bytes type in the json library (in 3.1, again). This > bytes support almost didn't work at all, but the

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Antoine Pitrou
Stephen J. Turnbull xemacs.org> writes: > > If > you see a broken encoding once, you're likely to see it a million times > (spammers have the most broken software) or maybe have it raise an > unhandled Exception a dozen times (in rate of using busted software, > the spammers are closely followed

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread R. David Murray
On Mon, 27 Apr 2009 at 01:40, Glenn Linderman wrote: Yes. My suggested use of ? is a visible character that is illegal in Windows file names, thus causing no valid Windows file names to be visually mangled. It is also a character that should be avoided in POSIX names because: 1) it is known t

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Glenn Linderman
On approximately 4/27/2009 12:55 AM, came the following characters from the keyboard of Cameron Simpson: On 26Apr2009 23:39, Glenn Linderman wrote: [...snip...] There are still issues regarding how Windows and POSIX programs that are sharing cross-mounted file systems might communicate file

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Cameron Simpson
On 26Apr2009 23:39, Glenn Linderman wrote: [...snip...] > There are still issues regarding how Windows and POSIX programs that are > sharing cross-mounted file systems might communicate file names between > each other, which is not at all clear from the PEP. If this is an > insoluble or un-

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Glenn Linderman
On approximately 4/25/2009 5:22 AM, came the following characters from the keyboard of Martin v. Löwis: The problem with this, and other preceding schemes that have been discussed here, is that there is no means of ascertaining whether a particular file name str was obtained from a str API, or wa