Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Martin v. Löwis
> However, Martin, I can promise you that I will _never_ ask for any
> convenience functions related to bytes as a result of this decision.

:-)

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Martin v. Löwis
> Sorry, maybe I'm just being thick here, but I don't understand how that 
> is possible. On the physical disk, each Windows file name must be 
> represented by a byte string, yes? So how is it possible that there are 
> Windows files with names that can't be represented as a byte string? 
> What have I missed?

That we are not really free to choose the byte representation when
choosing byte strings. Microsoft has defined how char* (i.e. byte
strings) are to be interpreted when interpreting them as byte strings,
namely in the ANSI code page. That code page is not capable of
representing all file names.

We could, for example, use the same representation as is used on disk.
However,
a) there is no API to find out what that representation is, and
b) it is not null-byte free, a property often desired for file names,
   and
c) because it contains null bytes, it won't be easy to display such
   file names on stdout, or in a GUI window.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread glyph

On 03:32 am, [EMAIL PROTECTED] wrote:

On Sep 30, 2008, at 10:06 PM, [EMAIL PROTECTED] wrote:



Can you clarify what proposal you are supporting for Python:


Sure.  Neither of your descriptions is terribly accurate, but I'll try 
to explain.
1) Two sets of APIs, one returning unicode strings, and one returning 
bytestrings. (subpoints: what does the unicode-returning API do when 
it cannot decode the bytestring into unicode? raise exception, pretend 
argument/envvar/file didn't exist/?)


The only API discussed so far which would actually provide two variants 
is 'getcwd', which would have a 'getcwdb' that gives back bytes instead.


Pretty much every other API takes some kind of input.  listdir(bytes) 
would give back bytes, while listdir(text) would give back text. 
listdir(text) would skip undecodable filenames.


Similarly for all the other APIs in os and os.path that take pathnames 
for input.
2) All APIs return bytestrings only. Converting to unicode is 
considered lossy, and would have to be done by applications for 
display purposes only.


This is a bad way to do things, because on Windows, filenames *really 
are* unicode.  Converting to bytes is what's lossy.  (See previous 
discussion of active codepages and CreateFileA/CreateFileW.)

I really don't understand the reasoning for (1).


The reasoning is that a lot of software doesn't care if it's wrong for 
edge cases, it's really hard to come up with something that's correct 
with respect to all of those edge cases (absurdly difficult, if you need 
to stay in the straightjacket of string / bytes types, as well as 
provide a useful library interface - which is why we're having this 
discussion).  But, it should be _possible_ to write software that's 
correct in the face of those edge cases.


And - let's not forget this - the worlds of POSIX and Windows really are 
different and really do require subtly different inputs.  Python can try 
to paper over this like Java does and make it impossible to write 
certain classes of application, or it can just provide an ugly, slightly 
inconsistent API that exposes the ugly, slightly inconsistent reality. 
Modulo the issues you've raised which I don't think the proposal totally 
covers yet (abspath with a non-decodable cwd) I think it strikes a nice 
balance; allow people to live in the delusion of unicode-on-POSIX and 
have software that mostly works, most of the time, or allow them to face 
the unpleasantness and spend the effort to get something really solid.


I think the _right_ answer to all of this is to (A) make FilePath work 
completely correctly for every totally insane edge case ever, and (B) 
include it in the stdlib.  One day I think we'll do that.  But nobody 
has the time or energy to do even the first part of that *right now*, 
before 3.0 is released, so I'm just looking for something which it will 
be possible to build FilePath, or something like it, on top of, without 
breaking other people's applications who rely on the os module directly 
too badly.
It seems to me that  most software (probably including all of the 
Python stdlib) would  continue to use the unicode string API.


That's true.  And that software wouldn't handle these edge cases 
completely correctly.  As Guido put it, "it's a quality of 
implementation issue".
Switching all of the Python  stdlib to use the bytestring APIs instead 
would certainly be a large  undertaking, and would have all sorts of 
ripple-on API changes (e.g.  __file__).


I am not quite sure what to do about __file__.  My preference would 
probably be to use unicode filename for consistency so it can always be 
displayed, but provide a second attribute (__open_file__?) that would be 
sometimes unicode, sometimes bytes, which would be guaranteed to work 
with open().  I suspect that most software which interacts with __file__ 
on a deep level would be of the variety which would deal with the edge 
cases.


But where the Python stdlib wants a pathname it should be accepting 
either bytes or unicode, as all of the os.path functions want.  This 
does kind of suck, but the alternatives are to encode crazy extra 
information in unicode path names that cannot be exchanged with other 
programs (or with users: NULL is potentially the worst bogus character 
from a UI perspective), or revert to bytes for everything (which is a 
non-solution, c.f. Windows above).
So I can only imagine that if you're proposing (1), you're  doing so 
without the intention of suggesting that Python be converted  to use 
it.


Maybe updating the stdlib to be correct in the face of such changes is 
hard, but it doesn't seem intractible.  Taken together, it looks like 
there are only about 100 calls in the stdlib to both getcwd and abspath 
together, and I suspect many of them are for purely aesthetic purposes 
and could just be eliminated, and many of them are redefinitions of the 
functions and don't need any changes.


All the other path manipulation functions would continue to work 

Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Terry Reedy

Guido van Rossum wrote:


No, that's because bytes is missing from the explicit list of
allowable types in io.open. Victor has a one-line trivial patch for
this. Could you try this though?


import _fileio
_fileio._FileIO(b'tem')


>>> import _fileio
>>> _fileio._FileIO(b'tem')
_fileio._FileIO(3, 'r')
>>>

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread James Y Knight


On Sep 30, 2008, at 10:06 PM, [EMAIL PROTECTED] wrote:
However, Martin, I can promise you that I will _never_ ask for any  
convenience functions related to bytes as a result of this  
decision.  I want bytes to come back from filesystem APIs because I  
intend to have a wrapper layer which knows two things about the  
file: the bytes (which are needed to talk to POSIX filesystem APIs)  
and the characters (which are computed from those bytes, can be  
safely renormalized, displayed to users, etc).  On Windows this  
filesystem wrapper will necessarily behave differently, and will not  
use bytes for anything.  Any formatting beyond joining path segments  
together and possibly splitting extensions off will be done on  
character strings, not byte strings.


Can you clarify what proposal you are supporting for Python:

1) Two sets of APIs, one returning unicode strings, and one returning  
bytestrings. (subpoints: what does the unicode-returning API do when  
it cannot decode the bytestring into unicode? raise exception, pretend  
argument/envvar/file didn't exist/?)


or

2) All APIs return bytestrings only. Converting to unicode is  
considered lossy, and would have to be done by applications for  
display purposes only.


I really don't understand the reasoning for (1). It seems to me that  
most software (probably including all of the Python stdlib) would  
continue to use the unicode string API. Switching all of the Python  
stdlib to use the bytestring APIs instead would certainly be a large  
undertaking, and would have all sorts of ripple-on API changes (e.g.  
__file__). So I can only imagine that if you're proposing (1), you're  
doing so without the intention of suggesting that Python be converted  
to use it.


And so, of course, that doesn't really fix things (such as getcwd  
failing if your cwd is a path that is undecodeable in the current  
locale, or well, currently, python refusing to even start).


If you're proposing (2), it's at least as large an undertaking as (1)  
+ converting Python to use the optional bytestring APIs. But at least  
it avoids exposing an API that people ought not use, and does make it  
obvious what still needs to be fixed: the unfixed code simply won't  
run at all.


The proposal of using U+ seems like it would have been almost  
the same from such a wrapper's perspective, except (A) people using  
the filesystem APIs without the benefit of such a wrapper would have  
been even more screwed


I'm not sure what your "more screwed" is comparing against: current  
py3k behavior? (aka: decoding to Unicode in locale's specified  
encoding)? I don't see how you can really be more screwed than that:  
not only can't you send your filename to display in a Gtk+ button, you  
can't access it at all, even staying within python.


and (B) there are a few nasty corner-cases when dealing with  
surrogate (i.e. invalid, in UTF-8) code points which I'm not quite  
sure what it would have done with.


The lone-surrogate-pair proposal was a totally different proposal than  
the U+ one.


James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Adam Olsen
On Tue, Sep 30, 2008 at 8:06 PM,  <[EMAIL PROTECTED]> wrote:
> The proposal of using U+ seems like it would have been almost the same
> from such a wrapper's perspective, except (A) people using the filesystem
> APIs without the benefit of such a wrapper would have been even more
> screwed, and (B) there are a few nasty corner-cases when dealing with
> surrogate (i.e. invalid, in UTF-8) code points which I'm not quite sure what
> it would have done with.

Surrogates in UTF-8 *should* be treated as errors, but current python
is far too lax.  That actually leads to another problem: improving
validating will change what gets escaped and what doesn't.

http://bugs.python.org/issue3297
http://bugs.python.org/issue3672



-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread glyph

On 30 Sep, 09:22 pm, [EMAIL PROTECTED] wrote:
On Tue, Sep 30, 2008 at 1:04 PM, "Martin v. Löwis" <[EMAIL PROTECTED]> 
wrote:

Guido van Rossum wrote:
On Mon, Sep 29, 2008 at 11:00 PM, "Martin v. Löwis" 
<[EMAIL PROTECTED]> wrote:



Martin, I don't understand why you are in favor of storing raw bytes
encoded as Latin-1 in Unicode string objects, which clearly gives 
rise

to mojibake.


This is my word of the day, by the way.  Reading this whole thread was 
_totally_ worth it to learn about "mojibake".  Obviously I'm familiar 
with the phenomenon but somehow I'd never heard this awesome term 
before.

I am also encouraged by Glyph's support for (a). He has a lot of
practical experience.


Thanks for the vote of confidence.  I hope for all our sakes that you're 
not over-valuing that experience ;-).


For what it's worth, I can see MvL's point in that I think there is some 
danger in generating confusion by adding _too many_ string-like 
functions to the bytes type.  I don't want my suggestion to contribute 
to the confusion between bytes and text.


However, Martin, I can promise you that I will _never_ ask for any 
convenience functions related to bytes as a result of this decision.  I 
want bytes to come back from filesystem APIs because I intend to have a 
wrapper layer which knows two things about the file: the bytes (which 
are needed to talk to POSIX filesystem APIs) and the characters (which 
are computed from those bytes, can be safely renormalized, displayed to 
users, etc).  On Windows this filesystem wrapper will necessarily behave 
differently, and will not use bytes for anything.  Any formatting beyond 
joining path segments together and possibly splitting extensions off 
will be done on character strings, not byte strings.


The proposal of using U+ seems like it would have been almost the 
same from such a wrapper's perspective, except (A) people using the 
filesystem APIs without the benefit of such a wrapper would have been 
even more screwed, and (B) there are a few nasty corner-cases when 
dealing with surrogate (i.e. invalid, in UTF-8) code points which I'm 
not quite sure what it would have done with.


Guido already mentioned "libraries" as a hypothetical issue, but here's 
a real-world problem that results from putting NULLs into filenames. 
Consider this program:


   import gtk
   w = gtk.Window()
   b = gtk.Button(u"\u/hello/world")
   w.add(b)
   w.show_all()
   gtk.main()

which emits this message:
   TypeError: OGtkButton.__init__() argument 1 must be string without 
null bytes or None, not unicode


SQLite has a similar problem with NULLs, and I'm definitely sticking 
paths in there, too.


Eventually I'd like to propose such a path type for inclusion in the 
stdlib, but that will have to wait for issues like 
 to be resolved.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Filename as byte string in python 2.6 or 3.0?

2008-09-30 Thread glyph

On 30 Sep, 09:37 pm, [EMAIL PROTECTED] wrote:

On Tue, Sep 30, 2008 at 11:42 AM,  <[EMAIL PROTECTED]> wrote:
There are other ways to glean this knowledge; for example, looking at 
the
'iocharset' or 'nls' mount options supplied to mount various 
filesystems.



I know we could do a better job, but absent anyone who knows what
they're doing we've chosen a fairly conservative approach. I certainly
hope that someone will contribute some mean encoding-guessing code to
the stdlib that users can use. I'm not sure if I'll ever endorse doing
this automatically in io.open(), though I'd be fine with a convention
like passing encoding="guess".


I think the conservative approach is actually correct, or rather, as 
close to correct as it is possible to get in this mess.  Inspecting 
these fantastically obscure options is only likely to be helpful in a 
tool which tries to correct filesystem encoding errors on legacy data. 
I wouldn't even know about them if I hadn't written several such tools 
(well, just little scripts, really) in the past.  I was just verifying 
that I wasn't missing some "right way" which would let someone else do 
the guesswork for me.


In reality, you have two options for filesystem encoding on Linux:

 * UTF-8
 * fall in a well and die

The OS will happily let you create a completely nonsensical environment 
where no application can possibly do anything reasonable: set LC_ALL to 
KOI8R, mount your USB keychain as Shift_JIS and your windows partition 
as ISO-8859-8.  Of course nobody would actually _do_ this, because they 
want things to work, so everything is gradually evolving to a default of 
UTF-8 everywhere.  In practice, however, there are still problems with 
CIFS/SMB shares where other clients have different ideas about encoding. 
I've experienced this most commonly when sharing with Macs, which have 
very particular and different ideas about normalization, as has already 
been discussed in this thread.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Greg Ewing

M.-A. Lemburg wrote:

In the end, I think it's better not to be clever and just return
the filenames that cannot be decoded as bytes objects in os.listdir().


But since it's a rare occurrence, most applications are
just going to ignore the issue, and then fail unexpectedly
one day on some unsuspecting user that doesn't have the
inclination to go diving into the code to fix it.

--
Greg

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Victor Stinner
Le Wednesday 01 October 2008 00:28:22 Martin v. Löwis, vous avez écrit :
> I don't think we will manage to release Python 3.0 this year if that
> change is to be implemented. And then, I don't think the release manager
> will agree to such a delay.

The minimum change is to disallow bytes/str mix:
 - os.listdir(unicode)->unicode and ignore invalid files
   (current behaviour is to return unicode and bytes)
 - os.readlink(unicode)->unicode or raise an error
   (current behaviour is to return unicode or bytes)
 - remove os.getcwdu() (use its code -which is better- for getcwd) 
   and fix the test_unicode_file.py

listdir() change (ignore invalid filenames) is important to avoid strange bugs 
in os.path.*(), glob.*() or on displaying a filename.

I can generate a specific patch for these issues. It's just a subset of my 
last patch.

-- 
Victor Stinner aka haypo
http://www.haypocalc.com/blog/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Michael Urman
On Tue, Sep 30, 2008 at 7:04 PM, Steven D'Aprano <[EMAIL PROTECTED]> wrote:
>> I believe on disk it uses UTF-16.
>
> Which is made up of bytes. There may be byte sequences that are illegal
> UTF-16, but that's not what Martin said. I don't understand how there
> can be UTF-16 sequences which don't correspond to some sequence of
> bytes. How would they be represented in memory? Is this to do with the
> endianness of the UTF-16 sequence?

It has to do with the internal mapping between the ANSI and Unicode
functions. On NT systems, CreateFileA will map the ANSI bytestring to
a Unicode filename via the active code page, and call CreateFileW
accordingly. The active code page cannot be set to something as useful
as UTF-8, so given any actual code page (1252, 932, etc.) there are
Unicode strings that cannot be represented with a bytestring provided
to the ANSI function.
-- 
Michael Urman
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Steven D'Aprano
On Wed, 1 Oct 2008 09:21:37 am you wrote:
> On Tue, Sep 30, 2008 at 4:08 PM, Steven D'Aprano <[EMAIL PROTECTED]> 
wrote:
> > On Wed, 1 Oct 2008 07:40:01 am Martin v. Löwis wrote:
> >> >> On Windows, we might reject bytes filenames for all file
> >> >> operations: open(), unlink(), os.path.join(), etc. (raise a
> >> >> TypeError or UnicodeError)
> >> >
> >> > Since I've seen no objections to this yet: please no. If we
> >> > offer a "lower-level" bytes filename API, it should work for all
> >> > platforms.
> >>
> >> Unfortunately, it can't. You cannot represent all possible file
> >> names in a byte string in Windows (just as you can't do so in a
> >> Unicode string on Unix).
> >
> > Sorry, maybe I'm just being thick here, but I don't understand how
> > that is possible. On the physical disk, each Windows file name must
> > be represented by a byte string, yes? So how is it possible that
> > there are Windows files with names that can't be represented as a
> > byte string? What have I missed?
>
> I believe on disk it uses UTF-16.

Which is made up of bytes. There may be byte sequences that are illegal 
UTF-16, but that's not what Martin said. I don't understand how there 
can be UTF-16 sequences which don't correspond to some sequence of 
bytes. How would they be represented in memory? Is this to do with the 
endianness of the UTF-16 sequence?


-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Guido van Rossum
On Tue, Sep 30, 2008 at 4:08 PM, Steven D'Aprano <[EMAIL PROTECTED]> wrote:
> On Wed, 1 Oct 2008 07:40:01 am Martin v. Löwis wrote:
>> >> On Windows, we might reject bytes filenames for all file
>> >> operations: open(), unlink(), os.path.join(), etc. (raise a
>> >> TypeError or UnicodeError)
>> >
>> > Since I've seen no objections to this yet: please no. If we offer a
>> > "lower-level" bytes filename API, it should work for all platforms.
>>
>> Unfortunately, it can't. You cannot represent all possible file names
>> in a byte string in Windows (just as you can't do so in a Unicode
>> string on Unix).
>
> Sorry, maybe I'm just being thick here, but I don't understand how that
> is possible. On the physical disk, each Windows file name must be
> represented by a byte string, yes? So how is it possible that there are
> Windows files with names that can't be represented as a byte string?
> What have I missed?

I believe on disk it uses UTF-16.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Steven D'Aprano
On Wed, 1 Oct 2008 07:40:01 am Martin v. Löwis wrote:
> >> On Windows, we might reject bytes filenames for all file
> >> operations: open(), unlink(), os.path.join(), etc. (raise a
> >> TypeError or UnicodeError)
> >
> > Since I've seen no objections to this yet: please no. If we offer a
> > "lower-level" bytes filename API, it should work for all platforms.
>
> Unfortunately, it can't. You cannot represent all possible file names
> in a byte string in Windows (just as you can't do so in a Unicode
> string on Unix).


Sorry, maybe I'm just being thick here, but I don't understand how that 
is possible. On the physical disk, each Windows file name must be 
represented by a byte string, yes? So how is it possible that there are 
Windows files with names that can't be represented as a byte string? 
What have I missed?



-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Jack Jansen


On  1-Oct-2008, at 00:32 , Martin v. Löwis wrote:



How does windows (and Python on windows) handle NFC versus NFD  
issues?


That's left to the application.

Can I have two files called "ümlaut.txt", one in NFD and one NFC  
form?


Yes, you can. It sounds confusing, but only in a theoretical way. You
never have combining characters on Windows (at least, I don't). The
keyboard input defaults to NFC, and users normally don't type file
names, anyways, except when creating the files - later, they just use
the mouse to indicate what file they want to act on.


And are both of those representable on the Python side (i.e. can they
both be returned from listdir() and passed to open())?


Certainly!


CIf I compare
these two filenames, do they compare differently?


Certainly!


Actually, that all sounds pretty non-confusing to me:-)

So, normal users will always have the one form, and if by chance they  
get the other form they can still use the file. Also from Python, even  
when doing listdir() and then open(), everything will work just as  
expected. That there are two files that have a similar visual  
representation is not too bad, the same happens with ellipses versus  
dot-dot-dot and many other cases.


Which means the only problem area left is unix filesystems (whether on  
Linux or mounted remotely on MacOS or whatever), where filenames are  
really byte strings with only / and nul illegal.




--
Jack Jansen, <[EMAIL PROTECTED]>, http://www.cwi.nl/~jack
If I can't dance I don't want to be part of your revolution -- Emma  
Goldman



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread James Y Knight

On Sep 30, 2008, at 6:21 PM, Martin v. Löwis wrote:

IOW, Java hasn't solved the problem in the last 10 years.


Java is already really bad at being a small little language to write  
cooperating tools in. I'd never even attempt to write a little  
pipeline filter in Java -- I've already pretty much learned to expect  
Java applications to be in their own world, so I'd hardly find it  
surprising if a Java app could only read files it wrote itself,  
nevermind files in odd encodings.


Python, on the other hand, is an awesome tool for writing small little  
scripts that interact well with the surrounding environment, Just The  
Way It Is, without trying to layer so much abstraction upon it so that  
you lose functionality. Moving away from that would be unfortunate.


James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Guido van Rossum
On Tue, Sep 30, 2008 at 3:21 PM, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
>>> My concern still is that it brings the bytes type into the status of
>>> another character string type, which is really bad, and will require
>>> further modifications to Python for the lifetime of 3.x.
>>
>> I'd like to understand why this is "really bad". I though it was by
>> design that the str and bytes types behave pretty similarly. You can
>> use both as dict keys.
>
> If they have to behave pretty similarly, they have to be supported in
> all APIs that deal with text.

I don't see how you get from "pretty similarly" to "all APIs". :-)

> For example, people will demand that
> printing bytes should just copy them onto the stream (rather than
> invoking repr()), and writing them onto a text stream should work the
> same way. GUI library should support them, the XML libraries, and so
> on.
>
> Where will you stop, and tell people that bytes are just not supposed
> to do this or that?

Printing a bytes object already works, and displays its repr(), which
is guaranteed to be pure ASCII (unlike the repr() of a unicode str
object in Py3k). All the others you mention will cause breakage as
they should -- these errors exist to force the programmer to think
about encodings or conversions. I don't see that as a big burden
because the only way there could be bytes here in the first place is
when the user explicitly requested bytes. A program that only ever
passes text strings to the os module is only ever going to get text
strings back.

>>> This is because applications will then regularly use byte strings for
>>> file names on Unix, and regular strings on Windows, and then expect
>>> the program to work the same without further modifications.
>>
>> It seems that bytes arguments actually *do* work on Windows -- somehow
>> they get decoded. (Unless Terry's report was from 2.x.)
>
> To a limited degree - see my other message. Don't try to listdir a
> directory with characters outside CP_ACP (it will give you invalid
> file names).

Understood.

>> Actually something like that may not be a bad idea. Ian Bicking's
>> webob supports similar double APIs for getting the request parameters
>> out of a request object; I believe request.GET['x'] is a text object
>> and request.GET_str['x'] is the corresponding uninterpreted bytes
>> sequence. I would prefer to have os.environb over os.environ[b"PATH"]
>> though.
>
> And would you keep them synchronized?

Yes, the bytes versions would be the canonical version and the str
version would wrap around that -- though updating the str version
would also update the bytes version. Some keys would be missing from
the str version (or perhaps they would raise exceptions or default to
some other error handler, like ignore or replace).

>> I assume at some point we can stop and have sufficiently low-level
>> interfaces that everyone can agree are in bytes only. Bytes aren't
>> going away. How does Java deal with this? Its File class doesn't seem
>> to deal in bytes at all. What would its listFiles() method do with
>> undecodable filenames?
>
> Apparently (JDK 1.5.0_16, on Linux), it decodes undecodable bytes/byte
> sequences as U+FFFD (REPLACEMENT CHARACTER). Opening such a file will
> fail with FileNotFoundException.
>
> IOW, Java hasn't solved the problem in the last 10 years. Marcin
> Kowalczyk did a more thorough analysis about a year ago in
>
> http://mail.python.org/pipermail/python-3000/2007-September/010450.html

I can't say I like the Java solution. I would like to be able to write
a robust backup tool in Python, even if the code needed to make it
work everywhere isn't going to win any prizes (due to the need to use
bytes on Unix, str on Windows).

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Martin v. Löwis

> How does windows (and Python on windows) handle NFC versus NFD issues?

That's left to the application.

> Can I have two files called "ümlaut.txt", one in NFD and one NFC form?

Yes, you can. It sounds confusing, but only in a theoretical way. You
never have combining characters on Windows (at least, I don't). The
keyboard input defaults to NFC, and users normally don't type file
names, anyways, except when creating the files - later, they just use
the mouse to indicate what file they want to act on.

> And are both of those representable on the Python side (i.e. can they
> both be returned from listdir() and passed to open())?

Certainly!

> CIf I compare
> these two filenames, do they compare differently? 

Certainly!

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Martin v. Löwis
> Yes! If there is a byte-string access method for Windows, pretty please
> make it decode from UTF-8 internally and call the Unicode version of the
> Windows APIs. The non-unicode windows APIs are pretty much just broken
> -- Ideally, Python should never be calling those.

I don't think we will manage to release Python 3.0 this year if that
change is to be implemented. And then, I don't think the release manager
will agree to such a delay.

I disagree that the ANSI APIs are broken. For most users (and by that,
I mean much more than 99% of the world population with access to
Windows computers), they work just fine. You have to deliberately try
to break them, or work in an environment were you speak multiple
languages (with conflicting scripts) simultaneously. Practicality
beats purity, and I applaud Microsoft for such a foresighted design
(they are guilty for bad designs in other places, but this one really
gives a good tradeoff of all issues, all things considered).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Filename as byte string in python 2.6 or 3.0?

2008-09-30 Thread Guido van Rossum
On Tue, Sep 30, 2008 at 3:18 PM, Nick Coghlan <[EMAIL PROTECTED]> wrote:
> That said, I don't think this is something we (or, more to the point,
> Guido) need to make a decision on right now - for 3.0, having
> bytes-level APIs that can see everything, and Unicode APIs that ignore
> badly encoded filenames is worth trying. If it proves inadequate, then
> we can revisit the idea of some kind of implicit escaping mechanism in
> the Unicode APIs for 3.1 when there is more time for a proper PEP.

Right. Given that most syscalls already support both bytes and
(unicode) str, the simplest thing to do is to take this a bit further,
along the lines of Victor's patches, which I'm reviewing in Rietveld
right now:

http://codereview.appspot.com/3055

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Martin v. Löwis
>> My concern still is that it brings the bytes type into the status of
>> another character string type, which is really bad, and will require
>> further modifications to Python for the lifetime of 3.x.
> 
> I'd like to understand why this is "really bad". I though it was by
> design that the str and bytes types behave pretty similarly. You can
> use both as dict keys.

If they have to behave pretty similarly, they have to be supported in
all APIs that deal with text. For example, people will demand that
printing bytes should just copy them onto the stream (rather than
invoking repr()), and writing them onto a text stream should work the
same way. GUI library should support them, the XML libraries, and so
on.

Where will you stop, and tell people that bytes are just not supposed
to do this or that?

>> This is because applications will then regularly use byte strings for
>> file names on Unix, and regular strings on Windows, and then expect
>> the program to work the same without further modifications.
> 
> It seems that bytes arguments actually *do* work on Windows -- somehow
> they get decoded. (Unless Terry's report was from 2.x.)

To a limited degree - see my other message. Don't try to listdir a
directory with characters outside CP_ACP (it will give you invalid
file names).

> Actually something like that may not be a bad idea. Ian Bicking's
> webob supports similar double APIs for getting the request parameters
> out of a request object; I believe request.GET['x'] is a text object
> and request.GET_str['x'] is the corresponding uninterpreted bytes
> sequence. I would prefer to have os.environb over os.environ[b"PATH"]
> though.

And would you keep them synchronized?

> I assume at some point we can stop and have sufficiently low-level
> interfaces that everyone can agree are in bytes only. Bytes aren't
> going away. How does Java deal with this? Its File class doesn't seem
> to deal in bytes at all. What would its listFiles() method do with
> undecodable filenames?

Apparently (JDK 1.5.0_16, on Linux), it decodes undecodable bytes/byte
sequences as U+FFFD (REPLACEMENT CHARACTER). Opening such a file will
fail with FileNotFoundException.

IOW, Java hasn't solved the problem in the last 10 years. Marcin
Kowalczyk did a more thorough analysis about a year ago in

http://mail.python.org/pipermail/python-3000/2007-September/010450.html

Regards,
Martin


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Filename as byte string in python 2.6 or 3.0?

2008-09-30 Thread Guido van Rossum
On Tue, Sep 30, 2008 at 2:43 PM, Nick Coghlan <[EMAIL PROTECTED]> wrote:
> Of the suggestions I've seen so far, I like Marcin's Mono-inspired
> NULL-escape codec idea the best. Since these strings all come from parts
> of the environment where NULLs are not permitted, a simple "'\0' in
> text" check will immediately identify any strings where decoding failed
> (for applications which care about the difference and want to try to do
> better), while applications which don't care will receive perfectly
> valid Python strings that can be passed around and manipulated as if the
> decoding error never happened.

I'm not so sure. While it maintains *internal* consistency, printing
and displaying those filenames isn't likely going to give useful
results. E.g. on the terminal emulator I happen to be using right now
null bytes are simply ignored. Another danger might be that the null
character may be seen as the end of a string by some other library.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Filename as byte string in python 2.6 or 3.0?

2008-09-30 Thread Nick Coghlan
Adam Olsen wrote:
> On Tue, Sep 30, 2008 at 3:43 PM, Nick Coghlan <[EMAIL PROTECTED]> wrote:
>> Of the suggestions I've seen so far, I like Marcin's Mono-inspired
>> NULL-escape codec idea the best. Since these strings all come from parts
>> of the environment where NULLs are not permitted, a simple "'\0' in
>> text" check will immediately identify any strings where decoding failed
>> (for applications which care about the difference and want to try to do
>> better), while applications which don't care will receive perfectly
>> valid Python strings that can be passed around and manipulated as if the
>> decoding error never happened.
> 
> It avoids the technical problems, but it's still magical behaviour
> that users have to learn, whereas bytes/unicode polymorphism uses the
> distinctions you should already know about.
> 
> There's also a problem of how to turn it on.  I'm against
> automatically Python changing the filesystem encoding, no matter how
> well intentioned.  Better to let the app do that, which is easy and
> could be done for all apps (not just python!) if someone defined a
> libc encoding of "null-escaped UTF-8".
> 
> On the whole I'm only -0 on it (compared to -1 for UTF-8b).

For the decoding side, you wouldn't need to do it as a codec - you could
do it as a 'nullescape' error handler (since NULLs can't be present in
the byte sequences being decoded, there is no need to worry about
escaping anything when decoding is successful).

Converting those NULL escaped strings back into something the filesystem
can understand would obviously need a custom codec though, but some kind
of application level handling of bad filenames is going to be needed no
matter how we deal with bad encoding on the input side.

That said, I don't think this is something we (or, more to the point,
Guido) need to make a decision on right now - for 3.0, having
bytes-level APIs that can see everything, and Unicode APIs that ignore
badly encoded filenames is worth trying. If it proves inadequate, then
we can revisit the idea of some kind of implicit escaping mechanism in
the Unicode APIs for 3.1 when there is more time for a proper PEP.

Cheers,
Nick.

-- 
Nick Coghlan   |   [EMAIL PROTECTED]   |   Brisbane, Australia
---
http://www.boredomandlaziness.org
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Patch for an initial support of bytes filename in Python3

2008-09-30 Thread Guido van Rossum
On Tue, Sep 30, 2008 at 12:07 PM, Simon Cross
<[EMAIL PROTECTED]> wrote:
> On Tue, Sep 30, 2008 at 7:56 PM, Guido van Rossum <[EMAIL PROTECTED]> wrote:
>> (since os.getcwdb() is a Unix-only thing).
>
> I would be happier if all the Unix byte functions existed on Windows
> fell back to something like encoding the filenames to/from UTF-8. Then
> at least it would be possible for programs to support reading all
> files on both Unix and Windows without having to perform some sort of
> explicit check to see whether os.getcwdb() and friends are supported.

Actually on Windows the syscalls use the encoding that Microsoft uses
-- when using bytes we use the Windows bytes API and when using str we
use the Windows wide API. That's the most platform-compatible
approach.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Jack Jansen


On  30-Sep-2008, at 23:42 , Martin v. Löwis wrote:

It's the other way 'round: On Windows, Unicode file names are the
natural choice, and byte strings have limitations. In a sense, Windows
got it right - but then, they started later. Unix missed the  
opportunity

of declaring that all file APIs are UTF-8 (except for Plan-9 and OS X,
neither being "true" Unix).



How does windows (and Python on windows) handle NFC versus NFD issues?  
Can I have two files called "ümlaut.txt", one in NFD and one NFC form?  
And are both of those representable on the Python side (i.e. can they  
both be returned from listdir() and passed to open())? CIf I compare  
these two filenames, do they compare differently?

--
Jack Jansen, <[EMAIL PROTECTED]>, http://www.cwi.nl/~jack
If I can't dance I don't want to be part of your revolution -- Emma  
Goldman



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Filename as byte string in python 2.6 or 3.0?

2008-09-30 Thread Adam Olsen
On Tue, Sep 30, 2008 at 3:43 PM, Nick Coghlan <[EMAIL PROTECTED]> wrote:
> Guido van Rossum wrote:
>> The callback would either be an extra argument to all
>> system calls (bad, ugly etc., and why not go with the existing unicode
>> encoding and error flags if we're adding extra args?) or would be
>> global, where I'd be worried that it might interfere with the proper
>> operation of library code that is several abstractions away from
>> whoever installed the callback, not under their control, and not
>> expecting the callback.
>>
>> I suppose I may have totally misunderstood your proposal, but in
>> general I find callbacks unwieldy.
>
> Not really - later in the email, I actually pointed out that exposing
> the unicode errors flag for the implicit PyUnicode_Decode invocations
> would be enough to enable a callback mechanism.
>
> However, James's post pointing out that this is a problem that also
> affects environment variables and command line arguments, not just file
> paths completely kills any hope of purely callback based approach - that
> processing needs to "just work" without any additional intervention from
> the application.
>
> Of the suggestions I've seen so far, I like Marcin's Mono-inspired
> NULL-escape codec idea the best. Since these strings all come from parts
> of the environment where NULLs are not permitted, a simple "'\0' in
> text" check will immediately identify any strings where decoding failed
> (for applications which care about the difference and want to try to do
> better), while applications which don't care will receive perfectly
> valid Python strings that can be passed around and manipulated as if the
> decoding error never happened.

It avoids the technical problems, but it's still magical behaviour
that users have to learn, whereas bytes/unicode polymorphism uses the
distinctions you should already know about.

There's also a problem of how to turn it on.  I'm against
automatically Python changing the filesystem encoding, no matter how
well intentioned.  Better to let the app do that, which is easy and
could be done for all apps (not just python!) if someone defined a
libc encoding of "null-escaped UTF-8".

On the whole I'm only -0 on it (compared to -1 for UTF-8b).


-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Patch for an initial support of bytes filename in Python3

2008-09-30 Thread Guido van Rossum
On Tue, Sep 30, 2008 at 11:47 AM,  <[EMAIL PROTECTED]> wrote:
>
> On 05:56 pm, [EMAIL PROTECTED] wrote:
>>
>> On Tue, Sep 30, 2008 at 10:59 AM,  <[EMAIL PROTECTED]> wrote:
>>>
>>> On 02:32 pm, [EMAIL PROTECTED] wrote:
>
>>> In the absence of a 2.6 getcwdb, perhaps the fixer could just drop the
>>> "benefit of the doubt" case?  It could always be added to 2.7, and the
>>> parity release of 2to3 could have a --2.7 switch that would modify the
>>> behavior of this and other fixers.
>>
>> I'm not sure what you're proposing. *My* proposal is that 2to3 changes
>> os.getcwdu() calls to os.getcwd() and leaves os.getcwd() calls alone
>> -- there's no way to tell whether os.getcwdb() would be a better
>> match, and for portable code, it won't be (since os.getcwdb() is a
>> Unix-only thing).
>
> My proposal is simply to change getcwd to getcwdb, and getcwdu to getcwd.
>  This preserves whatever bytes/text behavior you are expecting from 2.6 into
> 3.0.  Granted, the fact that unicode is really always the right thing to do
> on Windows complicates things.

Plus, even on Linux Unicode is *usually* what you should be doing,
unless you're writing a backup tool.

> I already tend to avoid os.getcwd() though, and this is just one more reason
> to avoid it.  In the rare cases where I really do need it, it looks like
> os.path.abspath(b".") / os.path.abspath(u".") will provide the clarity that
> I want.

Or os.path.expanduser('~') vs. os.path.expanduser(b'~'). :-)

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread James Y Knight

On Sep 30, 2008, at 5:40 PM, Martin v. Löwis wrote:
On Windows, we might reject bytes filenames for all file  
operations: open(),

unlink(), os.path.join(), etc. (raise a TypeError or UnicodeError)


Since I've seen no objections to this yet: please no. If we offer a
"lower-level" bytes filename API, it should work for all platforms.


Unfortunately, it can't. You cannot represent all possible file names
in a byte string in Windows (just as you can't do so in a Unicode
string on Unix).


As you mention in the parenthetical below, of course it can.


So using byte strings on Windows would work for some files, but fail
for others. In particular, listdir might give you a list of file names
which you then can't open/stat/recurse into.

(of course, you could use UTF-8 as the file system encoding on  
Windows,

but then you will have to rewrite a lot of C code first)


Yes! If there is a byte-string access method for Windows, pretty  
please make it decode from UTF-8 internally and call the Unicode  
version of the Windows APIs. The non-unicode windows APIs are pretty  
much just broken -- Ideally, Python should never be calling those.


But, I still don't like the idea of propagating the "sometimes a  
string, sometimes bytes" APIs...One or the other, please. Either  
always strings (if and only if a method for assuring decoding always  
succeeds), or always bytes.


James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Marcin 'Qrczak' Kowalczyk
2008/9/30 Glenn Linderman <[EMAIL PROTECTED]>:

> So the problem is that a Unicode file system interface can't deal with
> non-UTF-8 byte streams as file names.
>
> So it seems there are four suggested approaches, all of which have aspects
> that are inconvenient.

Let's not forget what happens when a non-UTF-8 file name is read from
a file or written to a file, under the assumption that the filename is
written to the file directly (which probably breaks for filenames
containing newlines or such).

> 4) Use of bytes APIs on FS interfaces.  This seems to be the "solution"
> adopted by Posix that creates the "problem" encountered by Unicode-native
> applications.  It is cumbersome to deal with within applications that
> attempt to display the names.  What do Posix-style "open file" dialog boxes
> do in this case?

http://library.gnome.org/devel/glib/stable/glib-Character-Set-Conversion.html#g-filename-display-name

I used to observe three different ways to display such filenames
within gedit (including %xx and \xx escapes), but now it is
consistent, probably because it switched to using the above function
everywhere:
$ touch $'abc\xffz'
$ gedit
The Open dialog shows:
   abc�z (invalid encoding)
When the file is open, the window title and the tab title show:
   abc�z
and the same is in recent file list.

It has a bug: it appends " (invalid encoding)" even if the filename
contains a correctly encoded U+FFFD character. Nautilus has the same
behavior and the same bug because this is a design bug of that
function which does not allow to tell whether the conversion was
successful.

A filename containing a newline is sometimes displayed in two lines,
and sometimes with a U+000A character from a fallback font (hex
character number in a box).

-- 
Marcin Kowalczyk
[EMAIL PROTECTED]
http://qrnik.knm.org.pl/~qrczak/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Filename as byte string in python 2.6 or 3.0?

2008-09-30 Thread Nick Coghlan
Guido van Rossum wrote:
> The callback would either be an extra argument to all
> system calls (bad, ugly etc., and why not go with the existing unicode
> encoding and error flags if we're adding extra args?) or would be
> global, where I'd be worried that it might interfere with the proper
> operation of library code that is several abstractions away from
> whoever installed the callback, not under their control, and not
> expecting the callback.
> 
> I suppose I may have totally misunderstood your proposal, but in
> general I find callbacks unwieldy.

Not really - later in the email, I actually pointed out that exposing
the unicode errors flag for the implicit PyUnicode_Decode invocations
would be enough to enable a callback mechanism.

However, James's post pointing out that this is a problem that also
affects environment variables and command line arguments, not just file
paths completely kills any hope of purely callback based approach - that
processing needs to "just work" without any additional intervention from
the application.

Of the suggestions I've seen so far, I like Marcin's Mono-inspired
NULL-escape codec idea the best. Since these strings all come from parts
of the environment where NULLs are not permitted, a simple "'\0' in
text" check will immediately identify any strings where decoding failed
(for applications which care about the difference and want to try to do
better), while applications which don't care will receive perfectly
valid Python strings that can be passed around and manipulated as if the
decoding error never happened.

Cheers,
Nick.

-- 
Nick Coghlan   |   [EMAIL PROTECTED]   |   Brisbane, Australia
---
http://www.boredomandlaziness.org
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Martin v. Löwis
> Oh, ok. I had assumed Windows just uses a fixed encoding without the problem
> of misencoded filenames.

It's the other way 'round: On Windows, Unicode file names are the
natural choice, and byte strings have limitations. In a sense, Windows
got it right - but then, they started later. Unix missed the opportunity
of declaring that all file APIs are UTF-8 (except for Plan-9 and OS X,
neither being "true" Unix).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Martin v. Löwis
>> On Windows, we might reject bytes filenames for all file operations: open(), 
>> unlink(), os.path.join(), etc. (raise a TypeError or UnicodeError)
> 
> Since I've seen no objections to this yet: please no. If we offer a
> "lower-level" bytes filename API, it should work for all platforms.

Unfortunately, it can't. You cannot represent all possible file names
in a byte string in Windows (just as you can't do so in a Unicode
string on Unix).

So using byte strings on Windows would work for some files, but fail
for others. In particular, listdir might give you a list of file names
which you then can't open/stat/recurse into.

(of course, you could use UTF-8 as the file system encoding on Windows,
but then you will have to rewrite a lot of C code first)

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Filename as byte string in python 2.6 or 3.0?

2008-09-30 Thread Guido van Rossum
On Tue, Sep 30, 2008 at 11:42 AM,  <[EMAIL PROTECTED]> wrote:
> There are other ways to glean this knowledge; for example, looking at the
> 'iocharset' or 'nls' mount options supplied to mount various filesystems.  I
> thought maybe Python (or some C library call) might be invoking some logic
> that did something with data like that; if not, great, one day when I have
> some free time (meaning: never) I can implement that logic myself without
> duplicating a bunch of work.

I know we could do a better job, but absent anyone who knows what
they're doing we've chosen a fairly conservative approach. I certainly
hope that someone will contribute some mean encoding-guessing code to
the stdlib that users can use. I'm not sure if I'll ever endorse doing
this automatically in io.open(), though I'd be fine with a convention
like passing encoding="guess".

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Guido van Rossum
On Tue, Sep 30, 2008 at 1:29 PM, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> Guido van Rossum wrote:
>> However
>> the *proposed* behavior (returns bytes if the arg was bytes, and
>> returns str when the arg was str) is IMO sane, and no different than
>> the polymorphism found in len() or many builtin operations.
>
> My concern still is that it brings the bytes type into the status of
> another character string type, which is really bad, and will require
> further modifications to Python for the lifetime of 3.x.

I'd like to understand why this is "really bad". I though it was by
design that the str and bytes types behave pretty similarly. You can
use both as dict keys.

> This is because applications will then regularly use byte strings for
> file names on Unix, and regular strings on Windows, and then expect
> the program to work the same without further modifications.

It seems that bytes arguments actually *do* work on Windows -- somehow
they get decoded. (Unless Terry's report was from 2.x.)

> The next
> question then will be environment variables and command line arguments,
> for which we then should provide two versions (e.g. sys.argv and
> sys.argvb; for os.environ, os.environ["PATH"] could mean something
> different from os.environ[b"PATH"]).

Actually something like that may not be a bad idea. Ian Bicking's
webob supports similar double APIs for getting the request parameters
out of a request object; I believe request.GET['x'] is a text object
and request.GET_str['x'] is the corresponding uninterpreted bytes
sequence. I would prefer to have os.environb over os.environ[b"PATH"]
though.

> And so on (passwd/group file, Tkinter, ...)

I assume at some point we can stop and have sufficiently low-level
interfaces that everyone can agree are in bytes only. Bytes aren't
going away. How does Java deal with this? Its File class doesn't seem
to deal in bytes at all. What would its listFiles() method do with
undecodable filenames?

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Guido van Rossum
On Tue, Sep 30, 2008 at 1:12 PM, Terry Reedy <[EMAIL PROTECTED]> wrote:
> Terry Reedy wrote:
>>
>> Guido van Rossum wrote:
>
>>> I'm not sure either way. I've heard it claim that Windows filesystem
>>> APIs use Unicode natively. Does Python 3.0 on Windows currently
>>> support filenames expressed as bytes? Are they encoded first before
>>> passing to the Unicode APIs? Using what encoding?
>
>> [os.listdir(bytes) returns list of bytes, open(bytes) fails]
>
> More:
>
> The path functions seem also do not work:
>
 op.abspath(b'tem')
> ...
>path = path.replace("/", "\\")
> TypeError: expected an object with the buffer interface
>
> The error message is a bit cryptic given that the problem is that the
> arguments to replace should be bytes instead of strings for a bytes path.
>
> .basename fails with
> ...
>   while i and p[i-1] not in '/\\':
> TypeError: 'in ' requires string as left operand, not int
>
> os.rename, os.stat, os.mkdir, os.rmdir work.  I presume same is true for
> others that normally work on windows.

It looks roughly like the system calls do support bytes (using what
encoding?) but the Python code in os.path doesn't. This is the same as
the status quo on Linux.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Guido van Rossum
On Tue, Sep 30, 2008 at 1:04 PM, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> Guido van Rossum wrote:
>> On Mon, Sep 29, 2008 at 11:00 PM, "Martin v. Löwis" <[EMAIL PROTECTED]> 
>> wrote:
 Change the default file system encoding to store bytes in Unicode is like
 introducing a new Python type: .
>>> Exactly. Seems like the best solution to me, despite your polemics.
>>
>> Martin, I don't understand why you are in favor of storing raw bytes
>> encoded as Latin-1 in Unicode string objects, which clearly gives rise
>> to mojibake. In the past you have always been staunchly opposed to API
>> changes or practices that could lead to mojibake (and you had me quite
>> convinced).
>
> True. I try to outweigh the need for simplicity in the API against the
> need to support all cases. So I see two solutions:
>
> a) support bytes as file names. Supports all cases, but complicates
>   the API very much, by pervasively bringing bytes into the status
>   of a character data type. IMO, this must be prevented at all costs.

That's a matter of opinion. I would also like to point out that it is
in fact already supported by the system calls. io.open() doesn't, but
that's a wrapper around _fileio._FileIO which does support bytes. All
other syscalls already do the right thing (even readlink()!) except
os.listdir(), which returns a mixture of bytes and str values (which
is horrible) and os.getcwd() which needs a bytes equivalent. Victor's
patch addresses all these issues.

Victor's patch also tries to fix glob.py, fnmatch.py, and
posixpath.py. That is more debatable, because this might be the start
of a never-ending project. OTOH we have precedents, e.g. the re module
similarly supports both bytes and unicode (and makes an effort to
avoid mixing them).

> b) make character (Unicode) strings the only string type. Does not
>   immediately support all cases, so some hacks are needed. However,
>   even with the hacks, it preserves the simplicity of the API; the
>   hacks then should ideally be limited to the applications that need
>   it. On this side, I see the following approaches:
>   1. try to automatically embed non-representable characters into
>  the Unicode strings, e.g. by using PUA characters. Reduces
>  the amount of moji-bake, but produces a lot of difficult issues.
>   2. let applications that desire so access all file names in a
>  uniform manner, at the cost of producing tons of moji-bake
>
> In this case, I think moji-bake is unavoidable: it is just a plain
> flaw in the POSIX implementations (not the API or specification) that
> you can run into file names where you can't come up with the right
> rendering. Even for solution a), the resulting data cannot
> be displayed "correctly" in all cases.

But I still like the ultimate solution to displaying names for (a)
better: if it's not decodable, display it as the repr() of a bytes
object. (Which happens to be its str() as well.)

> Currently, I favor b2, but haven't given up on b1, and they don't
> exclude each other. b2 is simple to implement, and delegates the
> choice between legible file names and universal access to all files
> to the application. Given the way Unix works, this is the most sensible
> choice, IMO: by default, Python should try to make file names legible,
> but stuff like backup applications should be implementable also -
> and they don't need legible file names.

I don't believe that an application-wide choice is safe. For example
the tempfile module manipulates filenames (at least for
NamedTemporaryFile) and I think it would be wrong if it were affected
by such a global setting. (E.g. the user could pass a suffix argument
containing Unicode characters outside Latin-1.)

> I think option a) will hunt us forever. People will ask for more and
> more features in the bytes type, eventually asking "give us Python
> 2.x strings back". It already starts: see #3982, where Benjamin
> asks to have .format added to bytes (for a reason unrelated to file
> names).

I'm not so worried about feature requests for the bytes type unrelated
to filesystems; we can either grant them or not, and I am actually in
many cases in favor of granting them -- just like we support bytes in
the re module as I already mentioned above. The bytes and str types
have intentionally similar APIs, because they have similar structure,
and even somewhat similar semantics (b'ABC' and 'ABC' have related
meanings even if there are subtle differences).

I am also encouraged by Glyph's support for (a). He has a lot of
practical experience.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Guido van Rossum
On Tue, Sep 30, 2008 at 12:42 PM, Terry Reedy <[EMAIL PROTECTED]> wrote:
> Guido van Rossum wrote:
>>
>> On Tue, Sep 30, 2008 at 11:13 AM, Georg Brandl <[EMAIL PROTECTED]> wrote:
>>>
>>> Victor Stinner schrieb:

 On Windows, we might reject bytes filenames for all file operations:
 open(),
 unlink(), os.path.join(), etc. (raise a TypeError or UnicodeError)
>>>
>>> Since I've seen no objections to this yet: please no. If we offer a
>>> "lower-level" bytes filename API, it should work for all platforms.
>>
>> I'm not sure either way. I've heard it claim that Windows filesystem
>> APIs use Unicode natively. Does Python 3.0 on Windows currently
>> support filenames expressed as bytes? Are they encoded first before
>> passing to the Unicode APIs? Using what encoding?
>
> In 3.0rc1, the listdir doc needs updating:
> "os.listdir(path)
> Return a list containing the names of the entries in the directory. The list
> is in arbitrary order. It does not include the special entries '.' and '..'
> even if they are present in the directory. Availability: Unix, Windows.
>
> On Windows NT/2k/XP and Unix, if path is a Unicode object, the result will
> be a list of Unicode objects."
>
> s/Unicode/bytes/ at least for Windows.
>
 os.listdir(b'.')
> [b'countries.txt', b'multeetest.py', b't1.py', b't1.pyc', b't2.py', b'tem',
> b'temp.py', b'temp.pyc', b'temp2.py', b'temp3.py', b'temp4.py', b'test.py',
> b'z', b'z.txt']
>
> The bytes names do not work however:
>
 t=open(b'tem')
> Traceback (most recent call last):
>  File "", line 1, in 
>t=open(b'tem')
>  File "C:\Programs\Python30\lib\io.py", line 284, in __new__
>return open(*args, **kwargs)
>  File "C:\Programs\Python30\lib\io.py", line 184, in open
>raise TypeError("invalid file: %r" % file)
> TypeError: invalid file: b'tem'
>
> Is this what you were asking?

No, that's because bytes is missing from the explicit list of
allowable types in io.open. Victor has a one-line trivial patch for
this. Could you try this though?

>>> import _fileio
>>> _fileio._FileIO(b'tem')

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Martin v. Löwis
> I'm not sure either way. I've heard it claim that Windows filesystem
> APIs use Unicode natively. Does Python 3.0 on Windows currently
> support filenames expressed as bytes?

Yes, it does (at least, os.open, os.stat support them, builtin open
doesn't).

> Are they encoded first before
> passing to the Unicode APIs? Using what encoding?

They aren't passed to the Unicode (W) APIs (by Python). Instead, they
are passed to the "ANSI" (A) APIs (i.e. CP_ACP APIs). On Windows NT+,
that API then converts it to Unicode through the CP_ACP (aka "mbcs")
encoding; this is inside the system DLLs.

CP_ACP is a lossy encoding (from Unicode to bytes): Microsoft uses
replacement characters if they can, starting with similarly-looking
characters, and falling back to question marks.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Martin v. Löwis
Guido van Rossum wrote:
> However
> the *proposed* behavior (returns bytes if the arg was bytes, and
> returns str when the arg was str) is IMO sane, and no different than
> the polymorphism found in len() or many builtin operations.

My concern still is that it brings the bytes type into the status of
another character string type, which is really bad, and will require
further modifications to Python for the lifetime of 3.x.

This is because applications will then regularly use byte strings for
file names on Unix, and regular strings on Windows, and then expect
the program to work the same without further modifications. The next
question then will be environment variables and command line arguments,
for which we then should provide two versions (e.g. sys.argv and
sys.argvb; for os.environ, os.environ["PATH"] could mean something
different from os.environ[b"PATH"]). And so on (passwd/group file,
Tkinter, ...)

Regards,
Martin


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Martin v. Löwis
> I didn't get an answer to my question: what is the result  characters) stored in unicode> + ? I guess that the result is 
>  instead of raising an error 
> (invalid types). So again: why introducing a new type instead of reusing 
> existing Python types?

I didn't mean to introduce a new data type in the strict sense - merely
to pass through undecodable bytes through the regular Unicode type.
So the result of adding them is a regular Unicode string.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Martin v. Löwis
Guido van Rossum wrote:
> On Mon, Sep 29, 2008 at 11:00 PM, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
>>> Change the default file system encoding to store bytes in Unicode is like
>>> introducing a new Python type: .
>> Exactly. Seems like the best solution to me, despite your polemics.
> 
> Martin, I don't understand why you are in favor of storing raw bytes
> encoded as Latin-1 in Unicode string objects, which clearly gives rise
> to mojibake. In the past you have always been staunchly opposed to API
> changes or practices that could lead to mojibake (and you had me quite
> convinced).

True. I try to outweigh the need for simplicity in the API against the
need to support all cases. So I see two solutions:

a) support bytes as file names. Supports all cases, but complicates
   the API very much, by pervasively bringing bytes into the status
   of a character data type. IMO, this must be prevented at all costs.

b) make character (Unicode) strings the only string type. Does not
   immediately support all cases, so some hacks are needed. However,
   even with the hacks, it preserves the simplicity of the API; the
   hacks then should ideally be limited to the applications that need
   it. On this side, I see the following approaches:
   1. try to automatically embed non-representable characters into
  the Unicode strings, e.g. by using PUA characters. Reduces
  the amount of moji-bake, but produces a lot of difficult issues.
   2. let applications that desire so access all file names in a
  uniform manner, at the cost of producing tons of moji-bake

In this case, I think moji-bake is unavoidable: it is just a plain
flaw in the POSIX implementations (not the API or specification) that
you can run into file names where you can't come up with the right
rendering. Even for solution a), the resulting data cannot
be displayed "correctly" in all cases.

Currently, I favor b2, but haven't given up on b1, and they don't
exclude each other. b2 is simple to implement, and delegates the
choice between legible file names and universal access to all files
to the application. Given the way Unix works, this is the most sensible
choice, IMO: by default, Python should try to make file names legible,
but stuff like backup applications should be implementable also -
and they don't need legible file names.

I think option a) will hunt us forever. People will ask for more and
more features in the bytes type, eventually asking "give us Python
2.x strings back". It already starts: see #3982, where Benjamin
asks to have .format added to bytes (for a reason unrelated to file
names).

Regards,
Martin


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Patch for an initial support of bytes filename in Python3

2008-09-30 Thread Simon Cross
On Tue, Sep 30, 2008 at 7:56 PM, Guido van Rossum <[EMAIL PROTECTED]> wrote:
> (since os.getcwdb() is a Unix-only thing).

I would be happier if all the Unix byte functions existed on Windows
fell back to something like encoding the filenames to/from UTF-8. Then
at least it would be possible for programs to support reading all
files on both Unix and Windows without having to perform some sort of
explicit check to see whether os.getcwdb() and friends are supported.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Patch for an initial support of bytes filename in Python3

2008-09-30 Thread glyph


On 05:56 pm, [EMAIL PROTECTED] wrote:

On Tue, Sep 30, 2008 at 10:59 AM,  <[EMAIL PROTECTED]> wrote:

On 02:32 pm, [EMAIL PROTECTED] wrote:



In the absence of a 2.6 getcwdb, perhaps the fixer could just drop the
"benefit of the doubt" case?  It could always be added to 2.7, and the
parity release of 2to3 could have a --2.7 switch that would modify the
behavior of this and other fixers.


I'm not sure what you're proposing. *My* proposal is that 2to3 changes
os.getcwdu() calls to os.getcwd() and leaves os.getcwd() calls alone
-- there's no way to tell whether os.getcwdb() would be a better
match, and for portable code, it won't be (since os.getcwdb() is a
Unix-only thing).


My proposal is simply to change getcwd to getcwdb, and getcwdu to 
getcwd.  This preserves whatever bytes/text behavior you are expecting 
from 2.6 into 3.0.  Granted, the fact that unicode is really always the 
right thing to do on Windows complicates things.


I already tend to avoid os.getcwd() though, and this is just one more 
reason to avoid it.  In the rare cases where I really do need it, it 
looks like os.path.abspath(b".") / os.path.abspath(u".") will provide 
the clarity that I want.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Filename as byte string in python 2.6 or 3.0?

2008-09-30 Thread glyph

On 06:16 pm, [EMAIL PROTECTED] wrote:

On Tue, Sep 30, 2008 at 11:12 AM,  <[EMAIL PROTECTED]> wrote:
The one thing it doesn't do is expose the decoding rules for the 
higher-
level applications to deal with.  I am pretty sure I don't understand 
how
the interaction between filesystem encoding and user locale works in 
that

case, though, so I can't immediately recommend a way to do it.


You can ask what the filesystem encoding is with
sys.getfilesystemencoding(). On my Linux box I can make this return
anything I like by setting LC_CTYPE=en_US. (as long as
 is a recognized encoding). There are probably 5 other
environment variables to influence this. :-(


Only 5?  Great! :-)

Of course that doesn't help for undecodable filenames, and in that
case I don't think *anything* can help you unless you have a lot of
additional knowledge about what the user might be doing, e.g. you know
a few other encodings to try that make sense for their environment.


There are other ways to glean this knowledge; for example, looking at 
the 'iocharset' or 'nls' mount options supplied to mount various 
filesystems.  I thought maybe Python (or some C library call) might be 
invoking some logic that did something with data like that; if not, 
great, one day when I have some free time (meaning: never) I can 
implement that logic myself without duplicating a bunch of work.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Guido van Rossum
On Tue, Sep 30, 2008 at 11:13 AM, Georg Brandl <[EMAIL PROTECTED]> wrote:
> Victor Stinner schrieb:
>> On Windows, we might reject bytes filenames for all file operations: open(),
>> unlink(), os.path.join(), etc. (raise a TypeError or UnicodeError)
>
> Since I've seen no objections to this yet: please no. If we offer a
> "lower-level" bytes filename API, it should work for all platforms.

I'm not sure either way. I've heard it claim that Windows filesystem
APIs use Unicode natively. Does Python 3.0 on Windows currently
support filenames expressed as bytes? Are they encoded first before
passing to the Unicode APIs? Using what encoding?

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Filename as byte string in python 2.6 or 3.0?

2008-09-30 Thread Guido van Rossum
On Tue, Sep 30, 2008 at 11:12 AM,  <[EMAIL PROTECTED]> wrote:

> The one thing it doesn't do is expose the decoding rules for the higher-
> level applications to deal with.  I am pretty sure I don't understand how
> the interaction between filesystem encoding and user locale works in that
> case, though, so I can't immediately recommend a way to do it.

You can ask what the filesystem encoding is with
sys.getfilesystemencoding(). On my Linux box I can make this return
anything I like by setting LC_CTYPE=en_US. (as long as
 is a recognized encoding). There are probably 5 other
environment variables to influence this. :-(

Of course that doesn't help for undecodable filenames, and in that
case I don't think *anything* can help you unless you have a lot of
additional knowledge about what the user might be doing, e.g. you know
a few other encodings to try that make sense for their environment.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Filename as byte string in python 2.6 or 3.0?

2008-09-30 Thread glyph

On 02:39 pm, [EMAIL PROTECTED] wrote:
For example, implementing os.listdir to return the file names as 
Unicode

subclasses with ability to access the underlying bytes (automatically
recognized by open and friends) sounds like a good compromise that
allows the word processor to both have the cake and eat it.


It really seems like the strategy of the current patch (which I believe 
Guido proposed) makes the most sense.  Programs pass different arguments 
for different things:


listdir(text) -> I am thinking in unicode and I do not know about 
encodings, please give me only things that are proper unicode, because I 
don't want to deal with that.


listdir(bytes) -> I am thinking about bytes, I know about encodings. 
Just give me filenames as bytes and I will decode them myself or do 
other fancy things.


You can argue about whether this should really be 'listdiru' or 'globu' 
for explicitness, but when such a simple strategy with unambiguous types 
works, there's no reason to introduce some weird hybrid bytes/text type 
that will inevitably be a bug attractor.


Python's path abstractions have never been particularly high level, nor 
do I think they necessarily should be - at least, not until there's some 
community consensus about what a "high level path abstraction" really 
looks like.  We're still wrestling with it in Twisted, and I can think 
of at least three ways that ours is wrong.  And ours is the one that's 
doing the best, as far as I can tell :).


This proposal gives higher level software the information that it needs 
to construct appropriate paths.


The one thing it doesn't do is expose the decoding rules for the higher- 
level applications to deal with.  I am pretty sure I don't understand 
how the interaction between filesystem encoding and user locale works in 
that case, though, so I can't immediately recommend a way to do it.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Patch for an initial support of bytes filename in Python3

2008-09-30 Thread Guido van Rossum
On Tue, Sep 30, 2008 at 10:59 AM,  <[EMAIL PROTECTED]> wrote:
> On 02:32 pm, [EMAIL PROTECTED] wrote:
>> If 2.6 weren't pretty much released already I'd ask to add
>> os.getcwdb() there, as an alias for os.getcwd(), and add a 2to3 fixer
>> that converts os.getcwdu() to os.getcwd(), leaves os.getcwd() alone
>> (benefit of the doubt) and leaves os.getcwdb() alone as well (a strong
>> indication the user meant to get bytes in the 3.x version of their
>> code. (Similar to using bytes instead of str in 2.6 even though they
>> mean the same thing there -- they will be properly separated in 3.x.)
>
> In the absence of a 2.6 getcwdb, perhaps the fixer could just drop the
> "benefit of the doubt" case?  It could always be added to 2.7, and the
> parity release of 2to3 could have a --2.7 switch that would modify the
> behavior of this and other fixers.

I'm not sure what you're proposing. *My* proposal is that 2to3 changes
os.getcwdu() calls to os.getcwd() and leaves os.getcwd() calls alone
-- there's no way to tell whether os.getcwdb() would be a better
match, and for portable code, it won't be (since os.getcwdb() is a
Unix-only thing).

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Patch for an initial support of bytes filename in Python3

2008-09-30 Thread glyph

On 02:32 pm, [EMAIL PROTECTED] wrote:

On Tue, Sep 30, 2008 at 6:21 AM,  <[EMAIL PROTECTED]> wrote:

On 12:47 am, [EMAIL PROTECTED] wrote:


It sounds like maybe there should be some 2to3 fixers in here 
somewhere,
too?  Not necessarily as part of this patch, but somewhere related?  I 
don't
know what they would do, but it does seem quite likely that code which 
was
previously correct under 2.6 (using bytes) would suddenly be mixing 
bytes

and unicode with these APIs.


Doesn't seem easy for 2to3 to recognize such cases.


Actually I think I'm wrong.  As far as dealing with glob(), listdir() 
and friends, I suppose that other bytes/text fixers will already have 
had their opportunity to deal with getting the type to be the 
appropriate thing, and if you have glob(should be bytes>) it will work as expected in 3.0.  (I am really just 
confirming that I have nothing useful to say here, using too many words 
to do it: at least, I hope that nobody will waste further time thinking 
about it as a result.)

If 2.6 weren't pretty much released already I'd ask to add
os.getcwdb() there, as an alias for os.getcwd(), and add a 2to3 fixer
that converts os.getcwdu() to os.getcwd(), leaves os.getcwd() alone
(benefit of the doubt) and leaves os.getcwdb() alone as well (a strong
indication the user meant to get bytes in the 3.x version of their
code. (Similar to using bytes instead of str in 2.6 even though they
mean the same thing there -- they will be properly separated in 3.x.)


In the absence of a 2.6 getcwdb, perhaps the fixer could just drop the 
"benefit of the doubt" case?  It could always be added to 2.7, and the 
parity release of 2to3 could have a --2.7 switch that would modify the 
behavior of this and other fixers.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] Patch for an initial support of bytes filename in Python3

2008-09-30 Thread Guido van Rossum
On Tue, Sep 30, 2008 at 10:41 AM, Bill Janssen <[EMAIL PROTECTED]> wrote:
> Guido van Rossum <[EMAIL PROTECTED]> wrote:
>> On Tue, Sep 30, 2008 at 8:47 AM, Bill Janssen <[EMAIL PROTECTED]> wrote:
>> > Victor Stinner <[EMAIL PROTECTED]> wrote:
>> >
>> >>  - listdir(unicode) -> only unicode, *skip* invalid filenames
>> >>(as asked by Guido)
>> >
>> > Is there an option listdir(bytes) which will return *all* filenames (as
>> > byte sequences)?  Otherwise, this seems troubling to me; *something*
>> > should be returned for filenames which can't be represented, even if
>> > it's only None.
>>
>> Yes, os.listdir() becomes polymorphic -- if you pass it a pathname in
>> bytes the output is in bytes and it will return everything exactly as
>> the underlying syscall returns it to you.
>
> What about everything else?  For instance, if I call
> os.path.join(, ), I presume I get back a  which can
> be passed to os.listdir() to retrieve the contents of that directory.

Yeah, Victor's code at http://bugs.python.org/issue3187 (file
python3_bytes_filename.patch) does this. More needs to be done but
it's a start.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Georg Brandl
Guido van Rossum schrieb:
> On Tue, Sep 30, 2008 at 10:28 AM, Georg Brandl <[EMAIL PROTECTED]> wrote:
>>> How can it *regularly* drive you crazy when "the majority of fie names
>>> [...] encoded correctly" (as you assert above)?
>>
>> Because Office files are a) often named with long, seemingly descriptive
>> filenames, which invariably means umlauts in German, and b) often sent around
>> between systems, creating encoding problems.
> 
> Gotcha.

Which means?

>> Having seen how much controversy returning an invalid Unicode string sparks,
>> and given that it really isn't obvious to the newbie either, I think I now 
>> agree
>> that dropping filenames when calling a listdir() that returns Unicode 
>> filenames
>> is the best solution. I'm a little uneasy with having one function for both
>> bytes and Unicode return, because that kind of str/unicode mixing I thought 
>> we
>> had left behind in 2.x, but of course can live with it.
> 
> Well, the *current* Py3k behavior where it may return a mix of bytes
> and str instances is really messy, and likely to trip up most code
> that doesn't expect it in a way that makes it hard to debug. However
> the *proposed* behavior (returns bytes if the arg was bytes, and
> returns str when the arg was str) is IMO sane, and no different than
> the polymorphism found in len() or many builtin operations.

I agree that everything is better than the current behavior.

Georg

-- 
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread M.-A. Lemburg
On 2008-09-30 18:46, Guido van Rossum wrote:
> On Tue, Sep 30, 2008 at 8:20 AM, M.-A. Lemburg <[EMAIL PROTECTED]> wrote:
>> In the end, I think it's better not to be clever and just return
>> the filenames that cannot be decoded as bytes objects in os.listdir().
> 
> Unfortunately that's going to break most code that is using
> os.listdir(), so it's hardly an improved experience.

Right, but this also signals a problem to the application and the
application is in the best position to determine a proper
work-around.

>> Passing those to open() will then open the files as expected, in most
>> other cases the application will have to provide explicit conversions
>> in whatever way best fits the application.
> 
> In most cases the app will try to concatenate a pathname given as a
> string and then it will fail.

True, and that's the right thing to do in those cases.
The application will have to deal with the problem, e.g. convert
the path to bytes and retry the joining, or convert the bytes string
to Latin-1 and then convert the result back to bytes (using Latin-1)
for passing it to open() (which will of course only work if there are
no non-Latin-1 characters in the path dir), or apply a different
filename encoding based on the path and then retry to convert the
bytes filename into Unicode, or ask the user what to do, etc.

There are many possibilities to solve the problem, apply a work-around,
or inform the user of ways to correct it.

>> Also note that os.listdir() isn't the only source of filesnames. You
>> often read them from a file, a database, some socket, etc, so letting
>> the application decide what to do is not asking too much, IMHO.
> 
> In all those cases, the code that reads them is responsible for
> picking an encoding or relying on a default encoding, and the
> resulting filenames are always expressed as text, not bytes. I don't
> think it's the same at all.

What I was trying to say is that you run into the same problem
in other places as well. Trying to have os.listdir() implement
some strategy is not going to solve the problem at large.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Sep 30 2008)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


 Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Guido van Rossum
On Tue, Sep 30, 2008 at 10:28 AM, Georg Brandl <[EMAIL PROTECTED]> wrote:
>> How can it *regularly* drive you crazy when "the majority of fie names
>> [...] encoded correctly" (as you assert above)?
>
> Because Office files are a) often named with long, seemingly descriptive
> filenames, which invariably means umlauts in German, and b) often sent around
> between systems, creating encoding problems.

Gotcha.

> Having seen how much controversy returning an invalid Unicode string sparks,
> and given that it really isn't obvious to the newbie either, I think I now 
> agree
> that dropping filenames when calling a listdir() that returns Unicode 
> filenames
> is the best solution. I'm a little uneasy with having one function for both
> bytes and Unicode return, because that kind of str/unicode mixing I thought we
> had left behind in 2.x, but of course can live with it.

Well, the *current* Py3k behavior where it may return a mix of bytes
and str instances is really messy, and likely to trip up most code
that doesn't expect it in a way that makes it hard to debug. However
the *proposed* behavior (returns bytes if the arg was bytes, and
returns str when the arg was str) is IMO sane, and no different than
the polymorphism found in len() or many builtin operations.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] Patch for an initial support of bytes filename in Python3

2008-09-30 Thread Bill Janssen
Guido van Rossum <[EMAIL PROTECTED]> wrote:
> On Tue, Sep 30, 2008 at 8:47 AM, Bill Janssen <[EMAIL PROTECTED]> wrote:
> > Victor Stinner <[EMAIL PROTECTED]> wrote:
> >
> >>  - listdir(unicode) -> only unicode, *skip* invalid filenames
> >>(as asked by Guido)
> >
> > Is there an option listdir(bytes) which will return *all* filenames (as
> > byte sequences)?  Otherwise, this seems troubling to me; *something*
> > should be returned for filenames which can't be represented, even if
> > it's only None.
> 
> Yes, os.listdir() becomes polymorphic -- if you pass it a pathname in
> bytes the output is in bytes and it will return everything exactly as
> the underlying syscall returns it to you.

What about everything else?  For instance, if I call
os.path.join(, ), I presume I get back a  which can
be passed to os.listdir() to retrieve the contents of that directory.

Bill
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Georg Brandl
Steven D'Aprano schrieb:
> On Tue, 30 Sep 2008 11:50:10 pm Guido van Rossum wrote:
> 
>> > To avoid silent skipping, is it possible to drop 'unreadable'
>> > names, issue a warning (instead of exception), and continue to
>> > completion? "Warning: unreadable filename skipped; see
>> > PyWiki/UnreadableFilenames"
>>
>> That would be annoying as hell in most cases.
> 
> Doesn't the warning module default to only displaying the warning once 
> per session? I don't see that it would be annoying as hell to be 
> notified once per session that an error has occurred.
> 
> What I'd find annoying as hell would be something like this:
> 
> $ ls . | wc -l
> 25
> $ python
> 
 import os
 len(os.listdir('.')
> 24
> 
> 
> Give me a nice clear error, or even a warning. Don't let the error pass 
> silently, unless I explicitly silence it.

Just another data point: I've just looked at Qt, which provides a filesystem
API and whose strings are Unicode, and it seems to drop undecodable filenames
as well.

Georg


-- 
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Georg Brandl
Guido van Rossum schrieb:

>> With the filenames decoded by UTF-8, your files named têste, ô, dossié will
>> be displayed and handled correctly. The others are *invalid* in the 
>> filesystem
>> encoding UTF-8 and therefore would be represented by something like
>>
>> u'dir\uXXffname' where XX is some private use Unicode namespace. It won't 
>> look
>> pretty when printed, but then, what do other applications do? They e.g. 
>> display
>> a question mark as you show above, which is not better in terms of 
>> readability.
>>
>> But it will work when given to a filename-handling function. Valid filenames
>> can be compared to Unicode strings.
>>
>> A real-world example: OpenOffice can't open files with invalid bytes in their
>> name. They are displayed in the "Open file" dialog, but trying to open fails.
>> This regularly drives me crazy. Let's not make Python not work this way too,
>> or, even worse, not even display those filenames.
> 
> How can it *regularly* drive you crazy when "the majority of fie names
> [...] encoded correctly" (as you assert above)?

Because Office files are a) often named with long, seemingly descriptive
filenames, which invariably means umlauts in German, and b) often sent around
between systems, creating encoding problems.

Having seen how much controversy returning an invalid Unicode string sparks,
and given that it really isn't obvious to the newbie either, I think I now agree
that dropping filenames when calling a listdir() that returns Unicode filenames
is the best solution. I'm a little uneasy with having one function for both
bytes and Unicode return, because that kind of str/unicode mixing I thought we
had left behind in 2.x, but of course can live with it.

Georg

-- 
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] Patch for an initial support of bytes filename in Python3

2008-09-30 Thread Guido van Rossum
On Tue, Sep 30, 2008 at 8:47 AM, Bill Janssen <[EMAIL PROTECTED]> wrote:
> Victor Stinner <[EMAIL PROTECTED]> wrote:
>
>>  - listdir(unicode) -> only unicode, *skip* invalid filenames
>>(as asked by Guido)
>
> Is there an option listdir(bytes) which will return *all* filenames (as
> byte sequences)?  Otherwise, this seems troubling to me; *something*
> should be returned for filenames which can't be represented, even if
> it's only None.

Yes, os.listdir() becomes polymorphic -- if you pass it a pathname in
bytes the output is in bytes and it will return everything exactly as
the underlying syscall returns it to you.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Guido van Rossum
On Tue, Sep 30, 2008 at 8:20 AM, M.-A. Lemburg <[EMAIL PROTECTED]> wrote:
> In the end, I think it's better not to be clever and just return
> the filenames that cannot be decoded as bytes objects in os.listdir().

Unfortunately that's going to break most code that is using
os.listdir(), so it's hardly an improved experience.

> Passing those to open() will then open the files as expected, in most
> other cases the application will have to provide explicit conversions
> in whatever way best fits the application.

In most cases the app will try to concatenate a pathname given as a
string and then it will fail.

> Also note that os.listdir() isn't the only source of filesnames. You
> often read them from a file, a database, some socket, etc, so letting
> the application decide what to do is not asking too much, IMHO.

In all those cases, the code that reads them is responsible for
picking an encoding or relying on a default encoding, and the
resulting filenames are always expressed as text, not bytes. I don't
think it's the same at all.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Guido van Rossum
On Tue, Sep 30, 2008 at 7:53 AM, Steven D'Aprano <[EMAIL PROTECTED]> wrote:
> On Tue, 30 Sep 2008 11:50:10 pm Guido van Rossum wrote:
>
>> > To avoid silent skipping, is it possible to drop 'unreadable'
>> > names, issue a warning (instead of exception), and continue to
>> > completion? "Warning: unreadable filename skipped; see
>> > PyWiki/UnreadableFilenames"
>>
>> That would be annoying as hell in most cases.
>
> Doesn't the warning module default to only displaying the warning once
> per session? I don't see that it would be annoying as hell to be
> notified once per session that an error has occurred.
>
> What I'd find annoying as hell would be something like this:
>
> $ ls . | wc -l
> 25
> $ python
> ...
 import os
 len(os.listdir('.')
> 24

And yet similar discrepancies happen all the time -- ls suppresses
filenames starting with '.', while os.listdir() shows them (except '.'
and '..' themselves). The Mac Finder and its Windows equivalent hide
lots of files from you. And have you considered mount points (on
Unix)?

Face it. Filesystems are black boxes. They have roughly specified
behavior, but calls into them can fail or seem inconsistent for many
reasons -- concurrent changes by other processes, hidden files
(Windows), files that exist but can't be opened due to kernel-level
locking, etc. It's best not to worry too much about this.

Here's another anomaly:

>>> import os
>>> '.snapshot' in os.listdir('.')
False
>>> os.chdir('.snapshot')
>>> os.getcwd()
'/home/guido/bin/.snapshot'
>>>

IOW there's a hidden .snapshot directory that os.listdir() doesn't
return -- but it exists! This is a standard feature on NetApp filers.
(The reason this file is extra hidden is that it gives access to an
infinite set of backups that you don't want to be found by find(1),
os.walk() and their kin.)

> Give me a nice clear error, or even a warning. Don't let the error pass
> silently, unless I explicitly silence it.

Depends on your use case. We're talking here of a family of APIs where
different programs have different needs. I assert that most programs
are best served by an API that doesn't give them surprising and
irrelevant errors, as long as there's also an API for the few that
want to get to the bottom of things (or as close as they can get --
see above '.snapshot' example).

>> I consider the dropping of unreadable names similar to the
>> suppression of "hidden" files by various operating systems.
>
> With the exception of '.' and '..', I consider "hidden" files to be a
> serious design mistake, but at least most operating systems give the
> user a way to easily see all such hidden files if you ask.
>
> (Almost all. Windows has "superhidden" files that remain hidden even
> when the user asks to see hidden files, all the better to hide malware.
> But that's a rant for another list.)

Rant all you want, it won't go away.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Patch for an initial support of bytes filename in Python3

2008-09-30 Thread Bill Janssen
Victor Stinner <[EMAIL PROTECTED]> wrote:

>  - listdir(unicode) -> only unicode, *skip* invalid filenames 
>(as asked by Guido)

Is there an option listdir(bytes) which will return *all* filenames (as
byte sequences)?  Otherwise, this seems troubling to me; *something*
should be returned for filenames which can't be represented, even if
it's only None.

Bill
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread M.-A. Lemburg
On 2008-09-30 16:05, Guido van Rossum wrote:
> On Tue, Sep 30, 2008 at 3:31 AM, M.-A. Lemburg <[EMAIL PROTECTED]> wrote:
>> On 2008-09-30 08:00, Martin v. Löwis wrote:
 Change the default file system encoding to store bytes in Unicode is like
 introducing a new Python type: .
>>> Exactly. Seems like the best solution to me, despite your polemics.
>> Not a bad idea... have os.listdir() return Unicode subclasses that work
>> like file handles, ie. they have an extra buffer that holds the original
>> bytes value received from the underlying C API.
>>
>> Passing these handles to open() would then do the right thing by using
>> whatever os.listdir() got back from the file system to open the file,
>> while still providing a sane way to display the filename, e.g. using
>> question marks for the invalid characters.
>>
>> The only problem with this approach is concatenation of such handles
>> to form pathnames, but then perhaps those concatenations could just
>> work on the bytes value as well (I don't know of any OS that uses non-
>> ASCII path separators).
> 
> While this seems to work superficially I expect an infinite number of
> problems caused by code that doesn't understand this subclass. You are
> hinting at this in your last paragraph.

Well, to some extent Unicode objects themselves already implement
such a strategy: the default encoded bytes object basically provides
the low-level interfacing value.

But I agree, the approach is not foolproof.

In the end, I think it's better not to be clever and just return
the filenames that cannot be decoded as bytes objects in os.listdir().

Passing those to open() will then open the files as expected, in most
other cases the application will have to provide explicit conversions
in whatever way best fits the application.

Also note that os.listdir() isn't the only source of filesnames. You
often read them from a file, a database, some socket, etc, so letting
the application decide what to do is not asking too much, IMHO.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Sep 30 2008)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


 Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Steven D'Aprano
On Tue, 30 Sep 2008 11:50:10 pm Guido van Rossum wrote:

> > To avoid silent skipping, is it possible to drop 'unreadable'
> > names, issue a warning (instead of exception), and continue to
> > completion? "Warning: unreadable filename skipped; see
> > PyWiki/UnreadableFilenames"
>
> That would be annoying as hell in most cases.

Doesn't the warning module default to only displaying the warning once 
per session? I don't see that it would be annoying as hell to be 
notified once per session that an error has occurred.

What I'd find annoying as hell would be something like this:

$ ls . | wc -l
25
$ python
...
>>> import os
>>> len(os.listdir('.')
24


Give me a nice clear error, or even a warning. Don't let the error pass 
silently, unless I explicitly silence it.


> I consider the dropping of unreadable names similar to the
> suppression of "hidden" files by various operating systems.

With the exception of '.' and '..', I consider "hidden" files to be a 
serious design mistake, but at least most operating systems give the 
user a way to easily see all such hidden files if you ask.

(Almost all. Windows has "superhidden" files that remain hidden even 
when the user asks to see hidden files, all the better to hide malware. 
But that's a rant for another list.)

 

-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Filename as byte string in python 2.6 or 3.0?

2008-09-30 Thread Hrvoje Nikšić
On Tue, 2008-09-30 at 07:26 -0700, Guido van Rossum wrote:
> > I am not convinced that a word processor can just ignore files with
> > (what it thinks are) undecodable file names.  In countries with a
> > history of incompatible national encodings, such file names crop up very
> > often, sometimes as a natural consequence of data migrating from older
> > systems to newer ones.  You can and do encounter "invalid" file names in
> > the filesystems of mainstream users even without them using buggy or
> > obsolete software.
> 
> This is a quality of implementation issue. Either the word processor
> is written to support "undecodable" files, or it isn't. If it isn't,
> there's nothing that can be done about it (short of buying another
> wordprocessor)

I agree with this.  I just believe the underlying python APIs shouldn't
make it impossible (or unnecessarily hard) for the word processor to
implement showing of files with undecodable names.

For example, implementing os.listdir to return the file names as Unicode
subclasses with ability to access the underlying bytes (automatically
recognized by open and friends) sounds like a good compromise that
allows the word processor to both have the cake and eat it.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Patch for an initial support of bytes filename in Python3

2008-09-30 Thread Guido van Rossum
On Tue, Sep 30, 2008 at 6:21 AM,  <[EMAIL PROTECTED]> wrote:
> On 12:47 am, [EMAIL PROTECTED] wrote:
>
> This is the most sane contribution I've seen so far :).

Thanks. I'll review it later today (after coffee+breakfast :) and will
apply it assuming the code is reasonably sane, otherwise I'll go
around with Victor until it is to my satisfaction.

>> See attached patch: python3_bytes_filename.patch
>>
>> Using the patch, you will get:
>> - open() support bytes
>> - listdir(unicode) -> only unicode, *skip* invalid filenames
>>  (as asked by Guido)
>
> Forgive me for being a bit dense, but I couldn't find this hunk in the
> patch.  Do I understand properly that (listdir(bytes) -> bytes)?
>
> If so, this seems basically sane to me, since it provides text behavior
> where possible and allows more sophisticated filesystem wrappers (i.e.
> Twisted's FilePath, Will McGugan's "FS") to do more tricky things,
> separating filenames for display to the user and filenames for exchange with
> the FS.
>>
>> - remove os.getcwdu()
>> - create os.getcwdb() -> bytes
>> - glob.glob() support bytes
>> - fnmatch.filter() support bytes
>> - posixpath.join() and posixpath.split() support bytes
>
> It sounds like maybe there should be some 2to3 fixers in here somewhere,
> too?  Not necessarily as part of this patch, but somewhere related?  I don't
> know what they would do, but it does seem quite likely that code which was
> previously correct under 2.6 (using bytes) would suddenly be mixing bytes
> and unicode with these APIs.

Doesn't seem easy for 2to3 to recognize such cases.

If 2.6 weren't pretty much released already I'd ask to add
os.getcwdb() there, as an alias for os.getcwd(), and add a 2to3 fixer
that converts os.getcwdu() to os.getcwd(), leaves os.getcwd() alone
(benefit of the doubt) and leaves os.getcwdb() alone as well (a strong
indication the user meant to get bytes in the 3.x version of their
code. (Similar to using bytes instead of str in 2.6 even though they
mean the same thing there -- they will be properly separated in 3.x.)

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] when is path==NULL?

2008-09-30 Thread Guido van Rossum
On Tue, Sep 30, 2008 at 5:48 AM, Christian Heimes <[EMAIL PROTECTED]> wrote:
> Ulrich Eckhardt wrote:
>>
>> Hi!
>>
>> I'm looking at trunk/Python/sysmodule.c, function PySys_SetArgv(). In that
>> function, there is code like this:
>>
>>  PyObject* path = PySys_GetObject("path");
>>  ...
>>  if (path != NULL) {
>>...
>>  }
>>
>> My intuition says that if path==NULL, something is very wrong. At least I
>> would expect to get 'None', but never NULL, except when out of memory. So,
>> for the case that path==NULL', I would simply invoke Py_FatalError("no mem
>> for sys.path"), similarly to the other call there.
>
> PySys_GetObject may return NULL after the user has removed sys.path with
> delattr(sys, 'path'). There are valid applications for removing sys.path.

Or before sys.path is initialized using PySys_SetPath(). Trust me,
this code is as it should be.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Filename as byte string in python 2.6 or 3.0?

2008-09-30 Thread Guido van Rossum
On Tue, Sep 30, 2008 at 3:52 AM, Hrvoje Nikšić <[EMAIL PROTECTED]> wrote:
> On Tue, 2008-09-30 at 19:45 +1000, Nick Coghlan wrote:
>> To my mind, there are two kinds of app in the world when it comes to
>> file paths:
>> 1) "Normal" apps (e.g. a word processor), that are only interested in
>> files with sane, well-formed file names that can be properly decoded to
>> Unicode with the filesystem encoding identified by Python. If there is
>> invalid data on the filesystem, they don't care and don't want to see it
>> or have to deal with it.
>
> I am not convinced that a word processor can just ignore files with
> (what it thinks are) undecodable file names.  In countries with a
> history of incompatible national encodings, such file names crop up very
> often, sometimes as a natural consequence of data migrating from older
> systems to newer ones.  You can and do encounter "invalid" file names in
> the filesystems of mainstream users even without them using buggy or
> obsolete software.

This is a quality of implementation issue. Either the word processor
is written to support "undecodable" files, or it isn't. If it isn't,
there's nothing that can be done about it (short of buying another
wordprocessor) and it shouldn't be crippled by the mere *presence* of
an undecodable file in a directory. I can think of lots of apps that
have a sufficiently small or homogeneous audience (e.g. lots of
in-house apps) that they don't need to care about such files, and
these shouldn't break when they are used in the vicinity of an
undecodable filename -- it's enough if they just ignore it.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Filename as byte string in python 2.6 or 3.0?

2008-09-30 Thread Guido van Rossum
On Tue, Sep 30, 2008 at 2:45 AM, Nick Coghlan <[EMAIL PROTECTED]> wrote:
> Adam Olsen wrote:
>> Lossy conversion just moves around what gets treated as garbage.  As
>> all valid unicode scalars can be round tripped, there's no way to
>> create a valid unicode file name without being lossy.  The alternative
>> is not be valid unicode, but since we can't use such objects with
>> external libs, can't even print them, we might as well call them
>> something else.  We already have a name for that: bytes.
>
> To my mind, there are two kinds of app in the world when it comes to
> file paths:
> 1) "Normal" apps (e.g. a word processor), that are only interested in
> files with sane, well-formed file names that can be properly decoded to
> Unicode with the filesystem encoding identified by Python. If there is
> invalid data on the filesystem, they don't care and don't want to see it
> or have to deal with it.
> 2) "Filesystem" apps (e.g. a filesystem explorer), that need to be able
> to deal with malformed filenames that may not decode properly using the
> identified filesystem encoding.
>
> For the former category of apps, the presence of a malformed filename
> should NOT disrupt the processing of well-formed files and directories.
> Those applications should "just work", even if the underlying filesystem
> has a few broken filenames.

Right. Totally agreed.

> The latter category of applications need some way of defining their own
> application-specific handling of malformed names.

Agreed again.

> That screams "callback" to me - and one mechanism to achieve that would
> be to expose the unicode "errors" argument for filesystem operations
> that return file paths (e.g. os.getcwd(), os.listdir(), os.readlink(),
> os.walk()).

Hm. This doesn't scream callback to me at all. I would never have
thought of callbacks for this use case -- and I don't think it's a
good idea. The callback would either be an extra argument to all
system calls (bad, ugly etc., and why not go with the existing unicode
encoding and error flags if we're adding extra args?) or would be
global, where I'd be worried that it might interfere with the proper
operation of library code that is several abstractions away from
whoever installed the callback, not under their control, and not
expecting the callback.

I suppose I may have totally misunderstood your proposal, but in
general I find callbacks unwieldy.

> Once that was exposed, the existing error handling machinery in the
> codecs module could be used to allow applications to define their own
> custom error handling for Unicode decode errors in these operations.
> (e.g. set "codecs.register_error('bad_filepath',
> handle_filepath_error)", then use "errors='bad_filepath'" in the
> relevant os API calls)
>
> The default handling could be left at "strict", with os.listdir() and
> os.walk() specifically ignoring path entries that trigger
> UnicodeDecodeError.
>
> getcwd() and readlink() could just propagate the exception, since they
> have no other information to return.



-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Victor Stinner
Le Tuesday 30 September 2008 15:53:09 Guido van Rossum, vous avez écrit :
> On Mon, Sep 29, 2008 at 11:00 PM, "Martin v. Löwis" <[EMAIL PROTECTED]> 
wrote:
> >> Change the default file system encoding to store bytes in Unicode is
> >> like introducing a new Python type: .
> >
> > Exactly. Seems like the best solution to me, despite your polemics.
>
> Martin, I don't understand why you are in favor of storing raw bytes
> encoded as Latin-1 in Unicode string objects, which clearly gives rise
> to mojibake. In the past you have always been staunchly opposed to API
> changes or practices that could lead to mojibake (and you had me quite
> convinced).

If I understood correctly, the goal of Python3 is the clear *separation* of 
bytes and characters. Store bytes in Unicode is pratical because it doesn't 
need to change the existing code, but it doesn't fix the problem, it's just 
move problems which be raised later.

I didn't get an answer to my question: what is the result  + ? I guess that the result is 
 instead of raising an error 
(invalid types). So again: why introducing a new type instead of reusing 
existing Python types?

-- 
Victor Stinner aka haypo
http://www.haypocalc.com/blog/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Guido van Rossum
On Tue, Sep 30, 2008 at 3:31 AM, M.-A. Lemburg <[EMAIL PROTECTED]> wrote:
> On 2008-09-30 08:00, Martin v. Löwis wrote:
>>> Change the default file system encoding to store bytes in Unicode is like
>>> introducing a new Python type: .
>>
>> Exactly. Seems like the best solution to me, despite your polemics.
>
> Not a bad idea... have os.listdir() return Unicode subclasses that work
> like file handles, ie. they have an extra buffer that holds the original
> bytes value received from the underlying C API.
>
> Passing these handles to open() would then do the right thing by using
> whatever os.listdir() got back from the file system to open the file,
> while still providing a sane way to display the filename, e.g. using
> question marks for the invalid characters.
>
> The only problem with this approach is concatenation of such handles
> to form pathnames, but then perhaps those concatenations could just
> work on the bytes value as well (I don't know of any OS that uses non-
> ASCII path separators).

While this seems to work superficially I expect an infinite number of
problems caused by code that doesn't understand this subclass. You are
hinting at this in your last paragraph.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Guido van Rossum
On Mon, Sep 29, 2008 at 11:22 PM, Georg Brandl <[EMAIL PROTECTED]> wrote:
> No, that was not what I meant (although it is another possibility). As I 
> wrote,
> Martin's proposal that I support here is using the modified UTF-8 codec that
> successfully roundtrips otherwise invalid UTF-8 data.

I thought that the "successful rountripping" pretty much stopped as
soon as the unicode data is exported to somewhere else -- doesn't it
contain invalid surrogate sequences?

In general, I'm very reluctant to use utf-8b given that it doesn't
seem to be well documented as a standard anywhere. Providing some
minimal APIs that can process raw-bytes filenames still makes more
sense -- it is mostly analogous of our treatment of text files, where
the underlying binary data is also accessible.

> You seem to forget that (disregarding OSX here, since it already enforces
> UTF-8) the majority of file names on Posix systems will be encoded correctly.

Apparently under certain circumstances (external FS mounted) OSX can
also have non-UTF-8 filenames.

[...]

> With the filenames decoded by UTF-8, your files named têste, ô, dossié will
> be displayed and handled correctly. The others are *invalid* in the filesystem
> encoding UTF-8 and therefore would be represented by something like
>
> u'dir\uXXffname' where XX is some private use Unicode namespace. It won't look
> pretty when printed, but then, what do other applications do? They e.g. 
> display
> a question mark as you show above, which is not better in terms of 
> readability.
>
> But it will work when given to a filename-handling function. Valid filenames
> can be compared to Unicode strings.
>
> A real-world example: OpenOffice can't open files with invalid bytes in their
> name. They are displayed in the "Open file" dialog, but trying to open fails.
> This regularly drives me crazy. Let's not make Python not work this way too,
> or, even worse, not even display those filenames.

How can it *regularly* drive you crazy when "the majority of fie names
[...] encoded correctly" (as you assert above)?

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Patch for an initial support of bytes filename in Python3

2008-09-30 Thread Victor Stinner
Hi,

> This is the most sane contribution I've seen so far :).

Oh thanks.

> Do I understand properly that (listdir(bytes) -> bytes)?

Yes, os.listdir(bytes)->bytes. It's already the current behaviour.

But with Python3 trunk, os.listdir(str) -> str ... or bytes (if unicode 
conversion fails).

> If so, this seems basically sane to me, since it provides text behavior
> where possible and allows more sophisticated filesystem wrappers (i.e.
> Twisted's FilePath, Will McGugan's "FS") to do more tricky things,
> separating filenames for display to the user and filenames for exchange
> with the FS.

It's the goal of my patch. Let people do what you want with bytes: rename the 
file, try the best charset to display the filename, etc.

> >- remove os.getcwdu()
> >- create os.getcwdb() -> bytes
> >- glob.glob() support bytes
> >- fnmatch.filter() support bytes
> >- posixpath.join() and posixpath.split() support bytes
>
> It sounds like maybe there should be some 2to3 fixers in here somewhere,
> too?

IMHO a programmer should not use bytes for filenames. Only specific programs 
used to fix a broken system (eg. convmv program), a backup program, etc. 
should use bytes. So the "default" type (type and not charset) for filenames 
should be str in Python3.

If my patch would be applied, 2to3 have to replace getcwdu() to getcwd(). 
That's all.

> Not necessarily as part of this patch, but somewhere related?  I 
> don't know what they would do, but it does seem quite likely that code
> which was previously correct under 2.6 (using bytes) would suddenly be
> mixing bytes and unicode with these APIs.

It looks like 2to3 convert all text '...' or u'...' to unicode (str). So 
converted programs will use str for filenames.

-- 
Victor Stinner aka haypo
http://www.haypocalc.com/blog/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Guido van Rossum
On Mon, Sep 29, 2008 at 11:00 PM, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
>> Change the default file system encoding to store bytes in Unicode is like
>> introducing a new Python type: .
>
> Exactly. Seems like the best solution to me, despite your polemics.

Martin, I don't understand why you are in favor of storing raw bytes
encoded as Latin-1 in Unicode string objects, which clearly gives rise
to mojibake. In the past you have always been staunchly opposed to API
changes or practices that could lead to mojibake (and you had me quite
convinced).

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Guido van Rossum
On Mon, Sep 29, 2008 at 8:55 PM, Terry Reedy <[EMAIL PROTECTED]> wrote:
>
>> Le Monday 29 September 2008 19:06:01 Guido van Rossum, vous avez écrit :
>
>>> I know I keep flipflopping on this one, but the more I think about it
>>> the more I believe it is better to drop those names than to raise an
>>> exception. Otherwise a "naive" program that happens to use
>>> os.listdir() can be rendered completely useless by a single non-UTF-8
>>> filename. Consider the use of os.listdir() by the glob module. If I am
>>> globbing for *.py, why should the presence of a file named b'\xff'
>>> cause it to fail?
>
> To avoid silent skipping, is it possible to drop 'unreadable' names, issue a
> warning (instead of exception), and continue to completion?
> "Warning: unreadable filename skipped; see PyWiki/UnreadableFilenames"

That would be annoying as hell in most cases.

I consider the dropping of unreadable names similar to the suppression
of "hidden" files by various operating systems.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Patch for an initial support of bytes filename in Python3

2008-09-30 Thread glyph

On 12:47 am, [EMAIL PROTECTED] wrote:

This is the most sane contribution I've seen so far :).

See attached patch: python3_bytes_filename.patch

Using the patch, you will get:
- open() support bytes
- listdir(unicode) -> only unicode, *skip* invalid filenames
  (as asked by Guido)


Forgive me for being a bit dense, but I couldn't find this hunk in the 
patch.  Do I understand properly that (listdir(bytes) -> bytes)?


If so, this seems basically sane to me, since it provides text behavior 
where possible and allows more sophisticated filesystem wrappers (i.e. 
Twisted's FilePath, Will McGugan's "FS") to do more tricky things, 
separating filenames for display to the user and filenames for exchange 
with the FS.

- remove os.getcwdu()
- create os.getcwdb() -> bytes
- glob.glob() support bytes
- fnmatch.filter() support bytes
- posixpath.join() and posixpath.split() support bytes


It sounds like maybe there should be some 2to3 fixers in here somewhere, 
too?  Not necessarily as part of this patch, but somewhere related?  I 
don't know what they would do, but it does seem quite likely that code 
which was previously correct under 2.6 (using bytes) would suddenly be 
mixing bytes and unicode with these APIs.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] when is path==NULL?

2008-09-30 Thread Thomas Lee

Ulrich Eckhardt wrote:

Hi!

I'm looking at trunk/Python/sysmodule.c, function PySys_SetArgv(). In that 
function, there is code like this:


  PyObject* path = PySys_GetObject("path");
  ...
  if (path != NULL) {
...
  }

My intuition says that if path==NULL, something is very wrong. At least I 
would expect to get 'None', but never NULL, except when out of memory. So, 
for the case that path==NULL', I would simply invoke Py_FatalError("no mem 
for sys.path"), similarly to the other call there.


Sounds reasonable?

Uli

  
I also meant to mention that there might be a reason why we want the out 
of memory error to bubble up to the caller should that happen while 
attempting to allocate the PyString in PyDict_GetItemString, rather than 
just bailing out with a generic FatalError.


Cheers,
T
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] when is path==NULL?

2008-09-30 Thread Thomas Lee

Ulrich Eckhardt wrote:

Hi!

I'm looking at trunk/Python/sysmodule.c, function PySys_SetArgv(). In that 
function, there is code like this:


  PyObject* path = PySys_GetObject("path");
  ...
  if (path != NULL) {
...
  }

My intuition says that if path==NULL, something is very wrong. At least I 
would expect to get 'None', but never NULL, except when out of memory. So, 
for the case that path==NULL', I would simply invoke Py_FatalError("no mem 
for sys.path"), similarly to the other call there.


Sounds reasonable?

Uli

  

Maybe it's just being safe?

From Python/sysmodule.c:

   PyThreadState *tstate = PyThreadState_GET();
   PyObject *sd = tstate->interp->sysdict;
   if (sd == NULL)
   return NULL;
   return PyDict_GetItemString(sd, name);


So if tstate->interp->sysdict is NULL, we return NULL. That's probably a 
bit unlikely.


However, PyDict_GetItemString attempts to allocate a new PyString from 
the given char* key. If that fails, PySys_GetObject will also return 
NULL -- just like most functions in the code base that hit an out of 
memory error:


PyObject *
PyDict_GetItemString(PyObject *v, const char *key)
{
   PyObject *kv, *rv;
   kv = PyString_FromString(key);
   if (kv == NULL)
   return NULL;
   rv = PyDict_GetItem(v, kv);
   Py_DECREF(kv);
   return rv;
}

Seems perfectly reasonable for it to return NULL in this situation.

Cheers,
T

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] when is path==NULL?

2008-09-30 Thread Christian Heimes

Ulrich Eckhardt wrote:

Hi!

I'm looking at trunk/Python/sysmodule.c, function PySys_SetArgv(). In that 
function, there is code like this:


  PyObject* path = PySys_GetObject("path");
  ...
  if (path != NULL) {
...
  }

My intuition says that if path==NULL, something is very wrong. At least I 
would expect to get 'None', but never NULL, except when out of memory. So, 
for the case that path==NULL', I would simply invoke Py_FatalError("no mem 
for sys.path"), similarly to the other call there.


PySys_GetObject may return NULL after the user has removed sys.path with 
delattr(sys, 'path'). There are valid applications for removing sys.path.


Christian

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Adam Olsen
On Tue, Sep 30, 2008 at 5:24 AM, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote:
> Adam Olsen writes:
>
>  > [1] You could argue that Unicode should add new scalars to handle all
>  > currently invalid UTF-8 sequences.
>
> AFAIK there are about 2^31 of these, though!

They've promised to never allocate above U+10 (0 to 1114111).  Not
sure that makes new additions easier or harder. ;)

-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] when is path==NULL?

2008-09-30 Thread Ulrich Eckhardt
Hi!

I'm looking at trunk/Python/sysmodule.c, function PySys_SetArgv(). In that 
function, there is code like this:

  PyObject* path = PySys_GetObject("path");
  ...
  if (path != NULL) {
...
  }

My intuition says that if path==NULL, something is very wrong. At least I 
would expect to get 'None', but never NULL, except when out of memory. So, 
for the case that path==NULL', I would simply invoke Py_FatalError("no mem 
for sys.path"), similarly to the other call there.

Sounds reasonable?

Uli

-- 
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

**
   Visit our website at 
**
Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten 
bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen 
Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein 
sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, 
weitergeleitet, veröffentlicht oder anderweitig benutzt werden.
E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte 
Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht 
verantwortlich.

**

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python security team

2008-09-30 Thread Steve Holden
Jan Mate wrote:
> Guido van Rossum napsal(a):
[...]
>> know you personally -- but perhaps other current members of the PSRT
>> do and that could be enough to secure an invitation.
> 
> No, i don't think that i'm known well enough to earn the invitation
> (yet), this was more of a "so how the hell does it really work" question.
> 
I haven't yet heard anyone make a convincing case that it does. It is a
great idea, and we *do* need to take security seriously, but at present
all we have is a bunch of well-intentioned and over-committed volunteers.

regards
 Steve
-- 
Steve Holden+1 571 484 6266   +1 800 494 3119
Holden Web LLC  http://www.holdenweb.com/

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python security team

2008-09-30 Thread jek <[EMAIL PROTECTED]>
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Guido van Rossum napsal(a):
> I think we may have to expand our selection creteria, since the
> existing approach has led to a small PSRT whose members are all too
> busy to do the necessary legwork. At the same time we need to remain
> selective -- I don't think having a crowd of hundreds would be
> productive, and we need to be sure that every single member can
> absolutely be trusted to take security seriously.

of course

> 
> To answer your question directly, I don't think that just being the
> Python maintainer for some Linux distribution is enough to qualify --
> if our process worked well enough, you'd be getting the patches from
> us via some downstream-flowing distribution mechanism that reaches
> only trusted people within each vendor organization. I don't happen to

Thanks for your answer. I guess the process is the real problem then.
- From what i could observe, the connection between vendor-sec and PSRT is
not really working as it should.
(And then of course you need some kind of upstream flow too, because not
everyone reports to PSRT.)

> know you personally -- but perhaps other current members of the PSRT
> do and that could be enough to secure an invitation.
> 
No, i don't think that i'm known well enough to earn the invitation
(yet), this was more of a "so how the hell does it really work" question.


regards,
jan matejek
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.9 (GNU/Linux)
Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org

iEYEARECAAYFAkjiDSUACgkQjBrWA+AvBr+zVwCfRGPsDUjREfUKBk7/9yzxDTRN
egUAoLQlQe1qJHU9IkbigpevDme6OqwT
=BYl7
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Filename as byte string in python 2.6 or 3.0?

2008-09-30 Thread Hrvoje Nikšić
On Tue, 2008-09-30 at 19:45 +1000, Nick Coghlan wrote:
> To my mind, there are two kinds of app in the world when it comes to
> file paths:
> 1) "Normal" apps (e.g. a word processor), that are only interested in
> files with sane, well-formed file names that can be properly decoded to
> Unicode with the filesystem encoding identified by Python. If there is
> invalid data on the filesystem, they don't care and don't want to see it
> or have to deal with it.

I am not convinced that a word processor can just ignore files with
(what it thinks are) undecodable file names.  In countries with a
history of incompatible national encodings, such file names crop up very
often, sometimes as a natural consequence of data migrating from older
systems to newer ones.  You can and do encounter "invalid" file names in
the filesystems of mainstream users even without them using buggy or
obsolete software.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread Stephen J. Turnbull
Adam Olsen writes:

 > [1] You could argue that Unicode should add new scalars to handle all
 > currently invalid UTF-8 sequences.

AFAIK there are about 2^31 of these, though!

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-30 Thread M.-A. Lemburg
On 2008-09-30 08:00, Martin v. Löwis wrote:
>> Change the default file system encoding to store bytes in Unicode is like 
>> introducing a new Python type: .
> 
> Exactly. Seems like the best solution to me, despite your polemics.

Not a bad idea... have os.listdir() return Unicode subclasses that work
like file handles, ie. they have an extra buffer that holds the original
bytes value received from the underlying C API.

Passing these handles to open() would then do the right thing by using
whatever os.listdir() got back from the file system to open the file,
while still providing a sane way to display the filename, e.g. using
question marks for the invalid characters.

The only problem with this approach is concatenation of such handles
to form pathnames, but then perhaps those concatenations could just
work on the bytes value as well (I don't know of any OS that uses non-
ASCII path separators).

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Sep 30 2008)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


 Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Filename as byte string in python 2.6 or 3.0?

2008-09-30 Thread Nick Coghlan
Adam Olsen wrote:
> Lossy conversion just moves around what gets treated as garbage.  As
> all valid unicode scalars can be round tripped, there's no way to
> create a valid unicode file name without being lossy.  The alternative
> is not be valid unicode, but since we can't use such objects with
> external libs, can't even print them, we might as well call them
> something else.  We already have a name for that: bytes.

To my mind, there are two kinds of app in the world when it comes to
file paths:
1) "Normal" apps (e.g. a word processor), that are only interested in
files with sane, well-formed file names that can be properly decoded to
Unicode with the filesystem encoding identified by Python. If there is
invalid data on the filesystem, they don't care and don't want to see it
or have to deal with it.
2) "Filesystem" apps (e.g. a filesystem explorer), that need to be able
to deal with malformed filenames that may not decode properly using the
identified filesystem encoding.

For the former category of apps, the presence of a malformed filename
should NOT disrupt the processing of well-formed files and directories.
Those applications should "just work", even if the underlying filesystem
has a few broken filenames.

The latter category of applications need some way of defining their own
application-specific handling of malformed names.

That screams "callback" to me - and one mechanism to achieve that would
be to expose the unicode "errors" argument for filesystem operations
that return file paths (e.g. os.getcwd(), os.listdir(), os.readlink(),
os.walk()).

Once that was exposed, the existing error handling machinery in the
codecs module could be used to allow applications to define their own
custom error handling for Unicode decode errors in these operations.
(e.g. set "codecs.register_error('bad_filepath',
handle_filepath_error)", then use "errors='bad_filepath'" in the
relevant os API calls)

The default handling could be left at "strict", with os.listdir() and
os.walk() specifically ignoring path entries that trigger
UnicodeDecodeError.

getcwd() and readlink() could just propagate the exception, since they
have no other information to return.

Cheers,
Nick.

-- 
Nick Coghlan   |   [EMAIL PROTECTED]   |   Brisbane, Australia
---
http://www.boredomandlaziness.org
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of MS Windows CE port

2008-09-30 Thread Ulrich Eckhardt
On Tuesday 30 September 2008, Martin v. Löwis wrote:
> Ulrich Eckhardt wrote:
> >>> Well, currently it does make a difference. Simple example:
> >>> CreateFile().
> >>
> >> It's not so simple: Python doesn't actually call CreateFile
> >
> > Martin, CreateFile() was just used as an example. You can substitute it
> > with LoadString() or CreateProcess() if you like, the problem remains the
> > same.
>
> However, the solution should be different from the one you propose. I
> don't know what call of CreateProcess you are referring to specifically,
> but I think they should all be changed to call CreateProcessW.
>
> Again, whether or not _UNICODE is defined should have no effect. If it
> does, it's a bug, and the solution is not to sprinkle TCHAR all over the
> place.

I think we're misunderstanding each other, because that is exactly the 
solution I'm targetting. I'm aware that TCHAR is just a hack to ease 
transition between obsolete MS Windows versions and NT and later. However, 
current state is that Python uses CreateProcessA() and changing that is not 
always trivial.

Therefore, my first step in porting is also to remove the dependency on TCHAR, 
i.e. replace things like CreateProcess() with preferably CreateProcessW() or 
alternatively CreateProcessA(). Just #defining _UNICODE in the build already 
turns up around 50 places in pythoncore that need work. I'll send patches 
soon.

> > [about using SCons for building]
> >
> >> And you *can* provide an SCons file that supports all the SDKs?
> >
> > No, but I can provide one that allows parametrisation. ;)
>
> And, with proper parametrization, then supports all SDKs?

Hopefully, yes, but I'm not going to make any claims which I'm not sure about. 
SCons is just convenient because the PythonCE project already uses it but I'm 
not adamant on that matter. The approach of Jack Jansen also sounds good, 
generating the projectfiles (which are XML, btw) from a Python script also 
sounds nice.

cheers

Uli

-- 
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

**
   Visit our website at 
**
Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten 
bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen 
Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein 
sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, 
weitergeleitet, veröffentlicht oder anderweitig benutzt werden.
E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte 
Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht 
verantwortlich.

**

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com