Re: [Python-Dev] Status of MS Windows CE port

2008-09-29 Thread Ulrich Eckhardt
On Friday 26 September 2008, Martin v. Löwis wrote:
> >> Please don't. Whether or not _UNICODE is defined should have no effect.
> >
> > Well, currently it does make a difference. Simple example: CreateFile().
>
> It's not so simple: Python doesn't actually call CreateFile, AFAICT
> (well, it does, in _msi.c, but I hope you aren't worried about that
> module on CE).

Martin, CreateFile() was just used as an example. You can substitute it with 
LoadString() or CreateProcess() if you like, the problem remains the same.


[about using SCons for building]
> And you *can* provide an SCons file that supports all the SDKs?

No, but I can provide one that allows parametrisation. ;)

Uli

-- 
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

**
   Visit our website at 
**
Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten 
bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen 
Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein 
sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, 
weitergeleitet, veröffentlicht oder anderweitig benutzt werden.
E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte 
Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht 
verantwortlich.

**

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Filename as byte string in python 2.6 or 3.0?

2008-09-29 Thread Tristan Seligmann
* Gregory P. Smith <[EMAIL PROTECTED]> [2008-09-28 13:34:50 -0700]:

> since any given path (not just fs) can have its own encoding it makes
> the most sense to me to let the OS deal with the errors and not try to
> enforce bytes vs string encoding type at the python lib. level.

But the underlying APIs differ; Linux uses bytestrings for filenames,
whereas I believe the native Windows APIs take "wide" (ie. Unicode)
strings.
-- 
mithrandi, i Ainil en-Balandor, a faer Ambar


signature.asc
Description: Digital signature
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Filename as byte string in python 2.6 or 3.0?

2008-09-29 Thread Ulrich Eckhardt
On Sunday 28 September 2008, Gregory P. Smith wrote:
> "broken" systems will always exist.  Code to deal with them must be
> possible to write in python 3.0.
>
> since any given path (not just fs) can have its own encoding it makes
> the most sense to me to let the OS deal with the errors and not try to
> enforce bytes vs string encoding type at the python lib. level.

Actually I'm afraid that that isn't really useful. I, too, would like to kick 
peoples' back in order to get the to fix their systems or use the proper 
codepage while mounting etc, etc, but that is not going to happen soon. Just 
ignoring those broken systems is tempting, but alienating a large group of 
users isn't IMHO worth it.

Instead, I'd like to present a different approach:

1. For POSIX platforms (using a byte string for the path):
Here, the first approach is to convert the path to Unicode, according to the 
locale's CTYPE category. Hopefully, it will be UTF-8, but also codepages 
should work. If there is a segment (a byte sequence between two path 
separators) where it doesn't work, it uses an ASCII mapping where possible 
and codepoints from the "Private Use Area" (PUA) of Unicode for the 
non-decodable bytes.
In order to pass this path to fopen(), each segment would be converted to a 
byte string again, using the locale's CTYPE category except for segments 
which use the PUA where it simply encodes the original bytes.

2. For win32 platforms, the path is already Unicode (UTF-16) and the whole 
problem is solved or not solved by the OS.

In the end, both approaches yield a path represented by a Unicode string for 
intermediate use, which provides maximum flexibility. Further, it 
preserves "broken" encodings by simply mapping their byte-values to the PUA 
of Unicode. Maybe not using a string to represent a path would be a good 
idea, too. At least it would make it very clear that the string is not 
completely free-form.

Uli

-- 
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

**
   Visit our website at 
**
Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten 
bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen 
Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein 
sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, 
weitergeleitet, veröffentlicht oder anderweitig benutzt werden.
E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte 
Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht 
verantwortlich.

**

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Filename as byte string in python 2.6 or 3.0?

2008-09-29 Thread M.-A. Lemburg
On 2008-09-29 12:50, Ulrich Eckhardt wrote:
> On Sunday 28 September 2008, Gregory P. Smith wrote:
>> "broken" systems will always exist.  Code to deal with them must be
>> possible to write in python 3.0.
>>
>> since any given path (not just fs) can have its own encoding it makes
>> the most sense to me to let the OS deal with the errors and not try to
>> enforce bytes vs string encoding type at the python lib. level.
> 
> Actually I'm afraid that that isn't really useful. I, too, would like to kick 
> peoples' back in order to get the to fix their systems or use the proper 
> codepage while mounting etc, etc, but that is not going to happen soon. Just 
> ignoring those broken systems is tempting, but alienating a large group of 
> users isn't IMHO worth it.
> 
> Instead, I'd like to present a different approach:
> 
> 1. For POSIX platforms (using a byte string for the path):
> Here, the first approach is to convert the path to Unicode, according to the 
> locale's CTYPE category. Hopefully, it will be UTF-8, but also codepages 
> should work. If there is a segment (a byte sequence between two path 
> separators) where it doesn't work, it uses an ASCII mapping where possible 
> and codepoints from the "Private Use Area" (PUA) of Unicode for the 
> non-decodable bytes.
> In order to pass this path to fopen(), each segment would be converted to a 
> byte string again, using the locale's CTYPE category except for segments 
> which use the PUA where it simply encodes the original bytes.

I'm not sure how this would work. How would you map the private use
code points back to bytes ? Using a special codec that knows about
these code points ? How would the fopen() know to use that special
codec instead of e.g. the UTF-8 codec ?

BTW: Private use areas in Unicode are meant for e.g. company specific
code points. Using them for escaping purposes is likely to cause problems
due to assignment clashes.

Regarding the subject of file names:

On Unix, it's well possible to have to deal with 2-3 different file
systems mounted on a machine. Each of those may use a different file name
encoding or not support file name encoding at all.

If the OS doesn't guarantee a consistent file name encoding, then
why should Python try to emulate this on top of the OS ?

I think it's more important to be able to open a file, than to have
a readable file name when printing it to stdout, e.g. I wouldn't be able
to tell whether some Chinese file name makes sense or not, but if I know
that all files in a directory are meant for processing I should be able
to iterate over them regardless of whether they make sense or not.

> 2. For win32 platforms, the path is already Unicode (UTF-16) and the whole 
> problem is solved or not solved by the OS.
> 
> In the end, both approaches yield a path represented by a Unicode string for 
> intermediate use, which provides maximum flexibility. Further, it 
> preserves "broken" encodings by simply mapping their byte-values to the PUA 
> of Unicode. Maybe not using a string to represent a path would be a good 
> idea, too. At least it would make it very clear that the string is not 
> completely free-form.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Sep 29 2008)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


 Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Filename as byte string in python 2.6 or 3.0?

2008-09-29 Thread glyph

On 10:50 am, [EMAIL PROTECTED] wrote:

On Sunday 28 September 2008, Gregory P. Smith wrote:

"broken" systems will always exist.  Code to deal with them must be
possible to write in python 3.0.



since any given path (not just fs) can have its own encoding it makes
the most sense to me to let the OS deal with the errors and not try to
enforce bytes vs string encoding type at the python lib. level.


Actually I'm afraid that that isn't really useful. I, too, would like 
to kick
peoples' back in order to get the to fix their systems or use the 
proper
codepage while mounting etc, etc, but that is not going to happen soon. 
Just
ignoring those broken systems is tempting, but alienating a large group 
of

users isn't IMHO worth it.

Instead, I'd like to present a different approach:



1. For POSIX platforms (using a byte string for the path):
Here, the first approach is to convert the path to Unicode, according 
to the
locale's CTYPE category. Hopefully, it will be UTF-8, but also 
codepages

should work. If there is a segment (a byte sequence between two path
separators) where it doesn't work, it uses an ASCII mapping where 
possible

and codepoints from the "Private Use Area" (PUA) of Unicode for the
non-decodable bytes.


In order to pass this path to fopen(), each segment would be converted 
to a
byte string again, using the locale's CTYPE category except for 
segments

which use the PUA where it simply encodes the original bytes.


That's a cool idea, but this encoding hack would need to be clearly 
documented and exposed for when you need to talk to another piece of 
software about pathnames.  Consider a Python implementation of "xargs". 
Right now this can be implemented as a pretty simple for loop which 
eventually invokes 'subprocess.call' or similar.


http://docs.python.org/dev/3.0/library/os.html#process-management 
doesn't say what the type of the arguments to the various 'exec' 
variants are - one presumes they'd have to be bytes.  Not all arguments 
to subprocesses need to be filenames, but when they are they need to be 
encoded appropriately.


Also, consider the following nightmare scenario: a system which has two 
users with incompatible locales.  One wishes to write a "text" (ha ha) 
file with a list of pathnames in it to share with the other.  What 
encoding should that file be in?  How should the other user know how to 
interpret it?  (And of course: what if that user is going to be piping 
that file to "xargs", or the original file came out of "find"?)  I don't 
think that you can do encoding a segment at a time here, at least not at 
the API level; however, the whole file could be written in the py-posix- 
paths encoding which does exactly what you propose.
2. For win32 platforms, the path is already Unicode (UTF-16) and the 
whole

problem is solved or not solved by the OS.


If the "or not solved" part of that is true then this probably bears 
further investigation.  I suspect that the OS *always* provides some 
solution, even if it's the wrong solution, though.


Also, what about MacOS X?
In the end, both approaches yield a path represented by a Unicode 
string for

intermediate use, which provides maximum flexibility. Further, it
preserves "broken" encodings by simply mapping their byte-values to the 
PUA
of Unicode. Maybe not using a string to represent a path would be a 
good

idea, too. At least it would make it very clear that the string is not
completely free-form.


Personally, I plan to use this:

http://twistedmatrix.com/documents/8.1.0/api/twisted.python.filepath.FilePath.html

for all of my file I/O in the future.

For what it's worth, this object _doesn't_ handle unicode properly and 
it's been a thorn in our side for quite a while.  We have plans to 
implement some kind of unicode-friendly API which is compatible with 
2.6; if we have any brilliant ideas I'll let you know, but I doubt 
they'll be in time.


The general idea right now is that we'll keep around the original bytes 
returned from filesystem inspection and provide some context-sensitive 
encoding/decoding APIs for different applications.  The PUA approach 
would allow us to maintain an API compatible with that.  I would not 
actually mind if there were a POSIX-specific module we had to use to get 
every arcane nuance of brokenness of writing pathnames into text files 
to be correct, since Windows needs to come up with _some_ valid unicode 
filename for every file in the system (even if it's improperly decoded).

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Filename as byte string in python 2.6 or 3.0?

2008-09-29 Thread Victor Stinner
Le Monday 29 September 2008 12:50:03 Ulrich Eckhardt, vous avez écrit :
> (...) uses an ASCII mapping where possible and codepoints from the 
> "Private Use Area" (PUA) of Unicode for the non-decodable bytes.

That sounds to me like a very *ugly* hack.

It remembers me my proposition to create an object have the API of both bytes 
and str types: str() = human representation of the filename, 
bytes() = original bytes filename. As I wrote in the first email of 
this thread, it's not a good idea to mix bytes and characters.

Why trying to convert bytes to characters when the operating system expects 
bytes? To get the best compatibility, we have to use the native types, at 
least when str(filename, fs_encoding) fails and/or str(filename, 
fs_encoding).encode(fs_encoding) != filename.

-- 
Victor Stinner aka haypo
http://www.haypocalc.com/blog/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Filename as byte string in python 2.6 or 3.0?

2008-09-29 Thread Ulrich Eckhardt
On Monday 29 September 2008, M.-A. Lemburg wrote:
> On 2008-09-29 12:50, Ulrich Eckhardt wrote:
> > 1. For POSIX platforms (using a byte string for the path):
> > Here, the first approach is to convert the path to Unicode, according to
> > the locale's CTYPE category. Hopefully, it will be UTF-8, but also
> > codepages should work. If there is a segment (a byte sequence between two
> > path separators) where it doesn't work, it uses an ASCII mapping where
> > possible and codepoints from the "Private Use Area" (PUA) of Unicode for
> > the non-decodable bytes.
> > In order to pass this path to fopen(), each segment would be converted to
> > a byte string again, using the locale's CTYPE category except for
> > segments which use the PUA where it simply encodes the original bytes.
>
> I'm not sure how this would work. How would you map the private use
> code points back to bytes ? Using a special codec that knows about
> these code points ? How would the fopen() know to use that special
> codec instead of e.g. the UTF-8 codec ?

Sorry, I wasn't clear enough. I'll try to explain further...

Let's assume we have a filename like this:

  0xc2 0xa9 0x2f 0x7f

The first two bytes are the copyright sign encoded in UTF-8, followed by a 
slash (0x2f, path separator) and a character encoded in an unknown codepage 
(0x7f is not ASCII!). The first thing when receiving that path from the 
system would be to split it into segments, here we would get two of them, one 
with 0xc2 0xa9 and the other with 0x7f. This uses the fact that the separator 
(slash/0x2f) is rather universal (Note: I'm not sure about encodings like 
BIG5, i.e. ones that are neither UTF-8 nor derived from ASCII).

For each segment, we would apply the locale's CTYPE facet and get the Unicode 
codepoint 0xa9 for the first segment, while the second one fails to convert. 
So, for the second one, we simply check for each byte if it is valid and 
printable ASCII (0x7f isn't). If it is, we emit the byte as Unicode 
codepoint. Otherwise, we map to the PUA.

The PUA reserves 0xe000 to 0xf8ff for private uses. I would simply encode the 
byte 0x7f as 0xe07f, i.e. map it to the beginning of that range. Eventually, 
we would end up with the following Unicode codepoints:

  0xa9, 0x2f, 0xe07f

When converting to a byte string for use with fopen(), we simply inspect the 
supplied string again. If a segment contains elements of the PUA, we simply 
reverse the mapping for those and leave the others in that segment as-is. For 
all other segments, we apply the CTYPE conversion.


Notes:
* This effectively converts the current path representation (a string) into a 
sequence of segments where each segment can either be a fully Unicode-capable 
string or a raw byte string without any known interpretation. However, 
instead of using an array for that, it uses a string, which is what most 
people's code expects anyway.
* You could also work on a byte-base instead of splitting the path in segments 
first. I just assumed that a single segment will not contain valid UTF-8 
sequences mixed with invalid ones. A path however can contain both correctly 
and incorrectly encoded segments.


> BTW: Private use areas in Unicode are meant for e.g. company specific
> code points. Using them for escaping purposes is likely to cause problems
> due to assignment clashes.

I'm not sure if the use I proposed is correct according to the intended use of 
the PUA. I know that ideally no such string would escape from Python, i.e. it 
should only be visible internally. I would guess that that is something the 
PUA was intended for.

Uli

-- 
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

**
   Visit our website at 
**
Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten 
bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen 
Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein 
sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, 
weitergeleitet, veröffentlicht oder anderweitig benutzt werden.
E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte 
Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht 
verantwortlich.

**

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] New proposition for Python3 bytes filename issue

2008-09-29 Thread Victor Stinner
Hi,

After reading the previous discussion, here is new proposition.

Python 2.x and Windows are not affected by this issue. Only Python3 on POSIX 
(eg. Linux or *BSD) is affected.

Some system are broken, but Python have to be able to open/copy/move/remove 
files with an "invalid filename".

The issue can wait for Python 3.0.1 / 3.1.

Windows
---

On Windows, we might reject bytes filenames for all file operations: open(), 
unlink(), os.path.join(), etc. (raise a TypeError or UnicodeError)

POSIX OS


The default behaviour should be to use unicode and raise an error if 
conversion to unicode fails. It should also be possible to use bytes using 
bytes arguments and optional arguments (for getcwd).

 - listdir(unicode) -> unicode and raise an error on invalid filename
 - listdir(bytes) -> bytes
 - getcwd() -> unicode
 - getcwd(bytes=True) -> bytes
 - open(): accept bytes or unicode

os.path.*() should accept operations on bytes filenames, but maybe not on 
bytes+unicode arguments. os.path.join('directory', b'filename'): raise an 
error (or use *implicit* conversion to bytes)?

When the user wants to display a filename to the screen, he can uses:
   text = str(filename, fs_encoding, "replace")

-- 
Victor Stinner aka haypo
http://www.haypocalc.com/blog/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] Filename as byte string in python 2.6 or 3.0?

2008-09-29 Thread Victor Stinner
Le Monday 29 September 2008 06:43:55, vous avez écrit :
> It will make users happy, and it's simple enough to implement for
> python 3.0.

I dislike your argument. A "quick and dirty hack" is always faster to 
implement than a real solution, but we may hits later new issues if we don't 
choose the right solution.

-- 
Victor Stinner aka haypo
http://www.haypocalc.com/blog/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python security team

2008-09-29 Thread Jan Matejek
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Brett Cannon napsal(a):
> On Sat, Sep 27, 2008 at 8:54 AM, Victor Stinner
> <[EMAIL PROTECTED]> wrote:
>> First, I would like to access to these informations. Not only this issue, but
>> all security related issues. I have some knowledges about security and I can
>> help to resolve issues and/or estimate the criticity of an issue.
>>
> 
> That would require commit privileges first. Don't know if the group
> requires that a person have a decent amount of time committing to the
> core first (I just joined the list in late July).

commit privileges?
I would be interested in joining the PSRT list too - as a python
maintainer for openSUSE, i think that it would be beneficial for both my
and your work. And i can imagine that maintainers from other
distributions have similar opinion on this ;)
And that does not necessarily mean commit privileges, right?

Or is this an issue of trust, where "we trust you enough to make changes
to the core" equals "we also trust you enough to see the security issues" ?

regards
jan matejek
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.9 (GNU/Linux)
Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org

iEYEARECAAYFAkjgxgsACgkQjBrWA+AvBr+8IACfdh6ia9btlB4YrD+FI49CI5rv
8PcAoKQJVdie4YKDzLxaJCE33/TakcdW
=Y8Ck
-END PGP SIGNATURE-
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Filename as byte string in python 2.6 or 3.0?

2008-09-29 Thread Jack Jansen
I'm a bit late to join in this discussion, but if unicode filenames  
are going to be the normal mode, how about this whole normalized/ 
canonical business?


This is a headache that has shown up on the Mac a couple of times,  
because MacOS prefers filenames to be NFC, whereas Python prefers its  
Unicode to be NFD (or the other way around, I keep forgetting the  
details).


To make the problem worse, even though MacOS prefers its filenames in  
the one form, it will allow filenames in the other form (which can  
happen if you mount a foreign filesystem, for example over the net).  
The fact that "incorrect" filenames can exist mean that the simple  
solution of converting NFC<->NFD in Python's open() and friends won't  
work (or, at least, it'll make some filenames inaccessible, and  
listdir() may return filenames that don't exist).



--
Jack Jansen, <[EMAIL PROTECTED]>, http://www.cwi.nl/~jack
If I can't dance I don't want to be part of your revolution -- Emma  
Goldman



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Filename as byte string in python 2.6 or 3.0?

2008-09-29 Thread Ulrich Eckhardt
On Monday 29 September 2008, [EMAIL PROTECTED] wrote:
> Also, what about MacOS X?

AFAIK, OS X guarantees UTF-8 for filesystem encodings. So the OS also provides 
Unicode filenames and how it deals with broken or legacy media is left up to 
the OS.

Uli


-- 
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

**
   Visit our website at 
**
Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten 
bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen 
Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein 
sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, 
weitergeleitet, veröffentlicht oder anderweitig benutzt werden.
E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte 
Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht 
verantwortlich.

**

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Filename as byte string in python 2.6 or 3.0?

2008-09-29 Thread Jean-Paul Calderone

On Mon, 29 Sep 2008 14:34:07 +0200, Ulrich Eckhardt <[EMAIL PROTECTED]> wrote:

On Monday 29 September 2008, [EMAIL PROTECTED] wrote:

Also, what about MacOS X?


AFAIK, OS X guarantees UTF-8 for filesystem encodings. So the OS also provides
Unicode filenames and how it deals with broken or legacy media is left up to
the OS.


Read Jack Jansen's recent email about NFC vs NFD.

Jean-Paul
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New proposition for Python3 bytes filename issue

2008-09-29 Thread Victor Stinner
Patches are already avaible in the issue #3187 (os.listdir):

Le Monday 29 September 2008 14:07:55 Victor Stinner, vous avez écrit :
>  - listdir(unicode) -> unicode and raise an error on invalid filename

Need raise_decoding_errors.patch (don't clear Unicode error

>  - listdir(bytes) -> bytes

Always working.

>  - getcwd() -> unicode
>  - getcwd(bytes=True) -> bytes

Need merge_os_getcwd_getcwdu.patch

Note that current implement of getcwd() uses PyUnicode_FromString() to encode 
the directory, whereas getcwdu() uses the correct code (PyUnicode_Decode). So 
I merged both functions to keep only the correct version: getcwdu() => 
getcwd().

>  - open(): accept bytes or unicode

Need io_byte_filename.patch (just remove a check)

> os.path.*() should accept operations on bytes filenames, but maybe not on
> bytes+unicode arguments. os.path.join('directory', b'filename'): raise an
> error (or use *implicit* conversion to bytes)?

os.path.join() already reject mixing bytes + str.

But os.path.join(), glob.glob(), fnmatch.*(), etc. doesn't support bytes. I 
wrote some patches like:
 - glob1_bytes.patch: Fix glob.glob() to accept invalid directory name
 - fnmatch_bytes.patch: Patch fnmatch.filter() to accept bytes filenames

But I dislike both patches since they mix bytes and str. So this part still 
need some work.

-- 
Victor Stinner aka haypo
http://www.haypocalc.com/blog/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Filename as byte string in python 2.6 or 3.0?

2008-09-29 Thread glyph


On 11:59 am, [EMAIL PROTECTED] wrote:

Sorry, I wasn't clear enough. I'll try to explain further...

Let's assume we have a filename like this:

 0xc2 0xa9 0x2f 0x7f

The first two bytes are the copyright sign encoded in UTF-8, followed 
by a
slash (0x2f, path separator) and a character encoded in an unknown 
codepage

(0x7f is not ASCII!).


Originally I thought that this was a valid idea, but then it became 
clear that this could be a problem.  Consider a filename which includes 
a UTF-8 encoding of a PUA code point.
I'm not sure if the use I proposed is correct according to the intended 
use of
the PUA. I know that ideally no such string would escape from Python, 
i.e. it
should only be visible internally. I would guess that that is something 
the

PUA was intended for.


Viewing the PUA with GNOME charmap, I can see that many code points 
there have character renderings on my Ubuntu system.  I have to assume, 
therefore, that there are other (and potentially conflicting) uses for 
this unicode feature.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] ',' precedence in documentation

2008-09-29 Thread Michele Simionato
I like Martin's proposals (use a function, remove -O) very much.
Actually I wanted
to propose the same months ago. Here is my take at the assert function, which
I would like to be able to raise even exceptions different from AssertionError:

def assert_(cond, exc, *args):
"""Raise an exception if cond is not satisfied. exc can be a template
string (then args are the interpolation arguments) or an exception
class (then args are passed to the constructor). Here are a few
examples:

>>> assert_(False, 'ahia!')
Traceback (most recent call last):
   ...
AssertionError: ahia!

>>> assert_(False, ValueError)
Traceback (most recent call last):
  ...
ValueError

>>> a = 1
>>> assert_(isinstance(a, str), TypeError, '%r is not a string' % a)
Traceback (most recent call last):
TypeError: 1 is not a string

"""
if isinstance(exc, basestring):
raise AssertionError(exc % args)
elif inspect.isclass(exc) and issubclass(exc, Exception):
raise exc(*args)
else:
raise TypeError('The second argument of assert_ must be a string '
'or an exception class, not %r' % exc)

 M. Simionato
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Broken link to NASM download location

2008-09-29 Thread Ulrich Eckhardt
Hi!

In trunk/PCbuild/readme.txt it says NASM is available from kernel.org. The 
project has moved to Sourceforge though, please replace the link with

  http://nasm.sf.net

Thanks!

Uli

-- 
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

**
   Visit our website at 
**
Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten 
bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen 
Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein 
sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, 
weitergeleitet, veröffentlicht oder anderweitig benutzt werden.
E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte 
Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht 
verantwortlich.

**

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-29 Thread Steven Bethard
On Mon, Sep 29, 2008 at 6:07 AM, Victor Stinner
<[EMAIL PROTECTED]> wrote:
> The default behaviour should be to use unicode and raise an error if
> conversion to unicode fails. It should also be possible to use bytes using
> bytes arguments and optional arguments (for getcwd).
>
>  - listdir(unicode) -> unicode and raise an error on invalid filename
>  - listdir(bytes) -> bytes
>  - getcwd() -> unicode
>  - getcwd(bytes=True) -> bytes

Please let's not introduce boolean flags like this. How about
``getcwdb`` in parallel with the old ``getcwdu``?

Steve
-- 
I'm not *in*-sane. Indeed, I am so far *out* of sane that you appear a
tiny blip on the distant coast of sanity.
--- Bucky Katt, Get Fuzzy
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-29 Thread Victor Stinner
Le Monday 29 September 2008 17:16:47 Steven Bethard, vous avez écrit :
> >  - getcwd() -> unicode
> >  - getcwd(bytes=True) -> bytes
>
> Please let's not introduce boolean flags like this. How about
> ``getcwdb`` in parallel with the old ``getcwdu``?

Yeah, you're right. So i wrote a new patch: os_getcwdb.patch

With my patch we get (Python3):
 * os.getcwd() -> unicode
 * os.getcwdb() -> bytes

Previously in Python2 it was:
 * os.getcwd() -> str (bytes)
 * os.getcwdu() -> unicode

-- 
Victor Stinner aka haypo
http://www.haypocalc.com/blog/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python security team

2008-09-29 Thread David Stanek
On Sat, Sep 27, 2008 at 8:45 PM, Brett Cannon <[EMAIL PROTECTED]> wrote:
> On Sat, Sep 27, 2008 at 8:54 AM, Victor Stinner
> <[EMAIL PROTECTED]> wrote:
>>
>> I would like to know if a Python security team does exist. I sent an email
>> about an imageop issue, and I didn't get any answer. Later I learned that a
>> security ticket was created, I don't have access to it.
>>
>
> Yes, the PSRT (Python Security Response Team) does exist. We did get
> your email; sorry we didn't respond. There are very few members on
> that list and most of them are extremely busy. Responding to your
> email just slipped through the cracks. I believe Benjamin was the last
> person to work on your submitted patch.
>

I would be interested in participating. Is there any documentation
about the team or the processes? My Google search just turned up a
bunch of mailing list posts looking for team members.

-- 
David
http://www.traceback.org
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Broken link to NASM download location

2008-09-29 Thread Georg Brandl
Ulrich Eckhardt schrieb:
> Hi!
> 
> In trunk/PCbuild/readme.txt it says NASM is available from kernel.org. The 
> project has moved to Sourceforge though, please replace the link with
> 
>   http://nasm.sf.net

Fixed in r66681. Thanks! (In the future, it might be better to open an issue
at bugs.python.org for this kind of reports since it can't get lost so easily ;)

Georg

-- 
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-29 Thread Guido van Rossum
> Victor Stinner schrieb:

(Thanks Victor for moving this to the list. Having a discussion in the
tracker is really painful, I find.)

>> POSIX OS
>> 
>>
>> The default behaviour should be to use unicode and raise an error if
>> conversion to unicode fails. It should also be possible to use bytes using
>> bytes arguments and optional arguments (for getcwd).
>>
>>  - listdir(unicode) -> unicode and raise an error on invalid filename

I know I keep flipflopping on this one, but the more I think about it
the more I believe it is better to drop those names than to raise an
exception. Otherwise a "naive" program that happens to use
os.listdir() can be rendered completely useless by a single non-UTF-8
filename. Consider the use of os.listdir() by the glob module. If I am
globbing for *.py, why should the presence of a file named b'\xff'
cause it to fail?

Robust programs using os.listdir() should use the bytes->bytes version.

>>  - listdir(bytes) -> bytes
>>  - getcwd() -> unicode
>>  - getcwd(bytes=True) -> bytes
>>  - open(): accept bytes or unicode
>>
>> os.path.*() should accept operations on bytes filenames, but maybe not on
>> bytes+unicode arguments. os.path.join('directory', b'filename'): raise an
>> error (or use *implicit* conversion to bytes)?

(Yeah, it should be all bytes or all strings.)

On Mon, Sep 29, 2008 at 9:45 AM, Georg Brandl <[EMAIL PROTECTED]> wrote:

> This approach (changing all path-handling functions to accept either bytes
> or string, but not both) is doomed in my eyes. First, there are lots of them,
> second, they are not only in os.path but in many modules and also in user
> code, and third, I see no clean way of implementing them in the specified way.
> (Just try to do it with os.path.join as an example; I couldn't find the
> good way to write it, only the bad and the ugly...)

It doesn't have to be supported for all operations -- just enough to
be able to access all the system calls. and do the most basic pathname
manipulations (split and join -- almost everything else can be built
out of those).

> If I had to choose, I'd still argue for the modified UTF-8 as filesystem
> encoding (if it were UTF-8 otherwise), despite possible surprises when a
> such-encoded filename escapes from Python.

I'm having a hard time finding info about UTF-8b. Does anyone have a
decent link?

I noticed that OSX has a different approach yet. I believe it insists
on valid UTF-8 filenames. It may even require some normalization but I
don't know if the kernel enforces this. I tried to create a file named
b'\xff' and it came out as %ff. Then "rm %ff" worked. So I think it
may be replacing all bad UTF8 sequences with their % encoding.

The "set filesystem encoding to be Latin-1" approach has a certain
charm as well, but clearly would be a mistake on OSX, and probably on
other systems too (whenever the user doesn't think in Latin-1).

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python security team

2008-09-29 Thread Guido van Rossum
On Mon, Sep 29, 2008 at 5:11 AM, Jan Matejek <[EMAIL PROTECTED]> wrote:
> Brett Cannon napsal(a):
>> On Sat, Sep 27, 2008 at 8:54 AM, Victor Stinner
>> <[EMAIL PROTECTED]> wrote:
>>> First, I would like to access to these informations. Not only this issue, 
>>> but
>>> all security related issues. I have some knowledges about security and I can
>>> help to resolve issues and/or estimate the criticity of an issue.
>>>
>>
>> That would require commit privileges first. Don't know if the group
>> requires that a person have a decent amount of time committing to the
>> core first (I just joined the list in late July).
>
> commit privileges?
> I would be interested in joining the PSRT list too - as a python
> maintainer for openSUSE, i think that it would be beneficial for both my
> and your work. And i can imagine that maintainers from other
> distributions have similar opinion on this ;)
> And that does not necessarily mean commit privileges, right?
>
> Or is this an issue of trust, where "we trust you enough to make changes
> to the core" equals "we also trust you enough to see the security issues" ?

Traditionally we have been extremely careful in selecting people to
join the PSRT -- basically people that have many years of reputation
*within the Python community*.

I think we may have to expand our selection creteria, since the
existing approach has led to a small PSRT whose members are all too
busy to do the necessary legwork. At the same time we need to remain
selective -- I don't think having a crowd of hundreds would be
productive, and we need to be sure that every single member can
absolutely be trusted to take security seriously.

To answer your question directly, I don't think that just being the
Python maintainer for some Linux distribution is enough to qualify --
if our process worked well enough, you'd be getting the patches from
us via some downstream-flowing distribution mechanism that reaches
only trusted people within each vendor organization. I don't happen to
know you personally -- but perhaps other current members of the PSRT
do and that could be enough to secure an invitation.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] ',' precedence in documentation

2008-09-29 Thread Guido van Rossum
On Mon, Sep 29, 2008 at 7:12 AM, Michele Simionato
<[EMAIL PROTECTED]> wrote:
> I like Martin's proposals (use a function, remove -O) very much.

That's too bad, since I don't like it at all :-). You can write your
own function trivially that does this; however IMO the *language*
should support something that can be disabled to the point where no
code is generated for it. Most languages have this (in fact Java
*added* it). I'll concede that -O is not necessary the right flag to
pass, but I do want to be able to write asserts that can be turned
into nothing completely, and unless we were to add macros for just
this purpose, it'll have to be something recognized by the compiler,
like a statement. Most other details are negotiable (like the exact
syntax, or the flag used to disable it). However I don't like changing
the syntax so that it resembles a function call -- that's not going to
resolve the existing confusion and will add more confusion ("why can't
I write my own assert() function?").

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python security team

2008-09-29 Thread Giampaolo Rodola'


On 27 Set, 20:04, "Josiah Carlson" <[EMAIL PROTECTED]> wrote:
> On Sat, Sep 27, 2008 at 8:54 AM, Victor Stinner
>
> <[EMAIL PROTECTED]> wrote:
> > Second, I would like to help to fix all Python security issues. It looks 
> > like
> > Python community isn't very reactive (proactive?) about security. Eg. a DoS
> > was reported in smtpd server (integrated to Python)... 15 months ago. A 
> > patch
> > is available but it's not applied in Python trunk.
>
> The smtpd module is not meant to be used without modification.  It is
> the responsibility of the application writer to decide the limitations
> of the emails they want to allow sending, and subsequently handle the
> case where emails overrun that limit.  

The issue does not concern the emails but the buffer used internally
to store the received raw data sent by client.
The user who wants to fix the issue (#1745035) should override the
collect_incoming_data method which is usually not meant to be
modified.
Moreover, there are two RFCs which state that extremely long lines
must be truncated and an error reply must be returned.

--- Giampaolo
http://code.google.com/p/pyftpdlib/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python security team

2008-09-29 Thread Josiah Carlson
On Mon, Sep 29, 2008 at 12:02 PM, Giampaolo Rodola' <[EMAIL PROTECTED]> wrote:
> On 27 Set, 20:04, "Josiah Carlson" <[EMAIL PROTECTED]> wrote:
>> On Sat, Sep 27, 2008 at 8:54 AM, Victor Stinner
>>
>> <[EMAIL PROTECTED]> wrote:
>> > Second, I would like to help to fix all Python security issues. It looks 
>> > like
>> > Python community isn't very reactive (proactive?) about security. Eg. a DoS
>> > was reported in smtpd server (integrated to Python)... 15 months ago. A 
>> > patch
>> > is available but it's not applied in Python trunk.
>>
>> The smtpd module is not meant to be used without modification.  It is
>> the responsibility of the application writer to decide the limitations
>> of the emails they want to allow sending, and subsequently handle the
>> case where emails overrun that limit.
>
> The issue does not concern the emails but the buffer used internally
> to store the received raw data sent by client.
> The user who wants to fix the issue (#1745035) should override the
> collect_incoming_data method which is usually not meant to be
> modified.
> Moreover, there are two RFCs which state that extremely long lines
> must be truncated and an error reply must be returned.

We can and should discuss the specifics of this item in the bug report
itself.  I should have replied there instead.

 - Josiah
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python security team

2008-09-29 Thread Giampaolo Rodola'
Yeah, right. Let's continue there.

--- Giampaolo
http://code.google.com/p/pyftpdlib



On 29 Set, 22:44, "Josiah Carlson" <[EMAIL PROTECTED]> wrote:
> On Mon, Sep 29, 2008 at 12:02 PM, Giampaolo Rodola' <[EMAIL PROTECTED]> wrote:
> > On 27 Set, 20:04, "Josiah Carlson" <[EMAIL PROTECTED]> wrote:
> >> On Sat, Sep 27, 2008 at 8:54 AM, Victor Stinner
>
> >> <[EMAIL PROTECTED]> wrote:
> >> > Second, I would like to help to fix all Python security issues. It looks 
> >> > like
> >> > Python community isn't very reactive (proactive?) about security. Eg. a 
> >> > DoS
> >> > was reported in smtpd server (integrated to Python)... 15 months ago. A 
> >> > patch
> >> > is available but it's not applied in Python trunk.
>
> >> The smtpd module is not meant to be used without modification.  It is
> >> the responsibility of the application writer to decide the limitations
> >> of the emails they want to allow sending, and subsequently handle the
> >> case where emails overrun that limit.
>
> > The issue does not concern the emails but the buffer used internally
> > to store the received raw data sent by client.
> > The user who wants to fix the issue (#1745035) should override the
> > collect_incoming_data method which is usually not meant to be
> > modified.
> > Moreover, there are two RFCs which state that extremely long lines
> > must be truncated and an error reply must be returned.
>
> We can and should discuss the specifics of this item in the bug report
> itself.  I should have replied there instead.
>
>  - Josiah
> ___
> Python-Dev mailing list
> [EMAIL PROTECTED]://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:http://mail.python.org/mailman/options/python-dev/python-dev2-garchiv...-
>  Nascondi testo citato
>
> - Mostra testo citato -
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-29 Thread Adam Olsen
On Mon, Sep 29, 2008 at 10:00 AM, Victor Stinner
<[EMAIL PROTECTED]> wrote:
> Le Monday 29 September 2008 17:16:47 Steven Bethard, vous avez écrit :
>> >  - getcwd() -> unicode
>> >  - getcwd(bytes=True) -> bytes
>>
>> Please let's not introduce boolean flags like this. How about
>> ``getcwdb`` in parallel with the old ``getcwdu``?
>
> Yeah, you're right. So i wrote a new patch: os_getcwdb.patch
>
> With my patch we get (Python3):
>  * os.getcwd() -> unicode
>  * os.getcwdb() -> bytes
>
> Previously in Python2 it was:
>  * os.getcwd() -> str (bytes)
>  * os.getcwdu() -> unicode

Why not do:
 * os.getcwd() -> unicode
 * posix.getcwdb() -> bytes

os gets the standard version and posix has an (unambiguously named)
platform-specific version.


-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-29 Thread Adam Olsen
On Mon, Sep 29, 2008 at 11:06 AM, Guido van Rossum <[EMAIL PROTECTED]> wrote:
> On Mon, Sep 29, 2008 at 9:45 AM, Georg Brandl <[EMAIL PROTECTED]> wrote:
>
>> This approach (changing all path-handling functions to accept either bytes
>> or string, but not both) is doomed in my eyes. First, there are lots of them,
>> second, they are not only in os.path but in many modules and also in user
>> code, and third, I see no clean way of implementing them in the specified 
>> way.
>> (Just try to do it with os.path.join as an example; I couldn't find the
>> good way to write it, only the bad and the ugly...)
>
> It doesn't have to be supported for all operations -- just enough to
> be able to access all the system calls. and do the most basic pathname
> manipulations (split and join -- almost everything else can be built
> out of those).
>
>> If I had to choose, I'd still argue for the modified UTF-8 as filesystem
>> encoding (if it were UTF-8 otherwise), despite possible surprises when a
>> such-encoded filename escapes from Python.
>
> I'm having a hard time finding info about UTF-8b. Does anyone have a
> decent link?

http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html

Scroll down to item D, near the bottom.

It turns malformed bytes into lone (therefor malformed) surrogates.


> I noticed that OSX has a different approach yet. I believe it insists
> on valid UTF-8 filenames. It may even require some normalization but I
> don't know if the kernel enforces this. I tried to create a file named
> b'\xff' and it came out as %ff. Then "rm %ff" worked. So I think it
> may be replacing all bad UTF8 sequences with their % encoding.

I suspect linux will eventually take this route as well.  If ext3 had
an option for UTF-8 validation I know I'd want it on.  That'd move the
error to the program creating bogus file names, rather than those
trying to read, display, and manage them.


> The "set filesystem encoding to be Latin-1" approach has a certain
> charm as well, but clearly would be a mistake on OSX, and probably on
> other systems too (whenever the user doesn't think in Latin-1).

Aye, it's a better hack than UTF-8b, but adding byte functions is even better.


-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-29 Thread James Y Knight

On Sep 29, 2008, at 6:17 PM, Adam Olsen wrote:

I suspect linux will eventually take this route as well.  If ext3 had
an option for UTF-8 validation I know I'd want it on.  That'd move the
error to the program creating bogus file names, rather than those
trying to read, display, and manage them.


Of course, even on Mac OS X, or a theoretical UTF-8-enforcing ext3,  
random byte strings are still possible in your program's argv, in  
environment variables, and as arguments to subprocesses.


So python still needs to do something...

James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Status of MS Windows CE port

2008-09-29 Thread Martin v. Löwis
Ulrich Eckhardt wrote:
>>> Well, currently it does make a difference. Simple example: CreateFile().
>> It's not so simple: Python doesn't actually call CreateFile
> 
> Martin, CreateFile() was just used as an example. You can substitute it with 
> LoadString() or CreateProcess() if you like, the problem remains the same.

However, the solution should be different from the one you propose. I
don't know what call of CreateProcess you are referring to specifically,
but I think they should all be changed to call CreateProcessW.

Again, whether or not _UNICODE is defined should have no effect. If it
does, it's a bug, and the solution is not to sprinkle TCHAR all over the
place.

> [about using SCons for building]
>> And you *can* provide an SCons file that supports all the SDKs?
> 
> No, but I can provide one that allows parametrisation. ;)

And, with proper parametrization, then supports all SDKs?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Filename as byte string in python 2.6 or 3.0?

2008-09-29 Thread Martin v. Löwis
> Originally I thought that this was a valid idea, but then it became
> clear that this could be a problem.  Consider a filename which includes
> a UTF-8 encoding of a PUA code point.

I still think it's a valid idea. For non-UTF-8 file system encodings,
use PUA characters, and generate them through an error handler.

If the file system encoding is UTF-8, use UTF-8b instead as the
file system encoding.

> Viewing the PUA with GNOME charmap, I can see that many code points
> there have character renderings on my Ubuntu system.  I have to assume,
> therefore, that there are other (and potentially conflicting) uses for
> this unicode feature.

Depends on how you use it. If you use the PUA block 1 (i.e.
U+E000..U+F8FF), there is a realistic chance of collision.

If you use the Plane 15 or Plane 16 PUA blocks, there is currently
zero chance of collision (AFAIK). PUA has a wide use for additional
characters in TrueType, but I don't think many tools even support
plane 15 and 16 for generating fonts, or rendering them (it may even
that the TrueType/OpenType format doesn't support them in the first
place). However, Python can make use of these planes fairly easily,
even in 2-byte mode (through UTF-16).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Filename as byte string in python 2.6 or 3.0?

2008-09-29 Thread Martin v. Löwis
Jack Jansen wrote:
> I'm a bit late to join in this discussion, but if unicode filenames are
> going to be the normal mode, how about this whole normalized/canonical
> business?

I don't think there is a change in the current implementation. Users
interested in this issue should contribute code that normalizes file
names appropriately on systems that require such normalization.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New proposition for Python3 bytes filename issue

2008-09-29 Thread Martin v. Löwis

> The default behaviour should be to use unicode and raise an error if 
> conversion to unicode fails. It should also be possible to use bytes using 
> bytes arguments and optional arguments (for getcwd).

I'm still opposed to allowing bytes as file names at all in 3k. Python
should really strive for providing a uniform datatype, and that should
be the character string type.

For applications that cannot trust that the conversion works always
correctly on POSIX systems, sys.setfilesystemencoding should be
provided.

In the long run, need for explicit calls to this function should be
reduced, by
a) systems getting more consistent in their file name encoding, and
b) Python providing better defaults for detecting the file name
   encoding, and better round-trip support for non-encodable bytes.
Part b) is probably out-of-scope for 3.0 now, but should be reconsidered
for 3.1

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Real segmentation fault handler

2008-09-29 Thread Victor Stinner
Hi,

I would like to be able to catch SIGSEGV in my Python code! So I started to 
hack Python trunk to support this feature. The idea is to use a signal 
handler which call longjmp(), and add setjmp() at Py_EvalFrameEx() enter.

See attached ("small") patch: segfault.patch

Example read.py with the *evil* ctypes module of invalid memory read:
--- 8< --
from ctypes import string_at

def fault():
text = string_at(1, 10)
print("text = {0!r}".format(text))

def test():
print("test: 1")
try:
fault()
except MemoryError, err:
print "ooops!"
print err

print("test: 2")
try:
fault()
except MemoryError, err:
print "ooops!"
print err

print("test: end")

def main():
test()

if __name__ == "__main__":
main()
--- 8< --

Result:
--- 8< --
$ python read.py
test: 1
sizeof()=160
ooops!
segmentation fault
test: 2
sizeof()=160
ooops!
segmentation fault
test: end
--- 8< --

Example bug1.py of a stack overflow:
--
loop = None,
for i in xrange(10**5):
loop = loop, None
--

Result:
--
$ python -i bug1.py
>>> print loop
(...Traceback (most recent call last):
  File "", line 1, in 
MemoryError: segmentation fault
>>>
--

Python is able to restore a valid state (stack/heap) after a segmentation 
fault and raise a classical Python exception (I choosed MemoryError, but it 
could be a specific exception).

On my computer (Ubuntu Gutsy/i386), each segfault_frame takes 
sizeof(sigjmpbuf) + sizeof(void*) = 160 bytes, allocated on the stack. I 
don't know if it's huge or not, but that will limit the number of recursive 
calls. The feature can be optional if we add a configure option and some 
#ifdef/#endif. A dedicated stack is needed to be call the signal handler on 
stack overflow error. I choosed 4 KB, but since I only call longjmp(), 
smaller stack might also works.

Does other VM support such feature? JVM, Mono, .NET, etc. ?

I had the idea of catching SIGSEGV after reading the issue 1069092 (stack 
overflow because of too many recursive calls).

-- 
Victor Stinner aka haypo
http://www.haypocalc.com/blog/
Index: Include/segfault.h
===
--- Include/segfault.h	(révision 0)
+++ Include/segfault.h	(révision 0)
@@ -0,0 +1,37 @@
+
+/* Interface to execute compiled code */
+
+#ifndef Py_SEGFAULT_H
+#define Py_SEGFAULT_H
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include 
+#include 
+
+/* Use a custom stack for signal handlers, especially the segfault handler */
+#define SEGFAULT_STACK
+
+typedef struct _segfault_frame_t {
+sigjmp_buf env;
+struct _segfault_frame_t *previous;
+} segfault_frame_t;
+
+typedef struct {
+int init;
+PyObject *text;
+segfault_frame_t *current_frame;
+#ifdef SEGFAULT_STACK
+char stack[4096];
+#endif
+} segfault_t;
+
+void segfault_enter(segfault_frame_t *frame);
+void segfault_exit(segfault_frame_t *frame);
+void segfault_set_error(void);
+
+#ifdef __cplusplus
+}
+#endif
+#endif /* !Py_SEGFAULT_H */

Modification de propriétés sur Include/segfault.h
___
Nom : svn:eol-style
   + native

Index: Python/segfault.c
===
--- Python/segfault.c	(révision 0)
+++ Python/segfault.c	(révision 0)
@@ -0,0 +1,81 @@
+/**
+ * Python segmentation fault handler
+ */
+
+#include "Python.h"
+#include "segfault.h"
+
+static segfault_t segfault;
+
+void
+segfault_set_error(void)
+{
+PyErr_SetObject(PyExc_MemoryError, segfault.text);
+}
+
+static int segfault_install(void);
+
+static void
+segfault_handler(int sig_num)
+{
+	(void)segfault_install();
+	siglongjmp(segfault.current_frame->env, 1);
+}
+
+static int
+segfault_install()
+{
+	struct sigaction context, ocontext;
+	context.sa_handler = segfault_handler;
+	sigemptyset(&context.sa_mask);
+#ifdef SEGFAULT_STACK
+	context.sa_flags = SA_RESETHAND | SA_RESTART | SA_ONSTACK;
+#else
+	context.sa_flags = SA_RESETHAND | SA_RESTART;
+#endif
+	if (sigaction(SIGSEGV, &context, &ocontext) == -1)
+		return 1;
+	else
+		return 0;
+}
+
+void
+segfault_enter(segfault_frame_t *frame)
+{
+	frame->previous = segfault.current_frame;
+	segfault.current_frame = frame;
+}
+
+void
+segfault_exit(segfault_frame_t *frame)
+{
+	if (segfault.current_frame)
+		segfault.current_frame = segfault.current_frame->previous;
+}
+
+void
+PyOS_InitSegfault()
+{
+#ifdef SEGFAULT_STACK
+	stack_t ss;
+	ss.ss_sp = segfault.stack;
+	ss.ss_size = sizeof(segfault.stack);
+	ss.ss_flags = 0;
+	if( sigaltstack(&ss, NULL)) {
+	/* FIXME: catch this error */
+	}
+#endif
+
+	segfault.text = PyString_FromString("segmentation fault");
+/* FIXME
+	if (!segfault.text) ???;
+*/
+	(void)segfault_install();
+}
+
+void
+PyOS_FiniSegfault()
+{
+	Py_DECREF(s

Re: [Python-Dev] I would like an Python account

2008-09-29 Thread Brett Cannon
On Sat, Sep 27, 2008 at 8:32 AM, Victor Stinner
<[EMAIL PROTECTED]> wrote:
> Hi,
>
> Would it possible to get more permissions on Python bugtracker, especially to
> add keywords, close a duplicate bug, etc.?
>

Let's start off with giving you Developer permissions on the tracker
to start and then you can work up to commit privileges, Victor. What
is your username on the tracker?

-Brett
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-29 Thread Victor Stinner
Le Monday 29 September 2008 19:06:01 Guido van Rossum, vous avez écrit :
> >>  - listdir(unicode) -> unicode and raise an error on invalid filename
>
> I know I keep flipflopping on this one, but the more I think about it
> the more I believe it is better to drop those names than to raise an
> exception. Otherwise a "naive" program that happens to use
> os.listdir() can be rendered completely useless by a single non-UTF-8
> filename. Consider the use of os.listdir() by the glob module. If I am
> globbing for *.py, why should the presence of a file named b'\xff'
> cause it to fail?

It would be hard for a newbie programmer to understand why he's unable to find 
his very important file ("important r?port.doc") using os.listdir(). And yes, 
if your file system is broken, glob() will fail.

If we choose to support bytes on Linux, a robust and portable program have to 
use only bytes filenames on Linux to always be able to list and open files.

A full example to list files and display filenames:

  import os
  import os.path
  import sys
  if os.path.supports_unicode_filenames:
 cwd = getcwd()
  else:
 cwd = getcwdb()
 encoding = sys.getfilesystemencoding()
  for filename in os.listdir(cwd):
 if os.path.supports_unicode_filenames:
text = str(filename, encoding, "replace)
 else:
text = filename
 print("=== File {0} ===".format(text))
 for line in open(filename):
...

We need an "if" to choose the directory. The second "if" is only needed to 
display the filename. Using bytes, it would be possible to write better code 
detect the real charset (eg. ISO-8859-1 in a UTF-8 file system) and so 
display correctly the filename and/or propose to rename the file. Would it 
possible using UTF-8b / PUA hacks?

-- 
Victor Stinner aka haypo
http://www.haypocalc.com/blog/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Filename as byte string in python 2.6 or 3.0?

2008-09-29 Thread Adam Olsen
On Mon, Sep 29, 2008 at 4:49 PM, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
>> Originally I thought that this was a valid idea, but then it became
>> clear that this could be a problem.  Consider a filename which includes
>> a UTF-8 encoding of a PUA code point.
>
> I still think it's a valid idea. For non-UTF-8 file system encodings,
> use PUA characters, and generate them through an error handler.
>
> If the file system encoding is UTF-8, use UTF-8b instead as the
> file system encoding.
>
>> Viewing the PUA with GNOME charmap, I can see that many code points
>> there have character renderings on my Ubuntu system.  I have to assume,
>> therefore, that there are other (and potentially conflicting) uses for
>> this unicode feature.
>
> Depends on how you use it. If you use the PUA block 1 (i.e.
> U+E000..U+F8FF), there is a realistic chance of collision.
>
> If you use the Plane 15 or Plane 16 PUA blocks, there is currently
> zero chance of collision (AFAIK). PUA has a wide use for additional
> characters in TrueType, but I don't think many tools even support
> plane 15 and 16 for generating fonts, or rendering them (it may even
> that the TrueType/OpenType format doesn't support them in the first
> place). However, Python can make use of these planes fairly easily,
> even in 2-byte mode (through UTF-16).

An example where lossy conversion fails:

1) create file using UTF-8 app with PUA (or ambiguous scalar of
choice) filename.
2) list dir in python.  file name is now a unicode object with PUA.
3) attempt to open.  file name gets converted to malformed UTF-8
sequence.  Doesn't match the name on disk, so opening fails

Lossy conversion just moves around what gets treated as garbage.  As
all valid unicode scalars can be round tripped, there's no way to
create a valid unicode file name without being lossy.  The alternative
is not be valid unicode, but since we can't use such objects with
external libs, can't even print them, we might as well call them
something else.  We already have a name for that: bytes.


-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Filename as byte string in python 2.6 or 3.0?

2008-09-29 Thread Greg Ewing

Ulrich Eckhardt wrote:

AFAIK, OS X guarantees UTF-8 for filesystem encodings. So the OS also provides 
Unicode filenames and how it deals with broken or legacy media is left up to 
the OS.


Does this mean that the OS always returns valid utf-8 strings
from filesystem calls, even if the media is broken or legacy?

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-29 Thread Guido van Rossum
On Mon, Sep 29, 2008 at 4:29 PM, Victor Stinner
<[EMAIL PROTECTED]> wrote:
> Le Monday 29 September 2008 19:06:01 Guido van Rossum, vous avez écrit :
>> >>  - listdir(unicode) -> unicode and raise an error on invalid filename
>>
>> I know I keep flipflopping on this one, but the more I think about it
>> the more I believe it is better to drop those names than to raise an
>> exception. Otherwise a "naive" program that happens to use
>> os.listdir() can be rendered completely useless by a single non-UTF-8
>> filename. Consider the use of os.listdir() by the glob module. If I am
>> globbing for *.py, why should the presence of a file named b'\xff'
>> cause it to fail?
>
> It would be hard for a newbie programmer to understand why he's unable to find
> his very important file ("important r?port.doc") using os.listdir().

*Every* failure in this scenario will be hard to understand for a
newbie programmer. We can just document the fact.

> And yes,
> if your file system is broken, glob() will fail.

Why should it?

> If we choose to support bytes on Linux, a robust and portable program have to
> use only bytes filenames on Linux to always be able to list and open files.

Right. But such robustness is only needed to support certain odd cases
and we cannot demand that most people bother to write robust code all
the time.

> A full example to list files and display filenames:
>
>  import os
>  import os.path
>  import sys
>  if os.path.supports_unicode_filenames:

This is backwards -- the Unicode API is always supported, the bytes
API only on Linux (and perhaps some other other Unixes).

> cwd = getcwd()
>  else:
> cwd = getcwdb()
> encoding = sys.getfilesystemencoding()
>  for filename in os.listdir(cwd):
> if os.path.supports_unicode_filenames:
>text = str(filename, encoding, "replace)
> else:
>text = filename
> print("=== File {0} ===".format(text))
> for line in open(filename):
>...
>
> We need an "if" to choose the directory. The second "if" is only needed to
> display the filename. Using bytes, it would be possible to write better code
> detect the real charset (eg. ISO-8859-1 in a UTF-8 file system) and so
> display correctly the filename and/or propose to rename the file. Would it
> possible using UTF-8b / PUA hacks?

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-29 Thread Victor Stinner
Le Monday 29 September 2008 18:45:28 Georg Brandl, vous avez écrit :
> If I had to choose, I'd still argue for the modified UTF-8 as filesystem
> encoding (if it were UTF-8 otherwise), despite possible surprises when a
> such-encoded filename escapes from Python.

If I understand correctly this solution. The idea is to change the default 
file system encoding, right? Eg. if your filesystem is UTF-8, use ISO-8859-1 
to make sure that UTF-8 conversion will never fail.

Let's try with an ugly directory on my UTF-8 file system:
$ find
.
./têste
./ô
./a?b
./dossié
./dossié/abc
./dir?name
./dir?name/xyz

Python3 using encoding=ISO-8859-1:
>>> import os; os.listdir(b'.')
[b't\xc3\xaaste', b'\xc3\xb4', b'a\xffb', b'dossi\xc3\xa9', b'dir\xffname']
>>> files=os.listdir('.'); files
['têste', 'ô', 'aÿb', 'dossié', 'dirÿname']
>>> open(files[0]).close()
>>> os.listdir(files[-1])
['xyz']

Ok, I have unicode filenames and I'm able to open a file and list a directory. 
The problem is now to display correctly the filenames.

For me "unicode" sounds like "text (characters) encoded in the correct 
charset". In this case, unicode is just a storage for *bytes* in a custom 
charset.

How can we mix  with ? Eg. os.path.join('dossié', "fichié") : first argument is encoded 
in ISO-8859-1 whereas the second argument is encoding in Unicode. It's 
something like that:
   str(b'dossi\xc3\xa9', 'ISO-8859-1') + '/' + 'fichi\xe9'

Whereas the correct (unicode) result should be: 
   'dossié/fichié'
as bytes in ISO-8859-1:
   b'dossi\xc3\xa9/fichi\xc3\xa9'
as bytes in UTF-8:
   b'dossi\xe9/fichi\xe9'

Change the default file system encoding to store bytes in Unicode is like 
introducing a new Python type: .

-- 
Victor Stinner aka haypo
http://www.haypocalc.com/blog/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-29 Thread Martin v. Löwis
>   import os
>   import os.path
>   import sys
>   if os.path.supports_unicode_filenames:
>  cwd = getcwd()
>   else:
>  cwd = getcwdb()
>  encoding = sys.getfilesystemencoding()
>   for filename in os.listdir(cwd):
>  if os.path.supports_unicode_filenames:
> text = str(filename, encoding, "replace)
>  else:
> text = filename
>  print("=== File {0} ===".format(text))
>  for line in open(filename):
> ...
> 
> We need an "if" to choose the directory. The second "if" is only needed to 
> display the filename. Using bytes, it would be possible to write better code 
> detect the real charset (eg. ISO-8859-1 in a UTF-8 file system) and so 
> display correctly the filename and/or propose to rename the file. Would it 
> possible using UTF-8b / PUA hacks?

Not sure what "it" is: to write the code above using the PUA hack:

for filename in os.listdir(os.getcwd())
text = repr(filename)
print("=== File {0} ===".format(text))
for line in open(filenmae):
...

If "it" is "display the filename": sure, see above. If "it" is "detect
the real charset": sure, why not?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-29 Thread Adam Olsen
On Mon, Sep 29, 2008 at 5:29 PM, Victor Stinner
<[EMAIL PROTECTED]> wrote:
> Le Monday 29 September 2008 19:06:01 Guido van Rossum, vous avez écrit :
>> >>  - listdir(unicode) -> unicode and raise an error on invalid filename
>>
>> I know I keep flipflopping on this one, but the more I think about it
>> the more I believe it is better to drop those names than to raise an
>> exception. Otherwise a "naive" program that happens to use
>> os.listdir() can be rendered completely useless by a single non-UTF-8
>> filename. Consider the use of os.listdir() by the glob module. If I am
>> globbing for *.py, why should the presence of a file named b'\xff'
>> cause it to fail?
>
> It would be hard for a newbie programmer to understand why he's unable to find
> his very important file ("important r?port.doc") using os.listdir(). And yes,
> if your file system is broken, glob() will fail.

Imagine a program that list all files in a dir, as well as their file
size.  If we return bytes we'll print the name wrong.  If we return
lossy unicode we'll be unable to get the size of some files.  If we
return a malformed unicode we'll be unable to print at all (and what
if this is a GUI app?)

The common use cases need unicode, so the best options for them are to
fail outright or skip bad filenames.

The uncommon use cases need bytes, and they could do an explicit lossy
decode for printing, while still keeping the internal file name as
bytes.


Failing outright does have the advantage that the resulting exception
should have a half-decent approximation of the bad filename.  (Thanks
to the recent choices on unicode repr() and having stderr do escapes.)


-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Filename as byte string in python 2.6 or 3.0?

2008-09-29 Thread Victor Stinner
Le Tuesday 30 September 2008 01:31:45 Adam Olsen, vous avez écrit :
> The alternative is not be valid unicode, but since we can't use such 
> objects with external libs, can't even print them, we might as well 
> call them something else.  We already have a name for that: bytes.

:-)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Patch for an initial support of bytes filename in Python3

2008-09-29 Thread Victor Stinner
Hi,

See attached patch: python3_bytes_filename.patch

Using the patch, you will get:
 - open() support bytes
 - listdir(unicode) -> only unicode, *skip* invalid filenames 
   (as asked by Guido)
 - remove os.getcwdu()
 - create os.getcwdb() -> bytes
 - glob.glob() support bytes
 - fnmatch.filter() support bytes
 - posixpath.join() and posixpath.split() support bytes

Mixing bytes and str is invalid. Examples raising a TypeError:
 - posixpath.join(b'x', 'y')
 - fnmatch.filter([b'x', 'y'], '*')
 - fnmatch.filter([b'x', b'y'], '*')
 - glob.glob1('.', b'*')
 - glob.glob1(b'.', '*')

$ diffstat ~/python3_bytes_filename.patch
 Lib/fnmatch.py|7 +++-
 Lib/glob.py   |   15 ++---
 Lib/io.py |2 -
 Lib/posixpath.py  |   20 
 Modules/posixmodule.c |   83 
++
 5 files changed, 62 insertions(+), 65 deletions(-)

TODO:
 - review this patch :-)
 - support non-ASCII bytes in fnmatch.filter()
 - fix other functions, eg. posixpath.isabs() and fnmatch.fnmatchcase()
 - fix functions written in C: grep FileSystemDefaultEncoding
 - make sure that mixing bytes and str is rejected

-- 
Victor Stinner aka haypo
http://www.haypocalc.com/blog/
Index: Lib/posixpath.py
===
--- Lib/posixpath.py	(révision 66687)
+++ Lib/posixpath.py	(copie de travail)
@@ -59,14 +59,18 @@
 """Join two or more pathname components, inserting '/' as needed.
 If any component is an absolute path, all previous path components
 will be discarded."""
+if isinstance(a, bytes):
+sep = b'/'
+else:
+sep = '/'
 path = a
 for b in p:
-if b.startswith('/'):
+if b.startswith(sep):
 path = b
-elif path == '' or path.endswith('/'):
+elif not path or path.endswith(sep):
 path +=  b
 else:
-path += '/' + b
+path += sep + b
 return path
 
 
@@ -78,10 +82,14 @@
 def split(p):
 """Split a pathname.  Returns tuple "(head, tail)" where "tail" is
 everything after the final slash.  Either part may be empty."""
-i = p.rfind('/') + 1
+if isinstance(p, bytes):
+sep = b'/'
+else:
+sep = '/'
+i = p.rfind(sep) + 1
 head, tail = p[:i], p[i:]
-if head and head != '/'*len(head):
-head = head.rstrip('/')
+if head and head != sep*len(head):
+head = head.rstrip(sep)
 return head, tail
 
 
Index: Lib/glob.py
===
--- Lib/glob.py	(révision 66687)
+++ Lib/glob.py	(copie de travail)
@@ -27,7 +27,7 @@
 return
 dirname, basename = os.path.split(pathname)
 if not dirname:
-for name in glob1(os.curdir, basename):
+for name in glob1(None, basename):
 yield name
 return
 if has_magic(dirname):
@@ -49,9 +49,8 @@
 def glob1(dirname, pattern):
 if not dirname:
 dirname = os.curdir
-if isinstance(pattern, str) and not isinstance(dirname, str):
-dirname = str(dirname, sys.getfilesystemencoding() or
-   sys.getdefaultencoding())
+if isinstance(pattern, bytes):
+dirname = dirname.encode("ASCII")
 try:
 names = os.listdir(dirname)
 except os.error:
@@ -73,6 +72,12 @@
 
 
 magic_check = re.compile('[*?[]')
+magic_check_bytes = re.compile(b'[*?[]')
 
 def has_magic(s):
-return magic_check.search(s) is not None
+if isinstance(s, bytes):
+match = magic_check_bytes.search(s)
+else:
+match = magic_check.search(s)
+return match is not None
+
Index: Lib/fnmatch.py
===
--- Lib/fnmatch.py	(révision 66687)
+++ Lib/fnmatch.py	(copie de travail)
@@ -43,7 +43,12 @@
 result=[]
 pat=os.path.normcase(pat)
 if not pat in _cache:
-res = translate(pat)
+if isinstance(pat, bytes):
+pat_str = str(pat, "ASCII")
+res_str = translate(pat_str)
+res = res_str.encode("ASCII")
+else:
+res = translate(pat)
 _cache[pat] = re.compile(res)
 match=_cache[pat].match
 if os.path is posixpath:
Index: Lib/io.py
===
--- Lib/io.py	(révision 66687)
+++ Lib/io.py	(copie de travail)
@@ -180,7 +180,7 @@
 opened in a text mode, and for bytes a BytesIO can be used like a file
 opened in a binary mode.
 """
-if not isinstance(file, (str, int)):
+if not isinstance(file, (str, bytes, int)):
 raise TypeError("invalid file: %r" % file)
 if not isinstance(mode, str):
 raise TypeError("invalid mode: %r" % mode)
Index: Modules/posixmodule.c
===
--- Modules/posixmodule.c	(révision 66687)
+++ Modules/posixmodule.c	(copie de tr

Re: [Python-Dev] Filename as byte string in python 2.6 or 3.0?

2008-09-29 Thread Stephen J. Turnbull
Greg Ewing writes:
 > Ulrich Eckhardt wrote:
 > 
 > > AFAIK, OS X guarantees UTF-8 for filesystem encodings. So the OS
 > > also provides Unicode filenames and how it deals with broken or
 > > legacy media is left up to the OS.
 > 
 > Does this mean that the OS always returns valid utf-8 strings
 > from filesystem calls, even if the media is broken or legacy?

No, this means Ulrich is wrong.  NFD-normalized UTF-8 is more or less
enforced by the default filesystem, but Mac OS X up to 10.4 at least
also supports the FreeBSD filesystems, and some of those can have any
encoding you like or none at all (ie, KOI8-R and Shift JIS in the same
directory is possible).

If you have a Mac it's easy enough to test by creating a disk image
with a non-default file system.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Real segmentation fault handler

2008-09-29 Thread Ralf W. Grosse-Kunstleve
FWIW: I didn't have much luck translating segfaults into exceptions. It 
(seemed) to work on
some platforms, but not others; this was in the context of C++.
In my experience, it is more useful to generate Python and C stack traces and 
bail out.
I also do this for floating-point exceptions. The handlers are installed at 
runtime
from a low-level extension module:

http://cctbx.svn.sourceforge.net/viewvc/cctbx/trunk/boost_adaptbx/meta_ext.cpp?view=markup

Example output is below. It works under Linux and partially under Mac OS X.

Ralf


% boost_adaptbx.segmentation_fault 
Now dereferencing null-pointer ...
show_stack(1): 
/net/chevy/raid1/rwgk/dist/boost_adaptbx/command_line/segmentation_fault.py(10) 
run
show_stack(2): 
/net/chevy/raid1/rwgk/dist/boost_adaptbx/command_line/segmentation_fault.py(14) 

libc backtrace (18 frames, most recent call last):
  /net/chevy/raid1/rwgk/bintbx_py252/base/bin/python [0x4118e9]
  /lib64/libc.so.6(__libc_start_main+0xf4) [0x363241e074]
  /net/chevy/raid1/rwgk/bintbx_py252/base/bin/python(Py_Main+0x935) [0x4123c5]
  
/net/chevy/raid1/rwgk/bintbx_py252/base/bin/python(PyRun_SimpleFileExFlags+0x1a0)
 [0x4a8860]
  /net/chevy/raid1/rwgk/bintbx_py252/base/bin/python(PyRun_FileExFlags+0x10e) 
[0x4a85ce]
  /net/chevy/raid1/rwgk/bintbx_py252/base/bin/python(PyEval_EvalCode+0x32) 
[0x487402]
  /net/chevy/raid1/rwgk/bintbx_py252/base/bin/python(PyEval_EvalCodeEx+0x81f) 
[0x4873bf]
  /net/chevy/raid1/rwgk/bintbx_py252/base/bin/python(PyEval_EvalFrameEx+0x6bc1) 
[0x486541]
  /net/chevy/raid1/rwgk/bintbx_py252/base/bin/python(PyEval_EvalFrameEx+0x2bb9) 
[0x482539]
  /net/chevy/raid1/rwgk/bintbx_py252/base/bin/python(PyObject_Call+0x13) 
[0x415ae3]
  /net/chevy/raid1/rwgk/bintbx_py252/lib/libboost_python.so [0x2ba7c6f7]
  
/net/chevy/raid1/rwgk/bintbx_py252/lib/libboost_python.so(boost::python::handle_exception_impl(boost::function0)+0x28)
 [0x2ba87148]
  
/net/chevy/raid1/rwgk/bintbx_py252/lib/libboost_python.so(boost::function0::operator()()
 const+0x19e) [0x2ba8816e]
  /net/chevy/raid1/rwgk/bintbx_py252/lib/libboost_python.so [0x2ba7fef8]
  
/net/chevy/raid1/rwgk/bintbx_py252/lib/libboost_python.so(boost::python::objects::function::call(_object*,
 _object*) const+0x7d) [0x2ba7fb5d]
  
/net/chevy/raid1/rwgk/bintbx_py252/lib/boost_python_meta_ext.so(boost::python::objects::caller_py_function_impl > >::operator()(_object*, 
_object*)+0x29) [0x2b8470a9]
  /net/chevy/raid1/rwgk/bintbx_py252/lib/boost_python_meta_ext.so 
[0x2b843790]
  /lib64/libc.so.6 [0x3632430f30]
Segmentation fault (Python and libc call stacks above)


% boost_adaptbx.divide_by_zero 
Now dividing by zero (in C++) ...
show_stack(1): 
/net/chevy/raid1/rwgk/dist/boost_adaptbx/command_line/divide_by_zero.py(10) run
show_stack(2): 
/net/chevy/raid1/rwgk/dist/boost_adaptbx/command_line/divide_by_zero.py(14) 

libc backtrace (18 frames, most recent call last):
  /net/chevy/raid1/rwgk/bintbx_py252/base/bin/python [0x4118e9]
  /lib64/libc.so.6(__libc_start_main+0xf4) [0x363241e074]
  /net/chevy/raid1/rwgk/bintbx_py252/base/bin/python(Py_Main+0x935) [0x4123c5]
  
/net/chevy/raid1/rwgk/bintbx_py252/base/bin/python(PyRun_SimpleFileExFlags+0x1a0)
 [0x4a8860]
  /net/chevy/raid1/rwgk/bintbx_py252/base/bin/python(PyRun_FileExFlags+0x10e) 
[0x4a85ce]
  /net/chevy/raid1/rwgk/bintbx_py252/base/bin/python(PyEval_EvalCode+0x32) 
[0x487402]
  /net/chevy/raid1/rwgk/bintbx_py252/base/bin/python(PyEval_EvalCodeEx+0x81f) 
[0x4873bf]
  /net/chevy/raid1/rwgk/bintbx_py252/base/bin/python(PyEval_EvalFrameEx+0x6bc1) 
[0x486541]
  /net/chevy/raid1/rwgk/bintbx_py252/base/bin/python(PyEval_EvalFrameEx+0x2bb9) 
[0x482539]
  /net/chevy/raid1/rwgk/bintbx_py252/base/bin/python(PyObject_Call+0x13) 
[0x415ae3]
  /net/chevy/raid1/rwgk/bintbx_py252/lib/libboost_python.so [0x2ba7c6f7]
  
/net/chevy/raid1/rwgk/bintbx_py252/lib/libboost_python.so(boost::python::handle_exception_impl(boost::function0)+0x28)
 [0x2ba87148]
  
/net/chevy/raid1/rwgk/bintbx_py252/lib/libboost_python.so(boost::function0::operator()()
 const+0x19e) [0x2ba8816e]
  /net/chevy/raid1/rwgk/bintbx_py252/lib/libboost_python.so [0x2ba7fef8]
  
/net/chevy/raid1/rwgk/bintbx_py252/lib/libboost_python.so(boost::python::objects::function::call(_object*,
 _object*) const+0x7d) [0x2ba7fb5d]
  
/net/chevy/raid1/rwgk/bintbx_py252/lib/boost_python_meta_ext.so(boost::python::objects::caller_py_function_impl > 
>::operator()(_object*, _object*)+0x12a) [0x2b84759a]
  /net/chevy/raid1/rwgk/bintbx_py252/lib/boost_python_meta_ext.so 
[0x2b8437a4]
  /lib64/libc.so.6 [0x3632430f30]
Floating-point error (Python and libc call stacks above)




- Original Message 
From: Victor Stinner <[EMAIL PROTECTED]>
To: python-dev@python.org
Sent: Monday, September 29, 2008 4:05:53 PM
Subject: [Python-Dev] Real segmentation fault handler

Hi,

I would like to be able to catch SIGSEGV in my Python code! So I started to 
hack Python trunk to support this feature. Th

Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-29 Thread Stephen J. Turnbull
Guido van Rossum writes:
 > On Mon, Sep 29, 2008 at 4:29 PM, Victor Stinner
 > <[EMAIL PROTECTED]> wrote:

 > > It would be hard for a newbie programmer to understand why he's
 > > unable to find his very important file ("important r?port.doc")
 > > using os.listdir().

 > *Every* failure in this scenario will be hard to understand for a
 > newbie programmer. We can just document the fact.

Guido is absolutely right.  The Emacs/Mule people have been trying to
solve this kind of problem for 20 years, and the best they've come up
with is Martin's strategy: if you need really robust decoding, force
ISO 8859/1 (which for historical reasons uses all 256 octets) to get a
lossless internal text representation, and decode from that and *track
the encoding used* at the application level.  The email-sig/Mailman
people will testify how hard this is to do well, even when you have a
handful of RFCs that specify how it is to be done!

On the other hand, this kind of robustness is almost never needed in
"general newbie programming", except when you are writing a program to
be used to clean up after an undisciplined administration, or some
other system disaster.  Under normal circumstances the system encoding
is well-known and conformance is universal.

The best you can do for a general programming system is to
heuristically determine a single system encoding and raise an error if
the decoding fails.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Patch for an initial support of bytes filename in Python3

2008-09-29 Thread Brett Cannon
On Mon, Sep 29, 2008 at 5:47 PM, Victor Stinner
<[EMAIL PROTECTED]> wrote:
> Hi,
>
> See attached patch: python3_bytes_filename.patch
>

Patches should go on the tracker, not the mailing list. Otherwise it
will just get lost.

-Brett
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-29 Thread Terry Reedy



Le Monday 29 September 2008 19:06:01 Guido van Rossum, vous avez écrit :



I know I keep flipflopping on this one, but the more I think about it
the more I believe it is better to drop those names than to raise an
exception. Otherwise a "naive" program that happens to use
os.listdir() can be rendered completely useless by a single non-UTF-8
filename. Consider the use of os.listdir() by the glob module. If I am
globbing for *.py, why should the presence of a file named b'\xff'
cause it to fail?


To avoid silent skipping, is it possible to drop 'unreadable' names, 
issue a warning (instead of exception), and continue to completion?

"Warning: unreadable filename skipped; see PyWiki/UnreadableFilenames"

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-29 Thread Martin v. Löwis
> Change the default file system encoding to store bytes in Unicode is like 
> introducing a new Python type: .

Exactly. Seems like the best solution to me, despite your polemics.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue

2008-09-29 Thread Adam Olsen
On Tue, Sep 30, 2008 at 12:22 AM, Georg Brandl <[EMAIL PROTECTED]> wrote:
> Victor Stinner schrieb:
>> Le Monday 29 September 2008 18:45:28 Georg Brandl, vous avez écrit :
>>> If I had to choose, I'd still argue for the modified UTF-8 as filesystem
>>> encoding (if it were UTF-8 otherwise), despite possible surprises when a
>>> such-encoded filename escapes from Python.
>>
>> If I understand correctly this solution. The idea is to change the default
>> file system encoding, right? Eg. if your filesystem is UTF-8, use ISO-8859-1
>> to make sure that UTF-8 conversion will never fail.
>
> No, that was not what I meant (although it is another possibility). As I 
> wrote,
> Martin's proposal that I support here is using the modified UTF-8 codec that
> successfully roundtrips otherwise invalid UTF-8 data.
>
> You seem to forget that (disregarding OSX here, since it already enforces
> UTF-8) the majority of file names on Posix systems will be encoded correctly.
>
>> Let's try with an ugly directory on my UTF-8 file system:
>> $ find
>> ..
>> ../têste
>> ../ô
>> ../a?b
>> ../dossié
>> ../dossié/abc
>> ../dir?name
>> ../dir?name/xyz
>>
>> Python3 using encoding=ISO-8859-1:
> import os; os.listdir(b'.')
>> [b't\xc3\xaaste', b'\xc3\xb4', b'a\xffb', b'dossi\xc3\xa9', b'dir\xffname']
> files=os.listdir('.'); files
>> ['têste', 'ô', 'aÿb', 'dossiÃ(c)', 'dirÿname']
> open(files[0]).close()
> os.listdir(files[-1])
>> ['xyz']
>>
>> Ok, I have unicode filenames and I'm able to open a file and list a 
>> directory.
>> The problem is now to display correctly the filenames.
>>
>> For me "unicode" sounds like "text (characters) encoded in the correct
>> charset". In this case, unicode is just a storage for *bytes* in a custom
>> charset.
>
>> How can we mix  with > unicode>? Eg. os.path.join('dossiÃ(c)', "fichié") : first argument is encoded
>> in ISO-8859-1 whereas the second argument is encoding in Unicode. It's
>> something like that:
>>str(b'dossi\xc3\xa9', 'ISO-8859-1') + '/' + 'fichi\xe9'
>>
>> Whereas the correct (unicode) result should be:
>>'dossié/fichié'
>> as bytes in ISO-8859-1:
>>b'dossi\xc3\xa9/fichi\xc3\xa9'
>> as bytes in UTF-8:
>>b'dossi\xe9/fichi\xe9'
>
> With the filenames decoded by UTF-8, your files named têste, ô, dossié will
> be displayed and handled correctly. The others are *invalid* in the filesystem
> encoding UTF-8 and therefore would be represented by something like
>
> u'dir\uXXffname' where XX is some private use Unicode namespace. It won't look
> pretty when printed, but then, what do other applications do? They e.g. 
> display
> a question mark as you show above, which is not better in terms of 
> readability.
>
> But it will work when given to a filename-handling function. Valid filenames
> can be compared to Unicode strings.
>
> A real-world example: OpenOffice can't open files with invalid bytes in their
> name. They are displayed in the "Open file" dialog, but trying to open fails.
> This regularly drives me crazy. Let's not make Python not work this way too,
> or, even worse, not even display those filenames.

The only way to display that file would be to transform it into some
other valid unicode string.  However, as that string is already valid,
you've just made any files named after it impossible to open.  If you
extend unicode then you're unable to display that extended name[1].

I think Guido's right on this one.  If I have to choose between
openoffice crashing or skipping my file, I'd vastly prefer it skip it.
 A warning would be a nice bonus (from python or from openoffice),
telling me there's a buggered file I should go fix.  Renaming the file
is the end solution.


[1] You could argue that Unicode should add new scalars to handle all
currently invalid UTF-8 sequences.  They could then output to their
original forms if in UTF-8, or a mundane form in UTF-16 and UTF-32.
However, I suspect "we don't want to add validation to linux" will not
be a very persuasive argument.

-- 
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com