Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
> The name "utf8b" suggested in the PEP is not in line with the codec
> design

Where is that design documented, and how exactly violates the name
the design (chapter and verse, please).

> Error handlers and codecs are two different things, so the namespaces
> need to be clearly separate.

They *are* separate naemspaces; that's guaranteed by the implementation.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
Stephen J. Turnbull wrote:
> "Martin v. Löwis" writes:
>  > > It occurs to me that the PEP maybe should say that it is an error
>  > > to have your POSIX locale set to UTF-16 or something like that.
>  > 
>  > No. It is *impossible* to have UTF-16 as the locale character set,
>  > not an error. Your statement is like saying "it is an error to
>  > breathe in the vacuum".
> 
> I realize this is not useful, so maybe you don't need to mention it.
> However, it certainly is possible to set LANG with an absurd, or
> merely dangerous, encoding.

How so? The C library will filter it out.

>  > In any case, the discussion says
>  > 
>  > # Encodings that are not compatible with ASCII are not supported by
>  > # this specification; bytes in the ASCII range that fail to decode
>  > # will cause an exception. It is widely agreed that such encodings
>  > # should not be used as locale charsets.
> 
> Which is your excuse for not supporting Shift JIS fully.  It doesn't
> stop people from setting LC_ALL=ja_JP.shift_jis, 

Well, it *does* stop them from doing so if their systems don't support
the locale setting.

In any case, if they do this, PEP 383 will not support them.

> or using Shift JIS as the default encoding for certain media.

I fail to see how this could ever matter. If, by "media", you mean
things like removable disks, and the file name encoding used on them,
it's fairly irrelevant for the PEP, since Python won't start using
Shift JIS as its file system encoding just because that's the encoding
used on the disk.

Regards,
Martin

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
>  > > Second, I suggest "surrogate-replace" as the name of the error handler
>  > > rather than "utf8b".
>  > 
>  > I think this is bike-shedding.
> 
> I don't personally care (I already was aware of UTF-8B), but there are
> plenty of others who do. 

I think it is a fairly bad name, because it is easy to confuse it with
the "surrogates" error handler (unless you suggest to rename that also).

> You have to fix the existing uses of
> the obsolete "python-escape", anyway.

Indeed - but only in the PEP. In the implementation, it's already utf8b
throughout. Now it is also in the PEP; thanks for pointing that out.

>  > It's a security risk. If U+DCXX would map to \xXX, then somebody could
>  > embed U+DC2E U+DC2E U+DC2F into a character string; even if this gets
>  > sanitized, nobody would expect that this will actually access ../
> 
> The odds that anybody will actually take notice of U+002E U+002E
> U+002F in a string are sufficiently small that any number of exploits
> have already been based on it.  I agree that there is some additional
> risk from this if people make the check for "../" before they prepend
> "\ucd2e\udc2e\udc2f", but I think that risk is very small compared to
> the pain of having a error handler whose raison d'etre is to not raise
> exceptions go ahead and raise them anyway.

The problem is that functions like normpath will recognize ../, and
that applications rely on them for file name sanitation. If they could
be tricked into writing outside of their target folders, this would
be a huge security risk.

OTOH, I don't care breaking applications on misconfigured systems.
People using SJIS as their locale encodings have bigger problems
than Python raising exceptions.

> See also my reply to Lino Mastrodomenico.

URL?

> But you're writing the PEP, so this battle will have to be deferred.
> Eventually Python will have to take a stand on Unicode conformance,
> but it's not urgent yet.

I think it's always applications that are conforming or not, rather
than libraries. Libraries should allow to write conforming applications.
They may refuse to write certain non-conforming applications (although
users then replace the library with one that does allow them to do
what they want). Libraries can never enforce that applications conform
to some standard.

> Sorry!  I suggest substituting the paragraph above for the paragraph
> which begins "The encode error handler interface presentlyrequires..."
> at line 129.

Ah, ok. This was Glen Linderman's text before - now it's yours :-)

> I think I forgot to do this before:  "I hereby dedicate all text
> I suggest for inclusion in the PEP to the public domain."

:-)

Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
> Yeah, yeah, this is the same old same old from PEP 3131.  Anything
> that handles the various attacks based on ASCII-alike characters
> should at least rule out invalid Unicode, too!
> 
> And where is this U+DC2F supposed to be coming from, anyway?  The
> user's *local* environment or the user's *local* filesystem! 

Why is that not a threat? Suppose you have a setuid application, and
you pass some string on the command line that decodes to /../. Then
the setuid application will be tricked into modifying files it didn't
mean to modify.

Likewise, it might come from a relational database. Use a relational
database that supports unicode code units, or lone surrogates through
utf-8, and fill in some bogus data. Then have the Python application
(running as root) read it.

> Of course I can't prove that there's no vector for an exploit here (in
> fact, I'm sure there is one with sufficiently careless handling of
> input), but I think "consenting adults" covers the Shift JIS use case.
> Make it an option, but it should be explicitly part of the PEP.

Nothing is lost at the moment. If users complain, we can still think
of ways to enhance the experience.

In any case, Python 3.1b1 may get released today, so it's way too late
for new features in the PEP. They can wait for Python 3.2.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Help on issue 5941

2009-05-06 Thread Tarek Ziadé
Hello,

I need some help on http://bugs.python.org/issue5941

The bug is quite simple: the Distutils unixcompiler used to set the
archiver command to "ar -rc".

For quite a while now, this behavior has changed in order to be able
to customize the compiler behavior from
the environment. That introduced a regression because the mechanism in
Distutils that looks for the
AR variable in the environment also looks into the Makefile of Python.
(in the Makefile then is os.environ)

And as a matter of fact, AR is set to "ar" in there, so the -cr option
is not set anymore.

So my question is : should I make a change into the Makefile by adding
for example a variable called AR_OPTIONS
then build the ar command with AR + AR_OPTIONS

*or*

that doesn't make sense and I just need to change the  behavior so it
doesn't look for AR into the Makefile. (just in os.environ)

Thanks
Tarek

-- 
Tarek Ziadé | http://ziade.org
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Antoine Pitrou
Martin v. Löwis  v.loewis.de> writes:
> 
> > I don't personally care (I already was aware of UTF-8B), but there are
> > plenty of others who do. 
> 
> I think it is a fairly bad name, because it is easy to confuse it with
> the "surrogates" error handler (unless you suggest to rename that also).

I didn't bother to say it at the time, but I think "surrogates" is a pretty bad
name. It should be more indicative of what it does, e.g. "surrogates-pass", or
"surrogates-accept".

> >  > It's a security risk. If U+DCXX would map to \xXX, then somebody could
> >  > embed U+DC2E U+DC2E U+DC2F into a character string; even if this gets
> >  > sanitized, nobody would expect that this will actually access ../

Agreed this is an annoying security breach. The whole point of the PEP is that
application developers do not have to care about filename encoding issues,
which is defeated is they have to check for strange (illegal) combinations of
characters.

By the way, what are the ASCII characters that are not suppported by Shift-JIS?
Not many I suppose? (if I read the Wikipedia entry correctly, it's only the
backslash and the tilde).

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Stephen J. Turnbull
"Martin v. Löwis" writes:

 > I fail to see how this could ever matter. If, by "media", you mean
 > things like removable disks, and the file name encoding used on them,
 > it's fairly irrelevant for the PEP, since Python won't start using
 > Shift JIS as its file system encoding just because that's the encoding
 > used on the disk.

I'm sorry for the lack of clarity of my posts, but somehow you're
completely missing the point.  The point is precisely that Python
*won't* use Shift JIS as the file system encoding (if it did there
would be no problem with reading Shift JIS), but the people who
created the media *did*.

Now, with Python's file system encoding == UTF-8 or any packed EUC,
and more than a handful of Shift JIS or Big5 characters in file names,
one is *almost certain* to encounter ASCII as the second byte of a
multibyte sequence.  PEP 383 can't handle this, but it is sure to be
the most common use case for PEP 383 in East Asia.

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread M.-A. Lemburg
Martin v. Löwis wrote:
>> The name "utf8b" suggested in the PEP is not in line with the codec
>> design
> 
> Where is that design documented, and how exactly violates the name
> the design (chapter and verse, please).

Martin, I designed the whole Python codec machinery, so even if
this is not explicitly written down somewhere, you can take my
word for it.

I don't want users to be confused by such an error handler
name, so please change it !

Here's a list of the currently available error handlers (taken from
codecs.py):

The .encode()/.decode() methods may use different error
handling schemes by providing the errors argument. These
string values are predefined:

 'strict' - raise a ValueError error (or a subclass)
 'ignore' - ignore the character and continue with the next
 'replace' - replace with a suitable replacement character;
Python will use the official U+FFFD REPLACEMENT
CHARACTER for the builtin Unicode codecs on
decoding and '?' on encoding.
 'xmlcharrefreplace' - Replace with the appropriate XML
   character reference (only for encoding).
 'backslashreplace'  - Replace with backslashed escape sequences
   (only for encoding).

The set of allowed values can be extended via register_error.

>> Error handlers and codecs are two different things, so the namespaces
>> need to be clearly separate.
> 
> They *are* separate naemspaces; that's guaranteed by the implementation.

In the implementation, yes, but not in the head of a typical user:
the 'utf8b' looks more like a codec name than an error handler
name.

I want to avoid any such confusion with Python codecs and don't
understand why you are making a problem out of this.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 06 2009)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2009-06-29: EuroPython 2009, Birmingham, UK53 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread MRAB

M.-A. Lemburg wrote:

Martin v. Löwis wrote:

The name "utf8b" suggested in the PEP is not in line with the codec
design

Where is that design documented, and how exactly violates the name
the design (chapter and verse, please).


Martin, I designed the whole Python codec machinery, so even if
this is not explicitly written down somewhere, you can take my
word for it.

I don't want users to be confused by such an error handler
name, so please change it !

Here's a list of the currently available error handlers (taken from
codecs.py):

The .encode()/.decode() methods may use different error
handling schemes by providing the errors argument. These
string values are predefined:

 'strict' - raise a ValueError error (or a subclass)
 'ignore' - ignore the character and continue with the next
 'replace' - replace with a suitable replacement character;
Python will use the official U+FFFD REPLACEMENT
CHARACTER for the builtin Unicode codecs on
decoding and '?' on encoding.
 'xmlcharrefreplace' - Replace with the appropriate XML
   character reference (only for encoding).
 'backslashreplace'  - Replace with backslashed escape sequences
   (only for encoding).

The set of allowed values can be extended via register_error.


Error handlers and codecs are two different things, so the namespaces
need to be clearly separate.

They *are* separate naemspaces; that's guaranteed by the implementation.


In the implementation, yes, but not in the head of a typical user:
the 'utf8b' looks more like a codec name than an error handler
name.


Judging by the existing names, I think that 'surrogate' would be
reasonable. It already contains the meaning of substitute, it's not too
long, and the codes which act as replacements are already called
surrogates.


I want to avoid any such confusion with Python codecs and don't
understand why you are making a problem out of this.



___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Antoine Pitrou
MRAB  mrabarnett.plus.com> writes:
> 
> Judging by the existing names, I think that 'surrogate' would be
> reasonable. It already contains the meaning of substitute,

Only if you are a native English-speaker I suppose... For me it's just a
technical term denoting a certain class of unicode code points (I'm not sure of
the latter terminology ;-)).

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Lino Mastrodomenico
2009/5/6 Antoine Pitrou :
> By the way, what are the ASCII characters that are not suppported by 
> Shift-JIS?
> Not many I suppose? (if I read the Wikipedia entry correctly, it's only the
> backslash and the tilde).

The biggest problem with Shift-JIS is that a perfectly valid unicode
character above 127 can be encoded to a byte sequence that includes
bytes in range(128).

E.g. the character 掛 (a.k.a. '\u639b') when encoded with Shift-JIS
becomes the two bytes sequence b'\x8a|'. Notice that the second byte
is 124, which on POSIX is usually interpreted as the pipe character
and can have security implications.

It's a know problem with Shift-JIS and was fixed in UTF-8.

-- 
Lino Mastrodomenico
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Lennart Regebro
On Wed, May 6, 2009 at 09:31, "Martin v. Löwis"  wrote:
> They *are* separate naemspaces; that's guaranteed by the implementation.

Yes. But utf8b *sounds like* an encoding. When it isn't. I sure
thought it was when it was first mentioned. I agree that it would be
better to find another name.

'utf8-binary-replace'?

Is it only usable with utf8 as an encoding?
-- 
Lennart Regebro: Python, Zope, Plone, Grok
http://regebro.wordpress.com/
+33 661 58 14 64
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Stephen J. Turnbull
Lino Mastrodomenico writes:

 > It's a know problem with Shift-JIS and was fixed in UTF-8.

It was fixed in EUC before Shift-JIS was invented by Microsoft or Big5
was invented by the Taiwanese clone makers.  Guido's not the only
language designer with a time machine


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Stephen J. Turnbull
"Martin v. Löwis" writes:

 > > Yeah, yeah, this is the same old same old from PEP 3131.  Anything
 > > that handles the various attacks based on ASCII-alike characters
 > > should at least rule out invalid Unicode, too!
 > > 
 > > And where is this U+DC2F supposed to be coming from, anyway?  The
 > > user's *local* environment or the user's *local* filesystem! 
 > 
 > Why is that not a threat? Suppose you have a setuid application, and
 > you pass some string on the command line that decodes to /../. Then
 > the setuid application will be tricked into modifying files it didn't
 > mean to modify.

Of course this is a threat, assuming that the application takes no
precautions.  But first, it should be stopped by any of several
standard precautions.  For example, applying os.path.realpath (come to
think of it, PEP 383 should say something about realpath, shouldn't
it?) and os.path.normpath (PEP 383 should definitely say something
about this function; maybe PEP 3131 should, too) before checking
access restrictions.  If you're not running your paths through those,
you're already vulnerable to symlink attacks, and maybe other forms of
spoofing.

Second, it's a threat already enabled by your restricted version of
PEP 383.  Access control applies to subdirectories as well as to
parent directories.  Since you can insert arbitrary non-ASCII bytes
into the path using the current definition of 'utf8b', name-based
access restrictions can be bypassed in exactly the same way for any
directory whose name is not 100.00% ASCII, and the setuid application
will be tricked into modifying files it didn't mean to modify.

Also, on Mac OS X, system directories, including directories
containing system libraries, frameworks, and executables, may be
accessible via locale-specific names (I don't have a Japanese-
localized Mac at hand to check, but I'm pretty sure in my old Mac the
Japanese names appeared in ls in Terminal.app, which means it may be
possible to access system directories containing libraries,
frameworks, and executables this way).  Those can be spoofed in
exactly the same way.

 > Nothing is lost at the moment.

Nothing is lost compared to 'strict', true, but under the PEP as it is
a large fraction of Shift JIS and Big5 filenames cannot be read under
ASCII-compatible file system encodings using 'utf8b'.  Yet it is those
users who are placed at risk by PEP 383.

 > In any case, Python 3.1b1 may get released today, so it's way too late
 > for new features in the PEP. They can wait for Python 3.2.

You have convinced me that the PEP should wait as well.

In its current form it is incomplete and dangerous.

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Antoine Pitrou
Stephen J. Turnbull  xemacs.org> writes:
> 
> Nothing is lost compared to 'strict', true, but under the PEP as it is
> a large fraction of Shift JIS and Big5 filenames cannot be read under
> ASCII-compatible file system encodings using 'utf8b'.

You should really be more specific. I'm not sure about others, but I don't
understand what filenames you are talking about.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread R. David Murray

On Wed, 6 May 2009 at 13:40, Antoine Pitrou wrote:

Stephen J. Turnbull  xemacs.org> writes:


Nothing is lost compared to 'strict', true, but under the PEP as it is
a large fraction of Shift JIS and Big5 filenames cannot be read under
ASCII-compatible file system encodings using 'utf8b'.


You should really be more specific. I'm not sure about others, but I don't
understand what filenames you are talking about.


Seems to me that the best thing to do would be to file a bug report with
test cases that demonstrate the problems when run against the current
py3k trunk.

Especially the security issues you cite (which I don't understand).

--David
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Zooko Wilcox-O'Hearn

On May 6, 2009, at 7:33 AM, Stephen J. Turnbull wrote:


You have convinced me that the PEP should wait as well.

In its current form it is incomplete and dangerous.


+1 on delaying PEP 383

I think PEP 383 is a good idea in principle, but I'm still struggling  
to understand it myself, and it seems to offer new hazards for the  
unwary programmer.


On the other hand, maybe the wary programmers are waiting for Python  
3.2 anyway .


On the gripping hand, if PEP 383 is released in Python 3.1, will that  
obligate python-dev to support it indefinitely, at least in backwards- 
compatibility mode?  I'm not thinking of API compatibility as much as  
data compatibility -- someone used Python 3.1 to write down some  
filenames, and now a few years later they are trying to use the  
latest and greatest Python release to read those filenames...


Regards,

Zooko
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread James Y Knight

On May 6, 2009, at 5:39 AM, Stephen J. Turnbull wrote:

Now, with Python's file system encoding == UTF-8 or any packed EUC,
and more than a handful of Shift JIS or Big5 characters in file names,
one is *almost certain* to encounter ASCII as the second byte of a
multibyte sequence.  PEP 383 can't handle this


Hm, I haven't tried the implementation, but I thought that what would  
happen is:
'\x85a'.decode('utf-8', 'utf8b/surrogate-replace/whateveritscalled') - 
> u'\uDC85a'


If that indeed doesn't happen, that's certainly a defect and should be  
remedied.



, but it is sure to be
the most common use case for PEP 383 in East Asia.


Yes.

James
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Undocumented change / bug in Python3's PyMapping_Check

2009-05-06 Thread Nick Coghlan
John Millikin wrote:
> In Python 2, PyMapping_Check will return 0 for list objects. In Python
> 3, it returns 1. Obviously, this makes it rather difficult to
> differentiate between mappings and other sized iterables. In addition,
> it differs from the behavior of the ``collections.Mapping`` ABC --
> isinstance([], collections.Mapping) returns False.
> 
> I believe the new behavior is erroneous, but would like to confirm
> that before filing a bug.

It's not a bug.

PyMapping_Check just tells you if a type has an entry in the
tp_as_mapping->mp_subscript slot. In 2.x, it used to have an additional
condition that the tp_as_sequence->sq_slice slot be empty, but that has
gone away in Py3k because the sq_slice slot has been removed.

Even in 2.x that test wasn't a reliable way of telling if something was
a mapping or a sequence - it happened to get it right for lists and
tuples (since they define __getslice__ and __setslice__), but this is
not the case for new-style user defined sequences:

>>> from operator import isMappingType
>>> class MySeq(object):
...   def __getitem__(self, idx):
... # Is this a mapping or an unsliceable sequence?
... return idx*2
...
>>> isMappingType(MySeq())
True

Using the new collections module ABCs to check for sequences and
mappings. That's what they're for, and they will give you a much more
reliable answer than the C level checks (which are really just an
implementation detail).

Cheers,
Nick.

-- 
Nick Coghlan   |   [email protected]   |   Brisbane, Australia
---
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Antoine Pitrou
Zooko Wilcox-O'Hearn  zooko.com> writes:
> 
> I'm not thinking of API compatibility as much as  
> data compatibility -- someone used Python 3.1 to write down some  
> filenames, and now a few years later they are trying to use the  
> latest and greatest Python release to read those filenames...

Well, if the filenames are generated by Python (as opposed to read from an
existing directory on disk), they should be regular unicode objects without any
lone surrogates, so I don't see the compatibility problem.

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Glenn Linderman
On approximately 5/6/2009 6:33 AM, came the following characters from 
the keyboard of Stephen J. Turnbull:

"Martin v. Löwis" writes:
 > In any case, Python 3.1b1 may get released today, so it's way too late
 > for new features in the PEP. They can wait for Python 3.2.

You have convinced me that the PEP should wait as well.

In its current form it is incomplete and dangerous.



I see nothing in this thread that suggests that the PEP is dangerous in 
its current form.


While I (still) think that more readable transcodings could have been 
used, and while I had difficulty fully understanding the PEP at first, 
now that I think I do understand the PEP, and it has been somewhat 
clarified and amended, I cannot see how it could be dangerous.  A 
specific case of danger should be included with such a statement.


Regarding incomplete, I agree it won't brush my teeth for me, but I 
think it does solve the problem it sets out to solve.



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Glenn Linderman
On approximately 5/6/2009 3:08 AM, came the following characters from 
the keyboard of MRAB:

M.-A. Lemburg wrote:

Martin v. Löwis wrote:



Judging by the existing names, I think that 'surrogate' would be
reasonable. It already contains the meaning of substitute, it's not too
long, and the codes which act as replacements are already called
surrogates.


I want to avoid any such confusion with Python codecs and don't
understand why you are making a problem out of this.



+1 for "surrogate" as the name for the error handler.


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Glenn Linderman
On approximately 5/6/2009 12:53 AM, came the following characters from 
the keyboard of Martin v. Löwis:



Sorry!  I suggest substituting the paragraph above for the paragraph
which begins "The encode error handler interface presentlyrequires..."
at line 129.


Ah, ok. This was Glen Linderman's text before - now it's yours :-)



Which is fine by me.  Stephen's is more explanatory than mine, but says 
the same thing.


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Terry Reedy

Glenn Linderman wrote:
On approximately 5/6/2009 3:08 AM, came the following characters from 
the keyboard of MRAB:

M.-A. Lemburg wrote:

Martin v. Löwis wrote:



Judging by the existing names, I think that 'surrogate' would be
reasonable. It already contains the meaning of substitute, it's not too
long, and the codes which act as replacements are already called
surrogates.


I want to avoid any such confusion with Python codecs and don't
understand why you are making a problem out of this.



+1 for "surrogate" as the name for the error handler.



+1 from me also

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Zooko Wilcox-O'Hearn

On May 6, 2009, at 10:54 AM, Antoine Pitrou wrote:


Zooko Wilcox-O'Hearn  zooko.com> writes:


I'm not thinking of API compatibility as much as data  
compatibility -- someone used Python 3.1 to write down some  
filenames, and now a few years later they are trying to use the  
latest and greatest Python release to read those filenames...


Well, if the filenames are generated by Python (as opposed to read  
from an existing directory on disk), they should be regular unicode  
objects without any lone surrogates, so I don't see the  
compatibility problem.


I meant that the application reads filenames from an existing  
directory on disk, saves those filenames, and then later, using a  
future version of Python, wants to read them and use them.


I'm not saying that I know this would be a problem.  I'm saying that  
I personally can't tell whether it would be a problem or not, and the  
extensive discussions so far have not convinced me that there is  
anyone who both understands PEP 383 and considers this use case.


Many people who apparently understand encoding issues well have said  
something to the effect that there is no problem, but those people  
haven't yet managed to get through my thick skull how I would use PEP  
383 safely for this sort of use case -- the one where data generated  
by os.listdir() travels forward in time or the one were that data  
travels sideways to other systems, including Windows or other systems  
that validate incoming unicode.


That's why I am a bit uncomfortable about PEP 383 being quickly  
implemented and deployed in Python 3.1.


By the way, much of the detailed discussion about what Tahoe requires  
and how that may or may not benefit from PEP 383 has now moved to the  
tahoe-dev mailing list: http://allmydata.org/cgi-bin/mailman/listinfo/ 
tahoe-dev .


Regards,

Zooko

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Glenn Linderman
On approximately 5/6/2009 12:18 PM, came the following characters from 
the keyboard of Zooko Wilcox-O'Hearn:

On May 6, 2009, at 10:54 AM, Antoine Pitrou wrote:


Zooko Wilcox-O'Hearn  zooko.com> writes:


I'm not thinking of API compatibility as much as data compatibility 
-- someone used Python 3.1 to write down some filenames, and now a 
few years later they are trying to use the latest and greatest Python 
release to read those filenames...


Well, if the filenames are generated by Python (as opposed to read 
from an existing directory on disk), they should be regular unicode 
objects without any lone surrogates, so I don't see the compatibility 
problem.


I meant that the application reads filenames from an existing directory 
on disk, saves those filenames, and then later, using a future version 
of Python, wants to read them and use them.



Regarding future versions of Python.  In the worst case, even if 
Python's default behavior changes, the transcoding done by PEP 383 can 
be done in other software too... it is a straightforward, fully 
specified, 1-to-1, reversible transcoding process, affecting and 
generating only invalid byte encodings on one side, and invalid Unicode 
sequences on the other.


So if Python's default behavior should change, the transcoding 
implemented by PEP 383 could be easily reimplemented to enable a future 
version of a Python application to manipulate the transcoded, saved, 
filenames.


By easily, I mean that I could code it in a couple hours, max.


I'm not saying that I know this would be a problem.  I'm saying that I 
personally can't tell whether it would be a problem or not, and the 
extensive discussions so far have not convinced me that there is anyone 
who both understands PEP 383 and considers this use case.



Does the above help?


Many people who apparently understand encoding issues well have said 
something to the effect that there is no problem, but those people 
haven't yet managed to get through my thick skull how I would use PEP 
383 safely for this sort of use case -- the one where data generated by 
os.listdir() travels forward in time or the one were that data travels 
sideways to other systems, including Windows or other systems that 
validate incoming unicode.



Regarding data traveling sideways, some comments:

1) PEP 383's effect could be recoded in other languages as easily as it 
is in Python (or the C in which Python is implmented).  So that could be 
a solution.


2) You mention "Windows" and "other systems that validate incoming 
unicode" in the same phrase, as if you think that "Windows" qualifies as 
 an "other systems that validate incoming unicode", but it does not (at 
least not universally).



That's why I am a bit uncomfortable about PEP 383 being quickly 
implemented and deployed in Python 3.1.



Does the above help?


By the way, much of the detailed discussion about what Tahoe requires 
and how that may or may not benefit from PEP 383 has now moved to the 
tahoe-dev mailing list: 
http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev .



I have no background with Tahoe, nor particular interest, although it 
sounds like a useful project... so I won't be joining that list.  I have 
no idea if there is an installed base of existing Tahoe file systems, my 
suggestions below assume that there is not, and that you are presently 
inventing them.  Therefore, I provide no migration path, although I 
could invent one, but it would take longer to describe.


However, since I'm responding here, and have read what you have posted 
here, it seems like the following could be true.


Assumptions from your emails:

A) Tahoe wants to provide a UTF-8 file name system
B) Tahoe wants to interface to POSIX systems that use (and do not 
validate) byte interfaces.
C) Tahoe wants to interface to non-POSIX systems that use 16-bit file 
name interfaces, with no validation.
D) Tahoe wants to interface to non-POSIX systems that use 16-bit file 
name interfaces, with validation.


Uncertainties: I'm not clear on what your goals are for Tahoe filenames. 
 There seem to be 2 possibilities:


1) you want to reject attempts to use non-validating Unicode, be it from 
a 16-bit interface, or a bytes interface.
2) you don't want to reject non-validating Unicode, but you want to 
convert it to valid Unicode for (D) systems.


3) Orthogonally, you might want to store only Valid Unicode in the 
names, or you might not care, if you can meet the other goals.


Truisms:

If you want to support (D), and (2), then you must transform names at 
some point, using some scheme, because not all names supplied by (B) 
systems will be acceptable to (D) systems.  You can choose to do this 
transformation when a (B) system provides an invalid (per Unicode) name, 
or you can choose to do the transformation when a (D) system accesses a 
file with an invalid (per Unicode) name.


If the (B) and (D) systems talk to each other outside of Tahoe, they 
will have to do similar transf

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
>>> The name "utf8b" suggested in the PEP is not in line with the codec
>>> design
>> Where is that design documented, and how exactly violates the name
>> the design (chapter and verse, please).
> 
> Martin, I designed the whole Python codec machinery

Not true. PEP 293 was written and designed by Walter Dörwald.

> so even if
> this is not explicitly written down somewhere, you can take my
> word for it.

If the design was specified in writing somewhere, I would probably
challenge it as obsolete. If it isn't described anywhere, I'll have
to ignore it.

> I want to avoid any such confusion with Python codecs and don't
> understand why you are making a problem out of this.

Because utf8b (or, perhaps "UTF-8b") is the official name for this
algorithm:

http://hyperreal.org/~est/utf-8b/

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
> I'm sorry for the lack of clarity of my posts, but somehow you're
> completely missing the point.  The point is precisely that Python
> *won't* use Shift JIS as the file system encoding (if it did there
> would be no problem with reading Shift JIS), but the people who
> created the media *did*.
> 
> Now, with Python's file system encoding == UTF-8 or any packed EUC,
> and more than a handful of Shift JIS or Big5 characters in file names,
> one is *almost certain* to encounter ASCII as the second byte of a
> multibyte sequence.  PEP 383 can't handle this

Not true. PEP 383 handles this very example just fine, with no problems
that I can see. Can you propose a specific example that you think might
cause problems? By "specific", I mean: what file names (exact bytes,
please), what locale charset, what API calls.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
> Judging by the existing names, I think that 'surrogate' would be
> reasonable

MAL's list of existing names is incomplete. "surrogates" is already
an existing name, also, and it means something different (similar,
but different).

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
Terry Reedy wrote:
> Glenn Linderman wrote:
>> On approximately 5/6/2009 3:08 AM, came the following characters from
>> the keyboard of MRAB:
>>> M.-A. Lemburg wrote:
 Martin v. Löwis wrote:
>>
>>> Judging by the existing names, I think that 'surrogate' would be
>>> reasonable. It already contains the meaning of substitute, it's not too
>>> long, and the codes which act as replacements are already called
>>> surrogates.
>>>
 I want to avoid any such confusion with Python codecs and don't
 understand why you are making a problem out of this.
>>
>>
>> +1 for "surrogate" as the name for the error handler.
>>
>>
> +1 from me also

Despite there being also an error handler called "surrogates".

Are you serious?

Regards,
Martin

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
> Is it only usable with utf8 as an encoding?

No, it applies to any codec which potentially cannot decode
all bytes >127.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Antoine Pitrou
Martin v. Löwis  v.loewis.de> writes:
> 
> Despite there being also an error handler called "surrogates".

People, perhaps we could end all the bikeshedding and call one of those handlers
"surrogates-pass" and the other "surrogates-escape", which sounds quite faithful
to what they actually /do/?

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
> But first, it should be stopped by any of several
> standard precautions.  For example, applying os.path.realpath (come to
> think of it, PEP 383 should say something about realpath, shouldn't
> it?)

Why do you think so? I think the existing documentation of realpath
is correct and complete.

> and os.path.normpath (PEP 383 should definitely say something
> about this function

Precisely what?

> maybe PEP 3131 should, too)

How can this be of relevance?

>  > Nothing is lost at the moment.
> 
> Nothing is lost compared to 'strict', true, but under the PEP as it is
> a large fraction of Shift JIS and Big5 filenames cannot be read under
> ASCII-compatible file system encodings using 'utf8b'.  Yet it is those
> users who are placed at risk by PEP 383.

I think this statement is incorrect. Those filenames *can* be read just
fine.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
Antoine Pitrou wrote:
> Martin v. Löwis  v.loewis.de> writes:
>> Despite there being also an error handler called "surrogates".
> 
> People, perhaps we could end all the bikeshedding and call one of those 
> handlers
> "surrogates-pass" and the other "surrogates-escape", which sounds quite 
> faithful
> to what they actually /do/?

The problem with these bike-shedding discussions is that you cannot stop
them with a proposal. People will counter-propose.

I would be willing to accept a ruling from someone who a) is a native
speaker of English, and b) has demonstrated to fully understand what
these do, and c) has understood why I insist on calling it utf8b.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Terry Reedy

Martin v. Löwis wrote:


+1 for "surrogate" as the name for the error handler.



+1 from me also


Despite there being also an error handler called "surrogates".


Given that additional information which MAL apparently omitted, I would 
revise.



Are you serious?


Are you? ;-?  You are the one naming a codec-agnostic error handler (if 
I understand correctly, and correct me if I do not) after a particular 
codec, and denying that that could cause confusion.  See other message.


Terry Jan Reedy



___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Terry Reedy

Martin v. Löwis wrote:


Because utf8b (or, perhaps "UTF-8b") is the official name for this
algorithm:
http://hyperreal.org/~est/utf-8b/


Thank you for the link.  It starts:
"This directory contains a C implementation of a UTF-8b codec.
A Python codec based on it is provided as well."

'RTF-8b' consists, obviously, 'UTF-8' plus 'b', with the 'b' signifying 
a variation of or addition to UTF-8.  The 'b', and only the 'b', refers 
to the innovative error-handler that was added to the existing 'UTF-8' 
codec/algorithm.  The name of the combined whole is not the name of the 
part.


If you were incorporating the Python-wrapped utf-8b *codec* as a codec, 
which is what I once thought *because you used that name*, then calling 
it 'utf-8b' would be fine.  But you apparently instead proposed and 
implemented an *error-handler*, which seems to me to be something else, 
and which will not be specific to utf-8 but usable with any codec. 
Hence some of us think it should have a different name.


I gather that you lifted the error-handler part of the algorithm and 
propose to use it with *any* ascii-respecting codec.  I could claim that 
the 'official name' of that part is 'b', but I think we can find a 
better name.


Terry Jan Reedy


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Paul Moore
2009/5/6 Antoine Pitrou :
> Martin v. Löwis  v.loewis.de> writes:
>>
>> Despite there being also an error handler called "surrogates".
>
> People, perhaps we could end all the bikeshedding and call one of those 
> handlers
> "surrogates-pass" and the other "surrogates-escape", which sounds quite 
> faithful
> to what they actually /do/?

We could also stop the bikeshedding by sticking with the name utf8b.
Martin's comment that it is the official name for this algorithm seems
compelling to me (even if it is confusing because of its similarity
with utf-8).

Paul.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Terry Reedy

Martin v. Löwis wrote:

Antoine Pitrou wrote:

Martin v. Löwis  v.loewis.de> writes:

Despite there being also an error handler called "surrogates".

People, perhaps we could end all the bikeshedding and call one of those handlers
"surrogates-pass" and the other "surrogates-escape", which sounds quite faithful
to what they actually /do/?


The problem with these bike-shedding discussions is that you cannot stop
them with a proposal. People will counter-propose.

I would be willing to accept a ruling from someone who a) is a native
speaker of English, and b) has demonstrated to fully understand what
these do, and c) has understood why I insist on calling it utf8b.


I qualify with a). I believe I understand c) but, as explained in my 
other post, I do not think your reason applies.  In fact, I think 
concern for naming rights might suggest that you *not* reuse the name 
for something different.  I would have to learn more about the existing 
'surrogates' handler to judge Antione's suggestion 'surrogates-pass'. 
'Surrogates-escape' is pretty good for the new handler since, to my 
understanding, it 'escapes' 'bad bytes' by prefixing them with bits that 
push them to the surrogates plane.


I have been supportive of the idea and, as well as I understood them, 
the particulars of your proposal, from the beginning.  Reusing the name 
of a codec as the name of an error-handler confused me and I believe it 
will confuse others, even though, but also because, the error handler 
was extracted and generalized from the codec.


Terry Jan Reedy


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
>> Are you serious?
> 
> Are you? ;-?  You are the one naming a codec-agnostic error handler (if
> I understand correctly, and correct me if I do not) after a particular
> codec, and denying that that could cause confusion.  See other message.

I can only repeat what I said before: I call it utf8b because that's
the established name for the algorithm it implements.

That algorithm was originally designed with UTF-8 in mind (and only
meant to be applied for UTF-8), however, it remains the same algorithm
even though PEP 383 widens its application.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread MRAB

Antoine Pitrou wrote:

Martin v. Löwis  v.loewis.de> writes:

Despite there being also an error handler called "surrogates".


People, perhaps we could end all the bikeshedding and call one of those handlers
"surrogates-pass" and the other "surrogates-escape", which sounds quite faithful
to what they actually /do/?


After having read about the existing error handler called "surrogates"
and having thought about it, I've decided that calling one just
"surrogates" isn't very helpful to the user; it has something to do with
surrogates, but what?

So +1 for Antoine's suggestion from me.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
> I qualify with a). I believe I understand c) but, as explained in my
> other post, I do not think your reason applies.  In fact, I think
> concern for naming rights might suggest that you *not* reuse the name
> for something different.  I would have to learn more about the existing
> 'surrogates' handler to judge Antione's suggestion 'surrogates-pass'.
> 'Surrogates-escape' is pretty good for the new handler since, to my
> understanding, it 'escapes' 'bad bytes' by prefixing them with bits that
> push them to the surrogates plane.

See issue 3672. In essence, in python 2.5:

py> u"\ud800".encode("utf-8")
'\xed\xa0\x80'
py> '\xed\xa0\x80'.decode("utf-8")
u'\ud800'

In 3.1,

py> "\ud800".encode("utf-8")
Traceback (most recent call last):
  File "", line 1, in 
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
position 0: surrogates not allowed
py> "\ud800".encode("utf-8","surrogates")
b'\xed\xa0\x80'
py> b'\xed\xa0\x80'.decode("utf-8")
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2:
illegal encoding
py> b'\xed\xa0\x80'.decode("utf-8","surrogates")
'\ud800'

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Antoine Pitrou
Martin v. Löwis  v.loewis.de> writes:
> py> b'\xed\xa0\x80'.decode("utf-8","surrogates")
> '\ud800'

The point is, "surrogates" does not mean anything intuitive for an /error
handler/. You seem to be the only one who finds this name explicit enough,
perhaps because you chose it.
Most other handlers' names have verbs in them ("ignore", "replace",
"xmlcharrefreplace", etc.).

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Proposed: add support for UNC paths to all functions in ntpath

2009-05-06 Thread Mark Hammond

Eric Smith wrote:

Mark: I've reviewed this and it looks okay to me.


Thanks Eric - I've now applied that patch.  As you mentioned in a 
followup to the bug:


| Thanks for looking at this, Mark. If we could only assign issues to
| Python 3.2 and 3.3 to change the pending deprecation warning to a real
| one, and to remove the function entirely, we'd be all set! I'm always
| worried we'll forget these things.

(for reference; the patch introduces a PendingDeprecationWarning for 
ntpath.uncpath)


The bug tracker doesn't have these future versions available yet - is 
there some other way these things should be tracked?  I fear simply 
opening a new bug without a reasonable 'trigger' will linger way beyond 
the next few versions...


Thanks,

Mark

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Michael Urman
On Wed, May 6, 2009 at 15:42, "Martin v. Löwis"  wrote:
> Despite there being also an error handler called "surrogates".

Not that I have to be, but I'm not sold on the previous UTF-8 codec
behavior becoming an error handler of the name "surrogates" for two
reasons (I do respect the obvious PBP argument for the implementation,
and have no better name - "lenient"?).

First, unless there's a way to stack error handlers, there's no way to
access the old behavior combined with the "replace" handler. Second,
errors="surrogates" reads like surrogates should be an error, not an
additionally allowed pattern. Neither of these are deal breakers or
hard to learn, but they are non-obvious. I think the utf8b behavior
makes a lot more sense with the name "surrogates", through the
mnemonic that errors become surrogates.

The stacking argument also applies to the new utf8b behavior on encode
(only, as it handles all errors on decode). This may be a YAGNI, but
for a non-UTF-8 encode, it may be useful to allow "xmlcharrefreplace"
handling for unavailable non-surrogate-escaped characters. But without
stacking that's unmaintainable, as we clearly don't want ${codec}b for
all current codecs.

I'd be perfectly happy with utf8b or UTF-8b, as either a codec or an
error handler (do we want both? YAGNI?). So what if it smells a little
inaccurate as a handler when used with codecs other than UTF-8, no big
deal. I could also see something like errors="roundtrip" which
explains the intention of the handler rather than the algorithm, but
is awkward on encode when it encounters unavailable Unicode
characters.

-- 
Michael Urman
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread M.-A. Lemburg
Martin v. Löwis wrote:
 The name "utf8b" suggested in the PEP is not in line with the codec
 design
>>> Where is that design documented, and how exactly violates the name
>>> the design (chapter and verse, please).
>> Martin, I designed the whole Python codec machinery
> 
> Not true. PEP 293 was written and designed by Walter Dörwald.

Walter added the generic error handler callback mechanism and
we both worked on their design.

I designed and wrote the codec implementation back in 2000,
which included the whole idea of having codec error handlers in the
first place.

The original implementation only allowed per-codec
error handlers. Walter extended this to build general-purpose
handlers that could be used by many codecs. His original
motivation was to be able to do XML character reference
escaping.

If you don't believe me, go look this up in the repository, the
mailing list archives and the trackers.

>> so even if
>> this is not explicitly written down somewhere, you can take my
>> word for it.
> 
> If the design was specified in writing somewhere, I would probably
> challenge it as obsolete. If it isn't described anywhere, I'll have
> to ignore it.

Ah, lovely attitude.

>> I want to avoid any such confusion with Python codecs and don't
>> understand why you are making a problem out of this.
> 
> Because utf8b (or, perhaps "UTF-8b") is the official name for this
> algorithm:
> 
> http://hyperreal.org/~est/utf-8b/

That's a codec implementing the escaping idea proposed by Markus
Kuhn, not an official reference. AFAIK, the term "UTF-8B" originated
from a "UTF-8 + binary" codec written for iconv:

http://mail.nl.linux.org/linux-utf8/2006-04/msg2.html

If it were the official name of an escape algorithm, as you are
suggesting, the inventor Markus Kuhn would probably have chosen
it, but he hasn't... the only reference to it is an email where it
is described as option D for ways of dealing with malformed
UTF-8 data in a decoder:

http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html

Note that this escape method is not applicable for data that
you decode from UTF-8 and then e.g. encode as Latin-1. It only
works as general purpose method if you are decoding and encoding
using the same codec, since it is specifically designed to
assure round-trip safety.

Martin, please stop being silly and just change the name.

Or drop the idea of using an error handler altogether and just let
people use the utf-8b codec you referenced above to solve their
problems whereever and if needed.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 07 2009)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2009-06-29: EuroPython 2009, Birmingham, UK52 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] test - please ignore

2009-05-06 Thread Benjamin Peterson
Some of my messages appear not to have gotten through.
-- 
Regards,
Benjamin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] [RELEASED] Python 3.1 beta 1

2009-05-06 Thread Benjamin Peterson
On behalf of the Python development team, I'm thrilled to announce the first and
only beta release of Python 3.1.

Python 3.1 focuses on the stabilization and optimization of features and changes
Python 3.0 introduced.  For example, the new I/O system has been rewritten in C
for speed.  File system APIs that use unicode strings now handle paths with
undecodable bytes in them. [1] Other features include an ordered dictionary
implementation and support for ttk Tile in Tkinter.  For a more extensive list
of changes in 3.1, see http://doc.python.org/dev/py3k/whatsnew/3.1.html or
Misc/NEWS in the Python distribution.

Please note that this is a beta release, and as such is not suitable for
production environments.  We continue to strive for a high degree of quality,
but there are still some known problems and the feature sets have not been
finalized.  This beta is being released to solicit feedback and hopefully
discover bugs, as well as allowing you to determine how changes in 3.1 might
impact you.  If you find things broken or incorrect, please submit a bug report
at

http://bugs.python.org

For more information and downloadable distributions, see the Python 3.1 website:

http://www.python.org/download/releases/3.1/

See PEP 375 for release schedule details:

http://www.python.org/dev/peps/pep-0375/



Enjoy,
-- Benjamin

Benjamin Peterson
benjamin at python.org
Release Manager
(on behalf of the entire python-dev team and 3.1's contributors)
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Stephen J. Turnbull
"Martin v. Löwis" writes:

 > > Now, with Python's file system encoding == UTF-8 or any packed EUC,
 > > and more than a handful of Shift JIS or Big5 characters in file names,
 > > one is *almost certain* to encounter ASCII as the second byte of a
 > > multibyte sequence.  PEP 383 can't handle this

Ah, I see.  Of course, the algorithm not only has to handle the ASCII
octet which is erroneous because it can't be a trailing byte, but
*also the leading byte that signalled to expect a trailing byte >127*.
So the algorithm backs up to the character boundary (which is
well-defined for all the "sane" encodings), encode the high byte(s) in
the character with lone surrogates, and encode the ASCII as itself
(promoted to a Unicode code point).

Sorry, you're right, I was just confused.  I withdraw the objection as
completely mistaken, and apologize for not thinking more carefully in
the first place.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Terry Reedy

Martin v. Löwis wrote:

Are you serious?

Are you? ;-?  You are the one naming a codec-agnostic error handler (if
I understand correctly, and correct me if I do not) after a particular
codec, and denying that that could cause confusion.  See other message.


I can only repeat what I said before: I call it


What, specifically, is 'it'?


utf8b because that's
the established name for the algorithm


Which algorithm?


it implements.


Again, what is 'it'?

As *I* read the sentence above, it is not true.

I went to the site you referred to as the source of your reasoning and 
specifically

http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043934/utf_8b.c

The algorithm called utf-8b *IS* utf-8 with the addition or replacement 
(of an error return) of essentially one line in each direction:


# encode
if 0xDC00 <= codepoint <= 0xDCFF:
byte = codepoint - 0xDC00 #encode

Note: for security concerns, you are increasing the lower limit to 
0xDC80. The comment at the top of the utf_8b.c, suggests that that is 
what it should be and should have been in the file, with the other half 
of that surrogate area an error along with the other surrogate area.


#decode
if (0x80 <= byte <= 0xFF) and utf-8-invalid(byte):
codepoint = byte + 0xDC00 # decode


That algorithm was originally designed with UTF-8 in mind (and only
meant to be applied for UTF-8), however, it remains the same algorithm
even though PEP 383 widens its application.


The error handler designed with utf-8 in mind has no name in the encode 
direction and is called "utf_8b_decoder_invalid_bytes" in the decode 
direction.  By your reasoning, *that* should be its name in Python.  The 
encoding error handler would then be named analogously 
"utf_8b_encoder_invalid_codepoints".  Even these, to me, would be better 
than confusing giving them the same name as the codec.


Terry Jan Reedy

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Glenn Linderman
On approximately 5/6/2009 6:06 PM, came the following characters from 
the keyboard of M.-A. Lemburg:



Martin, please stop being silly and just change the name.



Yes, please.  If indeed Marc-Andre invented the codec business as he 
claims, he would be an appropriate person to give a fiat name to the 
error handler.




Or drop the idea of using an error handler altogether and just let
people use the utf-8b codec you referenced above to solve their
problems whereever and if needed.



The design as an error handler is clever in leveraging the same error 
handler for multiple codecs, which cannot be done by using utf-8b alone, 
if I understand correctly.



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
Michael Urman wrote:
> On Wed, May 6, 2009 at 15:42, "Martin v. Löwis"  wrote:
>> Despite there being also an error handler called "surrogates".
> 
> Not that I have to be, but I'm not sold on the previous UTF-8 codec
> behavior becoming an error handler of the name "surrogates" for two
> reasons (I do respect the obvious PBP argument for the implementation,
> and have no better name - "lenient"?).

PBP?

> First, unless there's a way to stack error handlers, there's no way to
> access the old behavior combined with the "replace" handler.

Well, there is a way to stack error handlers, although it's not pretty:

_surrogates = codecs.lookup_errors("surrogates")
_replace = codecs.lookup_errors("replace")
def surrogates_then_replace(exc):
try:
return _surrogates(exc)
except UnicodeError:
return _replace(exc)
codecs.register_error("surrogates_then_replace",
  surrogates_then_replace)

> The stacking argument also applies to the new utf8b behavior on encode
> (only, as it handles all errors on decode). This may be a YAGNI

Indeed - in particular, as, in the primary application of this error
handler (i.e. file IO operations), there is no way of specifying
an addition error handler anyway.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
> By the way, what are the ASCII characters that are not suppported by 
> Shift-JIS?
> Not many I suppose? (if I read the Wikipedia entry correctly, it's only the
> backslash and the tilde).

The problem with this encoding is that bytes below 128 appear as second
bytes of a two-byte encoding:

py> "\x81@".decode("shift-jis")
u'\u3000'
py> "\x81A".decode("shift-jis")
u'\u3001'

So in on decoding, it may be the second byte (i.e. the ASCII byte) that
causes a problem:

py> "\x81/".decode("shift-jis")
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'shift_jis' codec can't decode bytes in position
0-1: illegal multibyte sequence

For the shift-jis codec, that's actually not a problem, though:

py> b"\x81/".decode("shift-jis","utf8b")
'\udc81/'

so the utf8b error handler will escape the first of the two bytes,
and then pass the second byte to the codec again, which then decodes
as ASCII.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
>> So are you proposing that I should rename the PEP 383 handler
>> to "utf_8b_encoder_invalid_codepoints"?
> 
> 
> No, he's saying that your algorithm for choosing the PEP 383 handler
> should have come up with that name, rather than utf8b.  But since PEP
> 383 applies to other codecs besides UTF-8, it should have a different
> name.  And one that is less cumbersome than
> "utf_8b_encoder_invalid_codepoints"

I'm still at a loss what name to give it, though. I understand that
I have to rename both error handlers, but I'm uncertain what I should
rename them to. So proposals that rename only one of them aren't
that helpful. It would be helpful if people would indicate support
for Antoine's proposal.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Glenn Linderman
On approximately 5/6/2009 10:53 PM, came the following characters from 
the keyboard of Martin v. Löwis:

The error handler designed with utf-8 in mind has no name in the encode
direction and is called "utf_8b_decoder_invalid_bytes" in the decode
direction.  By your reasoning, *that* should be its name in Python.  The
encoding error handler would then be named analogously
"utf_8b_encoder_invalid_codepoints".  Even these, to me, would be better
than confusing giving them the same name as the codec.


So are you proposing that I should rename the PEP 383 handler
to "utf_8b_encoder_invalid_codepoints"?



No, he's saying that your algorithm for choosing the PEP 383 handler 
should have come up with that name, rather than utf8b.  But since PEP 
383 applies to other codecs besides UTF-8, it should have a different 
name.  And one that is less cumbersome than 
"utf_8b_encoder_invalid_codepoints"


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
> Wouldn't renaming the existing "surrogates" handler be an incompatible
> change, and thus inappropriate?

No - it's new in Python 3.1.

So what do you think about Antoine's proposal?

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

2009-05-06 Thread Martin v. Löwis
> The error handler designed with utf-8 in mind has no name in the encode
> direction and is called "utf_8b_decoder_invalid_bytes" in the decode
> direction.  By your reasoning, *that* should be its name in Python.  The
> encoding error handler would then be named analogously
> "utf_8b_encoder_invalid_codepoints".  Even these, to me, would be better
> than confusing giving them the same name as the codec.

So are you proposing that I should rename the PEP 383 handler
to "utf_8b_encoder_invalid_codepoints"?

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com