[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2015-05-31 Thread Roundup Robot

Roundup Robot added the comment:

New changeset cf6e782a7f94 by Serhiy Storchaka in branch '2.7':
Issue #19543: Emit deprecation warning for known non-text encodings.
https://hg.python.org/cpython/rev/cf6e782a7f94

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2014-03-02 Thread Georg Brandl

Georg Brandl added the comment:

This is fixed now, right?

--
resolution:  - fixed
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2014-03-02 Thread Georg Brandl

Georg Brandl added the comment:

Both this backport and #20404 will make it into 3.3.5rc2.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2014-02-25 Thread Roundup Robot

Roundup Robot added the comment:

New changeset 9975f827eefd by Serhiy Storchaka in branch '3.3':
Fix typo (issue #19619).
http://hg.python.org/cpython/rev/9975f827eefd

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2014-02-24 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


--
nosy: +georg.brandl
priority: high - release blocker
stage: needs patch - patch review
versions:  -Python 3.4

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2014-02-24 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Sorry, I missed the patch.

--
Added file: 
http://bugs.python.org/file34213/issue19619_blacklist_transforms_py33.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2014-02-24 Thread Georg Brandl

Georg Brandl added the comment:

That looks ok to me.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2014-02-24 Thread Nick Coghlan

Nick Coghlan added the comment:

Backporting just the encode/decode changes sounds reasonable to me, since 
open() isn't likely to be a DoS vector in this particular case.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2014-02-24 Thread Roundup Robot

Roundup Robot added the comment:

New changeset 95386bbf9471 by Serhiy Storchaka in branch '3.3':
Issue #19619: Blacklist non-text codecs in method API
http://hg.python.org/cpython/rev/95386bbf9471

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2014-02-24 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

I think that issue20404 should be backported too. It is common to wrap socket 
stream with io.TextIOWrapper, and socket.makefile() returns text stream by 
default.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2014-02-17 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

I don't think that adding underscored parameter to public API is best solution, 
but we need the fix for 3.3. So here is a patch for backporting d68df99d7a57 to 
3.3.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2014-02-12 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Is a bug which allows easily make DDOS attacks important enough? Every Python 
HTTP server, mail or news client is affected.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2014-02-11 Thread STINNER Victor

STINNER Victor added the comment:

Is it still something to do? The initial issue looks to be fixed.

You may open new issue if you see more things to do?

--
nosy: +haypo

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2014-02-11 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

It isn't fixed in 3.3 yet.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2014-02-11 Thread Nick Coghlan

Nick Coghlan added the comment:

As Serhiy noted, 3.3 is still affected. On the other hand, the approach I
used for 3.4 is a pretty invasive fix, so I'm not sure it's a good idea to
implement something like that in a maintenance release.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2014-02-11 Thread STINNER Victor

STINNER Victor added the comment:

 It isn't fixed in 3.3 yet.

I'm not sure that the issue should be fixed in 3.3 because the patch is quite 
large. Do you consider that the bug is important enough?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2014-01-10 Thread Martin Panter

Changes by Martin Panter vadmium...@gmail.com:


--
nosy: +vadmium

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2014-01-02 Thread Jakub Wilk

Changes by Jakub Wilk jw...@jwilk.net:


--
nosy: +jwilk

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-12-30 Thread Roundup Robot

Roundup Robot added the comment:

New changeset 0e10367c88ce by Zachary Ware in branch 'default':
Issue19619: skip zlib error test when zlib not available
http://hg.python.org/cpython/rev/0e10367c88ce

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-12-18 Thread Nick Coghlan

Nick Coghlan added the comment:

Unassigning this one - I don't think the solution we used for 3.4 is 
appropriate in a maintenance release, but I'm not sure how else to resolve it.

--
assignee: ncoghlan - 
priority: critical - high

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-12-18 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

What about my comments in msg203841?

See http://comments.gmane.org/gmane.comp.python.devel/143943 for example how 
hard to get rid of private arguments in public functions.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-12-18 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


--
versions: +Python 3.4

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-12-18 Thread Nick Coghlan

Nick Coghlan added the comment:

I was planning to fix pydoc to not show private keyword only arguments.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-12-18 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

If people don't pay attention on explicit warning not to use certain 
parameters, is the lack of documentation will stop them?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-12-12 Thread STINNER Victor

Changes by STINNER Victor victor.stin...@gmail.com:


--
nosy:  -haypo

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-22 Thread Nick Coghlan

Nick Coghlan added the comment:

Victor, you can propose whatever you like for 3.5, but I'm not adding new
interpreter state two days before feature freeze when we don't have to.
Looking up the private CodecInfo attribute is still O(1) anyway.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-22 Thread Nick Coghlan

Nick Coghlan added the comment:

- switched to LookupError for the dedicated exception
- default value moved to a CodecInfo class attribute
- new private APIs guarded by PY_LIMITED_API
- used repr formatting where appropriate in the tests
- cleaned up the tests a bit by using encodings.normalize_encoding
- new test to ensure the codec output type checks are still exercised
- backwards compatibility tweaks for raw tuples returned from the codec 
registry lookup (uncovered by the full test suite run)

I'll be committing this version after a final local run of make test and a 
refleak check on test_codecs, test_charmapcodec and test_unicode (the latter 
two are the ones that found the backwards compatibility issue with the 
attribute lookup).

--
Added file: 
http://bugs.python.org/file32772/issue19619_blacklist_transforms_py34_postreview.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-22 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

Nick: I've had a look at your second patch. A couple of notes:

 * I think you should add the flag to the constructor of the CodecInfo
   tuple subclass and then set this in the resp. codecs. The extra
   constructor class method looks too much like a hack and is
   not needed.

 * The comment in codecs.h should read:

   
   Checks the encoding against a list of codecs which do not
   implement a str-bytes encoding before attempting the
   operation.

   Please note that these APIs are internal and should not
   be used in Python C extensions.
   

Regarding Victor's suggestion to use a separate registry dict
for this: I'm definitely -1 on this.

The black listing is a very special case only used for the
.encode()/.decode() methods and otherwise doesn't have anything to do
with the codecs sub-system. It doesn't make sense to change the design
of the registry just to implement this one special case.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-22 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 22.11.2013 12:24, Marc-Andre Lemburg wrote:
 
 Nick: I've had a look at your second patch. A couple of notes:
 
  * I think you should add the flag to the constructor of the CodecInfo
tuple subclass and then set this in the resp. codecs. The extra
constructor class method looks too much like a hack and is
not needed.

Like this:

_is_text_encoding = True # Assume codecs are text encodings by default

def __new__(cls, encode, decode, streamreader=None, streamwriter=None,
incrementalencoder=None, incrementaldecoder=None, name=None,
_is_text_encoding=None):
self = tuple.__new__(cls, (encode, decode, streamreader, streamwriter))
self.name = name
self.encode = encode
self.decode = decode
self.incrementalencoder = incrementalencoder
self.incrementaldecoder = incrementaldecoder
self.streamwriter = streamwriter
self.streamreader = streamreader
if _is_text_encoding is not None:
self._is_text_encoding = _is_text_encoding
return self

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-22 Thread Nick Coghlan

Nick Coghlan added the comment:

- switched the private flag from being set in a class method to using a keyword 
only parameter to __init__
- updated the codecs.h comment as MAL suggested

--
Added file: 
http://bugs.python.org/file32773/issue19619_blacklist_transforms_py34_keyword_only_param.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-22 Thread STINNER Victor

STINNER Victor added the comment:

If _is_text_encoding may change in Python 3.5, you should add a comment to warn 
users to not use it and explain its purpose, maybe with a reference to this 
issue.

--

We have talking about a very few codecs:

* base64: bytes = bytes
* bz2: bytes = bytes
* hex: bytes = bytes; decode supports also ASCII string (str) = bytes
* quopri: bytes = bytes
* rot_13: str = str
* uu: bytes = bytes
* zlib: bytes = bytes

I suppose that supporting ASCII string input to the hex decoder is a border 
effect of its implementation. I don't know if it is expected *for the codec*.

If we simplify the hex decoder to reject str types, all these codecs would have 
simply one type: same input and output type. Anyway, if you want something 
based on types, the special case for the hex decoder cannot be expressed with a 
type nor ABC. ASCII string is not a type.

So instead of  _is_text_encoding=False could be transform=bytes or 
transform=str. (I don't care of the name: transform_type, type, codec_type, 
data_type, etc.)

I know that bytes is not exact: bytearray, memoryview and any bytes-like object 
is accepted, but it is a probably enough for now.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-22 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 22.11.2013 12:43, STINNER Victor wrote:
 
 STINNER Victor added the comment:
 
 If _is_text_encoding may change in Python 3.5, you should add a comment to 
 warn users to not use it and explain its purpose, maybe with a reference to 
 this issue.

+1

 --
 
 We have talking about a very few codecs:
 
 * base64: bytes = bytes
 * bz2: bytes = bytes
 * hex: bytes = bytes; decode supports also ASCII string (str) = bytes
 * quopri: bytes = bytes
 * rot_13: str = str
 * uu: bytes = bytes
 * zlib: bytes = bytes
 
 I suppose that supporting ASCII string input to the hex decoder is a border 
 effect of its implementation. I don't know if it is expected *for the codec*.
 
 If we simplify the hex decoder to reject str types, all these codecs would 
 have simply one type: same input and output type. Anyway, if you want 
 something based on types, the special case for the hex decoder cannot be 
 expressed with a type nor ABC. ASCII string is not a type.
 
 So instead of  _is_text_encoding=False could be transform=bytes or 
 transform=str. (I don't care of the name: transform_type, type, codec_type, 
 data_type, etc.)
 
 I know that bytes is not exact: bytearray, memoryview and any bytes-like 
 object is accepted, but it is a probably enough for now.

I think it's better to go with something that's explicitly internal
now than to fix a public API in form of a constructor parameter
this late in the release process.

For 3.5 it may make sense to declare a few codec feature flags which
would then make lookups such as the one done for the blacklist easier
to implement and faster to check as well.

Such flags could provide introspection at a higher level than what
would be possible with type mappings (even though I still like the
idea of adding those to CodecInfo at some point).

One possible use for such flags would be to declare whether a
codec is reversible or not - in other words, whether .decode(.encode(x))
works for all possible inputs x. This flag could then be used to
quickly check whether a codec would fail on a Unicode str which
has non-Latin-1 code points or to create a list of valid encodings
for certain applications, e.g. a list which only contains reversible
Unicode encodings such as the UTF ones.

Anyway: Thanks to Nick for implementing this, to Serhiy for the black
list idea and Victor for the attribute idea :-)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-22 Thread STINNER Victor

STINNER Victor added the comment:

2013/11/22 Marc-Andre Lemburg rep...@bugs.python.org:
 Anyway: Thanks to Nick for implementing this, to Serhiy for the black
 list idea and Victor for the attribute idea :-)

In fact, the attribute idea comes from you :-)
http://bugs.python.org/issue7475#msg96374

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-22 Thread Nick Coghlan

Nick Coghlan added the comment:

It turns out the codec cache and the refleak hunting mechanism in regrtest 
*really* don't like each other if you use closures to implement your test 
codecs :)

Anyway, the attached patch tweaks the tests to handle refleak hunting (and 
running the refleak hunter indicates there aren't any leaks). I'll be 
committing this one immediately after posting this.

And like MAL, I'm keen to stick with a fully private API at this point, as I 
believe we need to get the status quo clearly documented before we can start 
talking about making changes to the public APIs.

I think both this issue and issue 7475 show that until we have a common 
understanding amongst the core development team of what's possible now, it's 
hard to discuss:

- what's currently possible that we would like to deprecate? (opinions differ 
strongly on this one - MAL and I think the status quo is mostly OK, but could 
use some better type inference support, Victor would like to see support for 
anything beyond text encodings, binary transforms and text transforms 
deprecated)
- what's not possible that we would like to make possible? (e.g. removing 
cached codecs, removing search functions, reordering search functions, type 
inference through the codec machinery)
- what's already possible that we would like to make easier? (e.g. registering 
new codecs)

--
Added file: 
http://bugs.python.org/file32774/issue19619_blacklist_transforms_py34_refleak_hunting.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-22 Thread Nick Coghlan

Nick Coghlan added the comment:

Well, that patch minus the debugging print that was still in there :)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-22 Thread Roundup Robot

Roundup Robot added the comment:

New changeset d68df99d7a57 by Nick Coghlan in branch 'default':
Issue #19619: Blacklist non-text codecs in method API
http://hg.python.org/cpython/rev/d68df99d7a57

--
nosy: +python-dev

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-22 Thread Nick Coghlan

Nick Coghlan added the comment:

3.4 has been dealt with. A similar functional patch should work in 3.3, but the 
test changes will be different (since those test cases don't exist yet in 3.3, 
and the wrapping ones aren't relevant at all).

--
versions:  -Python 3.4

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-22 Thread Roundup Robot

Roundup Robot added the comment:

New changeset 1e3b8601b098 by Nick Coghlan in branch 'default':
Issue #19619: Update What's New for codec blacklist
http://hg.python.org/cpython/rev/1e3b8601b098

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-22 Thread STINNER Victor

STINNER Victor added the comment:

The initial concern was a denial of service. Nick, can you backport your 
changeset to Python 3.3?

 This issue should avoid the denial of service attack when a compression codec 
 is used, see:
 https://mail.python.org/pipermail/python-dev/2013-November/130188.html

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-22 Thread STINNER Victor

STINNER Victor added the comment:

Can we now remove wrap_codec_error()? And maybe more changes which were done to 
workaround issues with non-Unicode codecs. bytes.decode/str.encode should no 
more raise a TypeError with codecs of the Python standard library.

Related changesets:


changeset:   87267:04e1f701aeaa
user:Nick Coghlan ncogh...@gmail.com
date:Tue Nov 19 22:33:10 2013 +1000
files:   Lib/test/test_codecs.py Objects/exceptions.c
description:
Also chain codec exceptions that allow weakrefs

The zlib and hex codecs throw custom exception types with
weakref support if the input type is valid, but the data
fails validation. Make sure the exception chaining in the
codec infrastructure can wrap those as well.


changeset:   87109:4ea622c085ca
user:Nick Coghlan ncogh...@gmail.com
date:Fri Nov 15 21:47:37 2013 +1000
files:   Lib/test/test_codecs.py Python/codecs.c
description:
Close 19609: narrow scope of codec exc chaining


changeset:   87084:854a2cea31b9
user:Nick Coghlan ncogh...@gmail.com
date:Wed Nov 13 23:49:21 2013 +1000
files:   Doc/whatsnew/3.4.rst Include/pyerrors.h Lib/test/test_codecs.py 
Misc/NEWS Objects/exceptions.c Objects/unicodeobject.c Python/codecs.c
description:
Close #17828: better handling of codec errors

- output type errors now redirect users to the type-neutral
  convenience functions in the codecs module
- stateless errors that occur during encoding and decoding
  will now be automatically wrapped in exceptions that give
  the name of the codec involved

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-22 Thread Nick Coghlan

Nick Coghlan added the comment:

No, we can't remove wrap_codec_error, as it is still needed in order
to cover direct use of the codec machinery and to handle non-text
codecs registered by third parties (the tests cover both these cases).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-22 Thread STINNER Victor

STINNER Victor added the comment:

 No, we can't remove wrap_codec_error, as it is still needed in order
 to cover direct use of the codec machinery and to handle non-text
 codecs registered by third parties (the tests cover both these cases)

I searched on the WWW for third party codecs, I only found Unicode encodings 
(str = bytes).

I don't think that we need all these tricks to provide more informations on 
transform codecs and transform codecs are not used.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-22 Thread Nick Coghlan

Nick Coghlan added the comment:

Victor, the general purpose codec infrastructure is more than a decade
old, and supported in both Python 2 and Python 3, so you're not going
to get it deprecated in the last few days before the 3.4 feature
freeze. You've already succeeded in inconveniencing affected users
migrating from Python 2 for another release by blocking the
restoration of the transform codec aliases, but I'm definitely not
going to revert any of the other already implemented codec handling
improvements without a direct request from Larry as release manager or
Guido as BDFL.

If you propose a new codec architecture as a PEP for Python 3.5 and
get it accepted, then *that* would be the appropriate time to remove
these improvements to the existing architecture. Until such a PEP is
put forward and accepted, I will continue to work on documenting the
status quo as clearly as I can (especially since the only thing I see
wrong with it is the challenges it poses for type inference, and
that's a pretty minor gripe in a language as resistant to static
analysis as Python).

I've tried to persuade you that lowering the barriers to adoption for
Python 3 is a more significant concern than a mythical nirvana of
conceptual purity that *runs directly counter to the stated intent of
the creator of the current codec architecture*, but if you wish to
exercise your core developer veto and deliberately inconvenience
users, even though the original problems cited in issue 7475 have all
been addressed, that's your choice. Just don't expect me to try to
defend that decision to any users that complain, because I think it's
completely the wrong thing to do.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-22 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

Victor, please accept the fact that the codec sub-system in Python
doesn't only have the Unicode implementation as target. It's
true that most codecs were added for the Unicode implementation,
but I deliberately designed the system to be open for other
purposes such as encoding/decoding between different representations
of data as well.

The hex/base64 and compression codecs are example of such
representations, but there are many other uses as well, e.g.
escaping of data in various ways, serialization of objects,
direct conversions between encoded data (à la recode), etc.

Python's history is full of cases where we've opened up its
functionality to new concepts and designs. If you want to propose
to remove the openness in the codec system for some perceived idea
of purity, then you will need to come up with very good arguments -
not only to convince me, but also to convince the Python users
at large :-)

I would much rather like to see the openness of the system used
more in the stdlib and have it developed further to make it easier
to use.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-22 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

  * I think you should add the flag to the constructor of the CodecInfo
tuple subclass and then set this in the resp. codecs. The extra
constructor class method looks too much like a hack and is
not needed.

I think that adding new keyword argument to public function is much worse than 
adding new private class method or private module level function. Arguments 
are always visible, while private methods and functions are usually hidden. 
New argument will confuse users.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-22 Thread STINNER Victor

STINNER Victor added the comment:

 Victor, please accept the fact that the codec sub-system in Python
 doesn't only have the Unicode implementation as target. It's
 true that most codecs were added for the Unicode implementation,
 but I deliberately designed the system to be open for other
 purposes such as encoding/decoding between different representations
 of data as well.

I was rejecting completly transform codecs, but I changed my mind. I'm trying 
to accept that codecs.encode/decode functions were present from the beginning 
and that they should be functions :-)

My request to remove extra code on the exceptions handling was specific to pure 
Unicode encodings (like UTF-8. The code can be kept for codecs.encode/decode.

The impact of my request should only impact misused third party transform 
codecs. You would just get as much info that you are getting with Python 3.3.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-22 Thread STINNER Victor

STINNER Victor added the comment:

With blacklisted transform codecs, I'm fine with the idea of restoring codecs 
aliases for transform codecs in Python 3.4. Go ahead Nick.

--

For Python 3.5, a better solution should be found to declare transform codecs.

And I had like to also add transform()/untransform() methods on bytes and str 
types. So you would have 4 API:

* Unicode text codecs: str.encode/str.decode, str=bytes
* bytes transform codecs: bytes.transform/untransform, bytes-like object=bytes
* Unicode transform codecs: str.transform/untransform, str=str
* all codecs: codecs.encode/codecs.decode, something=something else

But only few developers (only me?) are interested by transform/untransform, so 
codecs.encode/codecs.decode might be enough.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-22 Thread Nick Coghlan

Nick Coghlan added the comment:

Yay, thanks Victor!

Regarding UTF-8 et al, the existing shortcuts in unicodeobject.c already
bypass the full codec machinery, and that includes the exception wrapping
on failures.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-22 Thread STINNER Victor

STINNER Victor added the comment:

 Regarding UTF-8 et al, the existing shortcuts in unicodeobject.c already
 bypass the full codec machinery, and that includes the exception wrapping
 on failures.

There are still platforms using locale encodings different than ascii,
latin1 or utf8. For example, Windows never uses these encodings for
ANSI or OEM code page. ANSI code page is used as the Python filesystem
encoding which is used in a lot of places. OEM code page is used for
the stdio streams (stdin, stdout, stderr). There are some users using
locales with the latin9 encoding.

I proposed to remove the new code for exception handling to simplify
the code (because the error should not occur anymore).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-22 Thread Nick Coghlan

Nick Coghlan added the comment:

Just noting the exact list of codecs that currently bypass the full codec 
machinery and go direct to the C implementation by normalising the codec name 
(which includes forcing to lowercase) and then using strcmp to check against a 
specific set of known encodings.

In PyUnicode_Decode (and hence bytes.decode and bytearray.decode):

utf-8
utf8
latin-1
latin1
iso-8859-1
iso8859-1
mbcs (Windows only)
ascii
utf-16
utf-32

In PyUnicode_AsEncodedString (and hence str.encode), the list is mostly the 
same, but utf-16 and utf-32 are not accelerated (i.e. they're currently still 
looked up through the codec machinery).

It may be worth opening a separate issue to restore the consistency between the 
lists by adding utf-16 and utf-32 to the fast path for encoding as well.

As far as the wrapping mechanism from issue #17828 itself goes:

- it only triggers if PyEval_CallObject on the encoder or decoder returns NULL
- stateful exceptions (which includes UnicodeEncodeError and 
UnicodeDecodeError) and those with custom __init__ or __new__ implementations 
don't get wrapped
- the actual wrapping process is just the C equivalent of raise 
type(exc)(new_msg) from exc, plus the initial checks to determine if the 
current exception can be wrapped safely
- it applies to the *general purpose* codec machinery, not just to the text 
model related convenience methods

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-22 Thread Nick Coghlan

Nick Coghlan added the comment:

The examples from the 3.4 What's New may make it clearer why the exception 
wrapping in the codec machinery is much broader in scope that just detecting 
output type errors (even though handling that case was the original motivation 
for the idea):

 import codecs

 codecs.decode(babcdefgh, hex)
binascii.Error: Non-hexadecimal digit found

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File stdin, line 1, in module
binascii.Error: decoding with 'hex' codec failed (Error: Non-hexadecimal 
digit found)

 codecs.encode(hello, bz2)
TypeError: 'str' does not support the buffer interface

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File stdin, line 1, in module
TypeError: encoding with 'bz2' codec failed (TypeError: 'str' does not 
support the buffer interface)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-21 Thread Nick Coghlan

Nick Coghlan added the comment:

I used the private class method to minimise the per-codec impact (1
modified/added line per codec rather than 3).

Your other suggestions look good, so I'll apply those before committing.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-21 Thread STINNER Victor

STINNER Victor added the comment:

 return codecs.CodecInfo._declare_transform()

I also prefer the private attribute option.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-21 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Why

return codecs.CodecInfo._declare_transform(
name='base64',
encode=base64_encode,
decode=base64_decode,
incrementalencoder=IncrementalEncoder,
incrementaldecoder=IncrementalDecoder,
streamwriter=StreamWriter,
streamreader=StreamReader,
)

instead of

codec = codecs.CodecInfo(
name='base64',
encode=base64_encode,
decode=base64_decode,
incrementalencoder=IncrementalEncoder,
incrementaldecoder=IncrementalDecoder,
streamwriter=StreamWriter,
streamreader=StreamReader,
)
codec._is_text_encoding = False
return codec

?

I have added other minor comments on Rietveld.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-21 Thread Nick Coghlan

Nick Coghlan added the comment:

Victor pointed out this should now raise LookupError rather than TypeError.

However, I'm not going to duplicate the manipulation of the private
attribute across seven different codecs when a private alternate
constructor solves that problem far more cleanly.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-21 Thread STINNER Victor

STINNER Victor added the comment:

 There is no codec registry - there is only the default codec search
function, the encodings import namespace, the normalisation algorithm
and the alias dictionary.

interp-codec_search_cache can be seen as the registry. If you store codecs 
in two different registries depending a property, attribute, whatever; you keep 
O(1) complexity (bo extra strcmp or getting an attribute at each lookup). The 
overhead is only when you load a codec for the first time.

It should not be so hard to add a second dictionary.

You don't need to touch all parts of the codecs machinery, only 
interp-codec_search_cache.

It would not be possible to have the name in the two registries. So 
codecs.lookup() would still return any kind of codecs, it would just lookup in 
two dictionaries instead of one. So codecs.encode/decode would be unchanged too 
(if you want to keep these functions ;-)).

Only bytes.decode/str.encode would be modified to only lookup in the text 
codecs only registry.


Yet another option: add a new dictionary, but leave interp-codec_search_cache 
unchanged. Text codecs would also be registered twice: once in 
interp-codec_search_cache, once in the second dictionary. So 
bytes.decode/str.encode would only lookup in the text codecs dictionary, 
instead of interp-codec_search_cache. That's all ;-)


 Victor pointed out this should now raise LookupError rather than TypeError.

If you accept to raise a LookupError, the two registries option may become 
more obvious, isn't it?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-21 Thread Nick Coghlan

Nick Coghlan added the comment:

New patch for 3.4 that uses a private attribute on CodecInfo and a private 
class method to set it appropriately (as I believe that is a better approach 
than changing the signature of CodecInfo.__init__ at this point, especially if 
we end up pursuing the codec type map idea in 3.5)

This version also updates the tests to check for the appropriate error messages.

The integration into the text model related methods is that same as in the 
proof of concept: a parallel private text-encoding-only C API that is used in 
preference to the general purpose codec machinery where appropriate.

If there aren't any objections to this approach, I'll commit this one tomorrow.

--
Added file: 
http://bugs.python.org/file32755/issue19619_blacklist_transforms_py34.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-21 Thread Nick Coghlan

Changes by Nick Coghlan ncogh...@gmail.com:


--
assignee:  - ncoghlan

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-19 Thread Nick Coghlan

Nick Coghlan added the comment:

Given the time frame, how about we just go with Serhiy's suggestion of a known 
non-Unicode codec internal blacklist for both 3.3 and 3.4?

I still like the idea of exposing codec type maps for introspection, but 
designing a decent API for that which also handles type preserving codecs is 
going to take some work, and can't realistically be included in 3.4.

By my count, if we delay the blacklisting until after we do the codec lookup, 
there's only seven names we need to block:

 from codecs import lookup
 blacklist = base64 uu quopri hex bz2 zlib rot13.split()
 for name in blacklist:
... print(lookup(name).name)
... 
base64
uu
quopri
hex
bz2
zlib
rot-13

--
versions: +Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-19 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 19.11.2013 12:38, Nick Coghlan wrote:
 
 Given the time frame, how about we just go with Serhiy's suggestion of a 
 known non-Unicode codec internal blacklist for both 3.3 and 3.4?

+1

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-19 Thread Nick Coghlan

Nick Coghlan added the comment:

Attached is a proof of concept for the blacklist approach (for 3.4, but without 
the fixes needed for the transform codec handling tests in test_codecs)

This does have the potential to add a reasonable amount of additional overhead 
to encoding and decoding for shortstrings. Since it isn't obvious where to 
store a set for faster checking against the blacklist, it may be worth 
benchmarking this naive approach before doing something more complicated.

Regardless, I don't plan to take this further myself any time soon - I just 
wanted to give it a firm nudge in the direction of the blacklist approach by 
providing a proof of concept.

--
keywords: +patch
Added file: 
http://bugs.python.org/file32702/issue19619_blacklist_proof_of_concept.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-19 Thread STINNER Victor

STINNER Victor added the comment:

+/* A set would be faster, but when to build it, where to store it? */
+if (_PyUnicode_CompareWithId(codec_name, PyId_base64) == 0 ||
+_PyUnicode_CompareWithId(codec_name, PyId_uu) == 0 ||
+_PyUnicode_CompareWithId(codec_name, PyId_quopri) == 0 ||
+_PyUnicode_CompareWithId(codec_name, PyId_hex) == 0 ||
+_PyUnicode_CompareWithId(codec_name, PyId_bz2) == 0 ||
+_PyUnicode_CompareWithId(codec_name, PyId_zlib) == 0 ||
+PyUnicode_CompareWithASCIIString(codec_name, rot-13) == 0
+   ) {
+is_text_codec = 0;
+}

This is slow and not future proof. It would be faster and simpler to have two 
registries: a register only for bytes.decode()/str.encode() and another for 
custom codecs for codecs.encode/decode (or 
(bytes|str).transform()/untransform()).

So abc.encode(rot13) would simply fail with a LookupError.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-19 Thread Nick Coghlan

Nick Coghlan added the comment:

Future proofing is irrelevant at this point - this is just about what
can realistically be implemented in 3.4, not what can be implemented
with the luxury of several months to rearchitect the codec system (and
if we were going to do that, we'd just fix the type mapping
introspection problem).

There is no codec registry - there is only the default codec search
function, the encodings import namespace, the normalisation algorithm
and the alias dictionary.

It sounds to me like you still believe it is possible to stick the
genie back in the bottle and limit the codec system to what *you*
think is a good idea. It doesn't work like that - the codecs module
already provides a fully general data transformation system backed by
lazy imports, and that isn't going to change due to backwards
compatibility constraints. The only option we have is whether or not
we file off the rough edges and try to ease the transition for users
migrating from Python 2, where all of the standard library codecs fit
within the limits of the text model, so the general purpose codec
infrastructure almost never came into play. Getting rid of it is no
longer a realistic option - documenting it, improving the failure
modes and potentially adding some features (in Python 3.5+) are the
only improvements that are genuinely feasible.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-19 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Blacklisting by name is slow and it prevents a user from defining a codec with 
blacklisted name.

What if just add private attribute (_not_text?) to unsafe codecs? If a codec 
has this attribute, than it should not be used it text encoding/decoding. 
Checking an attribute is much faster than comparing with a couple of strings.

Another possibility is an inheriting all unsafe codecs from special class.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-19 Thread Nick Coghlan

Nick Coghlan added the comment:

Yes, a private attribute on CodecInfo is probably better - the rest of the
patch would stay the same, it would just check for that attribute instead
of particular names.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-16 Thread Nick Coghlan

Nick Coghlan added the comment:

The full input/output type specifications can't be implemented sensibly without 
also defining at least a ByteSequence ABC. While I think it's a good idea in 
the long run, there's no feasible way to design such a system in the time 
remaining before the Python 3.4 feature freeze.

However, we could do something much simpler as a blacklist API:

def is_unicode_codec(name):
Returns true if this is the name of a known Unicode text encoding

def set_as_non_unicode(name):
Indicates that the named codec is not a Unicode codec

And then the codecs module would just maintain a set internally of all the 
names explicitly flagged as non-unicode.

Such an API remains useful even if the input/output type support is added in 
Python 3.5 (since codecs.is_unicode_codec(name) is a bit simpler thing to 
explain than the exact type restrictions).

Alternatively, implementing just the encodes_to and decodes_to attributes 
would be enough for str.encode, bytes.decode and bytearray.decode to reject 
known bad encodings early, leaving the input type checks to the codecs for now 
(since it is correctly defining encode_from and decode_from for many stdlib 
codecs that would need the ByteSequence ABC).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-16 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

I think internal blacklist for all standard non-unicode codecs will be enough 
to prevent the denial of service attack in maintenance releases.

--
components: +Unicode
nosy: +doerwalter, ezio.melotti
priority: normal - critical
stage:  - needs patch
type:  - behavior

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-16 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 16.11.2013 10:16, Nick Coghlan wrote:
 
 Nick Coghlan added the comment:
 
 The full input/output type specifications can't be implemented sensibly 
 without also defining at least a ByteSequence ABC. While I think it's a good 
 idea in the long run, there's no feasible way to design such a system in the 
 time remaining before the Python 3.4 feature freeze.
 
 However, we could do something much simpler as a blacklist API:
 
 def is_unicode_codec(name):
 Returns true if this is the name of a known Unicode text 
 encoding
 
 def set_as_non_unicode(name):
 Indicates that the named codec is not a Unicode codec
 
 And then the codecs module would just maintain a set internally of all the 
 names explicitly flagged as non-unicode.

That doesn't look flexible enough to cover the various different
input/output types.

 Such an API remains useful even if the input/output type support is added in 
 Python 3.5 (since codecs.is_unicode_codec(name) is a bit simpler thing to 
 explain than the exact type restrictions).
 
 Alternatively, implementing just the encodes_to and decodes_to attributes 
 would be enough for str.encode, bytes.decode and bytearray.decode to reject 
 known bad encodings early, leaving the input type checks to the codecs for 
 now (since it is correctly defining encode_from and decode_from for many 
 stdlib codecs that would need the ByteSequence ABC).

The original idea we discussed some time ago was to add a mapping
or list attribute to CodecInfo which lists all supported type
combinations.

The codecs module could then make this information available through
a simple type check API (which also caches the lookups for performance
reasons), e.g.

codecs.types_supported(encoding, input_type, output_type) - boolean.

Returns True/False depending on whether the codec for
encoding supports the given input and output types.

Usage:

if not codecs.types_support(encoding, str, bytes):
# not a Unicode - 8-bit codec
...

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-16 Thread Nick Coghlan

Nick Coghlan added the comment:

Note that users can completely blacklist any codec that hasn't been imported 
yet by preventing imports of that codec definition:

 import sys, encodings
 blocked_codecs = bz2_codec, zlib_codec
 for name in blocked_codecs:
... sys.modules[encodings. + name] = None
... setattr(encodings, name, None)
... 
 bpayload.decode(bz2_codec)
Traceback (most recent call last):
  File stdin, line 1, in module
LookupError: unknown encoding: bz2_codec
 bpayload.decode(zlib_codec)
Traceback (most recent call last):
  File stdin, line 1, in module
LookupError: unknown encoding: zlib_codec

Add in an encodings._cache.clear() and you can also block the use of 
previously used codecs.

Regardless of what else we do, we should document this so that users know how 
to do it.

This means the case we're handling in this issue is just the one where we want 
to block a codec from the builtin method APIs, while still allowing it in the 
codecs module APIs.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-16 Thread Nick Coghlan

Nick Coghlan added the comment:

Now that I understand Victor's proposal better, I actually agree with it, I 
just think the attribute names need to be encodes_to and decodes_to.

With Victor's proposal, *input* validity checks (including type checks) would 
remain the responsibility of the codec itself. What the new attributes would 
enable is *output* type checks *without having to perform the encoding or 
decoding operation first*. codecs will be free to leave these as None to retain 
the current behaviour of try it and see.

The specific field names input_type and output_type aren't accurate, since 
the acceptable input types for encoding or decoding are likely to be more 
permissive than the specific output type for the other operation. Most of the 
binary codecs, for example, accept any bytes-like object as input, but produce 
bytes objects as output for both encoding and decoding. For Unicode encodings, 
encoding is strictly str-bytes, but decoding is generally the more permissive 
bytes-like object - str.

I would still suggest providing the following helper function in the codecs 
module (the name has changed from my earlier suggestion and I now suggest 
implementing it in terms of Victor's suggestion with more appropriate field 
names):

def is_text_encoding(name):
Returns true if the named encoding is a Unicode text encoding
info = codecs.lookup(name)
return info.encodes_to is bytes and info.decodes_to is str

This approach covers all the current stdlib codecs:

- the text encodings encode to bytes and decode to str
- the binary transforms encode to bytes and also decode to bytes
- the lone text transform (rot_13) encodes and decodes to str

This approach also makes it possible for a type inference engine (like mypy) to 
potentially analyse codec use, and could be expanded in 3.5 to offer type 
checked binary and text transform APIs that filtered codecs appropriately 
according to their output types.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-16 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 16.11.2013 13:44, Nick Coghlan wrote:
 
 Nick Coghlan added the comment:
 
 Now that I understand Victor's proposal better, I actually agree with it, I 
 just think the attribute names need to be encodes_to and decodes_to.
 
 With Victor's proposal, *input* validity checks (including type checks) would 
 remain the responsibility of the codec itself. What the new attributes would 
 enable is *output* type checks *without having to perform the encoding or 
 decoding operation first*. codecs will be free to leave these as None to 
 retain the current behaviour of try it and see.
 
 The specific field names input_type and output_type aren't accurate, 
 since the acceptable input types for encoding or decoding are likely to be 
 more permissive than the specific output type for the other operation. Most 
 of the binary codecs, for example, accept any bytes-like object as input, but 
 produce bytes objects as output for both encoding and decoding. For Unicode 
 encodings, encoding is strictly str-bytes, but decoding is generally the 
 more permissive bytes-like object - str.
 
 I would still suggest providing the following helper function in the codecs 
 module (the name has changed from my earlier suggestion and I now suggest 
 implementing it in terms of Victor's suggestion with more appropriate field 
 names):
 
 def is_text_encoding(name):
 Returns true if the named encoding is a Unicode text encoding
 info = codecs.lookup(name)
 return info.encodes_to is bytes and info.decodes_to is str
 
 This approach covers all the current stdlib codecs:
 
 - the text encodings encode to bytes and decode to str
 - the binary transforms encode to bytes and also decode to bytes
 - the lone text transform (rot_13) encodes and decodes to str
 
 This approach also makes it possible for a type inference engine (like mypy) 
 to potentially analyse codec use, and could be expanded in 3.5 to offer type 
 checked binary and text transform APIs that filtered codecs appropriately 
 according to their output types.

Nick, you are missing an important point: codecs can have any
number of input/output type combinations, e.g. they may
convert bytes - str and str-str (output type depends on
input type).

For this reason the simplistic approach with just one type
conversion will not work. Codecs will have to provide a
*mapping* of input to output types for each direction
(encoding and decoding) - either as Python mapping or
as list of mapping tuples.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-16 Thread Nick Coghlan

Nick Coghlan added the comment:

Such codecs can be represented (for 3.4) by simply not setting the attribute 
and leaving the output types unspecified. We don't need that complexity for the 
standard library, and the not specified escape hatch means complex codecs 
will be no worse off than they are now.

The elegance of Victor's proposal is that it doesn't lock us out of solving the 
more complex cases later (where the codec's output type depends on the input 
type) by permitting a tuple or dict mapping input types to output types for 
encodes_to and decodes_to, while still solving all of our immediate 
problems.

This is especially relevant since we can't effectively represent codec input 
types until we have a ByteSequence ABC to cover the bytes-like object case, 
so demanding that the general case be handled immediately is the same as 
requesting that the feature be postponed completely to Python 3.5.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-16 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 16.11.2013 14:37, Nick Coghlan wrote:
 
 Such codecs can be represented (for 3.4) by simply not setting the attribute 
 and leaving the output types unspecified. We don't need that complexity for 
 the standard library, and the not specified escape hatch means complex 
 codecs will be no worse off than they are now.
 
 The elegance of Victor's proposal is that it doesn't lock us out of solving 
 the more complex cases later (where the codec's output type depends on the 
 input type) by permitting a tuple or dict mapping input types to output types 
 for encodes_to and decodes_to, while still solving all of our immediate 
 problems.
 
 This is especially relevant since we can't effectively represent codec input 
 types until we have a ByteSequence ABC to cover the bytes-like object case, 
 so demanding that the general case be handled immediately is the same as 
 requesting that the feature be postponed completely to Python 3.5.

I don't agree.

The mapping API is not much more complex than
the single type combination proposal and it could well handle the
case for which you'd have to add a ByteSequence ABC now to be
able to define this single type combination using one ABC.

Rather than adding the ABC now, you could simply add all
relevant types to the mappings and then replace those mappings
with an ABC in 3.5.

BTW: I don't see a need to rush any of this. If more discussion
is needed, then it's better to have a more complete proposal
implemented in 3.5 than to try to do patchwork this late in the
3.4 release process.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-16 Thread Nick Coghlan

Nick Coghlan added the comment:

The only reasonable way to accurately represent anything that exposes a buffer 
memoryview can read as a type check is to write an appropriately duck-typed 
ABC. You can't enumerate all the types that the binary codecs accept as input, 
because that list of types isn't finite (unlike the output types, which are far 
more tightly constrained).

I'd also be fine with Serhiy's suggestion of a private non Unicode codec set 
that is maintained by hand and checked *before* the codec operations in the 
codec methods - that then just becomes an internal implementation detail to 
improve the efficiency of the output type checks where we have the additional 
info needed to save the interpreter some work.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-16 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 16.11.2013 15:52, Nick Coghlan wrote:
 
 The only reasonable way to accurately represent anything that exposes a 
 buffer memoryview can read as a type check is to write an appropriately 
 duck-typed ABC. You can't enumerate all the types that the binary codecs 
 accept as input, because that list of types isn't finite (unlike the output 
 types, which are far more tightly constrained).

Theoretically, yes. However, in practice, you'd only be interested
in a few type combinations (until the ABC is available).

 I'd also be fine with Serhiy's suggestion of a private non Unicode codec 
 set that is maintained by hand and checked *before* the codec operations in 
 the codec methods - that then just becomes an internal implementation detail 
 to improve the efficiency of the output type checks where we have the 
 additional info needed to save the interpreter some work.

For 3.4 that would also do fine :-)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

2013-11-15 Thread STINNER Victor

New submission from STINNER Victor:

I propose to add new input_type and output_type to CodecInfo. These attributes 
would only be defined for base64, hex, ... codecs which are not the classic 
encode: str=bytes, decode: bytes=str codecs.

I also propose to modify str.encode() and bytes.encode() to only accept codecs 
using the right types. If the type doesn't match, the codec raises a 
LookupError.

This issue should avoid the denial of service attack when a compression codec 
is used, see:
https://mail.python.org/pipermail/python-dev/2013-November/130188.html

--
messages: 202996
nosy: haypo, lemburg, ncoghlan, serhiy.storchaka
priority: normal
severity: normal
status: open
title: Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()
versions: Python 3.4

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue19619
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com