[issue23505] Urlparse insufficient validation leads to open redirect

2015-04-07 Thread Paul McMillan

Paul McMillan added the comment:

As Martin said, backporting a change for this wouldn't be appropriate
for Python 2.7. The 2.7 docs could certainly be updated to make this
clearer, but we can't introduce a breaking change like that to the
stable release.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23505
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23505] Urlparse insufficient validation leads to open redirect

2015-03-03 Thread Paul McMillan

Paul McMillan added the comment:

While some websites may use urlunparse(urlparse(url)) to validate a url, this 
is far from standard practice. Django, for instance, does not use this method. 
While I agree we should clean this behavior up, this is not a vulnerability in 
core python, and we need to invalidate the assigned CVE.

--
nosy: +PaulMcMillan

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23505
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14297] Custom string formatter doesn't work like builtin str.format

2012-03-13 Thread Paul McMillan

New submission from Paul McMillan p...@mcmillan.ws:

In Python 2.7, I can use an empty format string to indicate automatically 
numbered positional arguments when formatting the builtin string object.

http://docs.python.org/release/2.7/library/string.html#format-string-syntax

If I subclass string.Formatter, this expected functionality doesn't work.
http://docs.python.org/release/2.7/library/string.html#string-formatting


Python 2.7.2+ (default, Oct  4 2011, 20:06:09) 
[GCC 4.6.1] on linux2
Type help, copyright, credits or license for more information.
 from string import Formatter
 class MyFormatter(Formatter):
... pass
... 
 fs_a = 'a{0}b{1}c'
 fs_b = 'a{}b{}c'
 fs_a.format(1, 2)
'a1b2c'
 fs_b.format(1, 2)
'a1b2c'
 fmt = MyFormatter()
 fmt.format(fs_a, 1, 2)
'a1b2c'
 fmt.format(fs_b, 1, 2)
Traceback (most recent call last):
  File stdin, line 1, in module
  File /usr/lib/python2.7/string.py, line 545, in format
return self.vformat(format_string, args, kwargs)
  File /usr/lib/python2.7/string.py, line 549, in vformat
result = self._vformat(format_string, args, kwargs, used_args, 2)
  File /usr/lib/python2.7/string.py, line 571, in _vformat
obj, arg_used = self.get_field(field_name, args, kwargs)
  File /usr/lib/python2.7/string.py, line 632, in get_field
obj = self.get_value(first, args, kwargs)
  File /usr/lib/python2.7/string.py, line 591, in get_value
return kwargs[key]
KeyError: ''

--
messages: 155708
nosy: PaulMcMillan
priority: normal
severity: normal
status: open
title: Custom string formatter doesn't work like builtin str.format
versions: Python 2.7

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14297
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13703] Hash collision security issue

2012-01-23 Thread Paul McMillan

Paul McMillan p...@mcmillan.ws added the comment:

 I think you're asking a bit much here :-) A broken app is a broken
 app, no matter how nice Python tries to work around it. If an
 app puts too much trust into user data, it will be vulnerable
 one way or another and regardless of how the user data enters
 the app.

I notice your patch doesn't include fixes for the entire standard
library to work around this problem. Were you planning on writing
those, or leaving that for others?

As a developer, I honestly don't know how I can state with certainty
that input data is clean or not, until I actually see the error you
propose. I can't check validity before the fact, the way I can check
for invalid unicode before storing it in my database. Once I see the
error (probably only after my application is attacked, certainly not
during development), it's too late. My application can't know which
particular data triggered the error, so it can't delete it. I'm
reduced to trial and error to remove the offending data, or to writing
code that never stores more than 1000 things in a dictionary. And I
have to accept that the standard library may not work on any
particular data I want to process, and must write code that detects
the error state and somehow magically removes the offending data.

The alternative, randomization, simply means that my dictionary
ordering is not stable, something that is already the case.

While I appreciate that the counting approach feels cleaner;
randomization is the only solution that makes practical sense.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue13703
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13703] Hash collision security issue

2012-01-21 Thread Paul McMillan

Paul McMillan p...@mcmillan.ws added the comment:

On Sat, Jan 21, 2012 at 3:47 PM, Alex Gaynor rep...@bugs.python.org wrote:
 I'm able to put N pieces of data into the database on successive requests,
 but then *rendering* that data puts it in a dictionary, which renders that
 page unviewable by anyone.

This and the problems Frank mentions are my primary concerns about the
counting approach. Without the original suggestion of modifying the
hash and continuing without an exception (which has its own set of
problems), the valid data python can't process problem is a pretty
big one. Allowing attackers to poison interactions for other users is
unacceptable.

The other thing I haven't seen mentioned yet is that while it is true
that most web applications do have robust error handling to produce
proper 500s, an unexpected error will usually result in restarting the
server process - something that can carry significant weight by
itself. I would consider it a serious problem if every attack request
required a complete application restart, a la original cgi.

I'm strongly in favor of randomization. While there are many broken
applications in the wild that depend on dictionary ordering, if we
ship with this feature disabled by default for security and bugfix
branches, and enable it for 3.3, users can opt-in to protection as
they need it and as they fix their applications. Users who have broken
applications can still safely apply the security fix (without even
reading the release notes) because it won't change the default
behavior. Distro managers can make an appropriate choice for their
user base. Most importantly, it negates the entire compute once,
attack everywhere class of collision problems, even if we haven't
explicitly discovered them.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue13703
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13703] Hash collision security issue

2012-01-08 Thread Paul McMillan

Paul McMillan p...@mcmillan.ws added the comment:

 Christian Heimes added the comment:
 Ouch, the startup impact is large! Have we reached a point where one size 
 fits all doesn't work any longer? It's getting harder to have just one 
 executable for 500ms scripts and server processes that last for weeks.

This concerns me too, and is one reason I think the collision counting
code might be the winning solution. Randomness is hard to do correctly
and is expensive. If we can avoid it, we should try very hard to do
so...

 Christian Heimes said to Marc-Andre:
 Have you profiled your suggestion? I'm interested in the speed implications. 
 My gut feeling is that your idea could be slower, since you have added more 
 instructions to a tight loop, that is execute on every lookup, insert, update 
 and deletion of a dict key. The hash modification could have a smaller 
 impact, since the hash is cached. I'm merely speculating here until we have 
 some numbers to compare.

Interesting point, though I think we might be able to work it out so
that we're only adding instructions when there's actually a detected
collision. I'll be interested to see what the benchmarks (and real
world) have to say about the impacts of randomization as compared to
the existing black-magic optimization of the hash function.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue13703
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13703] Hash collision security issue

2012-01-07 Thread Paul McMillan

Paul McMillan p...@mcmillan.ws added the comment:

 Alex, I agree the issue has to do with the origin of the data, but the 
 modules listed are the ones that deal with the data supplied by this 
 particular attack.

They deal directly with the data. Do any of them pass the data
further, or does the data stop with them? A short and very incomplete
list of vulnerable standard lib modules includes: every single parsing
library (json, xml, html, plus all the third party libraries that do
that), all of numpy (because it processes data which probably came
from a user [yes, integers can trigger the vulnerability]), difflib,
the math module, most database adaptors, anything that parses metadata
(including commonly used third party libs like PIL), the tarfile lib
along with other compressed format handlers, the csv module,
robotparser, plistlib, argparse, pretty much everything under the
heading of 18. Internet Data Handling (email, mailbox, mimetypes,
etc.), 19. Structured Markup Processing Tools, 20. Internet
Protocols and Support, 21. Multimedia Services, 22.
Internationalization, TKinter, and all the os calls that handle
filenames. The list is impossibly large, even if we completely ignore
user code. This MUST be fixed at a language level.

I challenge you to find me 15 standard lib components that are certain
to never handle user-controlled input.

 Note that changing the hash algorithm for a persistent process, even though 
 each process may have a different seed or randomized source, allows attacks 
 for the life of that process, if an attack vector can be created during its 
 lifetime. This is not a problem for systems where each request is handled by 
 a different process, but is a problem for systems where processes are 
 long-running and handle many requests.

This point has been made many times now. I urge you to read the entire
thread on the mailing list. Your implementation is impractical because
your safe implementation completely ignores all hash caching (each
entry must be re-hashed for that dict). Your implementation is still
vulnerable in exactly the way you mentioned if you ever have any kind
of long-lived dict in your program thread.

 You have entered the class of people that claim lots of vulnerabilities, 
 without enumeration.

I have enumerated. Stop making this argument.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue13703
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13703] Hash collision security issue

2012-01-06 Thread Paul McMillan

Paul McMillan p...@mcmillan.ws added the comment:

 Those who use or advocate a simple randomized starting hash (Perl, Ruby, 
 perhaps MS, and the CCC presenters) are presuming that the randomized hash 
 values are kept private. Indeed, they should be (and the docs could note 
 this) unless an attacker has direct access to the interpreter.

Except that this is patently untrue. Anytime any programmer iterates
over a dictionary and returns the ordered result to the user in some
form, they're leaking information about the hash value. I hope you're
not suggesting that any programmer who is concerned about security
will make sure to sort the results of every iteration before making it
public in some fashion.

 I do not think we, as Python developers, should be concerned about esoteric 
 timing attacks.

Timing attacks are less esoteric than you think they are. This issue
gets worse, not better, as the internet moves (for better or worse)
towards virtualized computing.

 And if hashing were randomized per process, and probes were randomly 
 distributed among processes, and processes were periodically killed and 
 restarted with new seeds, could such an attack get anywhere...

You're suggesting that in order for a Python application to be secure,
it's a requirement that we randomly kill and restart processes from
time to time? I thought we were going for a good solution here, not a
hacky workaround.

 We could also consider, for 3.3, making the output of hash() be different 
 from the internal values used for dicts, perhaps by switching random seeds in 
 hash(). So even if someone does return hash(x) values to potential attackers, 
 they are not the values used in dicts. (This would require a slight change in 
 the doc.)

This isn't a bad idea, but I'd be fine documenting that the output of
hash() shouldn't be made public.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue13703
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13703] Hash collision security issue

2012-01-06 Thread Paul McMillan

Paul McMillan p...@mcmillan.ws added the comment:

 An attack can be based on trying to find many objects with the same
 hash value, or trying to find many objects that, as they get inserted
 into a dictionary, very often cause collisions due to the collision
 resolution algorithm not finding a free slot.

Yep. Allowing an attacker to produce very large dictionaries is also bad.

 if the application
 puts too much trust into large blobs of input data - which is
 the actual security issues we're trying to work around here...

To be very clear the issue is ANY large blob of data anywhere in the
application, not just on input. An attack could happen after whatever
transform your application runs on the data before returning it.

 I'll upload a patch that demonstrates the collisions counting
 strategy to show that detecting the problem is easy. Whether
 just raising an exception is a good idea, is another issue.

I'm in cautious agreement that collision counting is a better
strategy. The dict implementation performance would suffer from
randomization.

 The dict implementation could then alter the hash parameter
 and recreate the dict table in case the number of collisions
 exceeds a certain limit, thereby actively taking action
 instead of just relying on randomness solving the issue in
 most cases.

This is clever. You basically neuter the attack as you notice it but
everything else is business as usual. I'm concerned that this may end
up being costly in some edge cases (e.g. look up how many collisions
it takes to force the recreation, and then aim for just that many
collisions many times). Unfortunately, each dict object has to
discover for itself that it's full of offending hashes. Another
approach would be to neuter the offending object by changing its hash,
but this would require either returning multiple values, or fixing up
existing dictionaries, neither of which seems feasible.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue13703
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13703] Hash collision security issue

2012-01-05 Thread Paul McMillan

Paul McMillan p...@mcmillan.ws added the comment:

Marc-Andre: Victor already pasted the relevant part of my code:
http://bugs.python.org/issue13703#msg150568
The link to the fuller version, with revision history and a copy of the code 
before I modified it is here:
https://gist.github.com/0a91e52efa74f61858b5

Why? The attack doesn't work with short strings? What do you call a short 
string?

Well, the demonstrated collision is for 16 character ascii strings. Worst case 
UTF-8, we're looking at 3 manipulable bytes per character, but they may be 
harder to collide since some of those bytes are fixed.

 only be making it harder for script kiddies, but as soon as someone
 crypt-analysis the used hash algorithm, you're lost again.

Not true. What I propose is to make the amount of information necessary to 
analyze and generate collisions impractically large. My proposed hash function 
is certainly broken if you brute force the lookup table. There are undoubtedly 
other problems with it too. The point is that it's hard enough. We aren't going 
for perfect security - we're going for enough to make this attack impractical.

What are the downsides to counting collisions? For one thing, it's something 
that needs to be kept track of on a per-dict basis, and can't be cached the way 
the hash results are. How do you choose a good value for the limit? If you set 
it to something conservative, you still pay the collision price every time a 
dict  is created to discover that the keys collide. This means that it's 
possible to feed to bad data up to exactly the limit, and suddenly the python 
app is inexplicably slow. If you set the limit too aggressively, then sometimes 
valid data gets caught, and python randomly dies in hard to debug ways with an 
error the programmer has never seen in testing and cannot reproduce.

It adds a new way to kill most python applications, and so programs are going 
to have to be re-written to cope with it. It also introduces a new place to 
cause errors - if the WSGI server dies, it's hard for my application to catch 
that and recover gracefully.

... not in Python itself, but if you consider all the types in Python
 extensions and classes implementing __hash__ in user code, the number
 of hash functions to fix quickly becomes unmanageable.

When we looked at the Django project, we wouldn't have anything to fix since 
ours end up relying on the python internal values eventually. I suspect a lot 
of other code is similar.

Mark said:
What is the mechanism by which the attacker can determine the seeds?

The biggest information leak is probably the ordering in which dict entries are 
returned. This can be used to deduce the underlying hash values. This is much 
easier than trying to do it via timing.

 But that's not the issue we are supposed to be dealing with.
 A single (genuinely random) seed will deal with the attack described in 
 the talk and it is (almost) as fast as using 0 as a seed.

This is not true. A single random seed shifts the hash table, but does not 
actually prevent an attacker from generating collisions. Please see my other 
posts on the topic here and on the mailing list.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue13703
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13703] Hash collision security issue

2012-01-05 Thread Paul McMillan

Paul McMillan p...@mcmillan.ws added the comment:

As Alex said, Java has refused to fix the issue.

I believe that Ruby 1.9 (at least the master branch code that I looked
at) is using murmurhash2 with a random seed.

In either case, yes, these functions are vulnerable to a number of
attacks. We're solving the problem more completely than they did.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue13703
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13703] Hash collision security issue

2012-01-04 Thread Paul McMillan

Paul McMillan p...@mcmillan.ws added the comment:

This is not something that can be fixed by limiting the size of POST/GET. 

Parsing documents (even offline) can generate these problems. I can create 
books that calibre (a Python-based ebook format shifting tool) can't convert, 
but are otherwise perfectly valid for non-python devices. If I'm allowed to 
insert usernames into a database and you ever retrieve those in a dict, you're 
vulnerable. If I can post things one at a time that eventually get parsed into 
a dict (like the tag example), you're vulnerable. I can generate web traffic 
that creates log files that are unparsable (even offline) in Python if dicts 
are used anywhere. Any application that accepts data from users needs to be 
considered.

Even if the web framework has a dictionary implementation that randomizes the 
hashes so it's not vulnerable, the entire python standard library uses dicts 
all over the place. If this is a problem which must be fixed by the framework, 
they must reinvent every standard library function they hope to use.

Any non-trivial python application which parses data needs the fix. The entire 
standard library needs the fix if is to be relied upon by applications which 
accept data. It makes sense to fix Python.

Of course we must fix all the basic hashing functions in python, not just the 
string hash. There aren't that many. 

Marc-Andre:
If you look at my proposed code, you'll notice that we do more than simply 
shift the period of the hash. It's not trivial for an attacker to create 
colliding hash functions without knowing the key.

Since speed is a concern, I think that the proposal to avoid using the random 
hash for short strings is a good idea. Additionally, randomizing only some of 
the characters in longer strings will allow us to improve security without 
compromising speed significantly.

I suggest that we don't randomize strings shorter than 6 characters. For longer 
strings, we randomize the first and last 5 characters. This means we're only 
adding additional work to a max of 10 rounds of the hash, and only for longer 
strings. Collisions with the hash from short strings should be minimal.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue13703
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13703] Hash collision security issue

2012-01-04 Thread Paul McMillan

Paul McMillan p...@mcmillan.ws added the comment:

 My proposition only adds two XOR to hash(str) (outside the loop on Unicode 
 characters), so I expect a ridiculous overhead. I don't know yet how hard it 
 is to guess the secret from hash(str) output.

It doesn't work much better than a single random seed. Calculating the
hash of a null byte gives you the xor of your two seeds. An attacker
can still cause collisions inside the vulnerable hash function, your
change doesn't negate those internal collisions. Also, strings of all
null bytes collide trivially.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue13703
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13703] Hash collision security issue

2012-01-04 Thread Paul McMillan

Paul McMillan p...@mcmillan.ws added the comment:

 - for small strings we could use a different seed than for larger strings

Or just leave them unseeded with our existing algorithm. Shifting them
into a different part of the hash space doesn't really gain us much.

 - for larger strings we could use Paul's algorithm but limit the XOR op to 
 the first and last 16 elements instead of all elements.

Agreed. It does have to be both the first and the last though. We
can't just do one or the other.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue13703
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13703] Hash collision security issue

2012-01-03 Thread Paul McMillan

Paul McMillan p...@mcmillan.ws added the comment:

I agree that we should enable randomness by default, and provide an easy way 
for users to disable it if necessary (unit test suites that explicitly depend 
on order being an obvious candidate).

I'll link my proposed algorithm change here, for the record:
https://gist.github.com/0a91e52efa74f61858b5

I've gotten confirmation from several other sources that the fix recommended by 
the presenters (just a random initialization seed) only prevents the most basic 
form of the attack.

--
nosy: +PaulMcMillan

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue13703
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13703] Hash collision security issue

2012-01-03 Thread Paul McMillan

Paul McMillan p...@mcmillan.ws added the comment:

A couple of things here:

First, my proposed change is not cryptographically secure. There simply aren't 
any cryptographic hashing algorithms available that are in the performance 
class we need. My proposal does make the collision attack quite difficult to 
carry out, even if the raw output values from the hash are available to the 
attacker (they should not usually be).

I favor using the same algorithm between 32 and 64 bit builds for consistency 
of behavior. Developers would be startled to find that ordering stays 
consistent on a 64 bit build but varies on 32 bit builds. Additionally, the 
impracticality of attacking of 64 bit builds rests on the fact that these 
particular researchers didn't devise a way to do it. I'd hate to make this 
change and then have a clever mathematician publish some elegant point 
requiring us to go fix the problem all over again. 

I could be convinced either way on small strings. I like that they can't be 
used to attack the secret. At the same time, I worry that combining 2 different 
hashing routines into the same output space may introduce unexpected collisions 
and other difficult to debug edge-case conditions. It also means that the order 
of the hashes of long strings will vary while the order of short strings will 
not - another inconsistency which will encourage bugs.

Thank you Victor for the improvements to the python demonstration code. As 
Antoine said, it's only demo code, but better demo code is always good.

Antoine: That section of the manpage is referring to the overall security of a 
key generated using urandom. 256 bits is overkill for this application. We 
could take 256 bits and use them to generate a key using a cryptographically 
appropriate algorithm, but it's simpler to read more bits and use them directly 
as the key.

Additionally, that verbiage has been in the man page for urandom for quite some 
time (probably since the earliest version in the mid 90's). The PRNG has been 
improved since then.

Minimum length of r is a hard question. The shorter it is, the higher the 
correlation of the output. In my tests, 16kb was the amount necessary to 
generally do reasonably well on my test suite for randomness even with 
problematic input. Obviously our existing output isn't random, so it doesn't 
pass those tests at all. Using a fairly small value (4k) should not make the 
results much worse from a security perspective, but might be problematic from a 
collision/distribution standpoint. It's clear that we don't need 
cryptographically good randomness here, but passing the test suite is not a bad 
thing when considering the distribution.

When we settle on a C implementation, I'd like to run it through the smhasher 
set of tests to make sure we aren't making distribution worse, especially for 
very small values of r.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue13703
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com