[issue14621] Hash function is not randomized properly

2016-04-22 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
Removed message: http://bugs.python.org/msg263978

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2016-04-22 Thread SilentGhost

Changes by SilentGhost :


--
nosy: +Arfrever, Bob.Ziuchkovski, Giovanni.Bajo, PaulMcMillan, ReneSac, 
Vlado.Boza, alex, arigo, benjamin.peterson, bkabrda, camara, christian.heimes, 
cvrebert, dmalcolm, dstufft, gregory.p.smith, haypo, iElectric, isoschiz, 
konk, lemburg, mark.dickinson, ncoghlan, pconnell, sbermeister, 
serhiy.storchaka, Łukasz.Rekucki

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2016-04-22 Thread lissacoffey

lissacoffey added the comment:

How about Python grows a additional btree implementation in its collections 
module? I know that it's not going to fix existing code. However in the long 
run it's the best safeguard against hash collision attacks. I'm thinking about 
a simple, self balancing btree like red-black-tree. A quick search on Wikipedia 
also revealed Scapegoat and Splay tree with interesting properties.

Thanks
Lisa - python expert
mysite: http://www.fixithere.net/asos

--
nosy: +editor-buzzfeed -Arfrever, Bob.Ziuchkovski, Giovanni.Bajo, PaulMcMillan, 
ReneSac, Vlado.Boza, alex, arigo, benjamin.peterson, bkabrda, camara, 
christian.heimes, cvrebert, dmalcolm, dstufft, gregory.p.smith, haypo, 
iElectric, isoschiz, konk, lemburg, mark.dickinson, ncoghlan, pconnell, 
sbermeister, serhiy.storchaka, Łukasz.Rekucki

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2013-12-09 Thread Nick Coghlan

Nick Coghlan added the comment:

This issue has belatedly had a CVE assigned: CVE-2013-7040 (CPython hash 
secret can be recovered remotely)

3.4 is not affected (due to PEP 456), but 3.3 and 2.7 are still affected.

--
nosy: +bkabrda
status: pending - open
versions: +Python 2.7, Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2013-12-09 Thread Benjamin Peterson

Benjamin Peterson added the comment:

I think that's just WONTFIX at this point.

--
resolution:  - fixed
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2013-11-20 Thread Christian Heimes

Christian Heimes added the comment:

The issue has been solved for Python 3.4 with the integration of PEP 456.

--
stage:  - committed/rejected
status: open - pending

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2013-06-01 Thread Donald Stufft

Changes by Donald Stufft donald.stu...@gmail.com:


--
nosy: +dstufft

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2013-04-20 Thread Martin Morrison

Changes by Martin Morrison m...@ensoft.co.uk:


--
nosy: +isoschiz, pconnell

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2013-01-02 Thread Giovanni Bajo

Giovanni Bajo added the comment:

Il giorno 02/gen/2013, alle ore 00:20, Domen Kožar rep...@bugs.python.org ha 
scritto:

 
 Domen Kožar added the comment:
 
 According to talk at 29c3: 
 http://events.ccc.de/congress/2012/Fahrplan/events/5152.en.html
 
 Quote: We also describe a vulnerability of Python's new randomized hash, 
 allowing an attacker to easily recover the 128-bit secret seed. As a reliable 
 fix to hash-flooding, we introduce SipHash, a family of cryptographically 
 strong keyed hash function competitive in performance with the weak hashes, 
 and already adopted in OpenDNS, Perl 5, Ruby, and in the Rust language.

That is exactly the vulnerability that was previously mentioned in the context 
of this bug. SipHash is currently the only solution for a collision-resistant 
fast-enough hash. 
-- 
Giovanni Bajo

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2013-01-02 Thread Giovanni Bajo

Giovanni Bajo added the comment:

Il giorno 02/gen/2013, alle ore 06:52, Christian Heimes 
rep...@bugs.python.org ha scritto:

 
 Christian Heimes added the comment:
 
 Thanks for the information! I'm working on a PEP for the issue at hand.

Since you're collecting ideas on this, I would like to stress that, in the 
Python 3 transition, it was deemed acceptable to switch all objects to use 
unicode strings for attribute names, making the hash computation of such 
attributes (in the context of the instance dictionary) at least twice as slow 
than it used to be (the 'at least' refers to the fact that longer strings might 
have even worse effects because of a higher number of cache misses). SipHash 
isn't twice as slow as the current hash function, not even for short strings.

So there is a precedent in slowing down the hash computation time in a very 
important use case, and it doesn't look like hell froze over.
-- 
Giovanni Bajo

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2013-01-02 Thread Benjamin Peterson

Benjamin Peterson added the comment:

2013/1/2 Giovanni Bajo rep...@bugs.python.org:

 Giovanni Bajo added the comment:

 Il giorno 02/gen/2013, alle ore 06:52, Christian Heimes 
 rep...@bugs.python.org ha scritto:


 Christian Heimes added the comment:

 Thanks for the information! I'm working on a PEP for the issue at hand.

 Since you're collecting ideas on this, I would like to stress that, in the 
 Python 3 transition, it was deemed acceptable to switch all objects to use 
 unicode strings for attribute names, making the hash computation of such 
 attributes (in the context of the instance dictionary) at least twice as slow 
 than it used to be (the 'at least' refers to the fact that longer strings 
 might have even worse effects because of a higher number of cache misses). 
 SipHash isn't twice as slow as the current hash function, not even for short 
 strings.

 So there is a precedent in slowing down the hash computation time in a very 
 important use case, and it doesn't look like hell froze over.

It's probably not to bad for attribute names because a) they're short
b) they're interned c) the hash is cached.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2013-01-02 Thread Christian Heimes

Christian Heimes added the comment:

Giovanni, why do you think that hashing of unicode strings is slower than byte 
strings? 

First of all ASCII only unicode strings are packed and use just one byte per 
char. CPython's FNV implementation processes one element in each cycle, that is 
one byte for bytes and ASCII unicode, two bytes for UCS-2 and four bytes for 
UCS-4. Bytes and UCS-4 strings require the same amount of CPU instructions.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2013-01-02 Thread Giovanni Bajo

Giovanni Bajo added the comment:

Il giorno 02/gen/2013, alle ore 19:51, Christian Heimes 
rep...@bugs.python.org ha scritto:

 
 Christian Heimes added the comment:
 
 Giovanni, why do you think that hashing of unicode strings is slower than 
 byte strings? 
 
 First of all ASCII only unicode strings are packed and use just one byte per 
 char. CPython's FNV implementation processes one element in each cycle, that 
 is one byte for bytes and ASCII unicode, two bytes for UCS-2 and four bytes 
 for UCS-4. Bytes and UCS-4 strings require the same amount of CPU 
 instructions.

Ah sorry, I stand corrected (though packing wasn't there in 3.0, was it? I was 
specifically referred to the 2.x - 3.0 transition).
-- 
Giovanni Bajo

My Blog: http://giovanni.bajo.it

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2013-01-02 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

See microbenchmark results in issue16427. Really hashing of ASCII/UCS1 strings 
more than 2x slower than bytes hashing.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2013-01-01 Thread Domen Kožar

Domen Kožar added the comment:

According to talk at 29c3: 
http://events.ccc.de/congress/2012/Fahrplan/events/5152.en.html

Quote: We also describe a vulnerability of Python's new randomized hash, 
allowing an attacker to easily recover the 128-bit secret seed. As a reliable 
fix to hash-flooding, we introduce SipHash, a family of cryptographically 
strong keyed hash function competitive in performance with the weak hashes, and 
already adopted in OpenDNS, Perl 5, Ruby, and in the Rust language.

--
nosy: +iElectric

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2013-01-01 Thread Christian Heimes

Christian Heimes added the comment:

Thanks for the information! I'm working on a PEP for the issue at hand.

--
assignee:  - christian.heimes

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2013-01-01 Thread Nick Coghlan

Nick Coghlan added the comment:

Bob, the hash invariant isn't a mere implementation detail, it is critical to 
making hash based data structures work properly - if two equal objects (say the 
integer zero and the float zero) ever end up in different hash bins, then the 
uniqueness property of dictionary keys and sets breaks down.

The three proposed mitigation strategies (using SipHash for string hashing, a 
tunable collision counting hash map and providing a non-hash based mapping 
container in the standard library) are all reasonable approaches to the problem 
and, most importantly, they're *orthogonal* approaches to the problem. There's 
nothing stopping us doing all three if someone is willing and able to provide 
the code.

--
nosy: +ncoghlan

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-12-02 Thread Bob Ziuchkovski

Bob Ziuchkovski added the comment:

Why not redefine -R to mean use secure hashing algorithms for built-in types?

When specified, use hashing algorithms that are secure against 
denial-of-service and other known attacks, at the possible expense of 
performance.  When not specified, use whatever hashing algorithms provide the 
most sensible defaults for every-day use (basically hash the way python 
currently hashes).

Secure hashing would apply not just to strings but to numeric and other types 
as well.  This would break the invariant of `x == y implies hash(x) == hash(y)` 
for numeric types that Mark mentioned.  However, that seems like an 
implementation detail that python users shouldn't rely upon.

--
nosy: +Bob.Ziuchkovski

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-30 Thread René

René added the comment:

Christian Heimes: It has always been trivial to artificially generate 
collisions for fast hashes designed for hash tables, like MurmurHash. I 
wouldn't call Murmurhash3 busted because of that, as this was never a design 
goal. It is a known propriety of this type of hash: you do that basically 
running them backwards. You can't even call that cryptanalysis. 

Of course, you need the seed to search those collisions, but from this thread 
it seems very difficult, if not impossible, not to leak the random seed to the 
attacker. 

I see the various collision counting alternatives proposed as the less 
intrusive and definitive solution for this problem. It also has the benefit 
that it can work for any type of key. Some pseudo code:

hash as always, with a fast hash.
if reprobes  initial_threshold:
if the table has only one key type AND it has a robust comparison method:
store the collisions in a O(log n) worst case structure (tree).
elif the object has a secondary slow secure hash:
try searching/inserting the key again with the new secure hash.
It works like a double hashing reprobing hash table.
elif collisions  max_threshold:
raise Exception(Under attack or the object is using a crappy hash, 
with no fallback.)

The first option, the O(log n) structure, can be ignored as unnecessary 
complication (even though there is already a path implementing that), but I 
suspect it may be faster than a secure hash. If not, then there is really no 
point in this option, except if the secure hash proves to be not so secure.

--
nosy: +ReneSac

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-30 Thread Christian Heimes

Christian Heimes added the comment:

No, Murmur3 *is* busted. Some clever people have found a way to perform a 
universal multicollision attack, that's a key independent attack. An attacker 
doesn't need to know the seed for an attack.

Collision counting as not a solution for the issue, just a workaround. It has 
been proofed for decades that a tree data structure is not vulnerable to this 
kind of collision attacks. A hash function with crypto properties is the second 
best solution.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-30 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

However a tree data structure is vulnerable to sorted data.

May be worth it to have the two hash functions (switchable by interpreter 
option or environment variable), strong for applications which can be attacked, 
and fast for applications which run in safe environment?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-30 Thread René

René added the comment:

Serhiy Storchaka: I said a O(log n) data structure, so I was referring to 
balanced trees, like AVL, RBT or B+-tree. They are not vulnerable to sorted 
data. The downside is that they need the keys to provide robust comparison 
methods (like, if z  y  x, then z  x). 

Christian Heimes: Right, the seed indeed doesn't provides protection...

But the collision counting is compatible with your two suggestions, and solves 
the problem. The only difference is that you don't get the performance hit if 
not under attack.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-30 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

René, a balanced tree requires rebalancing on every (or almost every) item for 
some special (sorted) data sequences.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-30 Thread Michal Petrucha

Michal Petrucha added the comment:

On Fri, Nov 30, 2012 at 08:06:32PM +, Serhiy Storchaka wrote:
 René, a balanced tree requires rebalancing on every (or almost
 every) item for some special (sorted) data sequences.

That's perfectly true and it holds for most unsorted sequences as
well -- that's why they are balanced. The fact that the tree is always
balanced guarantees that each rebalance takes at most O(log n)
operations.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-30 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 30.11.2012 21:06, Serhiy Storchaka wrote:
 
 Serhiy Storchaka added the comment:
 
 René, a balanced tree requires rebalancing on every (or almost every) item 
 for some special (sorted) data sequences.

Sure, but that's still O(N*log N) for an attack scenario, not O(N^2).

I think the main point here is that using hash tables or dictionaries
for these things without any size limit is simply a wrong development
approach.

Developers need to be made aware of the problem and given data
structures that they can use more safely to store the data and
instead of trying to hide away the problem using some crypto
approach, it's better to offer methods that allow developers to
gain control over the problem (e.g. via an exception) rather than
hoping for few hash collisions.

If a developer has to build a lookup table from untrusted data,
she should be able to say:

try:
mapping = insert_untrusted_data(source)
except UnderAttackError:
return 'no thank you'

instead of:

# Hmm, let's hope this doesn't bomb...
mapping = insert_untrusted_data(source)

At the moment, the best thing you can do is insert the data
in chunks and measure the time it takes for each chunk. If it
takes too long, you raise the UnderAttackError.

If we make the collision counting limit a per-dictionary adjustable
limit with some global default limit, the above could be written
as:

try:
mapping = {}
mapping.max_collisions = 100
mapping.update(source)
except CollisionLimitError:
return 'no thank you'

Aside: It's better to know you worst case and program for it, rather
than to ignore the problem and hope an attacker won't find your secret
key. In the above case, if you know what the worst-case timing is for
a 100-item dictionary, you can safely deal with it, possibly adjusting
the limit to more suitable value for your application.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 30 2012)
 Python Projects, Consulting and Support ...   http://www.egenix.com/
 mxODBC.Zope/Plone.Database.Adapter ...   http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2012-11-28: Released eGenix mx Base 3.2.5 ... http://egenix.com/go36
2013-01-22: Python Meeting Duesseldorf ... 53 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! 

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-30 Thread René

René added the comment:

Serhiy Storchaka: Yes, but it is still O(log n) worst case. Even in the worst 
case rebalancing, you only need to walk up/down rotating/spliting every node in 
your path. As the tree height is guaranteed to be x * log n (x from 1 to 2, 
depending on the algorithm), the rebalancing operation is aways limited by 
O(log n).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-30 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

 Serhiy Storchaka: Yes, but it is still O(log n) worst case. Even in the
 worst case rebalancing, you only need to walk up/down rotating/spliting
 every node in your path. As the tree height is guaranteed to be x * log n
 (x from 1 to 2, depending on the algorithm), the rebalancing operation is
 aways limited by O(log n).

Agree. However I think that for large enough data a balanced tree is slower 
than a hashtable with any slow hash.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-30 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

 try:
 mapping = {}
 mapping.max_collisions = 100
 mapping.update(source)
 except CollisionLimitError:
 return 'no thank you'

May be use a more general solution?

try:
with run_with_timeout(timeout=100, timer=collisions_count):
mapping = insert_untrusted_data(source)
except TimeoutError:
return 'no thank you'

(You can can use different measurement for timeout: user time, real time, ticks 
count, collisions count, or even a user defined timer).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-30 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 30.11.2012 22:27, Serhiy Storchaka wrote:
 
 Serhiy Storchaka added the comment:
 
 try:
 mapping = {}
 mapping.max_collisions = 100
 mapping.update(source)
 except CollisionLimitError:
 return 'no thank you'
 
 May be use a more general solution?
 
 try:
 with run_with_timeout(timeout=100, timer=collisions_count):
 mapping = insert_untrusted_data(source)
 except TimeoutError:
 return 'no thank you'
 
 (You can can use different measurement for timeout: user time, real time, 
 ticks 
 count, collisions count, or even a user defined timer).

Sure, as long as there's a way to break into the execution,
any such method would do.

The problem is that you'd have to check for the timeout at some
point and the .update() call is running completely in C, so
the only way to break into execution is either by explicitly
adding a runtime check to the code, or use a signal (which is
a can of worms better avoided :-)).

Collision counting is one such method of detecting that
something is likely not working according to plan, but it's
really only another way of implementing the explicit runtime
check. Using other counters or timers would work just as well.

As long as you know that there are places in your code that can
produce CPU time overloads, you can address those issues.

The dictionary implementation is one such place, where you
can run into the problem, but there are usually many other
such areas in more complex applications as well, e.g. calculations
that enter endless loops for some parameters, deadlocks,
I/O operations that take unusually long (e.g. due to disk
errors), poorly written drivers of all sorts, etc. etc.

If there's a way to solve all these things in general
and without explicit runtime checks, I'd like to know :-)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-29 Thread Łukasz Rekucki

Łukasz Rekucki added the comment:

Note that a few weeks ago, Ruby has switched from Murmur to SipHash for this 
exact reason:

http://www.ruby-lang.org/en/news/2012/11/09/ruby19-hashdos-cve-2012-5371/

--
nosy: +Łukasz.Rekucki

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-11 Thread STINNER Victor

STINNER Victor added the comment:

Did qomeone start to write a PEP?
Le 11 nov. 2012 05:56, Chris Rebert rep...@bugs.python.org a écrit :


 Chris Rebert added the comment:

 What about CityHash? (http://code.google.com/p/cityhash/ ; unofficial C
 port: http://code.google.com/p/cityhash-c/ )
 It's good enough for Google...

 --
 nosy: +cvrebert

 ___
 Python tracker rep...@bugs.python.org
 http://bugs.python.org/issue14621
 ___


--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-11 Thread Giovanni Bajo

Giovanni Bajo added the comment:

Il giorno 11/nov/2012, alle ore 05:56, Chris Rebert rep...@bugs.python.org ha 
scritto:

 
 Chris Rebert added the comment:
 
 What about CityHash? (http://code.google.com/p/cityhash/ ; unofficial C port: 
 http://code.google.com/p/cityhash-c/ )
 It's good enough for Google...

It's good enough for Google in a context that does not require protection 
against collision attacks. If you have a look at SipHash' page, you will find a 
program to generate collisions to CityHash.
-- 
Giovanni Bajo

My Blog: http://giovanni.bajo.it

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-10 Thread Gregory P. Smith

Gregory P. Smith added the comment:

People have been posting micro-benchmarks (often run wrong) rather than actual 
useful benchmarks.  Running our real world benchmarks would be more interesting.

--
nosy: +gregory.p.smith

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-10 Thread Chris Rebert

Chris Rebert added the comment:

What about CityHash? (http://code.google.com/p/cityhash/ ; unofficial C port: 
http://code.google.com/p/cityhash-c/ )
It's good enough for Google...

--
nosy: +cvrebert

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-08 Thread Sasha B

Sasha B added the comment:

Ruby uses the Murmur hash for some types (string  integer at least): 
http://murmurhash.googlepages.com/
src: http://stackoverflow.com/a/3270836/1332819

The Perl hash implementation: 
http://cpansearch.perl.org/src/NWCLARK/perl-5.8.8/hv.c

PHP5 hash implementation: http://lxr.php.net/xref/PHP_5_4/ext/hash/hash.c

The Probe() function for V8's Javascript implementation is HW-specific:
Hash functions: 
http://code.google.com/searchframe#W9JxUuHYyMg/trunk/src/hashmap.hq=Probe%20package:v8%5C.googlecode%5C.coml=134
Probe() function: 
http://code.google.com/searchframe#search%26q%3DProbe%20package%3Av8%5C.googlecode%5C.com

--
nosy: +sbermeister

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-08 Thread Christian Heimes

Christian Heimes added the comment:

I considered MurMur a year ago, too. Nowadays I don't think it's an option 
anymore. JPA and DJB have released a C++ program that is able to generate lots 
of collisions:

https://www.131002.net/siphash/
C++ program to find universal (key-independent) multicollisions for MurmurHash3

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-08 Thread Christian Heimes

Christian Heimes added the comment:

From the header of murmurcollisions.cc:

 * multicollisions for MurmurHash3
 *
 * MurmurHash3 C++ implementation is available at 
 * http://code.google.com/p/smhasher/wiki/MurmurHash3
 *
 * the function Murmur3Multicollisions finds many different inputs
 * hashing to the same 32-bit value (multicollision)
 * 
 * example output:
 * 32-bit seed 7a0e823a
 * 4-multicollision
 * 16-byte inputs
 * MurmurHash3_x86_32( bdd0c04b5c3995827482773b12acab35 ) = 94d7cf1b
 * MurmurHash3_x86_32( 652fa0565c3946be7482773b12acab35 ) = 94d7cf1b
 * MurmurHash3_x86_32( bdd0c04b5c399582cc23983012ac5c71 ) = 94d7cf1b
 * MurmurHash3_x86_32( 652fa0565c3946becc23983012ac5c71 ) = 94d7cf1b
 *
 * the multicollisions found are universal: they work for any seed/key
 *
 * authors:
 * Jean-Philippe Aumasson, Daniel J. Bernstein

I consider MurMur3 busted and unsuitable for our purpose.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-07 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

Here's a demo patch (against Python 2.7) which counts hash value collisions and 
slot collisions. I had posted that in the original ticket where we discussed 
the hash problem (http://bugs.python.org/issue14621).

This avoids issues like attack 1 mentioned in 
http://mail.python.org/pipermail/python-dev/2012-January/115726.html

Attack 2 in that email can easily be worked around by reducing the collision 
limit to a smaller number.

Even better: An application could even dynamically adjust the maximum collision 
counts by catching the exception and setting a new upper limit depending on its 
knowledge of the field of application - warning the sysadmin of a potential 
problem and allowing her to take action. That way the application could start 
with a low safe maximum collision number of say 100 and then raise the limit in 
a controlled way.

BTW: When trying out new hash functions, you need to look not only at the 
performance of the hash function, but also (and more importantly) at the effect 
on dictionaries.

Just as reminder: The integer key problem is still open. Using the demo script 
http://bugs.python.org/file24300/integercollision.py, it's easy to keep Python 
going for minutes without any major effort.

I don't understand why we are only trying to fix the string problem and 
completely ignore other key types. Strings are easy to send to a web server, 
yes, but there are other applications out there which take input data from 
other sources/formats as well (e.g. csv files). And it's not unusual to convert 
input strings to integers to use them as dictionary keys, say item IDs or 
counts. So while the string keys may not cause a problem, the integer keys 
still might.

--
keywords: +patch
Added file: http://bugs.python.org/file27917/hash-attack-3.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-07 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 07.11.2012 09:34, Marc-Andre Lemburg wrote:
 
 Here's a demo patch (against Python 2.7) which counts hash value collisions 
 and slot collisions. I had posted that in the original ticket where we 
 discussed the hash problem (http://bugs.python.org/issue14621).

Sorry, wrong URL. The correct one is http://bugs.python.org/issue13703

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 07 2012)
 Python Projects, Consulting and Support ...   http://www.egenix.com/
 mxODBC.Zope/Plone.Database.Adapter ...   http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-07 Thread Giovanni Bajo

Giovanni Bajo added the comment:

Until it's broken with a yet-unknown attack, SipHash is a pseudo-random 
function and as such it does uniformly distribute values across the output 
space, and never leak any information on the key (the randomized seed). Being 
designed by cryptographers, it is likely that it doesn't turn out to be a 
fail like the solution that was just released (no offense intended, but it's 
been a large-scale PR failure).

As long as we don't introduce bias while reducing SipHash's output to fit the 
hash table size (so for instance, usage of modulus is not appropriate), then 
the hash function should behave very well.

Any data type can be supplied to SipHash, including numbers; you just need to 
take their (platform-dependent) memory representation and feed it to SipHash. 
Obviously it will be much much slower than the current function which used to 
be hash(x) = x (before randomization), but that's the price to pay to avoid 
security issues.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-07 Thread Armin Rigo

Armin Rigo added the comment:

Marc-André: estimating the risks of giving up on a valid query for a truly 
random hash, at an overestimated one billion queries per second, in a 2/3 full 
dictionary:

* for 1000: 4E159 years between mistakes

* for 100: 12.9 years between mistakes

* for 150: 8E9 years between mistakes

* for 200: 5E18 years between mistakes

So while it seems that 100 might be a bit too small, using 150 to 200 is 
perfectly safe (and that's perfect in the sense that a computer will 
encounter random hardware errors at a higher rate than that).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-07 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 07.11.2012 12:06, Armin Rigo wrote:
 
 Armin Rigo added the comment:
 
 Marc-André: estimating the risks of giving up on a valid query for a truly 
 random hash, at an overestimated one billion queries per second, in a 2/3 
 full dictionary:
 
 * for 1000: 4E159 years between mistakes
 
 * for 100: 12.9 years between mistakes
 
 * for 150: 8E9 years between mistakes
 
 * for 200: 5E18 years between mistakes
 
 So while it seems that 100 might be a bit too small, using 150 to 200 is 
 perfectly safe (and that's perfect in the sense that a computer will 
 encounter random hardware errors at a higher rate than that).

I used the 1000 limit only as example. In tests Victor and I ran (see the
original ticket from a few months ago), 200 turned out to be a reasonable
number for the default maximum hash collision value.

I'm not sure about the slot collision limit. We'd have to run more tests
on those.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-07 Thread Giovanni Bajo

Giovanni Bajo added the comment:

Il giorno 07/nov/2012, alle ore 08:40, Serhiy Storchaka 
rep...@bugs.python.org ha scritto:

 Serhiy Storchaka added the comment:
 
 I tested different kind of strings.
 
 $ ./python -m timeit -n 1 -s t = b'a' * 10**8  hash(t)
 $ ./python -m timeit -n 1 -s t = 'a' * 10**8  hash(t)
 $ ./python -m timeit -n 1 -s t = '\u0100' * 10**8  hash(t)
 $ ./python -m timeit -n 1 -s t = '\U0001' * 10**8  hash(t)
 
   current   SipHash
 
 bytes  181 msec  453 msec  2.5x
 UCS1   429 msec  453 msec  1.06x
 UCS2   179 msec  897 msec  5x
 UCS4   183 msec  1.79 sec  9.8x

Hi Serhiy,

can you please attach the generated assembly code for the siphash function with 
your compiler and your optimization flags (that is, the one that produces the 
above results)?

Thanks!
-- 
Giovanni Bajo

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-07 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

 can you please attach the generated assembly code for the siphash function 
 with your compiler and your optimization flags (that is, the one that 
 produces the above results)?

GCC (Ubuntu 4.4.3-4ubuntu5.1) options:

-pthread -c -Wno-unused-result -DNDEBUG -g -fwrapv -O3 -Wall 
-Wstrict-prototypes   -I. -IInclude -I./Include-DPy_BUILD_CORE

32-bit Linux on AMD Athlon(tm) 64 X2 Dual Core Processor 4600+.

--
Added file: http://bugs.python.org/file27919/_Py_Hash_Sip24.S

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-07 Thread Christian Heimes

Christian Heimes added the comment:

Serhiy, the performance of hash() for long strings isn't very relevant for the 
general performance of a Python program. Short strings dominate. I've modified 
the timeit to create a new string object every time.

for I in 5 10 15 20 30 40 50 60; do echo -ne $I\t; ./python -m timeit 
-n10 -r30 -s h = hash; x = 'ä' * $I -- h(x + 'a') | awk '{print $6}' ; 
done

ASCII:
#   SIPFNV
5   0.112  0.0979
10  0.115  0.103
15  0.12   0.107
20  0.124  0.112
30  0.126  0.127
40  0.136  0.142
50  0.142  0.147
60  0.146  0.159

UCS-2:
#   SIPFNV
5   0.114  0.0977
10  0.117  0.0988
15  0.12   0.11
20  0.126  0.109
30  0.13   0.122
40  0.14   0.132
50  0.144  0.147
60  0.152  0.157

For short strings the additional round and setup costs make hash() about 10% 
slower. For long strings SIP is faster as it processes 8 bytes at once instead 
of 1 to 4 bytes.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-07 Thread Mark Dickinson

Mark Dickinson added the comment:

[MAL]
 I don't understand why we are only trying to fix the string problem
 and completely ignore other key types.

[Armin]
 estimating the risks of giving up on a valid query for a truly random
 hash, at an overestimated one billion queries per second ...

That's fine in principle, but if this gets extended to integers, note that our 
current integer hash is about as far from 'truly random' as you can get:

Python 3.4.0a0 (default:f02555353544, Nov  4 2012, 11:50:12) 
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type help, copyright, credits or license for more information.
 [hash(i) for i in range(20)]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

Moreover, it's going to be *very* hard to change the int hash while preserving 
the `x == y implies hash(x) == hash(y)` invariant across all the numeric types 
(int, float, complex, Decimal, Fraction, 3rd-party types that need to remain 
compatible).

--
nosy: +mark.dickinson

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-07 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 07.11.2012 12:55, Mark Dickinson wrote:
 
 Mark Dickinson added the comment:
 
 [MAL]
 I don't understand why we are only trying to fix the string problem
 and completely ignore other key types.
 
 [Armin]
 estimating the risks of giving up on a valid query for a truly random
 hash, at an overestimated one billion queries per second ...
 
 That's fine in principle, but if this gets extended to integers, note that 
 our current integer hash is about as far from 'truly random' as you can get:
 
 Python 3.4.0a0 (default:f02555353544, Nov  4 2012, 11:50:12) 
 [GCC 4.2.1 (Apple Inc. build 5664)] on darwin
 Type help, copyright, credits or license for more information.
  [hash(i) for i in range(20)]
 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
 
 Moreover, it's going to be *very* hard to change the int hash while 
 preserving the `x == y implies hash(x) == hash(y)` invariant across all the 
 numeric types (int, float, complex, Decimal, Fraction, 3rd-party types that 
 need to remain compatible).

Exactly. And that's why trying to find secure hash functions isn't
going to solve the problem. Together with randomization they may
make things better for strings, but they are no solution for numeric
types, and they also don't allow detecting possible attacks on your
systems.

But yeah, I'm repeating myself :-)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-07 Thread Mark Dickinson

Mark Dickinson added the comment:

And I'm probably repeating myself too, but: the predictability of (and 
difficulty of changing of) hashing for numeric types is why I'm strongly 
opposed to hash collision / slot collision limits:  they'd end up disallowing 
reasonably natural looking Python numeric sets (e.g. {2**k for k in range(n)} 
for smallish n).  I don't think core Python should be solving this issue at 
all---I think that's a job for the web frameworks.  Christian's idea of 
providing more suitable types in the std. lib. sounds like the right direction 
to me.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-07 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 07.11.2012 13:06, Mark Dickinson wrote:
 
 Mark Dickinson added the comment:
 
 And I'm probably repeating myself too, but: the predictability of (and 
 difficulty of changing of) hashing for numeric types is why I'm strongly 
 opposed to hash collision / slot collision limits:  they'd end up disallowing 
 reasonably natural looking Python numeric sets (e.g. {2**k for k in range(n)} 
 for smallish n).  I don't think core Python should be solving this issue at 
 all---I think that's a job for the web frameworks.  Christian's idea of 
 providing more suitable types in the std. lib. sounds like the right 
 direction to me.

I definitely agree on that last sentence. Having more suitable data
types in Python (like e.g. tries, b-trees or red-black-trees) would certainly
be a better solution than trying to build everything into dictionaries.

Nice comparison:
http://en.wikipedia.org/wiki/Trie

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-07 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

See issue16427.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-07 Thread Armin Rigo

Armin Rigo added the comment:

I won't try to influence the outcome of this discussion, but I'd like to 
correct myself: in the measures I posted, true randomness is not needed at 
all.  The exact criterion might be hard to pin down, but as a first 
approximation, we get the same answers as long as most keys have different 
hashes, as all the bits of the hash are used by the dict lookup in only a few 
iterations.  No two small ints have the same hash, by construction.  You can 
build a sequence of (long) integers that have all exactly the same hash, but 
doing that is not as easy as 2**k.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-07 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

 Serhiy, the performance of hash() for long strings isn't very relevant for 
 the general performance of a Python program.

It exposes the raw speed of hashing algorithm.  It is good as a first estimate, 
because more real cases require more sophisticated measurements.

 Short strings dominate. I've modified the timeit to create a new string 
 object every time.

timeit is absolutely not suitable for this.  Need to write a C program that 
uses the Python C API.

 for I in 5 10 15 20 30 40 50 60; do echo -ne $I\t; ./python -m timeit 
 -n10 -r30 -s h = hash; x = 'ä' * $I -- h(x + 'a') | awk '{print $6}' 
 ; done

Please, do not be fooled by the wrong measurements. You measure the height of 
the building together with the hill, on which it stands. Use -n1 and you will 
see a 
completely different numbers.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-07 Thread Armin Rigo

Armin Rigo added the comment:

Wrong, sorry.  On a 32-bit Python 2.7, (2**32-1)*n has the same hash -2, for 
any value of n.

Of course if you build a dict containing thousands of such integers as keys, 
then right now you get unexpectedly bad performance.  I wonder if I should open 
another bug report about that --- the hash of longs should be slightly more 
random-looking...

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-07 Thread Mark Dickinson

Mark Dickinson added the comment:

[Armin]
 You can build a sequence of (long) integers that have all exactly the
 same hash, but doing that is not as easy as 2**k.

Sure it is.  The hash for integers is (by design) repeated modulo a number of 
the form 2**n - 1:  we use 2**61 - 1 on 64-bit systems and 2**31 - 1 on 32-bit. 
 So in {2**k for k in range(n)} you get at most 61 distinct hash values.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-07 Thread Giovanni Bajo

Giovanni Bajo added the comment:

Il giorno 07/nov/2012, alle ore 12:59, Marc-Andre Lemburg 
rep...@bugs.python.org ha scritto:

 
 Marc-Andre Lemburg added the comment:
 
 On 07.11.2012 12:55, Mark Dickinson wrote:
 
 Mark Dickinson added the comment:
 
 [MAL]
 I don't understand why we are only trying to fix the string problem
 and completely ignore other key types.
 
 [Armin]
 estimating the risks of giving up on a valid query for a truly random
 hash, at an overestimated one billion queries per second ...
 
 That's fine in principle, but if this gets extended to integers, note that 
 our current integer hash is about as far from 'truly random' as you can get:
 
Python 3.4.0a0 (default:f02555353544, Nov  4 2012, 11:50:12) 
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type help, copyright, credits or license for more information.
 [hash(i) for i in range(20)]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
 
 Moreover, it's going to be *very* hard to change the int hash while 
 preserving the `x == y implies hash(x) == hash(y)` invariant across all the 
 numeric types (int, float, complex, Decimal, Fraction, 3rd-party types that 
 need to remain compatible).
 
 Exactly. And that's why trying to find secure hash functions isn't
 going to solve the problem. Together with randomization they may
 make things better for strings, but they are no solution for numeric
 types, and they also don't allow detecting possible attacks on your
 systems.
 
 But yeah, I'm repeating myself :-)
 

I don't see how it follows. Python has several hash functions in its core, one 
of which is the string hash function; it is currently severely broken from a 
security standpoint; it also happens to be probably the most common case for 
dictionaries in Python, and the ones that it is more easily exploited in web 
frameworks. 

If we can manage to fix the string hash function (eg: through SipHash) we will 
be one step further in mitigating the possible attacks.

Solving collisions and mitigating attacks on numeric types is a totally 
different problem because it is a totally different function. I suggest we keep 
different discussions and different bugs for it. For instance, I'm only 
personally interested in mitigating attacks on the string hash function.
-- 
Giovanni Bajo

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-06 Thread Benjamin Peterson

Benjamin Peterson added the comment:

Here's the message that helped convince us to go against collision counting 
originally: http://mail.python.org/pipermail/python-dev/2012-January/115726.html

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-06 Thread John M Camara

John M Camara added the comment:

How about using a secure hash algorithm that's implemented in HW when 
available.  It doesn't eliminate the issue on systems that lack this support 
but at least it limits the scope of the problem.

Of course some testing would need to be done to make sure the hardware hashing 
doesn't have a significant impact on performance.

--
nosy: +camara

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-06 Thread Alex Gaynor

Changes by Alex Gaynor alex.gay...@gmail.com:


--
nosy: +alex

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-06 Thread Christian Heimes

Christian Heimes added the comment:

Our hash randomization will always leak some information about the 
randomization keys. The only way to properly secure our secrets is a 
cryptographic secure algorithms, for example a crypto hashing function in 
combination with a message authentication code like HMAC. I don't have to 
explain how that is going to hurt performance ...

We can try to make it harder to guess the secret parts with a slightly modified 
algorithm like e.g. V8's hash but that's never going to be 100% secure. But 
might be secure enough to make an attack too hard.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-06 Thread Christian Heimes

Christian Heimes added the comment:

I deem hash randomization and collision counting as a poor man's workaround for 
the actual issue. Perhaps we shouldn't try too hard to fix an unsuitable data 
type. Hash maps have a known worst case complexity of O(n). A O(log n) 
algorithm should be used to parses and malicious key/value pairs.

How about Python grows a additional btree implementation in its collections 
module? I know that it's not going to fix existing code. However in the long 
run it's the best safeguard against hash collision attacks. I'm thinking about 
a simple, self balancing btree like red-black-tree. A quick search on Wikipedia 
also revealed Scapegoat and Splay tree with interesting properties.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-06 Thread Giovanni Bajo

Giovanni Bajo added the comment:

Christian, there are good semi-crypto hash functions that don't leak as bad as 
Python's own modified FNV hash, without going all the way to HMAC.

SipHash has very good collision resistance and doesn't leak anything:
https://www.131002.net/siphash/
(notice: they distribute a python program to recover python's seed)

It's obviously slower than Python's FNV, but it's hard to beat a 
sum+multiplication per character.

--
nosy: +Giovanni.Bajo

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-06 Thread Christian Heimes

Christian Heimes added the comment:

Thanks!

SipHash looks interesting. It's using a XOR + ROT approach with a seed. And 
it's written by DJB which is usually a good sign. He writes secure code with 
good quality. Just his coding style tends to be ... unique. :)

I'm going to try the algorithm.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-06 Thread Christian Heimes

Christian Heimes added the comment:

I modified crypto_auth() a bit:

Py_uhash_t crypto_auth(const unsigned char *in, unsigned long long inlen)
  ...
  u64 k0 = _Py_HashSecret.prefix;
  u64 k1 = _Py_HashSecret.suffix;
  ...
  return (Py_uhash_t)b;

and replaced the loop in _Py_HashBytes() with a call to crypto_auth(). For 
large strings SipHash is as faster as our current algorithm on my 64bit box. 
That was to be expected as SipHash works on blocks of 8 bytes while the default 
algorithm can't be optimized with SIMD instructions.

Current hashing algorithm:
$ ./python -m timeit -s x = b'a' * int(1E7) hash(x)
100 loops, best of 3: 0.39 usec per loop

SipHash:
$ ./python -m timeit -s x = b'a' * int(1E7) hash(x)
100 loops, best of 3: 0.381 usec per loop

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-06 Thread Giovanni Bajo

Giovanni Bajo added the comment:

For short strings, you might want to have a look at the way you fetch the final 
partial word from memory.

If the string is = 8 bytes, you can fetch the last partial word as an 
unaligned memory fetch followed by a shift, instead of using a switch like in 
the reference code.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-06 Thread Christian Heimes

Christian Heimes added the comment:

We can explore the various optimization options later. Also unaligned memory 
address is not allowed on some architectures like SPARC.

If somebody likes to play with the algorithm:
http://hg.python.org/sandbox/cheimes/shortlog/2cb7e97ca8d0

--
hgrepos: +159

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-06 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

$ ./python -m timeit -s x = b'a' * int(1E7) hash(x)

Note that hash calculated only once.  Add -n 1 option and use a larger data.

 If somebody likes to play with the algorithm:

$ ./python -m timeit -n 1 -s t = 'abcdefgh' * 10**8  hash(t)
1 loops, best of 3: 4.86 sec per loop

Current hash algorithm runs 3.43 sec, V8's algorithm runs 2.44 sec.

With simple optimization I got 3.62 sec, only 6% slower than the current.

  #define U8TO64_LE(p) ((u64)((u32 *)(p))[0] | ((u64)((u32 *)(p))[1]  32))

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-06 Thread Christian Heimes

Christian Heimes added the comment:

Thanks to Snakebit I was able to tests the code on a 32bit BSD installation 
with GCC 4.2. The ASCII unicode and bytes performance is about 8% slower, UCS2 
unicode is about 37% slower. There might be room for improvements, though.

% ./python -m timeit -r20 -n100 -s h = hash; x = 'a' * 10**7 -- h(x)
Current:
100 loops, best of 20: 0.109 usec per loop
SipHash:
100 loops, best of 20: 0.118 usec per loop

% ./python -m timeit -r20 -n100 -s h = hash; x = 'ä' * 10**7 -- h(x)
Current:
100 loops, best of 20: 0.119 usec per loop
SipHash:
100 loops, best of 20: 0.163 usec per loop

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-06 Thread Christian Heimes

Christian Heimes added the comment:

Serhiy's trick

#define U8TO64_LE(p) ((u64)((u32 *)(p))[0] | ((u64)((u32 *)(p))[1]  32))

gives a nice speedup. Now UCS2 is down to 0.133 usec (12% slower than the 
current algorithm) and ASCII down to 0.105 usec (3% faster).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-06 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

% ./python -m timeit -r20 -n100 -s h = hash; x = 'a' * 10**7 -- h(x)

Here is only one hash calculation and 99 cached calls.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-11-06 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

I tested different kind of strings.

$ ./python -m timeit -n 1 -s t = b'a' * 10**8  hash(t)
$ ./python -m timeit -n 1 -s t = 'a' * 10**8  hash(t)
$ ./python -m timeit -n 1 -s t = '\u0100' * 10**8  hash(t)
$ ./python -m timeit -n 1 -s t = '\U0001' * 10**8  hash(t)

   current   SipHash

bytes  181 msec  453 msec  2.5x
UCS1   429 msec  453 msec  1.06x
UCS2   179 msec  897 msec  5x
UCS4   183 msec  1.79 sec  9.8x

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-10-22 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 21.10.2012 23:42, STINNER Victor wrote:
 
 STINNER Victor added the comment:
 
 It's interesting to note how this whole -R discussion made very long
 threads on python-dev, and python-dev has subsequently ignored (for the
 past 6 months!) the fact that their fix can be worked around in a matter
 of minutes.
 
 No, this issue has no been ignored. Nobody proposed anything to fix this
 issue, but we are still working on it (sometimes in private).
 
 In my opinion, we cannot solve this issue without slowing down python. Or I
 don't know yet.a.fast and secure hash algorithm. I don't really want to
 slow down Python for one specific issue whereas there are so many other
 ways to DoS a (web) server.

Well, I did propose a different approach to the whole problem to
count collisions. That would have avoided the usability issues you
have with the randomization approach, made it possible for the
application to detect the attack and not have introduced any significant
runtime overhead for applications not being attacked.

The proposal was shot down with the argument that it wouldn't
fix the problem.

It should also be noted that the randomization only applies to
strings/bytes, dictionaries with other colliding keys are not protected
at all.

Perhaps it's time to revisit the collision counting idea ?

It would work in much the same way as the stack recursion limit
we have in Python.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 22 2012)
 Python Projects, Consulting and Support ...   http://www.egenix.com/
 mxODBC.Zope/Plone.Database.Adapter ...   http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

2012-09-27: Released eGenix PyRun 1.1.0 ...   http://egenix.com/go35
2012-09-26: Released mxODBC.Connect 2.0.1 ... http://egenix.com/go34
2012-09-25: Released mxODBC 3.2.1 ... http://egenix.com/go33
2012-10-23: Python Meeting Duesseldorf ...  tomorrow

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/

--
nosy: +lemburg

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-10-21 Thread Armin Rigo

Armin Rigo added the comment:

Just to make it extra clear: Vlado showed that the -R switch of Python can 
easily be made fully pointless, with only a bit of extra work.  Here is how:

* Assume you have an algo that gives you as many strings with colliding hashes 
as you want, provided you know the last 8 bits of the secret prefix.

* Say you want to attack a web server.  You send it 256 requests, each with 100 
strings that have identical hash for one of the 256 possible values.  You 
measure which one is significantly slower than the others.

* From there you are back in the original situation: you know which of the 256 
values to pick, so you can make the web server crawl by sending it a large 
number of strings that have identical hashes for this particular value.

It's interesting to note how this whole -R discussion made very long threads on 
python-dev, and python-dev has subsequently ignored (for the past 6 months!) 
the fact that their fix can be worked around in a matter of minutes.

(For information, I'm sure that if the algorithm is improved to depend on all 
32 or 64 bits of the prefix, it would still be easy to crack it.  You don't 
actually need to send 2**32 or 2**64 requests to the web server: with careful 
design you can send only 32 or 64 requests that each leak one bit of 
information.  Doing that requires a bit more research, but once the recipe is 
known, it can be freely reused, which seems to defeat the original point.)

--
nosy: +arigo

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-10-21 Thread Armin Rigo

Armin Rigo added the comment:

For reference, the above means that we can implement -R support for PyPy as a 
dummy ignored flag, and get security that is very close to CPython's.  :-)

--
keywords: +easy

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-10-21 Thread Benjamin Peterson

Benjamin Peterson added the comment:

That doesn't make it any easy CPython issue. :)

--
keywords:  -easy

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-10-21 Thread Christian Heimes

Christian Heimes added the comment:

As far as my understanding goes the issue can't be solved with our current hash 
algorithm. We'd have to use a crypto hash function or at least a hash algorithm 
that has an increased avalanche effect on the outcome. The current hash 
algorithm is designed and optimized for speed and not for security. Any other 
algorithm is going to slow down hashing.

Small strings and strings with lots of NUL bytes may leak too many information, 
too.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-10-21 Thread STINNER Victor

STINNER Victor added the comment:

 It's interesting to note how this whole -R discussion made very long
threads on python-dev, and python-dev has subsequently ignored (for the
past 6 months!) the fact that their fix can be worked around in a matter
of minutes.

No, this issue has no been ignored. Nobody proposed anything to fix this
issue, but we are still working on it (sometimes in private).

In my opinion, we cannot solve this issue without slowing down python. Or I
don't know yet.a.fast and secure hash algorithm. I don't really want to
slow down Python for one specific issue whereas there are so many other
ways to DoS a (web) server.

How do other languages (say Perl, Ruby, PHP, Javascript) handle this issue?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-10-21 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

 $ ./python -m timeit -s t = 'abcdefgh' * int(1E8) hash(t)

I got another numbers (32-bit Linux, AMD Athlon 64 X2 4600+).

Python's current hash algorithm:
10 loops, best of 3: 343 msec per loop

V8's algorithm:
10 loops, best of 3: 244 msec per loop

--
nosy: +serhiy.storchaka

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-10-17 Thread Christian Heimes

Christian Heimes added the comment:

I've modified unicodeobject's unicode_hash() function. V8's algorithm is about 
55% slower for a 800 MB ASCII string on my box.

Python's current hash algorithm for bytes and unicode:

   while (--len = 0)
x = (_PyHASH_MULTIPLIER * x) ^ (Py_uhash_t) *P++;

$ ./python -m timeit -s t = 'abcdefgh' * int(1E8) hash(t)
10 loops, best of 3: 94.1 msec per loop


V8's algorithm:

while (--len = 0) {
x += (Py_uhash_t) *P++;
x += ((x + (Py_uhash_t)len)  10);
x ^= (x  6);
}

$ ./python -m timeit -s t = 'abcdefgh' * int(1E8) hash(t)
10 loops, best of 3: 164 msec per loop

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-10-11 Thread Christian Heimes

Changes by Christian Heimes li...@cheimes.de:


--
nosy: +christian.heimes

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-04-26 Thread STINNER Victor

STINNER Victor victor.stin...@gmail.com added the comment:

 One possible fix: ...
 Main loop looks like this: ..

Well, it was decided to not impact performances to workaround one specific 
class of attack, whereas there are many other ways to DoS Python. So we chose 
to keep the main loop unchanged. Randomizing the hash is not a fix for the hash 
DoS, it only makes the attacker harder.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-04-26 Thread STINNER Victor

Changes by STINNER Victor victor.stin...@gmail.com:


Removed file: http://bugs.python.org/file25282/hash.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-04-26 Thread STINNER Victor

STINNER Victor victor.stin...@gmail.com added the comment:

Oops, I attached the wrong hash.py file.

--
Added file: http://bugs.python.org/file25375/hash.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-04-26 Thread Benjamin Peterson

Benjamin Peterson benja...@python.org added the comment:

We should see if more security can be gained without sacrificing speed.

--
nosy: +benjamin.peterson

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-04-26 Thread STINNER Victor

STINNER Victor victor.stin...@gmail.com added the comment:

 Problem with current randomization of hash function
 is following: Suffix does not influence whether two keys
 have some hash or not (it is xor-ed after everything).

Yes, the suffix is used to protect the secret. Without the suffix, it would 
be too simple to compute the prefix: getting a single hash value of a known 
string would leak the prefix.

 Suffix does not influence whether two keys have some hash
 or not (...). Everything except last 8 bits in prefix does
 not influence it also.

I don't know if we can do better and/or if it is a critical issue.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-04-20 Thread Vlado Boza

Vlado Boza us...@ksp.sk added the comment:

One possible fix:
Look for StringHasher in google v8 code 
(http://code.google.com/p/v8/source/search?q=stringhasherorigq=stringhasherbtnG=Search+Trunk).
 Main loop looks like this:
raw_running_hash_ += c; 
  
raw_running_hash_ += (raw_running_hash_  10); 
  
raw_running_hash_ ^= (raw_running_hash_  6);  

It seems not to have same collisions with many different hash seeds.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-04-19 Thread Vlado Boza

New submission from Vlado Boza us...@ksp.sk:

Fix of this http://bugs.python.org/issue13703 is broken.

tl;dr: There only 256 different hash functions (compare it to size of 
_Py_HashSecret prefix and suffix). And whether keys collide or not depends only 
on the last 8 bits of prefix. 

Problem with current randomization of hash function is following:
Suffix does not influence whether two keys have some hash or not (it is xor-ed 
after everything). 
Everything except last 8 bits in prefix does not influence it also. Try adding 
0x474200 to prefix and see what happens (it will add 0x474200 to resulting 
hash). 

To make a DoS attack, attacker must do the following:
Generate sets of colliding keys for every 256 possible combinations of last 8 
bits. Try each one of these sets - one will have significantly bigger response 
time, and then repeat this one.

--
messages: 158736
nosy: Vlado.Boza
priority: normal
severity: normal
status: open
title: Hash function is not randomized properly
type: security

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-04-19 Thread Vlado Boza

Vlado Boza us...@ksp.sk added the comment:

E.g this strings collide for every prefix ending on 0xcd:
0x27fd5a18, 0x26fe78fa

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-04-19 Thread Dave Malcolm

Changes by Dave Malcolm dmalc...@redhat.com:


--
nosy: +dmalcolm

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-04-19 Thread Michal Petrucha

Changes by Michal Petrucha michal.petru...@ksp.sk:


--
nosy: +konk

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-04-19 Thread Antoine Pitrou

Changes by Antoine Pitrou pit...@free.fr:


--
nosy: +haypo

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-04-19 Thread Dave Malcolm

Dave Malcolm dmalc...@redhat.com added the comment:

Thanks for filing this bug report.

I'm not seeing the equal hashes you describe.

I'm using this recipe to hardcode a specific prefix and print the hashes using 
it:
$ gdb --eval-command=break _PyRandom_Init --eval-command=run 
--eval-command=print _Py_HashSecret --eval-command=set 
_Py_HashSecret.prefix=0xcdcdcdcd --eval-command=print _Py_HashSecret 
--eval-command=continue -eval-command=continue --args python -c 
a='\x27\xfd\x5a\x18'; b='\x26\xfe\x78\xfa'; print(hash(a)); print(hash(b))


On a 32-bit build of Python 2.7.3 (i686), if I set 
_Py_HashSecret.prefix=0xcdcdcdcd, I get non-equal hashes for the data you 
specify (output trimmed somewhat for conciseness):

  $1 = {prefix = 0, suffix = 0}
  $2 = {prefix = -842150451, suffix = 0}
  Continuing.
  -121255142
  -1199906326

Similarly, on a 64-bit build of Python 2.7.3 (x86_64), I get non-equal hashes:
  $1 = {prefix = 0, suffix = 0}
  $2 = {prefix = 3452816845, suffix = 0}
  -3992804574342296806
  -8147489705433570838

Did I misunderstand the report?  Thanks.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-04-19 Thread Antoine Pitrou

Changes by Antoine Pitrou pit...@free.fr:


--
nosy: +PaulMcMillan

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-04-19 Thread STINNER Victor

STINNER Victor victor.stin...@gmail.com added the comment:

I don't understand this issue: can you write a short script to test a 
collision? 

E.g this strings collide for every prefix ending on 0xcd

Do you mean that prefix  0xff == 0xcd?

0x27fd5a18, 0x26fe78fa

Is it a byte string or an Unicode string? b'\x27\xfd\x5a\x18' and 
b'\x26\xfe\x78\xfa'?

--

Using PYTHONHASHSEED environment variable, it's easy to find two values 
generating the same _Py_HashSecret. Just one example:

PYTHONHASHSEED=3035016679:
* _Py_HashSecret = {0xcd5192eff3fd4d58, 0x3926b1431b200720}
PYTHONHASHSEED=4108758503:
*  _Py_HashSecret = {0xcd5192eff3fd4d58, 0x3926b1431b200720}

--

I wrote find_hash_collision.py to try to compute a collision, but the programs 
fail with:
---
Fail to generate a new seed!
# seeds = 65298
---
So it fails to generate a new random seed after testing 65298 different seeds. 
I ran the script with a function generating a seed, a seed generate a prefix 
ending with 0xDC.

See attached program: it generates a random seed. Uncomment seed = 
generate_seed_0xCD() if the prefix must ends with 0xCD byte.

--
Added file: http://bugs.python.org/file25281/find_hash_collision.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-04-19 Thread Vlado Boza

Vlado Boza us...@ksp.sk added the comment:

My bad (I checked only function in C++, not result in python).
This should work on 32bit:
Prefix: anything ending on 0x00
Suffix: anything
Strings: \x00\xcf\x0b\x96\x19, \x00\x6d\x29\x45\x18

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-04-19 Thread Vlado Boza

Vlado Boza us...@ksp.sk added the comment:

For example take this script (on 32bit):
ha = hash(\x00\xcf\x0b\x96\x19)
hb = hash(\x00\x6d\x29\x45\x18)
if ha == hb:
  print collision

And run following:
for i in `seq 0 25`; do echo $i; for j in `seq 0 100`; do ./python -R x.py; 
done; done;

It gives collison too many times (around 9 out of 2500).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-04-19 Thread Dave Malcolm

Dave Malcolm dmalc...@redhat.com added the comment:

$ gdb --eval-command=break _PyRandom_Init --eval-command=run 
--eval-command=print _Py_HashSecret --eval-command=set 
_Py_HashSecret.prefix=0xcdcdcd00 --eval-command=print _Py_HashSecret 
--eval-command=continue -eval-command=continue --args python -c 
'a=\x00\xcf\x0b\x96\x19; b=\x00\x6d\x29\x45\x18; print(hash(a)); 
print(hash(b))'

On 32-bit:
$1 = {prefix = 0, suffix = 0}
$2 = {prefix = -842150656, suffix = 0}
1220138288
1220138288

On 64-bit:
$1 = {prefix = 0, suffix = 0}
$2 = {prefix = 3452816640, suffix = 0}
Continuing.
4087671194599937328
-167939011306192

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-04-19 Thread STINNER Victor

STINNER Victor victor.stin...@gmail.com added the comment:

 For example take this script (on 32bit): (...)
 It gives collison too many times (around 9 out of 2500).

I tried this script on Linux 32 bits and Linux 64 bits: I didn't see any 
collision. What is your operating system and the version of your operating 
system please?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-04-19 Thread Michal Petrucha

Michal Petrucha michal.petru...@ksp.sk added the comment:

@dmalcolm:
As for the gdb example, you need to add --eval-command=set 
_Py_HashSecret_Initialized=1, otherwise _Py_HashSecret will get overwritten 
immediately after it is set by gdb, either to 0 if run without the -R switch, 
or to a random value.

With the fresh pair of values Vlado provided, I managed to reproduce the 
collisions on Python 2.7.3, i686 (output trimmed like you did):

for __PREFIX in 0x0 0x4700 0xdead00 0xcafe00; do gdb --eval-command=break 
_PyRandom_Init --eval-command=run --eval-command=print _Py_HashSecret 
--eval-command=set _Py_HashSecret.prefix=${__PREFIX} --eval-command=set 
_Py_HashSecret_Initialized=1 --eval-command=print _Py_HashSecret 
--eval-command=continue -eval-command=continue --args ./python -c 
a='\x00\xcf\x0b\x96\x19'; b='\x00\x6d\x29\x45\x18'; print(hash(a)); 
print(hash(b)); done

$1 = {prefix = 0, suffix = 0}
$2 = {prefix = 0, suffix = 0}
Continuing.
1220138288
1220138288


$1 = {prefix = 0, suffix = 0}
$2 = {prefix = 18176, suffix = 0}
Continuing.
-1483212240
-1483212240


$1 = {prefix = 0, suffix = 0}
$2 = {prefix = 14593280, suffix = 0}
Continuing.
-972665808
-972665808


$1 = {prefix = 0, suffix = 0}
$2 = {prefix = 13303296, suffix = 0}
Continuing.
1003122480
1003122480

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-04-19 Thread Vlado Boza

Vlado Boza us...@ksp.sk added the comment:

I tried this script on Linux 32 bits and Linux 64 bits: I didn't see any 
collision. What is your operating system and the version of your operating 
system please?

uname -a
Linux 3.0.0-17-generic #30-Ubuntu SMP Thu Mar 8 20:45:39 UTC 2012 x86_64 x86_64 
x86_64 GNU/Linux


Anyway you should be able to do following (in 32bits):
- generate two colliding keys (with some random seed)
- try these keys with different random seeds and they will collide ~1 out of 
256 times

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue14621] Hash function is not randomized properly

2012-04-19 Thread STINNER Victor

STINNER Victor victor.stin...@gmail.com added the comment:

hash.py: Python implementation of the 32-bit version of hash(bytes). Ok, I now 
see that only the lower 8 bits are really mixed with the input string.

--
Added file: http://bugs.python.org/file25282/hash.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue14621
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



  1   2   >