viktorsomogyi edited a comment on pull request #9519:
URL: https://github.com/apache/kafka/pull/9519#issuecomment-736027069


   @lbradstreet it is really hard to give an exact answer to this as collision 
rate is hard to calculate mathematically as it is very dependant on the size 
and values of the testset. For non-cryptographic hashes it is possible to 
generate DDoS attacks where everything gets placed into the same bucket and 
thus slows down lookups. On the theoretical side though Murmur3 passes the most 
often cited Chi Square test, it has a very good avalanche effect and thus 
generates a hashes that are very close to the uniform distribution.
   Because of the lack of available mathematical articles on this topic (murmur 
vs MD5) I started brute-force tests where I generated a few billion unique keys 
and inserted them into a Bloom Filter (which had a 1% false positive 
probability). That showed that Murmur3 is actually on the same level as MD5, it 
generates roughly the same amount of collisions. I have to types of datasets: 
the first can use any UTF8 characters and the second works only from the 
printable ASCII characters. In fact both MD5, murmur3 128bit, murmur3 64bit and 
xxhash64 bit generated around the same amount of collisions which was 0.016% 
out of 200 million unique keys. I added Murmur3 32bit for a baseline but it was 
significantly worse, around 2% of collisions. Maybe to show the difference we 
need a much larger keyset, I'll try to do what I can.
   I'll publish my code in the following days I just have to work on something 
else too so it's a bit slow, sorry :).
   
   On the other hand if we want to make sure that there will be no collisions, 
I don't think it's possible with either of these solutions, there is always a 
chance. To completely cut this off we either have to store the user key-hash 
maps similarly to the offset indexes and reject new, colliding keys or use 
perfect hashes (but that couldn't work well as it requires the knowledge of the 
full keyset or have to rebuild the cache in each insert or at least often).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to