viktorsomogyi commented on pull request #9519:
URL: https://github.com/apache/kafka/pull/9519#issuecomment-736624841


   I ran another test with 2,000,000,000 keys being inserted into a Bloom 
Filter with 1% false positive rate (and it ran for 7 hours and 21 minutes). The 
test set was created by generating random printable ASCII characters and adding 
an incremental counter in a random position within a string. The counter was 
formatted to display 10 digits, so 16 would look like 0000000016 in the string, 
therefore preserving the length of the strings (which was 110 characters with 
the counter included).
   
   MD5 hash collisions: 3,325,038 (0.1662%)
   Murmur3 128bit hash collisions: 3,330,883 (0.1665%)
   Murmur3 64bit hash collisions: 3,327,266 (0.1663%)
   Murmur3 32bit hash collisions: 401,833,502 (20.0916%)
   XXHash 64bit hash collisions: 3,327,266 (0.1663%)
   
   It shows that interestingly there were slightly more collisions in Murmur3 
than in MD5 (MD5 was 0.0002% better, which is 5845 collisions) and 
interestingly the 64bit hashes performed the best (and I won't mention the 
32bit hash :D). Given that there were 2 billion unique keys, that 5845 
statistically well within any range of error (it's 0,1% of 3,330,883). I don't 
expect that with more test runs we'll have significantly bigger differences. As 
I observed the values were growing linearly with values close together 
(sometimes MD5 falling behind, sometimes getting ahead) so I think this 0.0002% 
doesn't mean too much. I didn't save this progress data out as I thought about 
this 4 hours into the test but I'll do it for the next one.
   
   I'd also like to try out xxHash 128bit but I haven't found any Java JNIs for 
it yet but I'll keep searching. I expect though that it would perform similarly 
to the other 128bit hashes but overall I think this could be a good foundation 
for using traditional hashing algorithms instead of the slow MD5.
   Another experiment could be to try out SHA1. As I saw it should be somewhat 
faster and more collision resistant than MD5, however I never did any kind of 
experiments, so we'll see if its performance gets meaningfully better (meaning 
it exceeds at least the 1% error rate).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to