I've done similar things in the past. Here is how I did it: Generate and sort the first file. Generate and sort the second file. join first second
The join command will only print lines that are in both the first and second file. This would give you a list possible collisions. You'd just need to verify that the collision is valid (i.e. the random string was not the actual string from the first). If you just wanted a count you could: join first second | wc -l On Fri, Dec 27, 2013 at 10:55 AM, S. Dale Morrey <sdalemor...@gmail.com>wrote: > I would love for you to tell me that, but still I'm trying to verify a > particular implementation of the algorithm, > > Your CDB file idea is a good one. I'm going to investigate it further. > > > > On Fri, Dec 27, 2013 at 10:50 AM, Steve Meyers <st...@plug.org> wrote: > > > On 12/27/13 10:43 AM, S. Dale Morrey wrote: > > > >> Yes, that is exactly what I'm doing. I'm checking the propensity for a > >> random string of characters to have a hash collision with an existing > >> known > >> set of words given an unsalted hashing algorithm. > >> > > > > Can I save you the trouble by telling you how unlikely it is to happen? > > > > Since your 24 million existing hashes are static, I'd look into how big a > > CDB file would be, and then use that to check for collisions. > > > > Steve > > > > > > > > /* > > PLUG: http://plug.org, #utah on irc.freenode.net > > Unsubscribe: http://plug.org/mailman/options/plug > > Don't fear the penguin. > > */ > > > > /* > PLUG: http://plug.org, #utah on irc.freenode.net > Unsubscribe: http://plug.org/mailman/options/plug > Don't fear the penguin. > */ > /* PLUG: http://plug.org, #utah on irc.freenode.net Unsubscribe: http://plug.org/mailman/options/plug Don't fear the penguin. */