Re: Different key with the same digest in log compaction
i just noticed my link didnt parse correctly. the formula for probability of collision is explained by googling "birthday paradox". if anything, we could further optimize this code to use a trivial hash map for cases where sizeOf(hash) > sizeOf(key) On Fri, Jan 6, 2017 at 10:00 AM, Colin McCabewrote: > That's a fair point. The calls to Arrays.equals are comparing just the > hashes, not the keys. > > Colin > > On Tue, Jan 3, 2017, at 17:15, radai wrote: > > looking at the code (SkimpyOffsetMap.get/put) they both start with > > hashInto(key, hash1) and then ignore key from that point on - so we're > > not > > using the key unless im missing something? > > > > as for the probability of collision - it depends on the hash algo and the > > number of keys. if you configure it to use something like sha-512 the > > probability is truly negligible. > > > > for example, the probability of collision on a topic with 4 billion > > entries > > using MD5 is ~ 10^-20 (math - > > http://www.wolframalpha.com/input/?i=n+%3D+2 > > ^32,+d+%3D+2^128,+1+-+%28%28d-1%29%2Fd%29+^+%28n%28n-1%29%2F2%29) > > > > On Tue, Jan 3, 2017 at 4:37 PM, Colin McCabe wrote: > > > > > Can you be a little bit clearer on why you think that different keys > > > with the same digest value will be treated as the same key? > > > SkimpyOffsetMap#get and SkimpyOffsetMap#put compare the key, not just > > > the hash digest of the key. > > > > > > best, > > > Colin > > > > > > > > > On Wed, Dec 21, 2016, at 23:27, Renkai Ge wrote: > > > > Hi,all: > > > > I am just learning the kafka codebase, as what I saw in > > > > https://github.com/apache/kafka/blob/6ed3e6b1cb8a73b1f5f78926ccb247 > > > a8953a554c/core/src/main/scala/kafka/log/OffsetMap.scala#L43-L43 > > > > > > > > if different log keys have the same digest value, they will be > treated as > > > > the same key in log compaction.Though the risk of such things > happens is > > > > very small, I still want it to be avoided.If what I thought is wrong > > > > please > > > > let me know, and I hope to know the thoughts of who created or > > > > is maintaining the code. > > > >
Re: Different key with the same digest in log compaction
That's a fair point. The calls to Arrays.equals are comparing just the hashes, not the keys. Colin On Tue, Jan 3, 2017, at 17:15, radai wrote: > looking at the code (SkimpyOffsetMap.get/put) they both start with > hashInto(key, hash1) and then ignore key from that point on - so we're > not > using the key unless im missing something? > > as for the probability of collision - it depends on the hash algo and the > number of keys. if you configure it to use something like sha-512 the > probability is truly negligible. > > for example, the probability of collision on a topic with 4 billion > entries > using MD5 is ~ 10^-20 (math - > http://www.wolframalpha.com/input/?i=n+%3D+2 > ^32,+d+%3D+2^128,+1+-+%28%28d-1%29%2Fd%29+^+%28n%28n-1%29%2F2%29) > > On Tue, Jan 3, 2017 at 4:37 PM, Colin McCabewrote: > > > Can you be a little bit clearer on why you think that different keys > > with the same digest value will be treated as the same key? > > SkimpyOffsetMap#get and SkimpyOffsetMap#put compare the key, not just > > the hash digest of the key. > > > > best, > > Colin > > > > > > On Wed, Dec 21, 2016, at 23:27, Renkai Ge wrote: > > > Hi,all: > > > I am just learning the kafka codebase, as what I saw in > > > https://github.com/apache/kafka/blob/6ed3e6b1cb8a73b1f5f78926ccb247 > > a8953a554c/core/src/main/scala/kafka/log/OffsetMap.scala#L43-L43 > > > > > > if different log keys have the same digest value, they will be treated as > > > the same key in log compaction.Though the risk of such things happens is > > > very small, I still want it to be avoided.If what I thought is wrong > > > please > > > let me know, and I hope to know the thoughts of who created or > > > is maintaining the code. > >
Re: Different key with the same digest in log compaction
looking at the code (SkimpyOffsetMap.get/put) they both start with hashInto(key, hash1) and then ignore key from that point on - so we're not using the key unless im missing something? as for the probability of collision - it depends on the hash algo and the number of keys. if you configure it to use something like sha-512 the probability is truly negligible. for example, the probability of collision on a topic with 4 billion entries using MD5 is ~ 10^-20 (math - http://www.wolframalpha.com/input/?i=n+%3D+2 ^32,+d+%3D+2^128,+1+-+%28%28d-1%29%2Fd%29+^+%28n%28n-1%29%2F2%29) On Tue, Jan 3, 2017 at 4:37 PM, Colin McCabewrote: > Can you be a little bit clearer on why you think that different keys > with the same digest value will be treated as the same key? > SkimpyOffsetMap#get and SkimpyOffsetMap#put compare the key, not just > the hash digest of the key. > > best, > Colin > > > On Wed, Dec 21, 2016, at 23:27, Renkai Ge wrote: > > Hi,all: > > I am just learning the kafka codebase, as what I saw in > > https://github.com/apache/kafka/blob/6ed3e6b1cb8a73b1f5f78926ccb247 > a8953a554c/core/src/main/scala/kafka/log/OffsetMap.scala#L43-L43 > > > > if different log keys have the same digest value, they will be treated as > > the same key in log compaction.Though the risk of such things happens is > > very small, I still want it to be avoided.If what I thought is wrong > > please > > let me know, and I hope to know the thoughts of who created or > > is maintaining the code. >
Re: Different key with the same digest in log compaction
Can you be a little bit clearer on why you think that different keys with the same digest value will be treated as the same key? SkimpyOffsetMap#get and SkimpyOffsetMap#put compare the key, not just the hash digest of the key. best, Colin On Wed, Dec 21, 2016, at 23:27, Renkai Ge wrote: > Hi,all: > I am just learning the kafka codebase, as what I saw in > https://github.com/apache/kafka/blob/6ed3e6b1cb8a73b1f5f78926ccb247a8953a554c/core/src/main/scala/kafka/log/OffsetMap.scala#L43-L43 > > if different log keys have the same digest value, they will be treated as > the same key in log compaction.Though the risk of such things happens is > very small, I still want it to be avoided.If what I thought is wrong > please > let me know, and I hope to know the thoughts of who created or > is maintaining the code.
Re: Different key with the same digest in log compaction
i think the code assumes that with a "good enough" hash function (and maybe few enough keys) the chance of such a collision is acceptably small to justify the savings of not keeping the keys in memory. On Wed, Dec 21, 2016 at 11:50 PM, Renkaiwrote: > Hi, all: > > I am just learning the Kafka codebase, as what I saw in > https://github.com/apache/kafka/blob/6ed3e6b1cb8a73b1f5f78926ccb247 > a8953a554c/core/src/main/scala/kafka/log/OffsetMap.scala#L43-L43 > > if different log keys have the same digest value, they will be treated as > the same key in log compaction. Though the risk of such things happens is > very small, I still want it to be avoided. If what I thought is wrong > please let me know, and I hope to know the thoughts of who created or is > maintaining the code. > > > >