Re: Different key with the same digest in log compaction

2017-01-06 Thread radai
i just noticed my link didnt parse correctly. the formula for probability
of collision is explained by googling "birthday paradox".
if anything, we could further optimize this code to use a trivial hash map
for cases where sizeOf(hash) > sizeOf(key)

On Fri, Jan 6, 2017 at 10:00 AM, Colin McCabe  wrote:

> That's a fair point.  The calls to Arrays.equals are comparing just the
> hashes, not the keys.
>
> Colin
>
> On Tue, Jan 3, 2017, at 17:15, radai wrote:
> > looking at the code (SkimpyOffsetMap.get/put) they both start with
> > hashInto(key, hash1) and then ignore key from that point on - so we're
> > not
> > using the key unless im missing something?
> >
> > as for the probability of collision - it depends on the hash algo and the
> > number of keys. if you configure it to use something like sha-512 the
> > probability is truly negligible.
> >
> > for example, the probability of collision on a topic with 4 billion
> > entries
> > using MD5 is ~ 10^-20 (math -
> > http://www.wolframalpha.com/input/?i=n+%3D+2
> > ^32,+d+%3D+2^128,+1+-+%28%28d-1%29%2Fd%29+^+%28n%28n-1%29%2F2%29)
> >
> > On Tue, Jan 3, 2017 at 4:37 PM, Colin McCabe  wrote:
> >
> > > Can you be a little bit clearer on why you think that different keys
> > > with the same digest value will be treated as the same key?
> > > SkimpyOffsetMap#get and SkimpyOffsetMap#put compare the key, not just
> > > the hash digest of the key.
> > >
> > > best,
> > > Colin
> > >
> > >
> > > On Wed, Dec 21, 2016, at 23:27, Renkai Ge wrote:
> > > > Hi,all:
> > > >  I am just learning the kafka codebase, as what I saw in
> > > > https://github.com/apache/kafka/blob/6ed3e6b1cb8a73b1f5f78926ccb247
> > > a8953a554c/core/src/main/scala/kafka/log/OffsetMap.scala#L43-L43
> > > >
> > > > if different log keys have the same digest value, they will be
> treated as
> > > > the same key in log compaction.Though the risk of such things
> happens is
> > > > very small, I still want it to be avoided.If what I thought is wrong
> > > > please
> > > > let me know, and I hope to know the thoughts of who created or
> > > > is maintaining the code.
> > >
>


Re: Different key with the same digest in log compaction

2017-01-06 Thread Colin McCabe
That's a fair point.  The calls to Arrays.equals are comparing just the
hashes, not the keys.

Colin

On Tue, Jan 3, 2017, at 17:15, radai wrote:
> looking at the code (SkimpyOffsetMap.get/put) they both start with
> hashInto(key, hash1) and then ignore key from that point on - so we're
> not
> using the key unless im missing something?
> 
> as for the probability of collision - it depends on the hash algo and the
> number of keys. if you configure it to use something like sha-512 the
> probability is truly negligible.
> 
> for example, the probability of collision on a topic with 4 billion
> entries
> using MD5 is ~ 10^-20 (math -
> http://www.wolframalpha.com/input/?i=n+%3D+2
> ^32,+d+%3D+2^128,+1+-+%28%28d-1%29%2Fd%29+^+%28n%28n-1%29%2F2%29)
> 
> On Tue, Jan 3, 2017 at 4:37 PM, Colin McCabe  wrote:
> 
> > Can you be a little bit clearer on why you think that different keys
> > with the same digest value will be treated as the same key?
> > SkimpyOffsetMap#get and SkimpyOffsetMap#put compare the key, not just
> > the hash digest of the key.
> >
> > best,
> > Colin
> >
> >
> > On Wed, Dec 21, 2016, at 23:27, Renkai Ge wrote:
> > > Hi,all:
> > >  I am just learning the kafka codebase, as what I saw in
> > > https://github.com/apache/kafka/blob/6ed3e6b1cb8a73b1f5f78926ccb247
> > a8953a554c/core/src/main/scala/kafka/log/OffsetMap.scala#L43-L43
> > >
> > > if different log keys have the same digest value, they will be treated as
> > > the same key in log compaction.Though the risk of such things happens is
> > > very small, I still want it to be avoided.If what I thought is wrong
> > > please
> > > let me know, and I hope to know the thoughts of who created or
> > > is maintaining the code.
> >


Re: Different key with the same digest in log compaction

2017-01-03 Thread radai
looking at the code (SkimpyOffsetMap.get/put) they both start with
hashInto(key, hash1) and then ignore key from that point on - so we're not
using the key unless im missing something?

as for the probability of collision - it depends on the hash algo and the
number of keys. if you configure it to use something like sha-512 the
probability is truly negligible.

for example, the probability of collision on a topic with 4 billion entries
using MD5 is ~ 10^-20 (math - http://www.wolframalpha.com/input/?i=n+%3D+2
^32,+d+%3D+2^128,+1+-+%28%28d-1%29%2Fd%29+^+%28n%28n-1%29%2F2%29)

On Tue, Jan 3, 2017 at 4:37 PM, Colin McCabe  wrote:

> Can you be a little bit clearer on why you think that different keys
> with the same digest value will be treated as the same key?
> SkimpyOffsetMap#get and SkimpyOffsetMap#put compare the key, not just
> the hash digest of the key.
>
> best,
> Colin
>
>
> On Wed, Dec 21, 2016, at 23:27, Renkai Ge wrote:
> > Hi,all:
> >  I am just learning the kafka codebase, as what I saw in
> > https://github.com/apache/kafka/blob/6ed3e6b1cb8a73b1f5f78926ccb247
> a8953a554c/core/src/main/scala/kafka/log/OffsetMap.scala#L43-L43
> >
> > if different log keys have the same digest value, they will be treated as
> > the same key in log compaction.Though the risk of such things happens is
> > very small, I still want it to be avoided.If what I thought is wrong
> > please
> > let me know, and I hope to know the thoughts of who created or
> > is maintaining the code.
>


Re: Different key with the same digest in log compaction

2017-01-03 Thread Colin McCabe
Can you be a little bit clearer on why you think that different keys
with the same digest value will be treated as the same key? 
SkimpyOffsetMap#get and SkimpyOffsetMap#put compare the key, not just
the hash digest of the key.

best,
Colin


On Wed, Dec 21, 2016, at 23:27, Renkai Ge wrote:
> Hi,all:
>  I am just learning the kafka codebase, as what I saw in
> https://github.com/apache/kafka/blob/6ed3e6b1cb8a73b1f5f78926ccb247a8953a554c/core/src/main/scala/kafka/log/OffsetMap.scala#L43-L43
> 
> if different log keys have the same digest value, they will be treated as
> the same key in log compaction.Though the risk of such things happens is
> very small, I still want it to be avoided.If what I thought is wrong
> please
> let me know, and I hope to know the thoughts of who created or
> is maintaining the code.


Re: Different key with the same digest in log compaction

2016-12-22 Thread radai
i think the code assumes that with a "good enough" hash function (and maybe
few enough keys) the chance of such a collision is acceptably small to
justify the savings of not keeping the keys in memory.


On Wed, Dec 21, 2016 at 11:50 PM, Renkai  wrote:

> Hi, all:
>
>  I am just learning the Kafka codebase, as what I saw in
> https://github.com/apache/kafka/blob/6ed3e6b1cb8a73b1f5f78926ccb247
> a8953a554c/core/src/main/scala/kafka/log/OffsetMap.scala#L43-L43
>
> if different log keys have the same digest value, they will be treated as
> the same key in log compaction. Though the risk of such things happens is
> very small, I still want it to be avoided. If what I thought is wrong
> please let me know, and I hope to know the thoughts of who created or is
> maintaining the code.
>
>
>
>