Thanks very much for the advice, everybody.

It's clearer to me now that the probability of collision in each approach is
based on different factors.  In the purely random approach it's based on the
number of keys in the system.  In the timestamp+random approach it's based
on the overall "key bandwidth" of the system - how many keys can be
generated within a certain timestamp.

For instance, /dev/urandom on my server maxes out at about 16MB/sec.  This
works out to ~1 million 128-bit keys per second, or 1000 nanoseconds per
key.  Plugging these numbers into the birthday problem approximation yields
a probability of collision of 2.71E-26, which is roughly the same as the
first approach with 16M keys in the system.  In the first approach, as more
keys are created, the risk of collision grows.  In the second case, as more
nodes join the system, the likelyhood of generating multiple keys in the
same nanosecond grows.

Oscar, your point is well-taken re. the monotonic increase of the
timestamp.  I guess the timestamp+random data key would have to be run
through ie. MD5 to make it suitable for consistent hashing, which adds
another layer of complexity to the collision analysis.

Thanks again, everybody.

Russ

On Fri, Apr 24, 2009 at 10:05 AM, P. Oscar Boykin <boy...@pobox.com> wrote:

> First, just review the birthday problem.  Collision free set of k
> uniformly selected objects from a total set of N occurs with
> probability:
>
> (1)(1-1/N)(1-2/N)...(1-(k-1)/N)
> If k << N, you can approximate the above by taking the ln, and noting
> that ln(1-x) ~ -x for small x:
>
> (1)(1-1/N)(1-2/N)...(1-(k-1)/N) ~ exp(-k(k-1)/(2N))
>
> If you generate keys are rate R we can ask when you expect to have a
> collision:
> k = Rt, and solve for exp(k(k-1)/2N) ~ 1/2.  Just to get the order,
> you get the square-root result: t = (1/R) \sqrt{N}, or the maximum
> rate: R = (1/t)\sqrt{N}
>
> In your second case, if we assume one timestamp per time P (or P units
> of time per timestamp), a maximum time T, and smaller set of names: n.
>  We have a maximum number of timestamps as T/P.  To compare, total
> names are N=nT/P.  A collision happens if you choose the same name in
> the set n, at the same timestamp.  If we assume key generation rate R
> per time, then you have RP keys per timestamp.  So, use the above
> formula with k = RP and maximum number of names n = NP/T:
>
> approximately: (RP)^2 T/PN = 1, so T = N/(PR^2).  The maximum key rate
> this can support: R = \sqrt{N/PT}.  You can recover the original
> result if have one big timestamp: P = T, then R = (1/T) \sqrt{N}.  To
> support a bigger R, we can control the parameter P.  Clearly P <= T,
> so the timestamp approach is always better when compared this way
> (because \sqrt{N/PT} >= \sqrt{N/T^2} = (1/T)\sqrt{N}.  So, choose P as
> small as is practical: so, the unit of time for the least precise
> timer you want to invoke.  I guess you want to make sure T is
> ridiculously large to avoid Y2K issues.
>
> So, seems like a good idea.  I would add one additional issue however:
> timestamps increase monotonically.  You might want to use some
> one-to-one mapping for the bits of the timestamp that destroys the
> monotonicity property before using them as parts of keys in a
> hashtable.
>
> I guess this is well known, or I made a mistake in the analysis.
>
> Best,
>
> On Fri, Apr 24, 2009 at 2:54 AM, Russ Weeks <rwe...@gmail.com> wrote:
> > Hi, All.
> > I'm playing around with building some data structures backed by a
> > conventional DHT like Chord or Kademlia.  It seems like many of the
> elements
> > in these structures don't need a semantically useful key.  Take a linked
> > list, for instance.  I need to be able to do a "get" on the head or tail
> of
> > the list, but the keys of the interior elements don't need to be exposed
> to
> > the caller.  This suggests that I can use random data for these keys.  My
> > question is, what's the best way to generate these keys.
> > The obvious approach is just to fill the key (say 128 bits) with random
> > data.  I think I understand the birthday problem and how to estimate the
> > probability that a collision will occur.
> > An alternative approach that was proposed to me was to use 64 bits of
> random
> > data and a 64-bit timestamp.  Here, we hope that the random number
> generator
> > is not seeded with the system time.  The theory is, the odds of
> generating 2
> > identical 64-bit numbers in the same nanosecond is less than the odds of
> > generating 2 128-bit numbers over the lifetime of the system.  If this is
> > true, it's not clear to me.
> > Can anybody suggest a way to quantify the likelyhood of collision in the
> > second approach?
> > Thanks,
> > Russ
> > _______________________________________________
> > p2p-hackers mailing list
> > p2p-hackers@lists.zooko.com
> > http://lists.zooko.com/mailman/listinfo/p2p-hackers
> >
> >
>
>
>
> --
> P. Oscar Boykin                            http://boykin.acis.ufl.edu
> Assistant Professor, Department of Electrical and Computer Engineering
> University of Florida
>
_______________________________________________
p2p-hackers mailing list
p2p-hackers@lists.zooko.com
http://lists.zooko.com/mailman/listinfo/p2p-hackers

Reply via email to