Keep a couple of points in mind.  First of all, since you are using MD5 as
what amounts to a random number generator rather than for a cryptographic
purpose, you may be undertaking more processing than would be required to
do this in a more straightforward manner.  This is because crypto hashes
are designed to be infeasible to replicate without lots of iterative
processing in order to make it hard to attack them by brute force; they
are therefore computationally intensive.  Most decent psuedo-random number
generators can produce equivalent results in terms of distribution with
orders of magnitude less processing, although with more predictability.

Second, in terms of randomness, any arbitrary selection from a random set
of bits is just as random.  That is, a simple method for getting a random
64-bit number given a random 128-bit number is to use only the most- or
least-significant half, or XOR'ing both halves together.

Third, crypto hashes may be slightly less random than you think when used
to digest short messages.  In general, the distribution of crypto hashes
is only expected to be sufficiently random for "large enough" messsages,
although no one currently knows how to estimate a lower bound.  I doubt
this has any practical significance for you, but you ought to try to hash
a reasonably large amount of data, say 1024 bytes, relative to the size of
the hash itself when using a 128-bit hash such as MD5.  Above all, avoid
using initializers which are actually shorter than the hash, such as a
32-bit time stamp.  Note that the information content of data may be far
smaller than it seems; for example, the Unix 'date' command outputs 29
bytes of ASCII ("Tue Jun 26 16:10:53 EDT 2001") which we know has only 4
bytes (32 bits) of information content, so "date | md5sum" is a very bad
idea to do.  You can get around this by using the output of each hash in
the input to the next hash, but then you may as well use a simple sequence
generator class.

Fourth, the probability of a collision between your next randomly chosen
number and any prior randomly chosen number is going to follow a Poisson
distribution, which may not be obvious at first.  This means that the
likelihood of collison is by no means neglible for databases of practical
size even when using fairly big arbitrary numbers.  Exactly what does
constitute "negligible" is a judgment call, not a technical decision.  
Although a 128-bit hash seems unlikely to be a problem, it is certainly
conceivable that a 64-bit hash could be.  Optimal behavior also depends
upon the hash being properly seeded, as discussed.

In general, I think that these kinds of subtleties make using a random
hash approach for primary key generation more trouble than it is worth.  
If you absolutely need to eliminate a singleton constraint and you can
afford to check for collisions, it may be worth the effort.

See http://www.ietf.org/rfc/rfc2795.txt for details. :-)

-- Mike


On 2001-06-26 at 15:28 +0200, [EMAIL PROTECTED] wrote:

> Well... I use similar technique for generating unique IDs, but I utilize the
> MD5 hash function. It is believed that it provides enough uniqueness. The
> only problem is that you cannot use the long as primary key field (in my
> case I use String). But you may convert 128-bit digest into 2x64-bit longs.
> 
> It has similar pros and cons, however the max value is now 2^128 which is
> quite large number :).
> 
> Regards,
> Roman.
> 
> > -----Original Message-----
> > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
> > Sent: Dienstag, 26. Juni 2001 14:54
> > To: jboss
> > Subject: [JBoss-user] Key generation by random numbers
> > 
> > 
> > As a follow-up to the debate on how to get auto-increment primary
> > keys:
> > 
> > Is it feasible to use a random number generator to generate primary
> > keys? I don't really need my records to have steadily increasing keys
> > and my number of records will presumably be much smaller than the size
> > of my value space (4 billion? depending on data type for prim-key). So
> > if I do something along the following when making a new record;
> > 
> > boolean created = false;
> > do 
> > {
> >    Long key = generateRandomLong();
> >    created = ejb.create(key, contents);
> > } 
> > while (!created);
> > 
> > 
> > (here, ejb.create() takes the primary key of the new record as its
> > first param and returns true if success, false if not success)
> > 
> > My theory is that most of the time, the creation will succeed on the
> > first attempt, based on the assumption that number of records is
> > insignificant compared to the value space of Long. 
> > 
> > Pros:
> > I don't need to do manual synchronisation with a central
> > key-generating bean.
> > I don't need to do DB-specific calls to get it to generate for me.
> > 
> > Cons:
> > Unpredictable time use for creation.
> > Unusable if number of records becomes significant compared to value
> > space of Long.
> > 
> > I am assuming also that the creation of a random Long is a fast
> > process.
> > 
> > Cheers
> >     Bent D



_______________________________________________
JBoss-user mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/jboss-user

Reply via email to