A useful schema trick: MD5 or SHA-1 ids. we generate our unique ID with the MD5 cryptographic checksumming algorithm. This takes X bytes of data and creates a 128-bit long "random" number, or 128 "random" bits. At this point there are no reports of two different datasets that give the same checksum.
This gives some handy things: a) a fixed-size unique ID field, giving fixed space requirements, The standard representation of this is 32 hex bytes, i.e. 'deadbeefdeadbeefdeadbeefdeadbeef'. You could make a special 128-bit Lucene data type for this. b) the ability to change your mind about the uniqueness formula for your data, c) a handy primary key for cross-correlating in other databases, Think external DBs which supply data for some records. The primary key is the MD5 signature. d) the ability to randomly pick subsets of your data. The record 'id:deadbeefdeadbeefdeadbeefdeadbeef', will match the wildcard string 'deadbeef*'. And 'd*'. 'd*' selects a perfectly random subset of your data, 1/16 of the total size. 'd**' gives 1/256 of your data. This is perfectly random because MD5 gives such a "perfectly" random hashcode. This should go on a wiki page 'SchemaDesignTips'. Cheers, Lance Norskog