RE: Solr feasibility with terabyte-scale data

Lance Norskog Fri, 09 May 2008 13:55:23 -0700

A useful schema trick: MD5 or SHA-1 ids. we generate our unique ID with the
MD5 cryptographic checksumming algorithm. This takes X bytes of data and
creates a 128-bit long "random" number, or 128 "random" bits. At this point
there are no reports of two different datasets that give the same checksum.


This gives some handy things: 
a) a fixed-size unique ID field, giving fixed space requirements,
        The standard representation of this is 32 hex bytes, i.e.
'deadbeefdeadbeefdeadbeefdeadbeef'. You could make a special 128-bit Lucene
data type for this.

b) the ability to change your mind about the uniqueness formula for your
data,

c) a handy primary key for cross-correlating in other databases,
        Think external DBs which supply data for some records. The primary
key is the MD5 signature.

d) the ability to randomly pick subsets of your data.
        The record 'id:deadbeefdeadbeefdeadbeefdeadbeef', will match the
wildcard string 'deadbeef*'. And 'd*'.
        'd*' selects a perfectly random subset of your data, 1/16 of the
total size. 'd**' gives 1/256 of your data.
        This is perfectly random because MD5 gives such a "perfectly" random
hashcode.

This should go on a wiki page 'SchemaDesignTips'.

Cheers,

Lance Norskog

RE: Solr feasibility with terabyte-scale data

Reply via email to