I am building up a schema for storing a bunch of data about proteins, which on a certain level can be modelled with quite simple tables. The problem is that the database I am building needs to house lots of it >10TB and growing, with one table in particular threatening to top 1TB. In the case of the table and in the case of the overall database, the size can be expected to grow quickly (and most of it can never be deleted).
In the past, with smaller tables, I have had success partitioning on a 64-bit crc hash that takes a more or less uniform distribution of input data and pumps out a more-or-less uniform distribution of partitioned data with a very small probability of collision. The hash itself is implemented as a c add-on library, returns a BIGINT and serves as a candidate key for what for our purposes we can call a protein record. Now back to the big table, which relates two of these records (in a theoretically symmetric way). Assuming I set the the table up as something like: CREATE TABLE big_protein_relation_partition_dimA_dimB{ protein_id_a BIGINTEGER NOT NULL CHECK( bin_num(protein_id_a) = dimA ), --- key (hash) from some table protein_id_a BIGINTEGER NOT NULL CHECK( bin_num(protein_id_b) = dimB ), --- key (hash) from some table ... } and do a little c bit-twiddling and define some binning mechanism on the BIGINTEGERs. As near I can tell, binning out along the A and B dimensions into 256 bins, I shouldn't be in any danger of running out of OIDs or anything like that (despite having to deal with 2^16 tables). Theoretically, at least, I should be able to do UNIONS along each axis (to avoid causing the analyzer too much overhead) and use range exclusion to make my queries zip along with proper indexing. Aside from running into a known bug with "too many triggers" when creating gratuitous indices on these tables, I feel as it may be possible to do what I want without breaking everything. But then again, am I taking too many liberties with technology that maybe didn't have use cases like this one in mind? Jason -- ======================================================== Jason Nerothin Programmer/Analyst IV - Database Administration UCLA-DOE Institute for Genomics & Proteomics Howard Hughes Medical Institute ======================================================== 611 C.E. Young Drive East | Tel: (310) 206-3907 105 Boyer Hall, Box 951570 | Fax: (310) 206-3914 Los Angeles, CA 90095. USA | Mail: [EMAIL PROTECTED] ======================================================== http://www.mbi.ucla.edu/~jason ========================================================