We (Centers for Disease Control and Prevention) want to mount relatively large read only tables that rarely change on modest ($10K ??) hardware and get < 10 second response time. For instance, 10 years of detailed natality records for the United States in which each record has some 200 fields and, in mysql myisam table, is 14G in size. Queries will typically hit 6 to 10 fields from the table and while some queries will have a WHERE clause that can take advantage of a YEAR or GEOGRAPHY index, most won't.
Hardware side even a 320 LVD SCSI raid sitting on a 64/66 PCI bus is limited to some 300 MB/sec and a more common 160 SCSI raid is limited to 32/33 PCI bus speed of approx 100+ MB/sec. And the problem becomes one of moving a whole table over the bus in X time. Assuming 300 MB/sec and a 10 second query we're limited to a 3G table. Allowing the query response time to bump up to 30 seconds and we hit a 9G table limit. But our detailed natality table is 14G. Some design options: - do the typical data mining thing and preaggregate or subset the main table into a thousand or more subtables and write an intelligent filter that sits in front of these tables to decide which to reroute the query to ... - use what Census calls a transverse server design and write out each field in the table to its own file - use mysql compression on myisam table -- I've tried this and the process becomes horribly CPU bound when scanning through an entire 14G uncompressed, 2.2G compressed, table with little if any (and typically a loss) advantage time wise While the file per field layout is not optimal for the general case, in our situation it dramatically decreases the physical bus limitation because it is rare for more than a small hand full of fields to be involved in a SELECT request. ... and our data is strictly read only and is updated once a year. The actual numbers work out to: 10 fields max in a query is 1/20th the overall record size 1/20 * 14G = approx 700M 700M can be scanned in 7 seconds at 100MB/s or < 3 sec @ 300MB/s If we do a light weight compression to avoid being CPU bound but to keep the bottleneck on the IO side the figures improve even more. Because mySQL already has several back ends I'm assuming that writing this backend for mySQL is possible. The one key is knowing at the beginning of the rip through the table what fields are needed by the query engine, then only opening up those files (to avoid hitting the 200 files open problem). Thoughts? Suggestions? Anyone else out there willing to work with us on this one? Of course the back end would be contributed back, etc. [[ other notes, data is denormalized, but typical field is one or two bytes of some code, i.e. '0', '1' for gender, so normalizing and joining back doesn't make sense at all ]] -- Heitzso <[EMAIL PROTECTED]> MetaMedia, Inc. --------------------------------------------------------------------- Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php