Re: proposal: new back end tailored to data mining very large tables

2003-02-03 Thread Steven Roussey
First of all, I'd try optimizing your app before writing a whole new
back-end. As such, I'd keep to the normal mysql list.

For example, even if the indexes are big, try to index all the columns that
might be searched. Heck, start by indexing all of them. If the data is
read-only, try myisampack.

Or, do the index thing above and use InnoDB for this app and be sure to
select only those columns that you need. InnoDB does not read the whole
record if it does not need to, even in a table scan (which is the worst case
scenario you are calculating).

All your calculations assume a full table scan which can be avoided by good
choice of indexes and by using InnoDB to avoid whole-record retrieval. Am I
missing something? Pulling data from a small 14GB table should not be a
problem. My machine ($10K) deals with 100GB of data and does 5000 to 1
queries per second.

Also, your reference to denormalization didn't make any sense to me. What
level of normal form are you expecting?

Sincerely,
Steven Roussey
http://Network54.com/ 




-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




proposal: new back end tailored to data mining very large tables

2003-02-02 Thread Heitzso
We (Centers for Disease Control and Prevention) want to mount
relatively large read only tables that rarely change on 
modest ($10K ??) hardware and get  10 second response time.
For instance, 10 years of detailed natality records for the United
States in which each record has some 200 fields and, in mysql myisam
table, is 14G in size.  Queries will typically hit 6 to 10 fields 
from the table and while some queries will have a WHERE clause
that can take advantage of a YEAR or GEOGRAPHY index, most won't.

Hardware side even a 320 LVD SCSI raid sitting on a 64/66 PCI bus
is limited to some 300 MB/sec and a more common 160 SCSI raid is 
limited to 32/33 PCI bus speed of approx 100+ MB/sec.

And the problem becomes one of moving a whole table over the bus
in X time.  Assuming 300 MB/sec and a 10 second query we're limited
to a 3G table.  Allowing the query response time to bump up to
30 seconds and we hit a 9G table limit.  But our detailed natality
table is 14G.

Some design options:
 - do the typical data mining thing and preaggregate or subset
 the main table into a thousand or more subtables and write an
 intelligent filter that sits in front of these tables to decide
 which to reroute the query to ...

 - use what Census calls a transverse server design and write out
 each field in the table to its own file

 - use mysql compression on myisam table -- I've tried this and
 the process becomes horribly CPU bound when scanning through
 an entire 14G uncompressed, 2.2G compressed, table with little
 if any (and typically a loss) advantage time wise

While the file per field layout is not optimal for the general case,
in our situation it dramatically decreases the physical bus limitation
because it is rare for more than a small hand full of fields to be
involved in a SELECT request.  ... and our data is strictly read
only and is updated once a year.

The actual numbers work out to:
   10 fields max in a query is 1/20th the overall record size
   1/20 * 14G = approx 700M
   700M can be scanned in 7 seconds at 100MB/s or  3 sec @ 300MB/s

If we do a light weight compression to avoid being CPU bound but 
to keep the bottleneck on the IO side the figures improve even
more.

Because mySQL already has several back ends I'm assuming that
writing this backend for mySQL is possible.  The one key is
knowing at the beginning of the rip through the table what
fields are needed by the query engine, then only opening up
those files (to avoid hitting the 200 files open problem).

Thoughts?  Suggestions?

Anyone else out there willing to work with us on this one?
Of course the back end would be contributed back, etc.


[[ other notes, data is denormalized, but typical field is
one or two bytes of some code, i.e. '0', '1' for gender, so
normalizing and joining back doesn't make sense at all ]]


-- 
Heitzso [EMAIL PROTECTED]
MetaMedia, Inc.


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php