>We are performing some rather extensive data collection (measurements) and
>are attempting to come up with the most sane storage mechanism to
>facilitate offline (Igor, MatLab, Custom Apps) and online (less complex
>web based solutions) analysis tools. The problem resides in the amount of
>data collected; by rough estimates we anticipate collecting 3-5 billion
>rows of data (presently, this can be represented in 2D plots; (ie, x,y
>data).. though, eventually data will be collected in more complex fashion,
>but, the above example makes our storage problems a bit easier to digest)
>per data collection cycle. Of course, that number could fluctuate up or
>down. The key here is that it is a lot of data.
>
>The question resides in the best way to store this data. The data itself
>is associated to objects (ie, there aren't 3 billion random measurements;
>1k, 2k, 10k etc. of them could be associated with an object - with 10's of
>100's of 1,000's of objects being measured).
That would be millions, right?
>The result is a graphable
>curve to represent the various measurements of an object.
>
>I've outlined 4 possible options, which, I'd love to get feedback on:
>
>1) Use mysql, try to store the data in its above form. Pray it can
>handle it. However, I've noticed in the past MySQL begins to bog down
>between 50-500M rows depending on your table architecture and the machine
>it's running on. The fact that the data isn't static (ie, we add to it)
>complicates the issue since MySQL seems to slow down (understandable)
>trying to build indexes on inserts over a certain row count. Thus, this
>doesn't really seem like an option.
>
>2) Reference flat files with rows in the database. Ie, store
>measurements for objects, or chunks therein within flat files on the
>filesystem and then reference those files with rows in a mysql table;
>similar to how I store and track binary data. I don't like this option
>simply because it requires file handlers to open up and compare data for
>objects.. it also is a bear to organize and maintain all of those flat
>files, and, slows down the application layer.
>
>3) Use something other than mysql (cringe).. recommendations welcome.
>(Oracle, DB2 etc, not a database or the standard filesystem).
>
>4) Use data compression, trend analysis, curve reduction algorithms to
>trim down the data but still not lose the significance of the raw data
>trends. If this is the best option, recommended table and data structures
>along with pointers on where to learn more about data reduction techniques
>would be helpful (ie, how to effectively store the data within the context
>of rows and columns).
>
>Of course, the best solution is probably not among those 4. Basically,
>any insight on how large (enterprise) level data collection and analysis
>is conducted and implemented would be a great help, whether through
>personal experience or through textual reference would be much
>appreciated!
>
>I just can't imagine that people out there DON'T use databases to handle
>large quantities of raw data like this.
>
>Thanks for any help!
>Seth
Sir, my 2 cents worth:
Option #2 is the right church, but the wrong pew. The objects and
their measurements should be stored separately, but not quite the way
you suggest.
If I understand your data correctly (and I have no reason to assume
that I do), you need a master table containing an auto_increment
column as primary key, and a second column containing a unique
identifier for each of your ~3 million objects. (In theory, you don't
need the auto_increment field, but in practice, surrogate keys tend
to avoid certain types of problems.) You'll want a UNIQUE INDEX on
the object identifier field, to ensure that each identifier is unique.
Your detail table will contain the measurements. The first column
will contain the primary key value of the object the measurement is
associated with (the auto_increment value from the object table). The
other columns will contain the data from the measurement. I'm
assuming that each measurement consists of an x value and a y value,
so there will be an x column and a y column. The first column will be
indexed and NOT NULL, but not UNIQUE.
When you search for an object, you'll be searching 3 million rows,
not 3 billion. Your measurement table still has 3 billion rows, but
you'll be searching it using an index when you do inner joins in your
select statements. I'm fairly certain that this will be quicker than
having your file system search for one of 3 million files. You still
have the problem of slow inserts as MySQL rebuilds the index. Since
it's only one index, it might not be so bad. Also, I'm assuming that
you can do the inserts in batches. I think MySQL will have to rebuild
the indices only once per batch. If users have to be able to enter
data one measurement at a time, you can set up a table to hold the
measurements, and then load them into the measurements table at night
(or some other period of minimum use). If your server is using some
form of Unix, you can set up a cron job to do this automatically.
Consider ways of splitting up your tables. If most of the searches
will be made on certain groups of objects, separate those objects and
their measurements out and put them in their own tables. When you
need to search across all objects, you can run several queries and
combine the results in temp tables. Or if 10% of your objects are the
subject of 90% of your SELECT statements, you could copy those 10%
and their measurements to separate tables and make them available for
quick searches, with the full database still available for slower
searches.
If you haven't done so already, put as much memory into your server
box as it will hold.
That will be $0.02, please. If my advice is no good, I'll refund the money.
Bob Hall
Know thyself? Absurd direction!
Bubbles bear no introspection. -Khushhal Khan Khatak
MySQL list magic words: sql query database
---------------------------------------------------------------------
Before posting, please check:
http://www.mysql.com/manual.php (the manual)
http://lists.mysql.com/ (the list archive)
To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php