Re: Large scale statistical analysis

Bob Hall Wed, 16 May 2001 11:19:43 -0700
>We are performing some rather extensive data collection (measurements) and
>are attempting to come up with the most sane storage mechanism to
>facilitate offline (Igor, MatLab, Custom Apps) and online (less complex
>web based solutions) analysis tools.  The problem resides in the amount of
>data collected; by rough estimates we anticipate collecting 3-5 billion
>rows of data (presently, this can be represented in 2D plots; (ie, x,y
>data).. though, eventually data will be collected in more complex fashion,
>but, the above example makes our storage problems a bit easier to digest)
>per data collection cycle.  Of course, that number could fluctuate up or
>down. The key here is that it is a lot of data.
>
>The question resides in the best way to store this data.  The data itself
>is associated to objects (ie, there aren't 3 billion random measurements;
>1k, 2k, 10k etc. of them could be associated with an object - with 10's of
>100's of 1,000's of objects being measured).

That would be millions, right?

>The result is a graphable
>curve to represent the various measurements of an object.
>
>I've outlined 4 possible options, which, I'd love to get feedback on:
>
>1)  Use mysql, try to store the data in its above form.  Pray it can
>handle it.  However, I've noticed in the past MySQL begins to bog down
>between 50-500M rows depending on your table architecture and the machine
>it's running on.  The fact that the data isn't static (ie, we add to it)
>complicates the issue since MySQL seems to slow down (understandable)
>trying to build indexes on inserts over a certain row count.  Thus, this
>doesn't really seem like an option.
>
>2)  Reference flat files with rows in the database.  Ie, store
>measurements for objects, or chunks therein within flat files on the
>filesystem and then reference those files with rows in a mysql table;
>similar to how I store and track binary data.  I don't like this option
>simply because it requires file handlers to open up and compare data for
>objects.. it also is a bear to organize and maintain all of those flat
>files, and, slows down the application layer.
>
>3)  Use something other than mysql (cringe).. recommendations welcome.
>(Oracle, DB2 etc, not a database or the standard filesystem).
>
>4)  Use data compression, trend analysis, curve reduction algorithms to
>trim down the data but still not lose the significance of the raw data
>trends.  If this is the best option, recommended table and data structures
>along with pointers on where to learn more about data reduction techniques
>would be helpful (ie, how to effectively store the data within the context
>of rows and columns).
>
>Of course, the best solution is probably not among those 4.  Basically,
>any insight on how large (enterprise) level data collection and analysis
>is conducted and implemented would be a great help, whether through
>personal experience or through textual reference would be much
>appreciated!
>
>I just can't imagine that people out there DON'T use databases to handle
>large quantities of raw data like this.
>
>Thanks for any help!
>Seth

Sir, my 2 cents worth:

Option #2 is the right church, but the wrong pew. The objects and 
their measurements should be stored separately, but not quite the way 
you suggest.

If I understand your data correctly (and I have no reason to assume 
that I do), you need a master table containing an auto_increment 
column as primary key, and a second column containing a unique 
identifier for each of your ~3 million objects. (In theory, you don't 
need the auto_increment field, but in practice, surrogate keys tend 
to avoid certain types of problems.) You'll want a UNIQUE INDEX on 
the object identifier field, to ensure that each identifier is unique.

Your detail table will contain the measurements. The first column 
will contain the primary key value of the object the measurement is 
associated with (the auto_increment value from the object table). The 
other columns will contain the data from the measurement. I'm 
assuming that each measurement consists of an x value and a y value, 
so there will be an x column and a y column. The first column will be 
indexed and NOT NULL, but not UNIQUE.

When you search for an object, you'll be searching 3 million rows, 
not 3 billion. Your measurement table still has 3 billion rows, but 
you'll be searching it using an index when you do inner joins in your 
select statements. I'm fairly certain that this will be quicker than 
having your file system search for one of 3 million files. You still 
have the problem of slow inserts as MySQL rebuilds the index. Since 
it's only one index, it might not be so bad. Also, I'm assuming that 
you can do the inserts in batches. I think MySQL will have to rebuild 
the indices only once per batch. If users have to be able to enter 
data one measurement at a time, you can set up a table to hold the 
measurements, and then load them into the measurements table at night 
(or some other period of minimum use). If your server is using some 
form of Unix, you can set up a cron job to do this automatically.

Consider ways of splitting up your tables. If most of the searches 
will be made on certain groups of objects, separate those objects and 
their measurements out and put them in their own tables. When you 
need to search across all objects, you can run several queries and 
combine the results in temp tables. Or if 10% of your objects are the 
subject of 90% of your SELECT statements, you could copy those 10% 
and their measurements to separate tables and make them available for 
quick searches, with the full database still available for slower 
searches.

If you haven't done so already, put as much memory into your server 
box as it will hold.

That will be $0.02, please. If my advice is no good, I'll refund the money.

Bob Hall

Know thyself? Absurd direction!
Bubbles bear no introspection.     -Khushhal Khan Khatak
MySQL list magic words: sql query database

---------------------------------------------------------------------
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/           (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: Large scale statistical analysis

Reply via email to