Large scale statistical analysis

Seth Northrop Tue, 15 May 2001 17:05:07 -0700

We are performing some rather extensive data collection (measurements) and
are attempting to come up with the most sane storage mechanism to
facilitate offline (Igor, MatLab, Custom Apps) and online (less complex
web based solutions) analysis tools.  The problem resides in the amount of
data collected; by rough estimates we anticipate collecting 3-5 billion
rows of data (presently, this can be represented in 2D plots; (ie, x,y
data).. though, eventually data will be collected in more complex fashion,
but, the above example makes our storage problems a bit easier to digest)
per data collection cycle.  Of course, that number could fluctuate up or
down. The key here is that it is a lot of data.

The question resides in the best way to store this data.  The data itself
is associated to objects (ie, there aren't 3 billion random measurements;
1k, 2k, 10k etc. of them could be associated with an object - with 10's of
100's of 1,000's of objects being measured).  The result is a graphable
curve to represent the various measurements of an object.

I've outlined 4 possible options, which, I'd love to get feedback on:

1)  Use mysql, try to store the data in its above form.  Pray it can
handle it.  However, I've noticed in the past MySQL begins to bog down
between 50-500M rows depending on your table architecture and the machine
it's running on.  The fact that the data isn't static (ie, we add to it)
complicates the issue since MySQL seems to slow down (understandable)
trying to build indexes on inserts over a certain row count.  Thus, this
doesn't really seem like an option.

2)  Reference flat files with rows in the database.  Ie, store
measurements for objects, or chunks therein within flat files on the
filesystem and then reference those files with rows in a mysql table;
similar to how I store and track binary data.  I don't like this option
simply because it requires file handlers to open up and compare data for
objects.. it also is a bear to organize and maintain all of those flat
files, and, slows down the application layer.

3)  Use something other than mysql (cringe).. recommendations welcome.
(Oracle, DB2 etc, not a database or the standard filesystem).

4)  Use data compression, trend analysis, curve reduction algorithms to
trim down the data but still not lose the significance of the raw data
trends.  If this is the best option, recommended table and data structures
along with pointers on where to learn more about data reduction techniques
would be helpful (ie, how to effectively store the data within the context
of rows and columns).

Of course, the best solution is probably not among those 4.  Basically,
any insight on how large (enterprise) level data collection and analysis
is conducted and implemented would be a great help, whether through 
personal experience or through textual reference would be much
appreciated!

I just can't imagine that people out there DON'T use databases to handle
large quantities of raw data like this. 

Thanks for any help!
Seth


---
Seth Northrop
Manager of Information Technology
Reflectivity, Inc.
3910 Freedom Circle, Suite 103
Santa Clara, CA 95054
http://www.reflectivity.com/


---------------------------------------------------------------------
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/           (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Large scale statistical analysis

Reply via email to