Keith Ching wrote:
> unfortunately, we have to store and make queryable all the raw data..
> 
> so it would be like 14 million rows x 300 / month = 4.2 billion rows / month
> = 50 billion rows per year..


I wouldn't do this with BASE. It is an issue of performance versus 
flexibility. Not only of the database but also of the software used to 
handle the data. BASE has a certain degree of flexibility, ie. you can 
configure different raw data types with their own specific data columns, 
and it is designed to work with different databases.

The price we pay is loss of performance. If you make a descision about 
the kind of data and which database to use you could create code that is 
optimized for that which would probably be a lot faster. I could give 
you an example: On my development machine data import runs at about 
2,000 rows per second. My machine is not a very fance or new one so this 
figure will probably be larger on a dedicated server. Let's say 10,000 
rows per second. For a data set that contains 14 million rows the import 
would take almost half an hour. Optimizing the import could be done by 
using tools provided by the database vendors. For example MySQL can 
import directly from a tab-separated file with the LOAD DATA INFILE command.

The Affymetrix platform has about one tenth of the data per array and we 
have already decided that this is too much to go into the database. 
Instead the raw data is stored as files only. The first step of analysis 
is to merge all probes in a probeset into a single row, which takes the 
number or rows down by a factor of 10-20. Ie. in the same range as 
regular microarray data. This is stored in the database. If the grouping 
you are talking about could take it down to a 100,000 rows BASE could be 
useful.

I really don't have any suggestions for you. You know your data better 
than I do. Maybe you need to create new software, maybe you can find 
some other existing software, maybe you can get the size of your data 
down to fit into BASE.

/Nicklas


> i guess mysql would bog..
> 
> however, since these are tiling arrays with evenly spaced probes, one 
> can calculate the position
> of each probe given the starting point and the number of probes from the 
> start.
> 
> could information be stored more efficiently if the probes where 
> compacted into groups of
> 10k or 100k? then we're talking about millions of rows instead of billions.
> 
> i've heard that oracle can handle billions of rows of data, but i can't 
> imagine that its very fast
> even if indexed properly..
> 
> -keith
> 
> Nicklas Nordborg wrote:
>> Keith Ching wrote:
>>   
>>> Hi,
>>>
>>> I am looking into using BASE2 to store ChIP-chip data from the NimbleGen 
>>> platform.
>>> Each whole genome scan has 14 million probes, divided up into 38 arrays 
>>> of 370k probes each.
>>>
>>> What is the feasibility of storing this information in BASE2?  Say we 
>>> had 100+ whole genome scans.
>>> Would it even be practical?  Should I just store the raw datafiles as 
>>> file attachments?  It would be nice
>>> to have some compression built into the file attachments as this could 
>>> save 75% on the disk space as each
>>> expt is 3 gigs or so.
>>>     
>>
>> Wow, that is really a lot of data. I wouldn't store that in the 
>> database. It would suck the performance out of the entire application. 
>> You could compress the files before you upload them to Base 2. Or, you 
>> could let the operating system automatically compress the folder where 
>> the file uploads are stored.
>>
>> Note however, if you store the data in files, you will not be able to 
>> use any of the existing plugins to analyze the data. If you want to do 
>> that you will need to create a plugin that generates a more managable 
>> data set from the files. We have created such a plugin for Affymetrix 
>> files. See http://lev.thep.lu.se/trac/baseplugins/wiki/thep.lu.se.RMAExpress
>> for more information about it.
>>
>> /Nicklas
>>
>>
>> -------------------------------------------------------------------------
>> Using Tomcat but need to do more? Need to support web services, security?
>> Get stuff done quickly with pre-integrated technology to make your job easier
>> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
>> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
>> _______________________________________________
>> The BASE general discussion mailing list
>> basedb-users@lists.sourceforge.net
>> unsubscribe: send a mail with subject "unsubscribe" to
>> [EMAIL PROTECTED]
>>
>>   
> 
> ------------------------------------------------------------------------
> 
> -------------------------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> The BASE general discussion mailing list
> basedb-users@lists.sourceforge.net
> unsubscribe: send a mail with subject "unsubscribe" to
> [EMAIL PROTECTED]


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
The BASE general discussion mailing list
basedb-users@lists.sourceforge.net
unsubscribe: send a mail with subject "unsubscribe" to
[EMAIL PROTECTED]

Reply via email to