non-profit biological research lab (Cold Spring Harbor Laboratory
http://www.cshl.org). Our lab recently became a part of a rather large,
international collaboration, the Haplotype Map project, to product
enormous amounts of biological data (genotypes). Our role will be to
synchronize data handling and build the database to hold the stuff (see
http://www.genome.gov/page.cfm?pageID=10001688 if you're interested in
this thing), plus related tasks.
We'd originally planned to use a fairly XML-centric approach from the
ground up, for a multitude of reasons. One very strong reason was that we
could use the allegedly powerful XML-capabilities of Oracle 9i XMLDB to
produce a XML-relational schema from our own data handling/exchange XML
Schema definitions.
To cut this story short, we got our funding cut down quite a bit and
will now not be able to afford the big-buch Oracle licenses we'd need
(3x$15.000). Open source is now pretty much the only option for us
database-wise, which is in fact a blessing in disguise because it will
make the endevour entirely open source (Oracle would have been the only
proprietary component otherwise).
As of now, we are investigating open source alternatives in either the XML-on-top-of-RDMS arena or the native XML database arena. There are some commercial offerings (Tamino, Ipedo), but as I said above, the preference is always open source. Our lab is very fond of the Apache/mod_perl world (I work for Dr. Lincoln Stein, a longtime guru in the Perl world) and we'll likely use some of the Apache XML project components (Xalan, Xerces) for XML-processing in this project.
As a part of this investigation, Xindice appeared on the horizon. After looking at some simple commandline examples etc. of how one goes about handling XML-documents and collections in Xindice, it looks we just might be able to use the thing. There is only one thing that we have concerns about: scalability. It says on the Xindice website that the db is designed for many, small documents. The XML dataset that we will be handling will contain fairly small documents but VERY many of them; up to 400 million instances of the most populous record class.
My question is therefore this: has anyone used/tested Xindice with datasets of this size (hundreds of millions) with decent performance as well? This will be mainly import + query work, hardly any heavy updating load, if that would make a difference as far as performance goes.
Thanks in advance of your reply. Regards,
Mummi, Cold Spring Harbor Laboratory
PS I have attached key XML Schema draft component files for the portion of our total schema that has been nailed down mostly so far, plus one (hapmap.xsd) that ties them all together. There will be maybe 2 or 3 times this many types of objects in the total schema. The file genotype.xsd defines the <400M record class.
batch_submission.xsd
Description: Binary data
genotype.xsd
Description: Binary data
hapmap.xsd
Description: Binary data
snp.xsd
Description: Binary data
element_groups.xsd
Description: Binary data
simple_types.xsd
Description: Binary data
--
----------------------------------------------------
Gudmundur Arni Thorisson, bioinformatics researcher, B.Sc.
Steinlab, Cold Spring Harbor Laboratory
w-phone#: 516-367-6904
w-fax#: 516-367-8389
1 Bungtown Road
Cold Spring Harbor
11724 New York
USA
