Re: Xindice scalability: using in a large bio

Gudmundur Arni Thorisson 23 Dec 2002 13:33:43 -0000

Thanks for the info, Devrim. Seems like Tamino is unfortunately out of our leage $$$-wise, same as with Oracle.


Mummi


On Monday, December 23, 2002, at 01:17 PM, Devrim Ergel wrote:

Hello,
Software AG Tamino XML Database might be the right answer. They are offering as a robust native xml database for mission critical applications. The only problem its price was 45000$ last year! (In our case, we are Software AG Partner Software Company in Turkey so we can bundle with reasonable prices)
Devrim
Parsera IT
----- Original Message -----
From: "Gudmundur Arni Thorisson" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Friday, December 20, 2002 11:07 PM
Subject: Re: Xindice scalability: using in a large bio
   Ximbro, I have not estimated the total storage requirements for the
project yet, as we have not yet finalized more than a part of the schema,
including the 400M instance  class. But it is certain that the size of
each record in that class will be quite small, in an XML-sense, something
like this:
<genotype
lsid="urn:LSID:genome.wi.mit.edu:HapMap/Genotype:883423434:1"
       <snp_assay
lsid="urn:LSID:genome.wi.mit.edu:HapMap/SNPAssay:300004343:1"/>
       <sample lsid="urn:LSID:hapmap.org:HapMap/Sample:1004:1"/>
       <genotyping_protocol
lsid="urn:LSID:genome.wi.mit.edu:HapMap/Protocol:0034:1:1/>
       <alleles>
         <allele base="G"/>
         <allele base="T"/>
       </alleles>
     </genotype>
Let's see, this less than 1/2 Kb in size for a single file, times 400 million records equals a whole bunch of space, you're right! I suppose I' d better look into the filesize limit thing and make sure that candidate db' s and OS platform(s) (Linux preferable) can in fact handle this size of files. Thanks for the tip, Kimbro.
           Mummi
On Friday, December 20, 2002, at 06:19 PM, Kimbro Staken wrote:
On Friday, December 20, 2002, at 09:06  AM, Gudmundur Arni Thorisson
wrote:
Thanks for rapid reply, Murray. There could be some division there, articifical if need be (divide by e.g. laboratory that produced genotypes) . But after looking briefly at the Xindice quick tutorial, it seemed to me that it would be natural to put each document type in its own collection:
/db/genotype/
/db/snp/
/db/haplotype/
/db/sample/
/db/individual/
/db/pedigree/
..and so on.
Where the genotype collection would be by far the biggest one (one genotype per sample per SNP, where the number of samples will be in the range 180-270 and SNPs from 500 thousand up to 1.5 million). So, yes, unless someone can suggest otherwise, I'd think that a single
collection
would need to contain those 400M records.
  Also, ince if one wants to retrieve a genotype by its unique (within
that type class) identifier, it would go something like this, using
LSIDs (Life Science Identifiers):
/db/genotype/@lsid='urn:LSID:washu.edu:
HapMap/
Genotype:23423432434:1
(I'm no good at XPath, I know!)
 But if there is a per-laboratory division, one would actually have to
know which lab the genotype came from, in addition to its identifier.
Not a Good Thing. This would probably also affect other, more complex
queries,
 I don't know.
  Is there a hard limit on the number of documents per Xindice
collection?
There is no hard limit.
 Max number of files per directory or whatever, something outside
Xindice'
s control?
The first external limit you'll run into will be file size. Xindice
can't
span a collection across files yet, so if your file system limits file
size to 4GB or something that will be all you can store. This of course
varies by platform.
Really though 400 million is a pretty big number. The most I've ever
tested with was a little over 1 million. The server could handle more,
but it was really pushing the limits of the current system. So until
Xindice matures quite a bit more, I'd have to recommend against it as a
solution.
It will be tough to find an open source solution that can easily handle that many documents with acceptable performance. As far as I know eXist won't be any better in this area. Honestly, I'd really have to question whether Oracle can even handle that much XML. Obviously, for relational data it's up to the task, but XML is quite a bit different and there's still some pretty inefficient aspects to what they're doing. Of course I do think Oracle is better then anything else currently available.
Beside number of documents, have you estimated document size and storage
space required? Even if you're looking at only 1k per document, I
believe
once you throw in indexes and overhead, you're pushing 1TB in data size.
That's a pretty big chunk of data, it's not going to be easy to manage
no
matter which route you take.
                  Mummi, CSHL
On Friday, December 20, 2002, at 03:43 PM, Murray Altheim wrote:
Gudmundur Arni Thorisson wrote:
[...]
It says on the Xindice website that the db is designed for many,
small documents. The XML dataset that we will be handling will
contain fairly small documents but VERY many of them; up to 400
million instances of the most populous record class.
My question is therefore this: has anyone used/tested Xindice with
datasets of this size (hundreds of millions) with decent performance
as well? This will be mainly import + query work, hardly any heavy
updating load, if that would make a difference as far as performance
goes.
One question that may help answer this: would 400 million records
be in *one* Xindice Collection, or could these be organized according
to some hierarchy, such that there would be a smaller limit at the
Collection level?
Murray
.....................................................................
.
Murray Altheim                  <http://kmi.open.ac.uk/people/murray/
>
Knowledge Media Institute
The Open University, Milton Keynes, Bucks, MK7 6AA, UK
           If you're the first person in a new territory,
           you're likely to get shot at.
                                                    -- ma
Kimbro Staken
Java and XML Software, Consulting and Writing
http://www.xmldatabases.org/
Apache Xindice native XML database http://xml.apache.org/xindice
XML:DB Initiative http://www.xmldb.org

Re: Xindice scalability: using in a large bio

Reply via email to