Re: Xindice scalability: using in a large bio

Kimbro Staken 20 Dec 2002 18:17:32 -0000

On Friday, December 20, 2002, at 09:06 AM, Gudmundur Arni Thorisson wrote:

Thanks for rapid reply, Murray. There could be some division there, articifical if need be (divide by e.g. laboratory that produced genotypes) . But after looking briefly at the Xindice quick tutorial, it seemed to me that it would be natural to put each document type in its own collection:
/db/genotype/
/db/snp/
/db/haplotype/
/db/sample/
/db/individual/
/db/pedigree/
..and so on.
Where the genotype collection would be by far the biggest one (one genotype per sample per SNP, where the number of samples will be in the range 180-270 and SNPs from 500 thousand up to 1.5 million). So, yes, unless someone can suggest otherwise, I'd think that a single collection would need to contain those 400M records. Also, ince if one wants to retrieve a genotype by its unique (within that type class) identifier, it would go something like this, using LSIDs (Life Science Identifiers): /db/genotype/@lsid='urn:LSID:washu.edu:HapMap/ Genotype:23423432434:1 (I'm no good at XPath, I know!) But if there is a per-laboratory division, one would actually have to know which lab the genotype came from, in addition to its identifier. Not a Good Thing. This would probably also affect other, more complex queries, I don't know.

Is there a hard limit on the number of documents per Xindice collection?


There is no hard limit.

Max number of files per directory or whatever, something outside Xindice' s control?

The first external limit you'll run into will be file size. Xindice can't span a collection across files yet, so if your file system limits file size to 4GB or something that will be all you can store. This of course varies by platform.

Really though 400 million is a pretty big number. The most I've ever tested with was a little over 1 million. The server could handle more, but it was really pushing the limits of the current system. So until Xindice matures quite a bit more, I'd have to recommend against it as a solution.

It will be tough to find an open source solution that can easily handle that many documents with acceptable performance. As far as I know eXist won't be any better in this area. Honestly, I'd really have to question whether Oracle can even handle that much XML. Obviously, for relational data it's up to the task, but XML is quite a bit different and there's still some pretty inefficient aspects to what they're doing. Of course I do think Oracle is better then anything else currently available.

Beside number of documents, have you estimated document size and storage space required? Even if you're looking at only 1k per document, I believe once you throw in indexes and overhead, you're pushing 1TB in data size. That's a pretty big chunk of data, it's not going to be easy to manage no matter which route you take.

                  Mummi, CSHL
On Friday, December 20, 2002, at 03:43 PM, Murray Altheim wrote:
Gudmundur Arni Thorisson wrote:
[...]
It says on the Xindice website that the db is designed for many, small documents. The XML dataset that we will be handling will contain fairly small documents but VERY many of them; up to 400 million instances of the most populous record class. My question is therefore this: has anyone used/tested Xindice with datasets of this size (hundreds of millions) with decent performance as well? This will be mainly import + query work, hardly any heavy updating load, if that would make a difference as far as performance goes.
One question that may help answer this: would 400 million records
be in *one* Xindice Collection, or could these be organized according
to some hierarchy, such that there would be a smaller limit at the
Collection level?
Murray
......................................................................
Murray Altheim                  <http://kmi.open.ac.uk/people/murray/>
Knowledge Media Institute
The Open University, Milton Keynes, Bucks, MK7 6AA, UK
           If you're the first person in a new territory,
           you're likely to get shot at.
                                                    -- ma

Kimbro Staken Java and XML Software, Consulting and Writing http://www.xmldatabases.org/ Apache Xindice native XML database http://xml.apache.org/xindice XML:DB Initiative http://www.xmldb.org

Re: Xindice scalability: using in a large bio

Reply via email to