Re: Xindice scalability: using in a large bio

Devrim Ergel 23 Dec 2002 13:14:13 -0000

Hello,

Software AG Tamino XML Database might be the right answer. They are offering
as a robust native xml database for mission critical applications. The only
problem its price was 45000$ last year!
(In our case, we are Software AG Partner Software Company in Turkey so we
can bundle with reasonable prices)


Devrim
Parsera IT

----- Original Message -----
From: "Gudmundur Arni Thorisson" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Friday, December 20, 2002 11:07 PM
Subject: Re: Xindice scalability: using in a large bio


>    Ximbro, I have not estimated the total storage requirements for the
> project yet, as we have not yet finalized more than a part of the schema,
> including the 400M instance  class. But it is certain that the size of
> each record in that class will be quite small, in an XML-sense, something
> like this:
>
>      <genotype
lsid="urn:LSID:genome.wi.mit.edu:HapMap/Genotype:883423434:1"
>  >
>        <snp_assay
> lsid="urn:LSID:genome.wi.mit.edu:HapMap/SNPAssay:300004343:1"/>
>        <sample lsid="urn:LSID:hapmap.org:HapMap/Sample:1004:1"/>
>        <genotyping_protocol
> lsid="urn:LSID:genome.wi.mit.edu:HapMap/Protocol:0034:1:1/>
>        <alleles>
>          <allele base="G"/>
>          <allele base="T"/>
>        </alleles>
>      </genotype>
>
>    Let's see, this less than 1/2 Kb in size for a single file, times 400
> million records equals a whole bunch of space, you're right! I suppose I'd
> better look into the filesize limit thing and make sure that candidate db'
> s and OS platform(s) (Linux preferable) can in fact handle this size of
> files. Thanks for the tip, Kimbro.
>
>
>            Mummi
>
>
> On Friday, December 20, 2002, at 06:19 PM, Kimbro Staken wrote:
>
> >
> > On Friday, December 20, 2002, at 09:06  AM, Gudmundur Arni Thorisson
> > wrote:
> >
> >>   Thanks for rapid reply, Murray. There could be some division there,
> >> articifical if need be (divide by e.g. laboratory that produced
> >> genotypes)
> >> . But after looking briefly at the Xindice quick tutorial, it seemed to
> >> me that it would be natural to put each document type in its own
> >> collection:
> >>
> >> /db/genotype/
> >> /db/snp/
> >> /db/haplotype/
> >> /db/sample/
> >> /db/individual/
> >> /db/pedigree/
> >> ..and so on.
> >>
> >>    Where the genotype collection would be by far the biggest one (one
> >> genotype per sample per SNP, where the number of samples will be in the
> >> range 180-270 and SNPs from 500 thousand up to 1.5 million). So, yes,
> >> unless someone can suggest otherwise, I'd think that a single
collection
> >> would need to contain those 400M records.
> >>   Also, ince if one wants to retrieve a genotype by its unique (within
> >> that type class) identifier, it would go something like this, using
> >> LSIDs (Life Science Identifiers):
/db/genotype/@lsid='urn:LSID:washu.edu:
> >> HapMap/
> >> Genotype:23423432434:1
> >> (I'm no good at XPath, I know!)
> >>  But if there is a per-laboratory division, one would actually have to
> >> know which lab the genotype came from, in addition to its identifier.
> >> Not a Good Thing. This would probably also affect other, more complex
> >> queries,
> >>  I don't know.
> >>
> >>   Is there a hard limit on the number of documents per Xindice
> >> collection?
> >
> > There is no hard limit.
> >
> >>  Max number of files per directory or whatever, something outside
> >> Xindice'
> >> s control?
> >
> > The first external limit you'll run into will be file size. Xindice
can't
> > span a collection across files yet, so if your file system limits file
> > size to 4GB or something that will be all you can store. This of course
> > varies by platform.
> >
> > Really though 400 million is a pretty big number. The most I've ever
> > tested with was a little over 1 million. The server could handle more,
> > but it was really pushing the limits of the current system. So until
> > Xindice matures quite a bit more, I'd have to recommend against it as a
> > solution.
> >
> > It will be tough to find an open source solution that can easily handle
> > that many documents with acceptable performance. As far as I know eXist
> > won't be any better in this area. Honestly, I'd really have to question
> > whether Oracle can even handle that much XML. Obviously, for relational
> > data it's up to the task, but XML is quite a bit different and there's
> > still some pretty inefficient aspects to what they're doing. Of course I
> > do think Oracle is better then anything else currently available.
> >
> > Beside number of documents, have you estimated document size and storage
> > space required? Even if you're looking at only 1k per document, I
believe
> > once you throw in indexes and overhead, you're pushing 1TB in data size.
> > That's a pretty big chunk of data, it's not going to be easy to manage
no
> > matter which route you take.
> >
> >>
> >>
> >>                   Mummi, CSHL
> >>
> >> On Friday, December 20, 2002, at 03:43 PM, Murray Altheim wrote:
> >>
> >>> Gudmundur Arni Thorisson wrote:
> >>>
> >>> [...]
> >>>
> >>>> It says on the Xindice website that the db is designed for many,
> >>>> small documents. The XML dataset that we will be handling will
> >>>> contain fairly small documents but VERY many of them; up to 400
> >>>> million instances of the most populous record class.
> >>>> My question is therefore this: has anyone used/tested Xindice with
> >>>> datasets of this size (hundreds of millions) with decent performance
> >>>> as well? This will be mainly import + query work, hardly any heavy
> >>>> updating load, if that would make a difference as far as performance
> >>>> goes.
> >>>
> >>>
> >>> One question that may help answer this: would 400 million records
> >>> be in *one* Xindice Collection, or could these be organized according
> >>> to some hierarchy, such that there would be a smaller limit at the
> >>> Collection level?
> >>>
> >>>
> >>> Murray
> >>>
> >>> ......................................................................
> >>> Murray Altheim                  <http://kmi.open.ac.uk/people/murray/>
> >>> Knowledge Media Institute
> >>> The Open University, Milton Keynes, Bucks, MK7 6AA, UK
> >>>
> >>>            If you're the first person in a new territory,
> >>>            you're likely to get shot at.
> >>>                                                     -- ma
> >>>
> >>>
> >>
> >>
> > Kimbro Staken
> > Java and XML Software, Consulting and Writing
http://www.xmldatabases.org/
> > Apache Xindice native XML database http://xml.apache.org/xindice
> > XML:DB Initiative http://www.xmldb.org
> >
> >
>
>
>

Re: Xindice scalability: using in a large bio

Reply via email to