On Wed, Feb 10, 2010 at 1:21 AM, Scott Marlowe <scott.marl...@gmail.com> wrote:
>
> On Wed, Feb 10, 2010 at 12:11 AM, Steve Atkins <st...@blighty.com> wrote:
> > A database isn't really the right way to do full text search for single 
> > files that big. Even if they'd fit in the database it's way bigger than the 
> > underlying index types tsquery uses are designed for.
> >
> > Are you sure that the documents are that big? A single document of that 
> > size would be 400 times the size of the bible. That's a ridiculously large 
> > amount of text, most of a small library.
> >
> > If the answer is "yes, it's really that big and it's really text" then look 
> > at clucene or, better, hiring a specialist.
>
> I'm betting it's something like gene sequences or geological samples,
> or something other than straight text.  But even those bear breaking
> down into some kind of simple normalization scheme don't they?
>

A single genome is ~ 1.3GB as chars, half that size if you use 4 bits
/ nucleotide (which should work for at least 90% of the use cases).
Simplest design is to store a single reference and then for everything
else store deltas from it.  On average that should require about about
3-5% of your reference sequence per comparative sample  (not counting
FKs and indexes).

As I mentioned on the list a couple of months ago we are in the middle
of stuffing a bunch of molecular data (including entire genomes) into
Postgres.   If anyone else is doing this I would welcome the
opportunity to discuss the issues off list...

--
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Reply via email to