subject:"\[GENERAL\] more than 2GB data string save"

Re: [GENERAL] more than 2GB data string save

2010-02-10 Thread Peter Hunsberger

On Wed, Feb 10, 2010 at 1:21 AM, Scott Marlowe scott.marl...@gmail.com wrote:

 On Wed, Feb 10, 2010 at 12:11 AM, Steve Atkins st...@blighty.com wrote:
  A database isn't really the right way to do full text search for single 
  files that big. Even if they'd fit in the database it's way bigger than the 
  underlying index types tsquery uses are designed for.
 
  Are you sure that the documents are that big? A single document of that 
  size would be 400 times the size of the bible. That's a ridiculously large 
  amount of text, most of a small library.
 
  If the answer is yes, it's really that big and it's really text then look 
  at clucene or, better, hiring a specialist.

 I'm betting it's something like gene sequences or geological samples,
 or something other than straight text.  But even those bear breaking
 down into some kind of simple normalization scheme don't they?


A single genome is ~ 1.3GB as chars, half that size if you use 4 bits
/ nucleotide (which should work for at least 90% of the use cases).
Simplest design is to store a single reference and then for everything
else store deltas from it.  On average that should require about about
3-5% of your reference sequence per comparative sample  (not counting
FKs and indexes).

As I mentioned on the list a couple of months ago we are in the middle
of stuffing a bunch of molecular data (including entire genomes) into
Postgres.   If anyone else is doing this I would welcome the
opportunity to discuss the issues off list...

--
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] more than 2GB data string save

2010-02-10 Thread Massa, Harald Armin

 As I mentioned on the list a couple of months ago we are in the middle
 of stuffing a bunch of molecular data (including entire genomes) into
 Postgres.   If anyone else is doing this I would welcome the
 opportunity to discuss the issues off list...


I do not stuff molecules or genomes or genomdata into PostgreSQL, but I sure
would love to read a case study about it or listen to a talk at
pgday.euabout this stuff.

Best wishes,

Harald


-- 
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Spielberger Straße 49
70435 Stuttgart
0173/9409607
no fx, no carrier pigeon
-
%s is too gigantic of an industry to bend to the whims of reality

[GENERAL] more than 2GB data string save

2010-02-09 Thread AI Rumman

How to save 2 GB or more text string in Postgresql?
Which data type should I use?

Please any suggestion.

Re: [GENERAL] more than 2GB data string save

2010-02-09 Thread Scott Marlowe

On Tue, Feb 9, 2010 at 9:38 PM, AI Rumman rumman...@gmail.com wrote:
 How to save 2 GB or more text string in Postgresql?
 Which data type should I use?

If you have to you can use either the lo interface, or you can use
bytea.  Large Object (i.e. lo) allows for access much like fopen /
fseek  etc in C, but the actual data are not stored in a row with
other data, but alone in the lo space.  Bytea is a legit type that you
can have as one of many in a row, but you retrieve the whole thing at
once when you get the row.

Preferred way to store 2GB data is to put it into a file and put the
name of the file into the database.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] more than 2GB data string save

2010-02-09 Thread Steve Atkins


On Feb 9, 2010, at 9:52 PM, Scott Marlowe wrote:

 On Tue, Feb 9, 2010 at 9:38 PM, AI Rumman rumman...@gmail.com wrote:
 How to save 2 GB or more text string in Postgresql?
 Which data type should I use?
 
 If you have to you can use either the lo interface, or you can use
 bytea.  Large Object (i.e. lo) allows for access much like fopen /
 fseek  etc in C, but the actual data are not stored in a row with
 other data, but alone in the lo space.  Bytea is a legit type that you
 can have as one of many in a row, but you retrieve the whole thing at
 once when you get the row.

Bytea definitely won't handle more than 1 GB. I don't think the lo interface
will handle more than 2GB.

 
 Preferred way to store 2GB data is to put it into a file and put the
 name of the file into the database.


This.

Cheers,
  Steve


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] more than 2GB data string save

2010-02-09 Thread Scott Marlowe

On Tue, Feb 9, 2010 at 11:26 PM, Steve Atkins st...@blighty.com wrote:

 On Feb 9, 2010, at 9:52 PM, Scott Marlowe wrote:

 On Tue, Feb 9, 2010 at 9:38 PM, AI Rumman rumman...@gmail.com wrote:
 How to save 2 GB or more text string in Postgresql?
 Which data type should I use?

 If you have to you can use either the lo interface, or you can use
 bytea.  Large Object (i.e. lo) allows for access much like fopen /
 fseek  etc in C, but the actual data are not stored in a row with
 other data, but alone in the lo space.  Bytea is a legit type that you
 can have as one of many in a row, but you retrieve the whole thing at
 once when you get the row.

 Bytea definitely won't handle more than 1 GB. I don't think the lo interface
 will handle more than 2GB.

That really depends on how compressible it is, doesn't it?

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] more than 2GB data string save

2010-02-09 Thread AI Rumman

Thanks for your quick answes.

But if I use a file and then store the name in the database, is it possible
to use TEXT search tsvector and tsquery indexing on these external files?

On Wed, Feb 10, 2010 at 12:26 PM, Steve Atkins st...@blighty.com wrote:


 On Feb 9, 2010, at 9:52 PM, Scott Marlowe wrote:

  On Tue, Feb 9, 2010 at 9:38 PM, AI Rumman rumman...@gmail.com wrote:
  How to save 2 GB or more text string in Postgresql?
  Which data type should I use?
 
  If you have to you can use either the lo interface, or you can use
  bytea.  Large Object (i.e. lo) allows for access much like fopen /
  fseek  etc in C, but the actual data are not stored in a row with
  other data, but alone in the lo space.  Bytea is a legit type that you
  can have as one of many in a row, but you retrieve the whole thing at
  once when you get the row.

 Bytea definitely won't handle more than 1 GB. I don't think the lo
 interface
 will handle more than 2GB.

 
  Preferred way to store 2GB data is to put it into a file and put the
  name of the file into the database.


 This.

 Cheers,
  Steve


 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] more than 2GB data string save

2010-02-09 Thread Pavel Stehule

2010/2/10 Scott Marlowe scott.marl...@gmail.com:
 On Tue, Feb 9, 2010 at 11:26 PM, Steve Atkins st...@blighty.com wrote:

 On Feb 9, 2010, at 9:52 PM, Scott Marlowe wrote:

 On Tue, Feb 9, 2010 at 9:38 PM, AI Rumman rumman...@gmail.com wrote:
 How to save 2 GB or more text string in Postgresql?
 Which data type should I use?

 If you have to you can use either the lo interface, or you can use
 bytea.  Large Object (i.e. lo) allows for access much like fopen /
 fseek  etc in C, but the actual data are not stored in a row with
 other data, but alone in the lo space.  Bytea is a legit type that you
 can have as one of many in a row, but you retrieve the whole thing at
 once when you get the row.

 Bytea definitely won't handle more than 1 GB. I don't think the lo interface
 will handle more than 2GB.

 That really depends on how compressible it is, doesn't it?


no. It is maximal length for varlena. TOAST is next possible step.

Regards
Pavel Stehule

p.s.

processing very large SQL values - like bytea, or text longer tens
megabytes is very expensive on memory. When you processing 100MB
bytea, then you need about 300MB RAM, Using a bytea over 100MB is not
good idea. LO interface is better and much more faster.



 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] more than 2GB data string save

2010-02-09 Thread Scott Marlowe

On Tue, Feb 9, 2010 at 11:51 PM, Pavel Stehule pavel.steh...@gmail.com wrote:
 2010/2/10 Scott Marlowe scott.marl...@gmail.com:
 On Tue, Feb 9, 2010 at 11:26 PM, Steve Atkins st...@blighty.com wrote:

 On Feb 9, 2010, at 9:52 PM, Scott Marlowe wrote:

 On Tue, Feb 9, 2010 at 9:38 PM, AI Rumman rumman...@gmail.com wrote:
 How to save 2 GB or more text string in Postgresql?
 Which data type should I use?

 If you have to you can use either the lo interface, or you can use
 bytea.  Large Object (i.e. lo) allows for access much like fopen /
 fseek  etc in C, but the actual data are not stored in a row with
 other data, but alone in the lo space.  Bytea is a legit type that you
 can have as one of many in a row, but you retrieve the whole thing at
 once when you get the row.

 Bytea definitely won't handle more than 1 GB. I don't think the lo interface
 will handle more than 2GB.

 That really depends on how compressible it is, doesn't it?


 no. It is maximal length for varlena. TOAST is next possible step.

 Regards
 Pavel Stehule

 p.s.

 processing very large SQL values - like bytea, or text longer tens
 megabytes is very expensive on memory. When you processing 100MB
 bytea, then you need about 300MB RAM, Using a bytea over 100MB is not
 good idea. LO interface is better and much more faster.

Agreed.  I wonder what kind of strings the OP is storing, and if there
isn't a more efficient way to do what he wants to do.  But we're
learning by pieces.  First size requirements, then the need for FTS.

So, OP, what are you trying to do? (i.e. big answer, not small)  It
could be there are basic inefficiencies in some part of your method
that are more easily recognised when know what you're trying to do.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] more than 2GB data string save

2010-02-09 Thread Steve Atkins


On Feb 9, 2010, at 10:38 PM, AI Rumman wrote:

 Thanks for your quick answes.
  
 But if I use a file and then store the name in the database, is it possible 
 to use TEXT search tsvector and tsquery indexing on these external files?

No, no way at all.

A database isn't really the right way to do full text search for single files 
that big. Even if they'd fit in the database it's way bigger than the 
underlying index types tsquery uses are designed for.

Are you sure that the documents are that big? A single document of that size 
would be 400 times the size of the bible. That's a ridiculously large amount of 
text, most of a small library.

If the answer is yes, it's really that big and it's really text then look at 
clucene or, better, hiring a specialist.

Cheers,
  Steve

 
 On Wed, Feb 10, 2010 at 12:26 PM, Steve Atkins st...@blighty.com wrote:
 
 On Feb 9, 2010, at 9:52 PM, Scott Marlowe wrote:
 
  On Tue, Feb 9, 2010 at 9:38 PM, AI Rumman rumman...@gmail.com wrote:
  How to save 2 GB or more text string in Postgresql?
  Which data type should I use?
 
  If you have to you can use either the lo interface, or you can use
  bytea.  Large Object (i.e. lo) allows for access much like fopen /
  fseek  etc in C, but the actual data are not stored in a row with
  other data, but alone in the lo space.  Bytea is a legit type that you
  can have as one of many in a row, but you retrieve the whole thing at
  once when you get the row.
 
 Bytea definitely won't handle more than 1 GB. I don't think the lo interface
 will handle more than 2GB.
 
 
  Preferred way to store 2GB data is to put it into a file and put the
  name of the file into the database.
 
 
 This.
 
 Cheers,
  Steve
 
 
 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general
 


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] more than 2GB data string save

2010-02-09 Thread Scott Marlowe

On Wed, Feb 10, 2010 at 12:11 AM, Steve Atkins st...@blighty.com wrote:
 A database isn't really the right way to do full text search for single files 
 that big. Even if they'd fit in the database it's way bigger than the 
 underlying index types tsquery uses are designed for.

 Are you sure that the documents are that big? A single document of that size 
 would be 400 times the size of the bible. That's a ridiculously large amount 
 of text, most of a small library.

 If the answer is yes, it's really that big and it's really text then look 
 at clucene or, better, hiring a specialist.

I'm betting it's something like gene sequences or geological samples,
or something other than straight text.  But even those bear breaking
down into some kind of simple normalization scheme don't they?

But if that's what they are, then I'd think that you'd need to be
willing to step up and design a type of new pg object that would hold
these long strings and be able to run hand written C that does cool
things to your data without killing your machine.

2Gigabytes is a lot.  But it's not so big on a machine with 128G of
ram as it is on a machine with 4G.  If both 2G+ objects can fit in
memory and be compared or operate on each other in odd ways that could
prove useful.

But postgresql doesn't really have anything built in to do that.  I'd
think it would be cheaper to write  simple program that reads two text
files and does the same thing.  With kernel file caching it should
load quickly after the first access   And on RAID arrays that read at
400 to 500M/sec it's only 4 seconds load time on the first access.

If there's some part of doing this that needs to be transactionally
sane, then write a simple control program that uses the database to
keep track of completed jobs and do it all outside the database in
some other language if it's better suited to this.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] more than 2GB data string save

2010-02-09 Thread Steve Atkins


On Feb 9, 2010, at 11:21 PM, Scott Marlowe wrote:

 On Wed, Feb 10, 2010 at 12:11 AM, Steve Atkins st...@blighty.com wrote:
 A database isn't really the right way to do full text search for single 
 files that big. Even if they'd fit in the database it's way bigger than the 
 underlying index types tsquery uses are designed for.
 
 Are you sure that the documents are that big? A single document of that size 
 would be 400 times the size of the bible. That's a ridiculously large amount 
 of text, most of a small library.
 
 If the answer is yes, it's really that big and it's really text then look 
 at clucene or, better, hiring a specialist.
 
 I'm betting it's something like gene sequences or geological samples,
 or something other than straight text.  But even those bear breaking
 down into some kind of simple normalization scheme don't they?

An entire human is a shade over 3 billion base pairs, with an information
content of well under a gigabyte.

The earth is about 4 billion years old, so if you were sampling every
couple of years and you have the perfect core sample... maybe.

I'm not sure that any form of full text search that includes stemming
would be terribly useful for either.

Cheers,
  Steve


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] more than 2GB data string save

2010-02-09 Thread Allan Kamau

On Wed, Feb 10, 2010 at 10:11 AM, Steve Atkins st...@blighty.com wrote:

 On Feb 9, 2010, at 10:38 PM, AI Rumman wrote:

 Thanks for your quick answes.

 But if I use a file and then store the name in the database, is it possible 
 to use TEXT search tsvector and tsquery indexing on these external files?

 No, no way at all.

 A database isn't really the right way to do full text search for single files 
 that big. Even if they'd fit in the database it's way bigger than the 
 underlying index types tsquery uses are designed for.

 Are you sure that the documents are that big? A single document of that size 
 would be 400 times the size of the bible. That's a ridiculously large amount 
 of text, most of a small library.

 If the answer is yes, it's really that big and it's really text then look 
 at clucene or, better, hiring a specialist.

 Cheers,
  Steve


 On Wed, Feb 10, 2010 at 12:26 PM, Steve Atkins st...@blighty.com wrote:

 On Feb 9, 2010, at 9:52 PM, Scott Marlowe wrote:

  On Tue, Feb 9, 2010 at 9:38 PM, AI Rumman rumman...@gmail.com wrote:
  How to save 2 GB or more text string in Postgresql?
  Which data type should I use?
 
  If you have to you can use either the lo interface, or you can use
  bytea.  Large Object (i.e. lo) allows for access much like fopen /
  fseek  etc in C, but the actual data are not stored in a row with
  other data, but alone in the lo space.  Bytea is a legit type that you
  can have as one of many in a row, but you retrieve the whole thing at
  once when you get the row.

 Bytea definitely won't handle more than 1 GB. I don't think the lo interface
 will handle more than 2GB.

 
  Preferred way to store 2GB data is to put it into a file and put the
  name of the file into the database.


 This.

 Cheers,
  Steve


 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general



 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general




I have some simple (and possibly wrong) solutions are presented below.
a) Store only the file name in the db and leave the file on the file
system where you can access it after after reading in the file name
from the DB to determine it's location
This solution may offer the possibility of using many of the very fast
and efficient file content search and manipulation algorithms and
tools that are part of most OSes (for instance grep, sed, ...)

b) If you really need to store the file contents in the DB. You may
write a small method which reads in the file contents in batches of n
number of bytes and writes each of these batches in a new table row,
but one thing to remember is to capture and store the batch number
(which should be sequencial) in another field of the respective batch.
And finally indicate the last batch by maybe setting another field in
the row as 'true'.
Doing so will provide you with the opportunity to parallize your the
batch persistance to DB and still have a way to correctly put together
the sections of the file contents (the batches) while reconstructing
the file.
Also will ensure the size of your transaction will be as small the
size of a batch.



Allan.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] more than 2GB data string save

Re: [GENERAL] more than 2GB data string save

[GENERAL] more than 2GB data string save

Re: [GENERAL] more than 2GB data string save

Re: [GENERAL] more than 2GB data string save

Re: [GENERAL] more than 2GB data string save

Re: [GENERAL] more than 2GB data string save

Re: [GENERAL] more than 2GB data string save

Re: [GENERAL] more than 2GB data string save

Re: [GENERAL] more than 2GB data string save

Re: [GENERAL] more than 2GB data string save

Re: [GENERAL] more than 2GB data string save

Re: [GENERAL] more than 2GB data string save

13 matches

Site Navigation

Mail list logo

Footer information