Re: [GENERAL] more than 2GB data string save
On Wed, Feb 10, 2010 at 1:21 AM, Scott Marlowe scott.marl...@gmail.com wrote: On Wed, Feb 10, 2010 at 12:11 AM, Steve Atkins st...@blighty.com wrote: A database isn't really the right way to do full text search for single files that big. Even if they'd fit in the database it's way bigger than the underlying index types tsquery uses are designed for. Are you sure that the documents are that big? A single document of that size would be 400 times the size of the bible. That's a ridiculously large amount of text, most of a small library. If the answer is yes, it's really that big and it's really text then look at clucene or, better, hiring a specialist. I'm betting it's something like gene sequences or geological samples, or something other than straight text. But even those bear breaking down into some kind of simple normalization scheme don't they? A single genome is ~ 1.3GB as chars, half that size if you use 4 bits / nucleotide (which should work for at least 90% of the use cases). Simplest design is to store a single reference and then for everything else store deltas from it. On average that should require about about 3-5% of your reference sequence per comparative sample (not counting FKs and indexes). As I mentioned on the list a couple of months ago we are in the middle of stuffing a bunch of molecular data (including entire genomes) into Postgres. If anyone else is doing this I would welcome the opportunity to discuss the issues off list... -- Peter Hunsberger -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] more than 2GB data string save
As I mentioned on the list a couple of months ago we are in the middle of stuffing a bunch of molecular data (including entire genomes) into Postgres. If anyone else is doing this I would welcome the opportunity to discuss the issues off list... I do not stuff molecules or genomes or genomdata into PostgreSQL, but I sure would love to read a case study about it or listen to a talk at pgday.euabout this stuff. Best wishes, Harald -- GHUM Harald Massa persuadere et programmare Harald Armin Massa Spielberger Straße 49 70435 Stuttgart 0173/9409607 no fx, no carrier pigeon - %s is too gigantic of an industry to bend to the whims of reality
[GENERAL] more than 2GB data string save
How to save 2 GB or more text string in Postgresql? Which data type should I use? Please any suggestion.
Re: [GENERAL] more than 2GB data string save
On Tue, Feb 9, 2010 at 9:38 PM, AI Rumman rumman...@gmail.com wrote: How to save 2 GB or more text string in Postgresql? Which data type should I use? If you have to you can use either the lo interface, or you can use bytea. Large Object (i.e. lo) allows for access much like fopen / fseek etc in C, but the actual data are not stored in a row with other data, but alone in the lo space. Bytea is a legit type that you can have as one of many in a row, but you retrieve the whole thing at once when you get the row. Preferred way to store 2GB data is to put it into a file and put the name of the file into the database. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] more than 2GB data string save
On Feb 9, 2010, at 9:52 PM, Scott Marlowe wrote: On Tue, Feb 9, 2010 at 9:38 PM, AI Rumman rumman...@gmail.com wrote: How to save 2 GB or more text string in Postgresql? Which data type should I use? If you have to you can use either the lo interface, or you can use bytea. Large Object (i.e. lo) allows for access much like fopen / fseek etc in C, but the actual data are not stored in a row with other data, but alone in the lo space. Bytea is a legit type that you can have as one of many in a row, but you retrieve the whole thing at once when you get the row. Bytea definitely won't handle more than 1 GB. I don't think the lo interface will handle more than 2GB. Preferred way to store 2GB data is to put it into a file and put the name of the file into the database. This. Cheers, Steve -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] more than 2GB data string save
On Tue, Feb 9, 2010 at 11:26 PM, Steve Atkins st...@blighty.com wrote: On Feb 9, 2010, at 9:52 PM, Scott Marlowe wrote: On Tue, Feb 9, 2010 at 9:38 PM, AI Rumman rumman...@gmail.com wrote: How to save 2 GB or more text string in Postgresql? Which data type should I use? If you have to you can use either the lo interface, or you can use bytea. Large Object (i.e. lo) allows for access much like fopen / fseek etc in C, but the actual data are not stored in a row with other data, but alone in the lo space. Bytea is a legit type that you can have as one of many in a row, but you retrieve the whole thing at once when you get the row. Bytea definitely won't handle more than 1 GB. I don't think the lo interface will handle more than 2GB. That really depends on how compressible it is, doesn't it? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] more than 2GB data string save
Thanks for your quick answes. But if I use a file and then store the name in the database, is it possible to use TEXT search tsvector and tsquery indexing on these external files? On Wed, Feb 10, 2010 at 12:26 PM, Steve Atkins st...@blighty.com wrote: On Feb 9, 2010, at 9:52 PM, Scott Marlowe wrote: On Tue, Feb 9, 2010 at 9:38 PM, AI Rumman rumman...@gmail.com wrote: How to save 2 GB or more text string in Postgresql? Which data type should I use? If you have to you can use either the lo interface, or you can use bytea. Large Object (i.e. lo) allows for access much like fopen / fseek etc in C, but the actual data are not stored in a row with other data, but alone in the lo space. Bytea is a legit type that you can have as one of many in a row, but you retrieve the whole thing at once when you get the row. Bytea definitely won't handle more than 1 GB. I don't think the lo interface will handle more than 2GB. Preferred way to store 2GB data is to put it into a file and put the name of the file into the database. This. Cheers, Steve -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] more than 2GB data string save
2010/2/10 Scott Marlowe scott.marl...@gmail.com: On Tue, Feb 9, 2010 at 11:26 PM, Steve Atkins st...@blighty.com wrote: On Feb 9, 2010, at 9:52 PM, Scott Marlowe wrote: On Tue, Feb 9, 2010 at 9:38 PM, AI Rumman rumman...@gmail.com wrote: How to save 2 GB or more text string in Postgresql? Which data type should I use? If you have to you can use either the lo interface, or you can use bytea. Large Object (i.e. lo) allows for access much like fopen / fseek etc in C, but the actual data are not stored in a row with other data, but alone in the lo space. Bytea is a legit type that you can have as one of many in a row, but you retrieve the whole thing at once when you get the row. Bytea definitely won't handle more than 1 GB. I don't think the lo interface will handle more than 2GB. That really depends on how compressible it is, doesn't it? no. It is maximal length for varlena. TOAST is next possible step. Regards Pavel Stehule p.s. processing very large SQL values - like bytea, or text longer tens megabytes is very expensive on memory. When you processing 100MB bytea, then you need about 300MB RAM, Using a bytea over 100MB is not good idea. LO interface is better and much more faster. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] more than 2GB data string save
On Tue, Feb 9, 2010 at 11:51 PM, Pavel Stehule pavel.steh...@gmail.com wrote: 2010/2/10 Scott Marlowe scott.marl...@gmail.com: On Tue, Feb 9, 2010 at 11:26 PM, Steve Atkins st...@blighty.com wrote: On Feb 9, 2010, at 9:52 PM, Scott Marlowe wrote: On Tue, Feb 9, 2010 at 9:38 PM, AI Rumman rumman...@gmail.com wrote: How to save 2 GB or more text string in Postgresql? Which data type should I use? If you have to you can use either the lo interface, or you can use bytea. Large Object (i.e. lo) allows for access much like fopen / fseek etc in C, but the actual data are not stored in a row with other data, but alone in the lo space. Bytea is a legit type that you can have as one of many in a row, but you retrieve the whole thing at once when you get the row. Bytea definitely won't handle more than 1 GB. I don't think the lo interface will handle more than 2GB. That really depends on how compressible it is, doesn't it? no. It is maximal length for varlena. TOAST is next possible step. Regards Pavel Stehule p.s. processing very large SQL values - like bytea, or text longer tens megabytes is very expensive on memory. When you processing 100MB bytea, then you need about 300MB RAM, Using a bytea over 100MB is not good idea. LO interface is better and much more faster. Agreed. I wonder what kind of strings the OP is storing, and if there isn't a more efficient way to do what he wants to do. But we're learning by pieces. First size requirements, then the need for FTS. So, OP, what are you trying to do? (i.e. big answer, not small) It could be there are basic inefficiencies in some part of your method that are more easily recognised when know what you're trying to do. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] more than 2GB data string save
On Feb 9, 2010, at 10:38 PM, AI Rumman wrote: Thanks for your quick answes. But if I use a file and then store the name in the database, is it possible to use TEXT search tsvector and tsquery indexing on these external files? No, no way at all. A database isn't really the right way to do full text search for single files that big. Even if they'd fit in the database it's way bigger than the underlying index types tsquery uses are designed for. Are you sure that the documents are that big? A single document of that size would be 400 times the size of the bible. That's a ridiculously large amount of text, most of a small library. If the answer is yes, it's really that big and it's really text then look at clucene or, better, hiring a specialist. Cheers, Steve On Wed, Feb 10, 2010 at 12:26 PM, Steve Atkins st...@blighty.com wrote: On Feb 9, 2010, at 9:52 PM, Scott Marlowe wrote: On Tue, Feb 9, 2010 at 9:38 PM, AI Rumman rumman...@gmail.com wrote: How to save 2 GB or more text string in Postgresql? Which data type should I use? If you have to you can use either the lo interface, or you can use bytea. Large Object (i.e. lo) allows for access much like fopen / fseek etc in C, but the actual data are not stored in a row with other data, but alone in the lo space. Bytea is a legit type that you can have as one of many in a row, but you retrieve the whole thing at once when you get the row. Bytea definitely won't handle more than 1 GB. I don't think the lo interface will handle more than 2GB. Preferred way to store 2GB data is to put it into a file and put the name of the file into the database. This. Cheers, Steve -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] more than 2GB data string save
On Wed, Feb 10, 2010 at 12:11 AM, Steve Atkins st...@blighty.com wrote: A database isn't really the right way to do full text search for single files that big. Even if they'd fit in the database it's way bigger than the underlying index types tsquery uses are designed for. Are you sure that the documents are that big? A single document of that size would be 400 times the size of the bible. That's a ridiculously large amount of text, most of a small library. If the answer is yes, it's really that big and it's really text then look at clucene or, better, hiring a specialist. I'm betting it's something like gene sequences or geological samples, or something other than straight text. But even those bear breaking down into some kind of simple normalization scheme don't they? But if that's what they are, then I'd think that you'd need to be willing to step up and design a type of new pg object that would hold these long strings and be able to run hand written C that does cool things to your data without killing your machine. 2Gigabytes is a lot. But it's not so big on a machine with 128G of ram as it is on a machine with 4G. If both 2G+ objects can fit in memory and be compared or operate on each other in odd ways that could prove useful. But postgresql doesn't really have anything built in to do that. I'd think it would be cheaper to write simple program that reads two text files and does the same thing. With kernel file caching it should load quickly after the first access And on RAID arrays that read at 400 to 500M/sec it's only 4 seconds load time on the first access. If there's some part of doing this that needs to be transactionally sane, then write a simple control program that uses the database to keep track of completed jobs and do it all outside the database in some other language if it's better suited to this. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] more than 2GB data string save
On Feb 9, 2010, at 11:21 PM, Scott Marlowe wrote: On Wed, Feb 10, 2010 at 12:11 AM, Steve Atkins st...@blighty.com wrote: A database isn't really the right way to do full text search for single files that big. Even if they'd fit in the database it's way bigger than the underlying index types tsquery uses are designed for. Are you sure that the documents are that big? A single document of that size would be 400 times the size of the bible. That's a ridiculously large amount of text, most of a small library. If the answer is yes, it's really that big and it's really text then look at clucene or, better, hiring a specialist. I'm betting it's something like gene sequences or geological samples, or something other than straight text. But even those bear breaking down into some kind of simple normalization scheme don't they? An entire human is a shade over 3 billion base pairs, with an information content of well under a gigabyte. The earth is about 4 billion years old, so if you were sampling every couple of years and you have the perfect core sample... maybe. I'm not sure that any form of full text search that includes stemming would be terribly useful for either. Cheers, Steve -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] more than 2GB data string save
On Wed, Feb 10, 2010 at 10:11 AM, Steve Atkins st...@blighty.com wrote: On Feb 9, 2010, at 10:38 PM, AI Rumman wrote: Thanks for your quick answes. But if I use a file and then store the name in the database, is it possible to use TEXT search tsvector and tsquery indexing on these external files? No, no way at all. A database isn't really the right way to do full text search for single files that big. Even if they'd fit in the database it's way bigger than the underlying index types tsquery uses are designed for. Are you sure that the documents are that big? A single document of that size would be 400 times the size of the bible. That's a ridiculously large amount of text, most of a small library. If the answer is yes, it's really that big and it's really text then look at clucene or, better, hiring a specialist. Cheers, Steve On Wed, Feb 10, 2010 at 12:26 PM, Steve Atkins st...@blighty.com wrote: On Feb 9, 2010, at 9:52 PM, Scott Marlowe wrote: On Tue, Feb 9, 2010 at 9:38 PM, AI Rumman rumman...@gmail.com wrote: How to save 2 GB or more text string in Postgresql? Which data type should I use? If you have to you can use either the lo interface, or you can use bytea. Large Object (i.e. lo) allows for access much like fopen / fseek etc in C, but the actual data are not stored in a row with other data, but alone in the lo space. Bytea is a legit type that you can have as one of many in a row, but you retrieve the whole thing at once when you get the row. Bytea definitely won't handle more than 1 GB. I don't think the lo interface will handle more than 2GB. Preferred way to store 2GB data is to put it into a file and put the name of the file into the database. This. Cheers, Steve -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general I have some simple (and possibly wrong) solutions are presented below. a) Store only the file name in the db and leave the file on the file system where you can access it after after reading in the file name from the DB to determine it's location This solution may offer the possibility of using many of the very fast and efficient file content search and manipulation algorithms and tools that are part of most OSes (for instance grep, sed, ...) b) If you really need to store the file contents in the DB. You may write a small method which reads in the file contents in batches of n number of bytes and writes each of these batches in a new table row, but one thing to remember is to capture and store the batch number (which should be sequencial) in another field of the respective batch. And finally indicate the last batch by maybe setting another field in the row as 'true'. Doing so will provide you with the opportunity to parallize your the batch persistance to DB and still have a way to correctly put together the sections of the file contents (the batches) while reconstructing the file. Also will ensure the size of your transaction will be as small the size of a batch. Allan. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general