Re: [GENERAL] to BLOB or not to BLOB
I am going to write an application tht writes big amounts of plain text into a database. What kind of data are those? Articles? News/mail messages? Other? I thought of using the text type for this, but I don't know if it has a maxlenght, and given that these will be very long texts I started wondered if these would have to be blobs... but they aren't binary. Well, plain text is a `subset' of `binary,' so storing it as a BLOB shouldn't be an issue. My home-brewn email archive stores message bodies as BLOBs with absolutely no problem. On length limit of text type field, for older versions of PGSQL it should be less that page size (default 8 KB, configurable at compile time). In TOASTed PGSQL there's probably no limit set by DBMS itself. So, does the text daya type have a maxium lenght? Do BLOBs? What should I use? For the time being, at least, I won't be searching these texts with a search engine or anything... but if I were to be, what considerations should I take into account when designing the tables? As you already figured it, the answer to your question depends on what do you want from your system. Maybe you'll be OK with BLOBs but you have to keep in mind that pg_dump couldn't handle them and you need to perform somewhat tricky things to backup and restore PGSQL DB with BLOBs. Another thing to consider is searchability and indexes. If you really don't want (and never going to) search you big fields, your best way would be to store just file names and to keep long chunks of text in external files. Sure, you can store them in TOASTed fields if you need indexes, but _IMHO_, this is yet to be proven that simplicity of design and implementation is worth _possiblie_ preformance degradation. I believe we still have to see good and fast index built on a 100 KB text field. Maybe what you need is FTS, because IMNSHO, there's no much use of an attribute that can't be indexed and searched (updated, joined, grouped) on swiftly. Everything relatively small and fast to process goes to DB, huge portions of not-easily-indexable data should stay outside. Actually, FTS is a last resort, kind of. You may wish to split your text into smaller fragments with similar semantical (logical, whatever) load. -- ÌĤ¯Ç¤ÏÁͤòÊá¤é¤Ì ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [GENERAL] To BLOB Or Not To BLOB
How do I start this? There are so many different ways to look at this question. Do I save files inside the database, or do I simply use the database as a tool to query information about a given file, it's properties, etc. Instead of making general statements, let me pick specific examples. Let's say we want to create an MP3 repository. Our goals are to be able to query the information saved in ID3 tags, to locate, organize, manipulate, or play the MP3. To narrow the scope of discussion, let's say I will be accessing these files locally, on my filesystem or a NFS mounted directory. What would be the most efficient way for us to query the information found in the ID3 tags? We could parse each file in a number of directories and their subdirectories for the information we require each time we need it, or we could parse it once and save the information in a database structured around ID3 information. I would say the later is definitely more desireable and acceptable. Inside the database, the ID3 tags can be indexed and our searches optimized by the information we're requesting. But should we store the mp3's inside the database? I say no. A few observations: * Filesystems are generally better at allocating space for files than databases are. * Storing files inside databases renders shell access or direct access to files as impossible. If the database goes South, so does access to the files. Now, our mp3 decoders are shell programs that read mp3 input by opening a file, a URL connection, or reading it from standard input (that is if we have a good mp3 player). Yet, in order for these mp3 decoders to be able to play an mp3 found inside the database, we are forced to use a specialized database interface. It may return the mp3 file to the decoder via standard input, or it may export the file to a temp directory and feed the mp3 player the location of that temp file. Yet, why should we go through all of this trouble? And why should we tie the operation of two clearly separate applications into such inflexible relationships? We add overhead to the database to output this file, and we add another level of logic to get the file to the decoder. A level of logic that could prove the downfall of a would-be MP3 mobile disc jockey, who's worked so hard to please the Wicked Witch of the Wedding. He'd be forced to pull out the CD's and tapes he so dreads -- if he remembered to bring them along. If we keep file access decoupled from gleaning catagorizable, quantifiable data about/from the file, we allow two separate and efficient tools to do their jobs independently. If file access is your concern, then use a file system that will give you the level of control you desire to develop your security policies. If grabbing information from a data file is important, such as parsing text files for content, then build indexes to the data inside the file and store those indexes in the database. Internet directories are excellent examples of this! We don't neuter local file access by the applications that need the data files, and we still provide access to searchable information. I would argue that access to a data file and querying sets of data files for information are two separate layers of logic. Logical layers are best served when they are given tools to interact, but are not crippled by tight integration. Saving data files inside a database, IMHO, is an example of tight integration, and something to be avoided if possible -- at least in this case. -- Chad "^chewie, gunnarr" Walstrom [EMAIL PROTECTED] http://wookimus.net/chewie
Re: [GENERAL] To BLOB Or Not To BLOB
G'day. I do a lot of work with the BASIS textual/multi-media RDBMS package and run into this question all the time. There is one pretty basic answer: If you leave BLOBS lying around in the file system - particularly if it is a Novell etc file system - people move them and the links get broken. Just today I had an example where several thousand links broke because some dumb Novel administrator moved a directory holding files which were linked to from a database rather than stored in it. (Not my design decision) A database that can hold the BLOBS internally avoids this sort of snafu. (BASIS also has the advantage that it can automagically index almost any form of BLOB to the word/phrase/sentence level if it is stored in the database.) Cheers, Stephen Davies "John Henderson" [EMAIL PROTECTED] wrote: Hi, It seems that the issue with large objects is "Why do you want the info in a database?" It seems to me that the point of a database is its ability to order and relate data. If you want to retrieve the "large-ish text files" based on their content then I think you need to have the files in the database so they can be searched. However, if you are going to retrieve the files based on something about them that is stored as a separate attribute, such as their title, then I think that storing the files as large objects is unnecessary. If you have the option to "[do] as pointers t those files" then I think you are not intending to use any of Postgres's features beyond its ability to store and retrieve data. Surely the file system itself is 'better' at this regardless of how stable Postgres LO's are. I would sure like to hear someone else's opinion on this because it seems that quite a few folks are using large objects and I am often curious about why. John Henderson A while ago it was being held that the Postgres large object data type was too new and not sufficiently tested and mature to be used in a production environment. I am about to deploy a little database that involves storing large-ish text files (20-500k) which could be either done as large objects or as pointers to those files that would be then stored as ordinary files in the OS's filesystem. I am undecided as to this question. What are the pros and cons? What is the performance in either case vis-a-vis the other? It is a web app; the interface is done in PHP. Beginning from which version is the large object interface (if at all) to be considered stable and ready for production? cheers frank -- frank joerdens joerdens new media heinrich-roller str. 16/17 10405 berlin germany e: [EMAIL PROTECTED] t: +49 30 44055471 f: +49 30 44055475 h: http://www.joerdens.de pgp public key: http://www.joerdens.de/pgp/frank_joerdens.asc Stephen Davies Consulting [EMAIL PROTECTED] Adelaide, South Australia.Voice: 08-8177 1595 Computing Network solutions.Fax: 08-8177 0133
Re: [GENERAL] To BLOB Or Not To BLOB
On Sun, 16 Apr 2000, Frank Joerdens wrote: A while ago it was being held that the Postgres large object data type was too new and not sufficiently tested and mature to be used in a production environment. I am about to deploy a little database that involves storing large-ish text files (20-500k) which could be either done as large objects or as pointers to those files that would be then stored as ordinary files in the OS's filesystem. I am undecided as to this question. What are the pros and cons? What is the performance in either case vis-a-vis the other? It is a web app; the interface is done in PHP. Beginning from which version is the large object interface (if at all) to be considered stable and ready for production? How are you going to dump and restore the large objects? The pg_dump and pg_dumpall programs will only handle the tables it seems. There are some programs out there to dump large objects and I've been playing with one. It's worked well so far. You can get it at ftp://ftp2.zf.jcu.cz/zakkr/pg/ I'd like to know what other ways there are to handle dumping/restoring large objects. Isn't there any standard way that comes with PostgreSQL? Nevertheless, I've been using php and large objects with no problems so far. I've been able to fully dump the database with all large objects, upgrade PostgreSQL to a new version, initdb, and restore everything. Reading large objects seems fairly fast too. On my website, all data is stored in the database, including pictures. The version that is up and running reads the data out of the database on every request! Since I use query string urls like ?image=2345, web browsers don't cache it much. Its not so slow if you don't have a lot of traffic, but if you are planning for a lot of traffic, try designing for on-the-fly content generation for a speed up. That is what I am doing on the devel version of my site. This way the data is only read out of the database on the first request then html and image files are generated on the filesystem to serve the same request next time unless a query string is passed which overrides and forces content regeneration. Its a little tricky but cool. I'd say go for storing everything in the database so long as you master how to dump and restore large objects as well as the databases/tables using pg_dump/pg_dumpall. This way, it is possible to really separate your site code from its data. All data backed up in the database and its dumps, all code (php etc) backed up in a CVS repository. Html pages and images generated on the filesystem as an ordinary looking website can be seen as just temporary cached data that can be sent out quickly when it matches a request. Look into mod_rewrite on apache. If you want a simpler site that serves fast, then just use the file pointer idea and keep the files outside the database. Robert Easter [EMAIL PROTECTED]
Re: [reaster@comptechnews.com: Re: [GENERAL] To BLOB Or Not To BLOB]
On Sun, 16 Apr 2000, you wrote: There are some programs out there to dump large objects and I've been playing with one. It's worked well so far. You can get it at ftp://ftp2.zf.jcu.cz/zakkr/pg/ --- file not found... Correction: ftp://ftp2.zf.jcu.cz/users/zakkr/pg/
Re: [GENERAL] To BLOB Or Not To BLOB
Hi, It seems that the issue with large objects is "Why do you want the info in a database?" It seems to me that the point of a database is its ability to order and relate data. If you want to retrieve the "large-ish text files" based on their content then I think you need to have the files in the database so they can be searched. However, if you are going to retrieve the files based on something about them that is stored as a separate attribute, such as their title, then I think that storing the files as large objects is unnecessary. If you have the option to "[do] as pointers t those files" then I think you are not intending to use any of Postgres's features beyond its ability to store and retrieve data. Surely the file system itself is 'better' at this regardless of how stable Postgres LO's are. I would sure like to hear someone else's opinion on this because it seems that quite a few folks are using large objects and I am often curious about why. John Henderson A while ago it was being held that the Postgres large object data type was too new and not sufficiently tested and mature to be used in a production environment. I am about to deploy a little database that involves storing large-ish text files (20-500k) which could be either done as large objects or as pointers to those files that would be then stored as ordinary files in the OS's filesystem. I am undecided as to this question. What are the pros and cons? What is the performance in either case vis-a-vis the other? It is a web app; the interface is done in PHP. Beginning from which version is the large object interface (if at all) to be considered stable and ready for production? cheers frank -- frank joerdens joerdens new media heinrich-roller str. 16/17 10405 berlin germany e: [EMAIL PROTECTED] t: +49 30 44055471 f: +49 30 44055475 h: http://www.joerdens.de pgp public key: http://www.joerdens.de/pgp/frank_joerdens.asc
Re: [GENERAL] To BLOB Or Not To BLOB
Well I'm currently using the file system for large files. However because of that I can see a few reasons why people might want to use Postgresql to handle them. Others can probably mention more. Using Pg to handle large stuff makes more consistent overall and it's easier for you to handle exceptions - e.g. if things fail the whole thing is rolled back, and you theoretically don't get the large files dangling around. Well ok you probably do until the next vacuum, but at least you don't have to write your own vacuum to handle that ;). Basically you shift the problem to Pg (and the very fine developers :) ). The reasons I decided to go file system were: 1) I'm using Linux and ext2fs has a 2GB limit on files, and it seems like 6.5.3 tables are stored as single files, so better not go down that path :). 2) I'm using Perl, DBI etc and a brief look at BLOB handling put me off. Maybe it was unwarranted, but given 1) I decided to call the whole thing off. Cheerio, Link.
Re: [GENERAL] To BLOB Or Not To BLOB (fwd)
Hola: Una opinión sobre el uso de imágenes en blobs. Alguien tiene experiencias al respecto? Saludos, Roberto Andrade Fonseca [EMAIL PROTECTED] -- Forwarded message -- Date: Mon, 17 Apr 2000 09:40:47 +0800 From: Lincoln Yeoh [EMAIL PROTECTED] To: John Henderson [EMAIL PROTECTED], Frank Joerdens [EMAIL PROTECTED], [EMAIL PROTECTED] Subject: Re: [GENERAL] To BLOB Or Not To BLOB Well I'm currently using the file system for large files. However because of that I can see a few reasons why people might want to use Postgresql to handle them. Others can probably mention more. Using Pg to handle large stuff makes more consistent overall and it's easier for you to handle exceptions - e.g. if things fail the whole thing is rolled back, and you theoretically don't get the large files dangling around. Well ok you probably do until the next vacuum, but at least you don't have to write your own vacuum to handle that ;). Basically you shift the problem to Pg (and the very fine developers :) ). The reasons I decided to go file system were: 1) I'm using Linux and ext2fs has a 2GB limit on files, and it seems like 6.5.3 tables are stored as single files, so better not go down that path :). 2) I'm using Perl, DBI etc and a brief look at BLOB handling put me off. Maybe it was unwarranted, but given 1) I decided to call the whole thing off. Cheerio, Link.
Re: [GENERAL] To BLOB Or Not To BLOB
A while ago it was being held that the Postgres large object data type was too new and not sufficiently tested and mature to be used in a production environment. I am about to deploy a little database that involves storing large-ish text files (20-500k) which could be either done as large objects or as pointers to those files that would be then stored as ordinary files in the OS's filesystem. I am undecided as to this question. What are the pros and cons? What is the performance in either case vis-a-vis the other? It is a web app; the interface is done in PHP. Beginning from which version is the large object interface (if at all) to be considered stable and ready for production? I would always use a reference solution. It's database independent and if the contents you store is URL, then you may easily get the referenced text file. Marten
Re: [GENERAL] To BLOB Or Not To BLOB
- It seems that the issue with large objects is "Why do you want the info in a - database?" To organize them, of course. - It seems to me that the point of a database is its ability to order and - relate data. If you want to retrieve the "large-ish text files" based on - their content then I think you need to have the files in the database so - they can be searched. However, if you are going to retrieve the files based - on something about them that is stored as a separate attribute, such as - their title, then I think that storing the files as large objects is - unnecessary. Partly I would like to retrieve them based on their content, and it is why I wrote (badly) a small extension that searches large objects for text strings. - If you have the option to "[do] as pointers t those files" then I think you - are not intending to use any of Postgres's features beyond its ability to - store and retrieve data. Surely the file system itself is 'better' at this - regardless of how stable Postgres LO's are. Yes, but the file system is external to the database. A view that I am beginning to appreciate the more that I use databases like PG and Oracle is that all of the relations within a database should be internal. Want to reference a title? Reference to the table of titles. Want to reference a bunch of data? Point to the bunch of data. If I have to use the large object interface to do this, then I will. This way, you can let the database worry about organizing and storing the data, and all you need to do is worry about backing that database up -- not about making sure that the permissions on files are right, or that the directory structure remains the same, or the DB remains synced to the contents of the files... This is also why the FOREIGN KEY functionality that e.g. MySQL completely lacks and PostgreSQL is going to have as of 7.0 is so important - maintaining referential integrity. Here are two specific applications for which I am using large objects, one in Oracle and one in PG, both involving online databases, both using the ArsDigita Community System. First of all, I managed the online paper submission and review process for a conference using Oracle. For this, it was very convenient to be able to grab submissions and stuff them into the database, from which they could be referenced by review etc. Moreover, because of the geographical distribution of the review committee, the ability to grab the papers without giving access to the machine by shell or by FTP (which is generally a security problem) was excellent. Finally, separate revisions of the papers could be kept separate using the obvious internal mechanisms of the database. The second application is one I'm working on right now, a genome database project for the sea urchin. For this, we need to be able to archive arbitrarily large amounts of file-oriented data by (for example) sequence name, and we want to give the option to search it as well. The only really convenient way to do this is to store it all in the database. (Note that it's not terribly *fast* to do it this way ;). I would be happy to discuss this more if anyone has suggestions for alternatives -- I'm not wedded to the approach, but it seemed to be the obvious one. cheers, --titus