Re: [GENERAL] to BLOB or not to BLOB

2001-05-14 Thread KuroiNeko

 I am going  to write an application tht writes big  amounts of plain text
into a database.

 What kind of data are those? Articles? News/mail messages? Other?

 I thought of using  the text type for this, but I don't  know if it has a
maxlenght, and given that these will  be very long texts I started wondered
if these would have to be blobs... but they aren't binary.

 Well,  plain text  is a  `subset' of  `binary,' so  storing it  as a  BLOB
shouldn't be an issue. My home-brewn email archive stores message bodies as
BLOBs with absolutely no problem.
 On length limit of text type field,  for older versions of PGSQL it should
be less  that page size  (default 8 KB,  configurable at compile  time). In
TOASTed PGSQL there's probably no limit set by DBMS itself.

 So, does the text daya type have a maxium lenght? Do BLOBs? What should I
use? For the time being, at least,  I won't be searching these texts with a
search  engine or  anything... but  if I  were to  be, what  considerations
should I take into account when designing the tables?

 As you already figured it, the answer  to your question depends on what do
you want from  your system. Maybe you'll  be OK with BLOBs but  you have to
keep in  mind that  pg_dump couldn't  handle them and  you need  to perform
somewhat tricky things to backup and restore PGSQL DB with BLOBs.
 Another  thing to  consider is  searchability and  indexes. If  you really
don't want (and never going to) search  you big fields, your best way would
be to  store just file names  and to keep  long chunks of text  in external
files. Sure,  you can  store them  in TOASTed fields  if you  need indexes,
but  _IMHO_,  this is  yet  to  be proven  that  simplicity  of design  and
implementation is  worth _possiblie_ preformance degradation.  I believe we
still have to see good and fast index built on a 100 KB text field.
 Maybe what  you need  is FTS, because  IMNSHO, there's no  much use  of an
attribute that can't be indexed  and searched (updated, joined, grouped) on
swiftly. Everything relatively  small and fast to process goes  to DB, huge
portions of not-easily-indexable data should stay outside.
 Actually, FTS is a  last resort, kind of. You may wish  to split your text
into smaller fragments with similar semantical (logical, whatever) load.


--

 ÌĤ¯Ç­¤ÏÁͤòÊá¤é¤Ì


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [GENERAL] To BLOB Or Not To BLOB

2000-04-17 Thread chewie

How do I start this?  There are so many different ways to look at this
question.  Do I save files inside the database, or do I simply use the database
as a tool to query information about a given file, it's properties, etc.

Instead of making general statements, let me pick specific examples.  Let's say
we want to create an MP3 repository.  Our goals are to be able to query the
information saved in ID3 tags, to locate, organize, manipulate, or play the
MP3.  To narrow the scope of discussion, let's say I will be accessing these
files locally, on my filesystem or a NFS mounted directory.

What would be the most efficient way for us to query the information found in
the ID3 tags?  We could parse each file in a number of directories and their
subdirectories for the information we require each time we need it, or we could
parse it once and save the information in a database structured around ID3
information.  I would say the later is definitely more desireable and
acceptable.  Inside the database, the ID3 tags can be indexed and our searches
optimized by the information we're requesting.

But should we store the mp3's inside the database?  I say no.  A few
observations:

* Filesystems are generally better at allocating space for files than
databases are.  
* Storing files inside databases renders shell access or direct access to
files as impossible.  If the database goes South, so does access to the
files.

Now, our mp3 decoders are shell programs that read mp3 input by opening a file,
a URL connection, or reading it from standard input (that is if we have a good
mp3 player).  Yet, in order for these mp3 decoders to be able to play an mp3
found inside the database, we are forced to use a specialized database
interface.  It may return the mp3 file to the decoder via standard input,
or it may export the file to a temp directory and feed the mp3 player the
location of that temp file.

Yet, why should we go through all of this trouble?  And why should we tie the
operation of two clearly separate applications into such inflexible
relationships?  We add overhead to the database to output this file, and we add
another level of logic to get the file to the decoder.  A level of logic that
could prove the downfall of a would-be MP3 mobile disc jockey, who's worked so
hard to please the Wicked Witch of the Wedding.  He'd be forced to pull out the
CD's and tapes he so dreads -- if he remembered to bring them along. 

If we keep file access decoupled from gleaning catagorizable, quantifiable data
about/from the file, we allow two separate and efficient tools to do their jobs
independently.  If file access is your concern, then use a file system that
will give you the level of control you desire to develop your security
policies.  

If grabbing information from a data file is important, such as parsing text
files for content, then build indexes to the data inside the file and store
those indexes in the database.  Internet directories are excellent examples of
this!  We don't neuter local file access by the applications that need the data
files, and we still provide access to searchable information.

I would argue that access to a data file and querying sets of data files for
information are two separate layers of logic.  Logical layers are best served
when they are given tools to interact, but are not crippled by tight
integration.  Saving data files inside a database, IMHO, is an example of tight
integration, and something to be avoided if possible -- at least in this case.

-- 
  Chad "^chewie, gunnarr" Walstrom [EMAIL PROTECTED]
  http://wookimus.net/chewie



Re: [GENERAL] To BLOB Or Not To BLOB

2000-04-17 Thread Stephen Davies

G'day.

I do a lot of work with the BASIS textual/multi-media RDBMS package and 
run into this question all the time.

There is one pretty basic answer:

If you leave BLOBS lying around in the file system - particularly if 
it is a Novell etc file system - people move them and the links get 
broken.

Just today I had an example where several thousand links broke because 
some dumb Novel administrator moved a directory holding files which 
were linked to from a database rather than stored in it. (Not my design 
decision)

A database that can hold the BLOBS internally avoids this sort of snafu.

(BASIS also has the advantage that it can automagically index almost 
any form of BLOB to the word/phrase/sentence level if it is stored in 
the database.)

Cheers,
Stephen Davies
"John Henderson" [EMAIL PROTECTED]  wrote:
 Hi,
 It seems that the issue with large objects is "Why do you want the info in a
 database?"
 
 It seems to me that the point of a database is its ability to order and
 relate data. If you want to retrieve the "large-ish text files" based on
 their content then I think you need to have the files in the database so
 they can be searched. However, if you are going to retrieve the files based
 on something about them that is stored as a separate attribute, such as
 their title, then I think that storing the files as large objects is
 unnecessary.
 
 If you have the option to "[do] as pointers t those files" then I think you
 are not intending to use any of Postgres's features beyond its ability to
 store and retrieve data. Surely the file system itself is 'better' at this
 regardless of how stable Postgres LO's are.
 
 I would sure like to hear someone else's opinion on this because it seems
 that quite a few folks are using large objects and I am often curious about
 why.
 
 John Henderson
 
 A while ago it was being held that the Postgres large object data type
 was too new and not sufficiently tested and mature to be used in a
 production environment. I am about to deploy a little database that
 involves storing large-ish text files (20-500k) which could be either done
 as large objects or as pointers to those files that would be then stored
 as ordinary files in the OS's filesystem. I am undecided as to this
 question. What are the pros and cons? What is the performance in either
 case vis-a-vis the other? It is a web app; the interface is done in PHP.
 Beginning from which version is the large object interface (if at all)
 to be considered stable and ready for production?
 
 cheers frank
 
 --
 frank joerdens
 
 joerdens new media
 heinrich-roller str. 16/17
 10405 berlin
 germany
 
 e: [EMAIL PROTECTED]
 t: +49 30 44055471
 f: +49 30 44055475
 h: http://www.joerdens.de
 
 pgp public key: http://www.joerdens.de/pgp/frank_joerdens.asc
 





Stephen Davies Consulting [EMAIL PROTECTED]
Adelaide, South Australia.Voice: 08-8177 1595
Computing  Network solutions.Fax: 08-8177 0133





Re: [GENERAL] To BLOB Or Not To BLOB

2000-04-16 Thread Robert B. Easter

On Sun, 16 Apr 2000, Frank Joerdens wrote:
 A while ago it was being held that the Postgres large object data type
 was too new and not sufficiently tested and mature to be used in a
 production environment. I am about to deploy a little database that
 involves storing large-ish text files (20-500k) which could be either done
 as large objects or as pointers to those files that would be then stored
 as ordinary files in the OS's filesystem. I am undecided as to this
 question. What are the pros and cons? What is the performance in either
 case vis-a-vis the other? It is a web app; the interface is done in PHP.
 Beginning from which version is the large object interface (if at all)
 to be considered stable and ready for production?
 

How are you going to dump and restore the large objects?  The pg_dump and
pg_dumpall programs will only handle the tables it seems.

There are some programs out there to dump large objects and I've been playing
with one.  It's worked well so far.  You can get it at
ftp://ftp2.zf.jcu.cz/zakkr/pg/

I'd like to know what other ways there are to handle dumping/restoring large
objects.  Isn't there any standard way that comes with PostgreSQL?

Nevertheless, I've been using php and large objects with no problems so far. 
I've been able to fully dump the database with all large objects, upgrade
PostgreSQL to a new version, initdb, and restore everything.

Reading large objects seems fairly fast too.  On my website, all data is
stored in the database, including pictures.  The version that is up and running
reads the data out of the database on every request!  Since I use query
string urls like ?image=2345, web browsers don't cache it much.  Its not so slow
if you don't have a lot of traffic, but if you are planning for a lot of
traffic, try designing for on-the-fly content generation for a speed up.  That
is what I am doing on the devel version of my site.  This way the data is only
read out of the database on the first request then html and image files are
generated on the filesystem to serve the same request next time unless a query
string is passed which overrides and forces content regeneration.  Its a little
tricky but cool.

I'd say go for storing everything in the database so long as you master how to
dump and restore large objects as well as the databases/tables using
pg_dump/pg_dumpall.  This way, it is possible to really separate your site code
from its data.  All data backed up in the database and its dumps, all code (php
etc) backed up in a CVS repository.  Html pages and images generated on the
filesystem as an ordinary looking website can be seen as just temporary cached
data that can be sent out quickly when it matches a request.  Look into
mod_rewrite on apache.  If you want a simpler site that serves fast,
then just use the file pointer idea and keep the files outside the database.

Robert Easter
[EMAIL PROTECTED]



Re: [reaster@comptechnews.com: Re: [GENERAL] To BLOB Or Not To BLOB]

2000-04-16 Thread Robert B. Easter

On Sun, 16 Apr 2000, you wrote:
 There are some programs out there to dump large objects and I've been playing
 with one.  It's worked well so far.  You can get it at
 ftp://ftp2.zf.jcu.cz/zakkr/pg/
 
 ---
 
 file not found...

Correction:

ftp://ftp2.zf.jcu.cz/users/zakkr/pg/



Re: [GENERAL] To BLOB Or Not To BLOB

2000-04-16 Thread John Henderson

Hi,
It seems that the issue with large objects is "Why do you want the info in a
database?"

It seems to me that the point of a database is its ability to order and
relate data. If you want to retrieve the "large-ish text files" based on
their content then I think you need to have the files in the database so
they can be searched. However, if you are going to retrieve the files based
on something about them that is stored as a separate attribute, such as
their title, then I think that storing the files as large objects is
unnecessary.

If you have the option to "[do] as pointers t those files" then I think you
are not intending to use any of Postgres's features beyond its ability to
store and retrieve data. Surely the file system itself is 'better' at this
regardless of how stable Postgres LO's are.

I would sure like to hear someone else's opinion on this because it seems
that quite a few folks are using large objects and I am often curious about
why.

John Henderson

A while ago it was being held that the Postgres large object data type
was too new and not sufficiently tested and mature to be used in a
production environment. I am about to deploy a little database that
involves storing large-ish text files (20-500k) which could be either done
as large objects or as pointers to those files that would be then stored
as ordinary files in the OS's filesystem. I am undecided as to this
question. What are the pros and cons? What is the performance in either
case vis-a-vis the other? It is a web app; the interface is done in PHP.
Beginning from which version is the large object interface (if at all)
to be considered stable and ready for production?

cheers frank

--
frank joerdens

joerdens new media
heinrich-roller str. 16/17
10405 berlin
germany

e: [EMAIL PROTECTED]
t: +49 30 44055471
f: +49 30 44055475
h: http://www.joerdens.de

pgp public key: http://www.joerdens.de/pgp/frank_joerdens.asc





Re: [GENERAL] To BLOB Or Not To BLOB

2000-04-16 Thread Lincoln Yeoh

Well I'm currently using the file system for large files. However because
of that I can see a few reasons why people might want to use Postgresql to
handle them. Others can probably mention more.

Using Pg to handle large stuff makes more consistent overall and it's
easier for you to handle exceptions - e.g. if things fail the whole thing
is rolled back, and you theoretically don't get the large files dangling
around. Well ok you probably do until the next vacuum, but at least you
don't have to write your own vacuum to handle that ;). Basically you shift
the problem to Pg (and the very fine developers :) ).

The reasons I decided to go file system were:
1) I'm using Linux and ext2fs has a 2GB limit on files, and it seems like
6.5.3 tables are stored as single files, so better not go down that path :). 
2) I'm using Perl, DBI etc and a brief look at BLOB handling put me off.
Maybe it was unwarranted, but given 1) I decided to call the whole thing off.

Cheerio,

Link.





Re: [GENERAL] To BLOB Or Not To BLOB (fwd)

2000-04-16 Thread Ing. Roberto Andrade Fonseca

Hola:

Una opinión sobre el uso de imágenes en blobs.

Alguien tiene experiencias al respecto?

Saludos,

Roberto Andrade Fonseca
[EMAIL PROTECTED]

-- Forwarded message --
Date: Mon, 17 Apr 2000 09:40:47 +0800
From: Lincoln Yeoh [EMAIL PROTECTED]
To: John Henderson [EMAIL PROTECTED], Frank Joerdens [EMAIL PROTECTED],
 [EMAIL PROTECTED]
Subject: Re: [GENERAL] To BLOB Or Not To BLOB

Well I'm currently using the file system for large files. However because
of that I can see a few reasons why people might want to use Postgresql to
handle them. Others can probably mention more.

Using Pg to handle large stuff makes more consistent overall and it's
easier for you to handle exceptions - e.g. if things fail the whole thing
is rolled back, and you theoretically don't get the large files dangling
around. Well ok you probably do until the next vacuum, but at least you
don't have to write your own vacuum to handle that ;). Basically you shift
the problem to Pg (and the very fine developers :) ).

The reasons I decided to go file system were:
1) I'm using Linux and ext2fs has a 2GB limit on files, and it seems like
6.5.3 tables are stored as single files, so better not go down that path :). 
2) I'm using Perl, DBI etc and a brief look at BLOB handling put me off.
Maybe it was unwarranted, but given 1) I decided to call the whole thing off.

Cheerio,

Link.





Re: [GENERAL] To BLOB Or Not To BLOB

2000-04-16 Thread Marten Feldtmann

 A while ago it was being held that the Postgres large object data type
 was too new and not sufficiently tested and mature to be used in a
 production environment. I am about to deploy a little database that
 involves storing large-ish text files (20-500k) which could be either done
 as large objects or as pointers to those files that would be then stored
 as ordinary files in the OS's filesystem. I am undecided as to this
 question. What are the pros and cons? What is the performance in either
 case vis-a-vis the other? It is a web app; the interface is done in PHP.
 Beginning from which version is the large object interface (if at all)
 to be considered stable and ready for production?
 

 I would always use a reference solution. It's database independent
and if the contents you store is URL, then you may easily get the
referenced text file.

 Marten





Re: [GENERAL] To BLOB Or Not To BLOB

2000-04-16 Thread Titus Brown

- It seems that the issue with large objects is "Why do you want the info in a
- database?"

To organize them, of course.

- It seems to me that the point of a database is its ability to order and
- relate data. If you want to retrieve the "large-ish text files" based on
- their content then I think you need to have the files in the database so
- they can be searched. However, if you are going to retrieve the files based
- on something about them that is stored as a separate attribute, such as
- their title, then I think that storing the files as large objects is
- unnecessary.

Partly I would like to retrieve them based on their content, and it is
why I wrote (badly) a small extension that searches large objects for
text strings.

- If you have the option to "[do] as pointers t those files" then I think you
- are not intending to use any of Postgres's features beyond its ability to
- store and retrieve data. Surely the file system itself is 'better' at this
- regardless of how stable Postgres LO's are.

Yes, but the file system is external to the database.

A view that I am beginning to appreciate the more that I use databases
like PG and Oracle is that all of the relations within a database should
be internal.  Want to reference a title?  Reference to the table of titles.
Want to reference a bunch of data?  Point to the bunch of data.  If I
have to use the large object interface to do this, then I will.

This way, you can let the database worry about organizing and storing the
data, and all you need to do is worry about backing that database up --
not about making sure that the permissions on files are right, or that
the directory structure remains the same, or the DB remains synced to
the contents of the files...

This is also why the FOREIGN KEY functionality that e.g. MySQL completely
lacks and PostgreSQL is going to have as of 7.0 is so important -
maintaining referential integrity.

Here are two specific applications for which I am using large objects,
one in Oracle and one in PG, both involving online databases, both
using the ArsDigita Community System.

First of all, I managed the online paper submission and review process
for a conference using Oracle.  For this, it was very convenient to
be able to grab submissions and stuff them into the database, from which
they could be referenced by review etc.  Moreover, because of the
geographical distribution of the review committee, the ability to
grab the papers without giving access to the machine by shell or by
FTP (which is generally a security problem) was excellent.  Finally,
separate revisions of the papers could be kept separate using the
obvious internal mechanisms of the database.

The second application is one I'm working on right now, a genome database
project for the sea urchin.  For this, we need to be able to archive
arbitrarily large amounts of file-oriented data by (for example) sequence
name, and we want to give the option to search it as well.  The
only really convenient way to do this is to store it all in the
database.  (Note that it's not terribly *fast* to do it this way ;).

I would be happy to discuss this more if anyone has suggestions for
alternatives  -- I'm not wedded to the approach, but it seemed to be
the obvious one.

cheers,
--titus