[GENERAL] what is the best way of storing text+image documents in postgresql

2011-06-08 Thread Arash pajoohande
hello
i have a lot of files in microsoft word format. each file consists of text
and images (other text formatting like font is not important).
i want to store this documents in Postgresql, and documents must display on
web page when corresponding user requests occurs.

it seems theres 2 way to do that:
1. save .doc documents in bytea columns. and show them with a word reader in
web page (disadvantage: it needs a proper .doc reader installed on user
computer)
2. convert document to html format and store html code as string
(disadvantage: images will remain in file system)

what is your offer? any help highly would be appreciated

thanks in advance
arash


Re: [GENERAL] what is the best way of storing text+image documents in postgresql

2011-06-08 Thread Craig Ringer

On 06/08/2011 06:13 PM, Arash pajoohande wrote:


1. save .doc documents in bytea columns. and show them with a word
reader in web page (disadvantage: it needs a proper .doc reader
installed on user computer)


1a: Convert the .doc files to a standard format like PDF that most 
browsers can display. That's what I'd do.



2. convert document to html format and store html code as string
(disadvantage: images will remain in file system)


You could always store images as 'bytea' fields on a subtable.

CREATE TABLE doc (
  id serial primary key,
  doc_html text,
  ...
);

CREATE TABLE doc_images (
  id serial primary key,
  doc_id integer references doc(id),
  image_data bytea not null
);

There are advantages and disadvantages to storing files in the database 
vs in the file system. It's been discussed to death recently, so see the 
mailing list archives.


--
Craig Ringer

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] what is the best way of storing text+image documents in postgresql

2011-06-08 Thread John R Pierce

On 06/08/11 6:06 AM, Craig Ringer wrote:

1. save .doc documents in bytea columns. and show them with a word
reader in web page (disadvantage: it needs a proper .doc reader
installed on user computer)


1a: Convert the .doc files to a standard format like PDF that most 
browsers can display. That's what I'd do. 


thats harder to integrate with a website in the sense that the PDF 
documents are hard page formatted, and can at best be displayed in an 
 within your site, and half the time, only displayed in an 
external PDF viewer since browser-pdf integration remains flakey and 
bugridden after all these years.   PDF text won't flow to fit your page 
layout, etc etc.


one approach to conversion might be to save the documents as an RTF type 
format, and run that through a preprocessor that reencodes them as a 
clean HTML or similar metalanguage that you can deal with intelligently. 
   MS Word's own HTML converter creates wretched HTML with tons of 
extra bizarro-world tags which likely will trip up your page formatting 
if you display these in context in your pages.


as Craig suggested, images can be stored as bytea objects, and the image 
links converted to point to a CGI that fetches them from the database 
and presents them to the client browser.


--
john r pierceN 37, W 122
santa cruz ca mid-left coast


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] what is the best way of storing text+image documents in postgresql

2011-06-08 Thread Tomas Vondra
Dne 8.6.2011 21:37, John R Pierce napsal(a):
> On 06/08/11 6:06 AM, Craig Ringer wrote:
>>> 1. save .doc documents in bytea columns. and show them with a word
>>> reader in web page (disadvantage: it needs a proper .doc reader
>>> installed on user computer)
>>
>> 1a: Convert the .doc files to a standard format like PDF that most
>> browsers can display. That's what I'd do. 
> 
> thats harder to integrate with a website in the sense that the PDF
> documents are hard page formatted, and can at best be displayed in an
>  within your site, and half the time, only displayed in an
> external PDF viewer since browser-pdf integration remains flakey and
> bugridden after all these years.   PDF text won't flow to fit your page
> layout, etc etc.

OTOH, the probability that the visitor has a PDF reader is much higher.
Plus the PDF is usually much easier to index etc. But yes, using this to
build a website is PITA.

> one approach to conversion might be to save the documents as an RTF type
> format, and run that through a preprocessor that reencodes them as a
> clean HTML or similar metalanguage that you can deal with intelligently.
>MS Word's own HTML converter creates wretched HTML with tons of extra
> bizarro-world tags which likely will trip up your page formatting if you
> display these in context in your pages.

You could as well run htmltidy or something like that on the HTML. But
in both cases, this will seriously damage the formatting. So if the OP
wants to preserve it, the only viable solution is to keep the .doc
format or convert it to a .pdf and pray there's a working viewer.

Or you can convert the PDF into images ("convert" from imagemagick can
do that quite easily), display those images on the web and offer the PDF
for download.

regards
Tomas

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] what is the best way of storing text+image documents in postgresql

2011-06-09 Thread Radosław Smogura

On Wed, 8 Jun 2011 14:43:16 +0430, Arash pajoohande wrote:

hello
i have a lot of files in microsoft word format. each file consists of
text and images (other text formatting like font is not important).
i want to store this documents in Postgresql, and documents must
display on web page when corresponding user requests occurs.

it seems theres 2 way to do that:
1. save .doc documents in bytea columns. and show them with a word
reader in web page (disadvantage: it needs a proper .doc reader
installed on user computer)
2. convert document to html format and store html code as string
(disadvantage: images will remain in file system)

what is your offer? any help highly would be appreciated

thanks in advance
arash


If you want to do 2. there is no problem your application should 
display documents like this

/browswer//.html

Then after conversion put all generated files in database, converted 
document (start.html) should be relatively linked, if some browser will 
call to load image big/my_logo.png, then it will do request

/browswer//html/big/my_logo.png

Your CGI/Servlet will capture every request after /browser, you have id 
of document and requested file name passed in URL then do

SELECT content where doc_id=... and filename=''

Caching, speed-uping, on demand generation, securing is on your side.

Regards,
Radek

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] what is the best way of storing text+image documents in postgresql

2011-06-09 Thread Vincent Veyron
Le mercredi 08 juin 2011 à 14:43 +0430, Arash pajoohande a écrit :

> and documents must display on web page when corresponding user
> requests occurs.

I wonder why that is? is there any reason you can't just store and serve
the binary content as is, simply letting the proper application (Word in
that case) open it?

-- 
Vincent Veyron
http://marica.fr/
Logiciel de gestion des sinistres et des contentieux pour le service juridique


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] what is the best way of storing text+image documents in postgresql

2011-06-10 Thread Arash pajoohande
hello all and thank you for your advices
On Thu, Jun 9, 2011 at 7:05 PM, Vincent Veyron  wrote:

> Le mercredi 08 juin 2011 à 14:43 +0430, Arash pajoohande a écrit :
>
> > and documents must display on web page when corresponding user
> > requests occurs.
>
> I wonder why that is? is there any reason you can't just store and serve
> the binary content as is, simply letting the proper application (Word in
> that case) open it?
>
that is because MS Word may hasn't been installed in user machine to handle
the page.

Actually, my word files consists of many exam questions. for each user, a
random subset of questions must selected and displayed.
it seem's that it would be nice if I have each question as an easy to handle
document section (e.g. html div) and display them without need of any other
application.
for now, i think the best solution for me is to convert each question as an
RTF type format, and run that through a preprocessor that reencodes them as
a clean HTML (as john r pierce suggested).

thanks in advance
Arash


Re: [GENERAL] what is the best way of storing text+image documents in postgresql

2011-06-10 Thread John R Pierce

On 06/10/11 12:04 AM, Arash pajoohande wrote:
Actually, my word files consists of many exam questions. for each 
user, a random subset of questions must selected and displayed.
it seem's that it would be nice if I have each question as an easy to 
handle document section (e.g. html div) and display them without need 
of any other application.
for now, i think the best solution for me is to convert each question 
as an RTF type format, and run that through a preprocessor that 
reencodes them as a clean HTML (as john r pierce suggested).



actually, for THAT requirement, I think I'd store the questions in the 
database as plain text and wrap them in HTML formatting in your View or 
Presentation layer as its being sent to the user



--
john r pierceN 37, W 122
santa cruz ca mid-left coast


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] what is the best way of storing text+image documents in postgresql

2011-06-10 Thread Craig Ringer

On 06/10/2011 03:26 PM, John R Pierce wrote:


actually, for THAT requirement, I think I'd store the questions in the
database as plain text and wrap them in HTML formatting in your View or
Presentation layer as its being sent to the user


I agree - sending a PDF, Word doc, RTF, etc for that is just horrid.

Consider saveing the Word docs as plain text, then parsing them with a 
Perl script or something to split them into questions and answers you 
can store in the database.


Even if the script only does a rough job it might do enough of the job 
that you could quickly go through and fix them up / check them against 
the original documents.


There are almost certainly existing questionnaire / exam scripts or 
applications you can re-use for the presentation side.


--
Craig Ringer

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] what is the best way of storing text+image documents in postgresql

2011-06-10 Thread Leif Biberg Kristensen
On Friday 10. June 2011 09.04.18 Arash pajoohande wrote:

> Actually, my word files consists of many exam questions. for each user, a
> random subset of questions must selected and displayed.
> it seem's that it would be nice if I have each question as an easy to
> handle document section (e.g. html div) and display them without need of
> any other application.

That is a very trivial task in eg. PHP. Like John and Craig has said, just 
store the questions as plain text in the database. (Watch out for special MS 
characters.) Let the PHP script pick a random subset of predefined size and 
display it in the browser. Click on File -> Print.

Sounds like less than a day's work to me.

regards, Leif

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] what is the best way of storing text+image documents in postgresql

2011-06-10 Thread Vincent Veyron
Le vendredi 10 juin 2011 à 11:34 +0430, Arash pajoohande a écrit :

> 
> > and documents must display on web page when corresponding
> user
> > requests occurs.
> 
> 
> I wonder why that is? is there any reason you can't just store
> and serve
> the binary content as is, simply letting the proper
> application (Word in
> that case) open it?
> that is because MS Word may hasn't been installed in user machine to
> handle the page.
> 

Provide a link to LibreOffice?


> for now, i think the best solution for me is to convert each question
> as an RTF type format, and run that through a preprocessor that
> reencodes them as a clean HTML (as john r pierce suggested).

That depends on how clean your documents are : it could be hard to
convert them. But I agree on the principle.


-- 
Vincent Veyron
http://marica.fr/
Logiciel de gestion des sinistres et des contentieux pour le service juridique


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general