Hi List.
I am looking for some general ideas or recommendations regarding a
development that I am about to begin, without really having fixed
deadlines. My hope is to obtain some pointers allowing me to choose
well among a range of possibilities, based on the wisdom and experience
of other users on this list.
I have written what is essentially a web-based document management
system, essentially in Perl. It works fine, at a dozen sites so far.
The retrieval side is based on a text-retrieval system, which allows the
user to search the full-text of documents and display results on a web
page. Along with each summary is a link to the corresponding original
document stored in the system.
This link, at the moment, triggers a cgi-bin on the same web server.
(the links are something like "/cgi-bin/getdoc.pl?doc=id...")
This cgi-bin in turn establishes a TCP connection with an "original
document server", asks it for the original document, reads it from the
TCP connection, and returns it to the user's browser.
The "original document server" is a separate daemon written in Perl,
single-process and single-threaded, so it is sometimes a bottleneck,
because it can only answer one document request at a time, and documents
can be large (1 MB on average, but sometimes several MB).
I am thus considering rewriting it, making it multi-process (forking) or
multi-threaded, among other things (like zipping the document).
But, having recently perused a lot of Apache2 and mod_perl2
documentation (including the recently-published mod_perl2 User's Guide -
thanks Stas & Jim), I now wonder if a better idea would not be to use a
dedicated Apache2/mp2 server for the task, thus leaving all that complex
multi-process management to Apache. My document-retrieval code could
just go into a PerlResponseHandler.
Any pearls of wisdom available so far ?
Of course, if this document server is then a separate Apache instance,
all the application document links would have to be rewritten as
http:://this_other_server/getdoc/doc-id
and that is quite some work. (I also kind of dislike the idea of the
end-user browsers accessing the document-server directly.)
So, having also followed some other threads on this list, I am wondering
which other solution would be available, such as mod_rewrite or
mod_proxy and the like in the "front" server, and the "document server"
being located "behind" that one.
Any ideas or recommendations around this subject ?
(Maybe also ideas about relative performamce issues)
As a third concern for the same :
One of the things that the document server must do in order to decode a
"document-id" into a real path on the disk, is to read a couple of
relatively large index files, parse them and store them into memory for
later referral (at the moment, it's in a perl hash).
I would of course like to avoid having to do that for each request.
Ideally, I would like to have these files read amd parsed once into some
shareable table accessible by all Apache/mp2 children, and usable
read-only by all concurrent request handlers. But also these files do
change from time to time (as new documents are added), so they must be
re-read and re-parsed from time to time (when their last mod-time changes).
This is of course easy in a single-threaded server, but I don't quite
see how to do that best in an Apache/mp2 context.
Any suggestions in that area ?
Many thanks in advance for your help,
aw
P.S.
advertisement : curious ones can see a sample application here :
http://mira.mira-consulting.net