Genral application architecture question

André Warnier Tue, 25 Sep 2007 04:41:14 -0700

Hi List.

I am looking for some general ideas or recommendations regarding adevelopment that I am about to begin, without really having fixeddeadlines. My hope is to obtain some pointers allowing me to choosewell among a range of possibilities, based on the wisdom and experienceof other users on this list.

I have written what is essentially a web-based document managementsystem, essentially in Perl. It works fine, at a dozen sites so far.The retrieval side is based on a text-retrieval system, which allows theuser to search the full-text of documents and display results on a webpage. Along with each summary is a link to the corresponding originaldocument stored in the system.

This link, at the moment, triggers a cgi-bin on the same web server.
(the links are something like "/cgi-bin/getdoc.pl?doc=id...")

This cgi-bin in turn establishes a TCP connection with an "originaldocument server", asks it for the original document, reads it from theTCP connection, and returns it to the user's browser.

The "original document server" is a separate daemon written in Perl,single-process and single-threaded, so it is sometimes a bottleneck,because it can only answer one document request at a time, and documentscan be large (1 MB on average, but sometimes several MB).I am thus considering rewriting it, making it multi-process (forking) ormulti-threaded, among other things (like zipping the document).

But, having recently perused a lot of Apache2 and mod_perl2documentation (including the recently-published mod_perl2 User's Guide -thanks Stas & Jim), I now wonder if a better idea would not be to use adedicated Apache2/mp2 server for the task, thus leaving all that complexmulti-process management to Apache. My document-retrieval code couldjust go into a PerlResponseHandler.


Any pearls of wisdom available so far ?

Of course, if this document server is then a separate Apache instance,all the application document links would have to be rewritten as

http:://this_other_server/getdoc/doc-id

and that is quite some work. (I also kind of dislike the idea of theend-user browsers accessing the document-server directly.)

So, having also followed some other threads on this list, I am wonderingwhich other solution would be available, such as mod_rewrite ormod_proxy and the like in the "front" server, and the "document server"being located "behind" that one.


Any ideas or recommendations around this subject ?
(Maybe also ideas about relative performamce issues)

As a third concern for the same :

One of the things that the document server must do in order to decode a"document-id" into a real path on the disk, is to read a couple ofrelatively large index files, parse them and store them into memory forlater referral (at the moment, it's in a perl hash).

I would of course like to avoid having to do that for each request.

Ideally, I would like to have these files read amd parsed once into someshareable table accessible by all Apache/mp2 children, and usableread-only by all concurrent request handlers. But also these files dochange from time to time (as new documents are added), so they must bere-read and re-parsed from time to time (when their last mod-time changes).This is of course easy in a single-threaded server, but I don't quitesee how to do that best in an Apache/mp2 context.


Any suggestions in that area ?

Many thanks in advance for your help,
aw

P.S.
advertisement : curious ones can see a sample application here :
http://mira.mira-consulting.net

Genral application architecture question

Reply via email to