Any pearls of wisdom available so far ?
Yes, why not integrate the actual retrieval into the site that people
use to do the searches with? Or is that physically separate from the
place you store documents at?
Of course, if this document server is then a separate Apache
instance, all the application document links would have to be
rewritten as
http:://this_other_server/getdoc/doc-id
and that is quite some work. (I also kind of dislike the idea of
the end-user browsers accessing the document-server directly.)
So, having also followed some other threads on this list, I am
wondering which other solution would be available, such as
mod_rewrite or mod_proxy and the like in the "front" server, and
the "document server" being located "behind" that one
Any ideas or recommendations around this subject ?
(Maybe also ideas about relative performamce issues)
Well, using mod_proxy to do some reverse proxying would work, but
users would still be able to more or less 'browse' the document tree
if they know where to look. No real way around that one ;)
As a third concern for the same :
One of the things that the document server must do in order to
decode a "document-id" into a real path on the disk, is to read a
couple of relatively large index files, parse them and store them
into memory for later referral (at the moment, it's in a perl hash).
I would of course like to avoid having to do that for each request.
Ideally, I would like to have these files read amd parsed once into
some shareable table accessible by all Apache/mp2 children, and
usable read-only by all concurrent request handlers. But also
these files do change from time to time (as new documents are
added), so they must be re-read and re-parsed from time to time
(when their last mod-time changes).
This is of course easy in a single-threaded server, but I don't
quite see how to do that best in an Apache/mp2 context.
I suggest setting up MySQL and storing that information in there --
depending on the type of documents you search through, you could
potentially even put the documents in the database as well, although
that's not really a 'good' way of doing it. So in the end, if you
store those indexes in the database, you get the shared
accessibility, and you can always use a cronjob to update it.
Just my 2 cents :)