Any pearls of wisdom available so far ?

Yes, why not integrate the actual retrieval into the site that people use to do the searches with? Or is that physically separate from the place you store documents at?


Of course, if this document server is then a separate Apache instance, all the application document links would have to be rewritten as
http:://this_other_server/getdoc/doc-id
and that is quite some work. (I also kind of dislike the idea of the end-user browsers accessing the document-server directly.)

So, having also followed some other threads on this list, I am wondering which other solution would be available, such as mod_rewrite or mod_proxy and the like in the "front" server, and the "document server" being located "behind" that one

Any ideas or recommendations around this subject ?
(Maybe also ideas about relative performamce issues)

Well, using mod_proxy to do some reverse proxying would work, but users would still be able to more or less 'browse' the document tree if they know where to look. No real way around that one ;)



As a third concern for the same :
One of the things that the document server must do in order to decode a "document-id" into a real path on the disk, is to read a couple of relatively large index files, parse them and store them into memory for later referral (at the moment, it's in a perl hash).
I would of course like to avoid having to do that for each request.
Ideally, I would like to have these files read amd parsed once into some shareable table accessible by all Apache/mp2 children, and usable read-only by all concurrent request handlers. But also these files do change from time to time (as new documents are added), so they must be re-read and re-parsed from time to time (when their last mod-time changes). This is of course easy in a single-threaded server, but I don't quite see how to do that best in an Apache/mp2 context.

I suggest setting up MySQL and storing that information in there -- depending on the type of documents you search through, you could potentially even put the documents in the database as well, although that's not really a 'good' way of doing it. So in the end, if you store those indexes in the database, you get the shared accessibility, and you can always use a cronjob to update it.

Just my 2 cents :)

Reply via email to