|
Hi,
I'm currently trying to optimize the performance of ht://dig for multiple career web sites that have the following traits: a local URL (accessible only from the computer running ht://dig) that for illustration I'll refer to as http://localhost. http://localhost itself contains two or more subdirectories with integers as directory names. For example, http://localhost/1 and http://localhost/2. The contents of http://localhost/1 are all HTML documents that
change constantly
The contents of http://localhost/2 (as well as http://localhost/3 and beyond) are pdf, text, and
rtf documents. The documents themselves never change. New documents might be
occasionally added, but for all intents and purposes documents are never removed
once they're added.
Up to this point, I've had ht://dig just index http://localhost -- allowing it to grab everything
in the subdirectories below it in one action. Unfortunately, that strategy
worked for about a year, but lately has been close to collapsing under its own
weight now that there are thousands of documents for it to handle. From what I
can tell, the two most resource-intensive phases are:
parsing pdf documents while htdig is
running
sorting the wordlist when htmerge is
running
I was originally hoping that the root of the
problem was caused by rebuilding the entire index every time (using -i on
htdig), but omitting it seemed to make VERY little difference. What really
caught my suspicion was finding pdf2text running repeatedly during indexing,
even though NONE of the pdf documents had changed.
It looks as though I've either overlooked an
important parameter somewhere, or found a rough spot in htdig itself. If htdig
really IS re-parsing every pdf document every time it's run -- even when the old
.work files are present and -i is left off, does anybody know WHY that's
happening and have any suggestions on how to prevent it? If I can solve this
problem, it will save me a LOT of work from implementing Plan B (my backup
plan)
PLAN B - run htdig for each subdirectory (creating
a separate wordlist and docdb for each directory, creating a new config file for
each on the fly so they have names like 1db.docdb.work, 1db.wordlist.work,
2db.docdb.work, 2db.wordlist.work, etc.), then getting htmerge to combine them
all into one big index. That way, the documents in http://localhost/1 can be indexed frequently, the
documents in http://localhost/2 (and beyond)
can be reindexed when new documents are added, and the combined index will have
everything.
Unfortunately, I've seen virtually nothing in the
way of documentation anywhere on the site that tells how to go about doing this
properly, besides listing the "-m" flag on the page for htmerge.
My first experiment was to run htdig for each
subdirectory (after creating a unique config file for each directory, since
there doesn't seem to be a way to specify ${database_base} and ${database_dir}
as command line options (is there?). After doing that, I ended up with the
following files:
1db.docdb
1db.wordlist 2db.wordlist 2db.docdb
3db.wordlist
3db.docdb
I made copies of the 1db.* files as alldb.*
(because there's another conf file for the phantom combined database with
${database_base}=all, then launched htmerge twice:
/www/htdig/bin/htmerge -c
/www/htdig/conf/resconf-cns_all.conf -m /www/htdig/conf/resconf-cns_2.conf
/www/htdig/bin/htmerge -c
/www/htdig/conf/resconf-cns_all.conf -m /www/htdig/conf/resconf-cns_3.conf
finally, I moved all the all.* files to the name
the conf file used by htsearch expects... db.docdb, db.wordlist, etc.
It seems to work, but I have a few
questions:
- would I get better reindexing performance if I
were to copy the all.* files rather than move them? The impression I've gotten
from the site is that htmerge works from scratch every time it runs (hence the
lack of its own -i option), so I might as well move them.
- Is htdig using its own sort program, or is there
perhaps a better one to use in its place? Would specifically compiling a new
copy of the sort program with all the specific Pentium III optimizations
available (vs running the 386-compatible one that presumably came with Redhat 7)
make an appreciable difference?
- if multiple iterations of htdig and/or htmerge
are running simultaneously (doing different sites, using different conf files
with different values for ${database_dir}), does anything special need to be
done to keep them from trampling each other's temp files, or does htdig make
sure it uses unique names for its tempfiles?
- Is there anything in particular I should be aware
of when merging LOTS of databases using htmerge? Perhaps small bits of
corruption that are known to occur, that can snowball and become significant if
htmerge ends up getting called a hundred or more times (with the same -c value,
but a different -m value)? Would I be making things better, worse, or just
needlessly complex if I were to handle the merges like a binary tree (merging
all the "odd" databases with the "even" databases, then repeating the process on
the results until the final two databases -- each containing 50 databases
-- were merged into a single one) to minimize the number of times any
single database gets merged into... with every chunk of data having been
through exactly 8 (I think) merges, vs merging databases 2 through 100 into
database 1, one at a time?
Finally, in case you missed the questions higher up
on the page...
- Is there a way to specify ${database_dir} and
${database_base} from the command line itself (to avoid having to potentially
create hundreds of conf files for each site, each one differing only with
respect to these values)?
- Is it normal for htdig to be launching pdf2text
for EVERY file EVERY time htdig runs -- even when the old .work files are
present and -i isn't specified?
Thanks!
Jeff
|

