[htdig] Improving performance with lots of rarely-changing pdf documents and almost as many new HTML documents

Jeff Thu, 01 Feb 2001 20:13:18 -0800

Hi,

I'm currently trying to optimize the performance of ht://dig for multiple career web sites that have the following traits:

a local URL (accessible only from the computer running ht://dig) that for illustration I'll refer to as http://localhost.

http://localhost itself contains two or more subdirectories with integers as directory names. For example, http://localhost/1 and http://localhost/2.

The contents of http://localhost/1 are all HTML documents that change constantly

The contents of http://localhost/2 (as well as http://localhost/3 and beyond) are pdf, text, and rtf documents. The documents themselves never change. New documents might be occasionally added, but for all intents and purposes documents are never removed once they're added.

Up to this point, I've had ht://dig just index http://localhost -- allowing it to grab everything in the subdirectories below it in one action. Unfortunately, that strategy worked for about a year, but lately has been close to collapsing under its own weight now that there are thousands of documents for it to handle. From what I can tell, the two most resource-intensive phases are:

parsing pdf documents while htdig is running

sorting the wordlist when htmerge is running

I was originally hoping that the root of the problem was caused by rebuilding the entire index every time (using -i on htdig), but omitting it seemed to make VERY little difference. What really caught my suspicion was finding pdf2text running repeatedly during indexing, even though NONE of the pdf documents had changed.

It looks as though I've either overlooked an important parameter somewhere, or found a rough spot in htdig itself. If htdig really IS re-parsing every pdf document every time it's run -- even when the old .work files are present and -i is left off, does anybody know WHY that's happening and have any suggestions on how to prevent it? If I can solve this problem, it will save me a LOT of work from implementing Plan B (my backup plan)

PLAN B - run htdig for each subdirectory (creating a separate wordlist and docdb for each directory, creating a new config file for each on the fly so they have names like 1db.docdb.work, 1db.wordlist.work, 2db.docdb.work, 2db.wordlist.work, etc.), then getting htmerge to combine them all into one big index. That way, the documents in http://localhost/1 can be indexed frequently, the documents in http://localhost/2 (and beyond) can be reindexed when new documents are added, and the combined index will have everything.

Unfortunately, I've seen virtually nothing in the way of documentation anywhere on the site that tells how to go about doing this properly, besides listing the "-m" flag on the page for htmerge.

My first experiment was to run htdig for each subdirectory (after creating a unique config file for each directory, since there doesn't seem to be a way to specify ${database_base} and ${database_dir} as command line options (is there?). After doing that, I ended up with the following files:

1db.docdb
1db.wordlist
2db.wordlist

2db.docdb

3db.wordlist

3db.docdb

I made copies of the 1db.* files as alldb.* (because there's another conf file for the phantom combined database with ${database_base}=all, then launched htmerge twice:

/www/htdig/bin/htmerge -c /www/htdig/conf/resconf-cns_all.conf -m /www/htdig/conf/resconf-cns_2.conf

/www/htdig/bin/htmerge -c /www/htdig/conf/resconf-cns_all.conf -m /www/htdig/conf/resconf-cns_3.conf

finally, I moved all the all.* files to the name the conf file used by htsearch expects... db.docdb, db.wordlist, etc.

It seems to work, but I have a few questions:

- would I get better reindexing performance if I were to copy the all.* files rather than move them? The impression I've gotten from the site is that htmerge works from scratch every time it runs (hence the lack of its own -i option), so I might as well move them.

- Is htdig using its own sort program, or is there perhaps a better one to use in its place? Would specifically compiling a new copy of the sort program with all the specific Pentium III optimizations available (vs running the 386-compatible one that presumably came with Redhat 7) make an appreciable difference?

- if multiple iterations of htdig and/or htmerge are running simultaneously (doing different sites, using different conf files with different values for ${database_dir}), does anything special need to be done to keep them from trampling each other's temp files, or does htdig make sure it uses unique names for its tempfiles?

- Is there anything in particular I should be aware of when merging LOTS of databases using htmerge? Perhaps small bits of corruption that are known to occur, that can snowball and become significant if htmerge ends up getting called a hundred or more times (with the same -c value, but a different -m value)? Would I be making things better, worse, or just needlessly complex if I were to handle the merges like a binary tree (merging all the "odd" databases with the "even" databases, then repeating the process on the results until the final two databases -- each containing 50 databases -- were merged into a single one) to minimize the number of times any single database gets merged into... with every chunk of data having been through exactly 8 (I think) merges, vs merging databases 2 through 100 into database 1, one at a time?

Finally, in case you missed the questions higher up on the page...

- Is there a way to specify ${database_dir} and ${database_base} from the command line itself (to avoid having to potentially create hundreds of conf files for each site, each one differing only with respect to these values)?

- Is it normal for htdig to be launching pdf2text for EVERY file EVERY time htdig runs -- even when the old .work files are present and -i isn't specified?

Thanks!

Jeff

[htdig] Improving performance with lots of rarely-changing pdf documents and almost as many new HTML documents

Reply via email to