On Tue, 21 Sep 2004, Aaron wrote:
I gave the -vvv flags, and ran the rundig against the first 100 documents and saw everything finish. It finished in roughly 1 minute. Based on this, I am guessing 500,000 documents / 100 documents per minute = 5000 minutes. So this will take roughly 3 and a half days to index? If that is the case, does that mean each time I update the index via cron it will take this long, or will it be differential?
That depends somewhat on how the nature of the pages and how they are served up. If you run htdig without the -i (initial) option, it will try to only index pages that have changed since the last index was created. It makes the determination of whether a page has changed or not based on the date returned by the web server. If the page hasn't changed and the web server returns a correct modification date, then no attempt should be made to reindex that page.
Also, with the full 500,000 documents, with the -vvv flags, I haven't seen any progress after 10 hours.
I would take that as a bad sign. You might try running top or something similar so that you can keep an eye on processor and memory utilization.
Also be aware that there is a 2 GB limit on the size of some of the files involved with index creation. Exceeding that limit will kill your indexing run.
How can I guess how big the files will get?
You just need to watch them and see what happens. There are way too many factors involved to make even an educated guess based on the number of documents.
I suspect that you will just need to feel your way through some of the aspects of this project. At 500,000 documents, you are probably pushing the limits of what you can do with ht://Dig. The farthest I have taken
it myself is about 300K documents in a single database set. I still had
at a little breathing room, but probably not enough to take it to 500K.
However the documents in that collection had an usually large number of
unique words.
Jim
------------------------------------------------------- This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170 Project Admins to receive an Apple iPod Mini FREE for your judgement on who ports your project to Linux PPC the best. Sponsored by IBM. Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php _______________________________________________ ht://Dig general mailing list: <[EMAIL PROTECTED]> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general

