Hello,

I've been using htdig for a little while, and I've recently been alerted
to an indexing "issue", which I'm hoping someone might be able to help.
We have a list of about 800 sites we need to index.  If I run a small
subset (10 or 20 sites), they index fine.  However, when I index the full
800, I find that htdig no longer stays on the site - that is, it seems to
crawl off-site links as well (which is definitely a problem for us).

I have "limit_urls_to: ${start_url}" set in both my htdig.conf and a
seperate scitechdb.conf (science & technology database) file.  I'm
actually using a multidig configuration (we index a few other small sites
on the same server as different databases), which otherwise works well.

I'm wondering if there is an issue with indexing large amounts of data - a
small index of 10 sites is:
        db.docdb:        81 MB
        db.docs.index:  693 KB
        db.words.db:     89 MB
while the index of 800 sites is:
        db.docdb:       1.76 GB
        db.docs.index:    34 MB
        db.words.db:    1.57 GB
or perhaps some other bug?  I've tested in several times, and the system
does NOT go off site on any of the sites in the small subset.  Any help
would be appreciated.

BTW, I'm not subscribed to the list, so if you would CC me on replies, I'd
be very grateful.  Thanks so much.

-- 
Geoff Silver                                    <geoff at uslinux dot net>
"If Bill Gates had a nickel for every time Windows crashed...
        Oh wait, he does"


_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to