Hi all.

I have been having a problem getting htdig to build a reasonable
database for a particular site. Specifically, the combined database
sizes were ending up on the order of 3 to 4 times larger than the entire
site. I believe I found the cause of this problem, and while not
technically a problem with htdig, I thought I would pass the information
on in the hope that it will save someone else a week of building broken
databases and reading debug output :)

While examining htdig's output using the -vvv option, I discovered that
htdig was creating a lot of broken GET requests. Toward the end, they
were looking something like...

GET
/index.html/queries/fyi/hosts/hosts/fyi/queries/hosts/queries/qpost.htm
..
GET
/index.html/queries/fyi/hosts/hosts/fyi/queries/fyi/queries/qpost.htm
..
GET
/index.html/queries/fyi/hosts/hosts/fyi/queries/queries/queries/qpost.htm
..

>From the server's point of view, everything in the URL after the
index.html is garbage, and the same page (index.html) is returned over
and over with all relative links resulting in new unique URLs that again
result in htdig grabbing the same index.html file.

As far as I can tell, the melt down originates with a small syntax error
in one users page. This user had a link that looks like...

<a href="../../index.html/">

This then resolved to a new, unique URL of
http://www.########.org/index.html/  So, htdig went ahead and processed
it as such. When relative links were found in the index.html file, new
URLs were generated, such as
http://www.########.org/index.html/queries/qpost.html

When htdig did the GET on this specific URL, the server of course
returned index.html instead of qpost.html, but treated relative links in
index.html as if they were relative to
http://www.########.org/index.html/queries, which generated URLs like
http://www.########.org/index.html/queries/fyi/tngnfind.html  This
process continued, generating longer and longer bogus URLs. Not sure
what finally broke the cycle.


I am in the process of trying to crawl the site again with .html/ and
.htm/ added to the exclude_urls attribute. On the off chance that this
doesn't work, does anyone have other ideas about how to avoid this
problem? Well, short of validating thousands of pages contributed by
dozens of people? ;)

Thanks.

Jim Cole
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.

Reply via email to