I was testing out htdig-3.2.0b4-072201 on a client's web site that uses (heavily)
query strings for page traversal and tracking. It appeared that any link with a query
string would trigger the following error in my htdig (-vvv) output:
0:2:0:http://tsc.local.net/: Making HTTP request on http://tsc.local.net/
title: The Cycling Store
image: http://tsc.local.net/images/t11gr8d.jpg
href: http://tsc.local.net/cranks.htm?394 ()
Rejected: item in bad query list
url rejected: (level 1)http://tsc.local.net/cranks.htm?394
--
The start_url is: http://tsc.local.net and that is also in the limit_urls_to field.
--
The work around? One might think to just blank out bad_querystr: in the config file.
But that didn't work. If I put in arbitrary random text (definitely
non-matching/relevant text) in the bad_querystr field (ex: bad_querystr: willywonka)
-- it works just fine.
I think the problem is in Retriever.cc. If bad_querystr is not defined, it matches
ANY URL with a query string. And in fact, in studying that section of code, I noticed
that bad_querystr regex matching doesn't apply just to the CGI query string, but
rather, to the entire URL. So I guess this problem is kind of two fold.
--
Summary:
bad_querystr -- when its blank, Retriever.cc regex matches ANY string with a query
string (any URL that contains a "?") thus rejecting the URLs.
Retriever.cc bad_querystr matching code segment is comparing bad_querystr with the
entire URL, not just the query string. It should be looking past the "?" only.
--
[.kate]
___________________________________________________________________________
Visit http://www.visto.com.
Find out how companies are linking mobile users to the
enterprise with Visto.
_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/htdig-dev