I was testing out htdig-3.2.0b4-072201 on a client's web site that uses (heavily) 
query strings for page traversal and tracking.  It appeared that any link with a query 
string would trigger the following error in my htdig (-vvv) output:

0:2:0:http://tsc.local.net/: Making HTTP request on http://tsc.local.net/
title: The Cycling Store
image: http://tsc.local.net/images/t11gr8d.jpg
href: http://tsc.local.net/cranks.htm?394 ()

   Rejected: item in bad query list
url rejected: (level 1)http://tsc.local.net/cranks.htm?394

--

The start_url is: http://tsc.local.net and that is also in the limit_urls_to field.

--

The work around?  One might think to just blank out bad_querystr: in the config file.  
But that didn't work.  If I put in arbitrary random text (definitely 
non-matching/relevant text) in the bad_querystr field (ex: bad_querystr: willywonka) 
-- it works just fine. 

I think the problem is in Retriever.cc.  If bad_querystr is not defined, it matches 
ANY URL with a query string.  And in fact, in studying that section of code, I noticed 
that bad_querystr regex matching doesn't apply just to the CGI query string, but 
rather, to the entire URL.  So I guess this problem is kind of two fold.

--

Summary:

  bad_querystr -- when its blank, Retriever.cc regex matches ANY string with a query 
string (any URL that contains a "?") thus rejecting the URLs.

  Retriever.cc bad_querystr matching code segment is comparing bad_querystr with the 
entire URL, not just the query string.  It should be looking past the "?" only.

--

[.kate]
___________________________________________________________________________
Visit http://www.visto.com.
Find out  how companies are linking mobile users to the 
enterprise with Visto.


_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to