According to Patrick:
> Could someone give me some insight as to where I can begin
> to write a patch that will allow the ability to "remove all
> query string (anything after the '?') variables"?
> 
> My initial guess is within Retriever.cc, in the Retriever::Initial
> function, immediately after:
> 
> url = u.get();
> 
> ..then, if a certain config setting is true, perform something 
> similar to the Perl equivalent of:
> 
> url =~ s/\?.*$//;
> 
> Any help is appreciated.

Retriever::Initial only handles the initial URLs, i.e. in start_url
or URLs already in the database for an update htdig.  It won't handle
newly followed href's.  To get them all, maybe URL.cc is the best place
for this.  It already strips off the "#sectionname" portion of an URL,
in URL::URL() and URL::parse().

You may want to take a step back, though, and ask yourself why you want to
to this.  If your goal is simply to avoid indexing any URL with a query
string, you can just add a ? to the exclude_urls attribute definition
in your htdig.conf.  Stripping off the query string is a pretty drastic
step, as you'll still end up indexing all your CGI scripts (unless
excluded by exclude_urls), but calling them all without a query string.
It will also prevent you from being able to index any "virtual tree"
of documents accessed by a query string, if you ever need to do this.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this. 

Reply via email to