According to Gabriele Bartolini:
>     I was attempting to use the url_rewrite_urls attribute, because I need 
> it in a special case.
> 
>     While trying it, I noticed this thing, and if it is possible I would 
> like to have an explanation from you (particularly by Gilles and Geoff, I 
> guess).
> 
>     Is there a reason why URLs belonging to the start list are neither 
> normalized nor rewritten? Just wondering ... Otherwise we should add these 
> two lines to the Initial method of the Retriever class:
> 
>     u.normalize();
>     u.rewrite();
> 
> after the 'URL u(tokens[i]);' row.

I'm guessing it was just an oversight, or an assumption that the
URLs you feed it via start_url would already be in the form you want.
I don't see a problem with the modification you suggest, with one very
important condition: the rewriting should not be done more than once on
a given URL.  So, if I'm not mistaken, the URLs from db.docdb and those
from db.log have already gone through the process of being normalized
and rewritten, and only the URLs from start_url should be processed.
I think if you only do the rewrite if from == 1 you should be safe.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to