Turns out it is not a scrapy bug at all (sorry for this). I should just have read the <base> tag in the <head> of the page, and used that for the urljoin instead of response.url. Cheers! Michele C
Il giorno mercoledì 5 novembre 2014 14:41:10 UTC-5, Travis Leleu ha scritto: > > If you are still using a LinkExtractor, there is a process_links arg that > allows you to process a link before it continues on the pipeline. Perhaps > you could modify any URL w/o leading slash to include that. > > Otherwise, I'd look into the scrapy call stack to figure out where you > need to add a middleware layer to do this. You may also be able to > override canonicalize_url to handle this edge case (just don't forget to > call parent's method as well to retain the canonicalization component) > > On Wed, Nov 5, 2014 at 11:25 AM, Michele Coscia <[email protected] > <javascript:>> wrote: > >> Hi all, >> >> in this webpage http://www.gilacountyaz.gov/government/assessor/index.php >> all links in the left sidebar are absolute paths, but the webmaster did not >> include the leading slash. Therefore, scrapy does not detect duplicate >> pages and goes on and on crawling the same pages forever. This is because >> urljoin is not smart enough to detect this: >> >> >>> import urlparse >> >>> a = "http://www.gilacountyaz.gov/government/assessor/index.php" >> >>> b = "government/assessor/address_change.php" >> >>> urlparse.urljoin(a, b) >> ' >> http://www.gilacountyaz.gov/government/assessor/government/assessor/address_change.php >> ' >> >> However both Firefox and Chrome are able to build the url correctly. >> How do I fix this? >> >> Thanks! >> Michele C >> >> -- >> You received this message because you are subscribed to the Google Groups >> "scrapy-users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at http://groups.google.com/group/scrapy-users. >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
