If you are still using a LinkExtractor, there is a process_links arg that allows you to process a link before it continues on the pipeline. Perhaps you could modify any URL w/o leading slash to include that.
Otherwise, I'd look into the scrapy call stack to figure out where you need to add a middleware layer to do this. You may also be able to override canonicalize_url to handle this edge case (just don't forget to call parent's method as well to retain the canonicalization component) On Wed, Nov 5, 2014 at 11:25 AM, Michele Coscia <[email protected]> wrote: > Hi all, > > in this webpage http://www.gilacountyaz.gov/government/assessor/index.php > all links in the left sidebar are absolute paths, but the webmaster did not > include the leading slash. Therefore, scrapy does not detect duplicate > pages and goes on and on crawling the same pages forever. This is because > urljoin is not smart enough to detect this: > > >>> import urlparse > >>> a = "http://www.gilacountyaz.gov/government/assessor/index.php" > >>> b = "government/assessor/address_change.php" > >>> urlparse.urljoin(a, b) > ' > http://www.gilacountyaz.gov/government/assessor/government/assessor/address_change.php > ' > > However both Firefox and Chrome are able to build the url correctly. > How do I fix this? > > Thanks! > Michele C > > -- > You received this message because you are subscribed to the Google Groups > "scrapy-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
