If you are still using a LinkExtractor, there is a process_links arg that
allows you to process a link before it continues on the pipeline.  Perhaps
you could modify any URL w/o leading slash to include that.

Otherwise, I'd look into the scrapy call stack to figure out where you need
to add a middleware layer to do this.  You may also be able to override
canonicalize_url to handle this edge case (just don't forget to call
parent's method as well to retain the canonicalization component)

On Wed, Nov 5, 2014 at 11:25 AM, Michele Coscia <[email protected]>
wrote:

> Hi all,
>
> in this webpage http://www.gilacountyaz.gov/government/assessor/index.php
> all links in the left sidebar are absolute paths, but the webmaster did not
> include the leading slash. Therefore, scrapy does not detect duplicate
> pages and goes on and on crawling the same pages forever. This is because
> urljoin is not smart enough to detect this:
>
> >>> import urlparse
> >>> a = "http://www.gilacountyaz.gov/government/assessor/index.php";
> >>> b = "government/assessor/address_change.php"
> >>> urlparse.urljoin(a, b)
> '
> http://www.gilacountyaz.gov/government/assessor/government/assessor/address_change.php
> '
>
> However both Firefox and Chrome are able to build the url correctly.
> How do I fix this?
>
> Thanks!
> Michele C
>
> --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to