Hi all, in this webpage http://www.gilacountyaz.gov/government/assessor/index.php all links in the left sidebar are absolute paths, but the webmaster did not include the leading slash. Therefore, scrapy does not detect duplicate pages and goes on and on crawling the same pages forever. This is because urljoin is not smart enough to detect this:
>>> import urlparse >>> a = "http://www.gilacountyaz.gov/government/assessor/index.php" >>> b = "government/assessor/address_change.php" >>> urlparse.urljoin(a, b) 'http://www.gilacountyaz.gov/government/assessor/government/assessor/address_change.php' However both Firefox and Chrome are able to build the url correctly. How do I fix this? Thanks! Michele C -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
