Absolute paths without leading slash

Michele Coscia Wed, 05 Nov 2014 11:25:18 -0800

Hi all,

in this webpage http://www.gilacountyaz.gov/government/assessor/index.php 
all links in the left sidebar are absolute paths, but the webmaster did not 
include the leading slash. Therefore, scrapy does not detect duplicate 
pages and goes on and on crawling the same pages forever. This is because 
urljoin is not smart enough to detect this:


>>> import urlparse
>>> a = "http://www.gilacountyaz.gov/government/assessor/index.php";
>>> b = "government/assessor/address_change.php"
>>> urlparse.urljoin(a, b)
'http://www.gilacountyaz.gov/government/assessor/government/assessor/address_change.php'

However both Firefox and Chrome are able to build the url correctly.
How do I fix this?

Thanks!
Michele C

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Absolute paths without leading slash

Reply via email to