Turns out it is not a scrapy bug at all (sorry for this).
I should just have read the <base> tag in the <head> of the page, and used 
that for the urljoin instead of response.url.
Cheers!
Michele C


Il giorno mercoledì 5 novembre 2014 14:41:10 UTC-5, Travis Leleu ha scritto:
>
> If you are still using a LinkExtractor, there is a process_links arg that 
> allows you to process a link before it continues on the pipeline.  Perhaps 
> you could modify any URL w/o leading slash to include that.
>
> Otherwise, I'd look into the scrapy call stack to figure out where you 
> need to add a middleware layer to do this.  You may also be able to 
> override canonicalize_url to handle this edge case (just don't forget to 
> call parent's method as well to retain the canonicalization component)
>
> On Wed, Nov 5, 2014 at 11:25 AM, Michele Coscia <[email protected] 
> <javascript:>> wrote:
>
>> Hi all,
>>
>> in this webpage http://www.gilacountyaz.gov/government/assessor/index.php 
>> all links in the left sidebar are absolute paths, but the webmaster did not 
>> include the leading slash. Therefore, scrapy does not detect duplicate 
>> pages and goes on and on crawling the same pages forever. This is because 
>> urljoin is not smart enough to detect this:
>>
>> >>> import urlparse
>> >>> a = "http://www.gilacountyaz.gov/government/assessor/index.php";
>> >>> b = "government/assessor/address_change.php"
>> >>> urlparse.urljoin(a, b)
>> '
>> http://www.gilacountyaz.gov/government/assessor/government/assessor/address_change.php
>> '
>>
>> However both Firefox and Chrome are able to build the url correctly.
>> How do I fix this?
>>
>> Thanks!
>> Michele C
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to