Hello,

1. I use scrapy to create a *linkmap *table, i.e. listing not only all 
links on crawled pages but also the link status.

> url                                                  link                 
>                                 anchor
> http://example.com/                       http://example.com/link_A       
>        Link A
> http://example.com/                       http://example.com/link_B       
>        Link B
> http://example.com/                       http://example.com/link_C       
>        Link C


To do that, I extract links from the response then create a request with a 
specific callback
for href_element in response.xpath("//a['@href']")
    link = urljoin( response.url, href_element.xpath(attribute).
extract_first() )
    yield Request(link, callback=self.parse_item)

When urls & their links are crawled, they are stored in a table named *urls*
:

> url                                            status
> http://example.com/link_A       200
> http://example.com/link_B       200
> http://example.com/link_C       404


Then I extract the status in the first table based on link and url. 
"""select linkmap.url, linkmap.link, linkmap.anchor, urls.status from 
linkemap, urls where linkmap.link=urls.url"""

And the result is the following: 

> url                                                  link                 
>                                  anchor                                     
>              status
> http://example.com/                       http://example.com/link_A       
>        Link A                                                    200
> http://example.com/                       http://example.com/link_B       
>        Link B                                                    200
> http://example.com/                       http://example.com/link_C       
>        Link C                                                    404


2a. Problems arise when redirections are getting in the way... 
If http://example.com/link_D has a 301 pointing to 
http://example.com/link_D1: 
Since the link url and the final url are different, the query based on the 
url returns nothing, so I cannot connect the status

Table *linkmap*

> url                                                  link                 
>                                 anchor
> http://example.com/                       http://example.com/*link_D*     
>         Link D


Table *urls*

> url                                            status

http://example.com/*link_D1*     200
>
 
=> And the query result is the following: 

> url                                                  link                 
>                                  anchor                                     
>              status 

http://example.com/                       http://example.com/link_D         
>      Link D                                                   *EMPTY*
>


2b. So, I add the redirection path in the item with 
*response.meta.get('redirect_urls'). 
*Then, in pipelines.py, link_final is added to the structure based on the 
initial url:
links = item.get('redirect_urls', [])

if (len(links) > 0 and url != ''):
    link_final = item.get('url', '')
    cursor.execute("""UPDATE pagemap SET link_final=%s WHERE link=%s AND 
url=%s""", (link_final, links[0], url))

And now we have:
Table *linkmap*

> url                                                  link                 
>                                 link_final                                 
>                 anchor

http://example.com/                       http://example.com/link_D         
>     *http://example.com/link_D1*                Link D


Table *urls*

> url                                            status
> http://example.com/link_D1     200


=> The query is updated to: 
"""select linkmap.url, linkmap.link, linkmap.anchor, urls.status from 
linkemap, urls where linkmap.link=urls.url *or linkmap.link_final=urls.url*
"""

And the query result is now the following and life is beautiful: 

> url                                                  link                 
>                                  link_final                                 
>                  anchor                                                 
>  status 

http://example.com/                       http://example.com/link_D         
>      *http://example.com/link_D1 *                 Link D                 
>                                   *200*
>
 

3. However, life is tough, and more problems arise when more redirections 
get in the way... 
Let's say http://example.com/link_E also redirects to 
http://example.com/link_D1, and it is crawled after 
http://example.com/link_D ... 

Because of RFPDupeFilter, link_final is not updated for 
http://example.com/link_E, so the linkmap for link_E of them cannot be 
updated:
Table linkmap

> url                                                  link                 
>                                 link_final                                 
>                 anchor
> http://example.com/                       http://example.com/link_D       
>       *http://example.com/link_D1*                 Link D
> http://example.com/                       http://example.com/link_E       
>       *EMPTY*                                                   Link E


Table urls
url                                            status
http://example.com/link_D1     200

=> And the query result is the following: 

> url                                                  link                 
>                                  link_final                                 
>                  anchor                                                 
>  status 

http://example.com/                       http://example.com/link_D         
>      http://example.com/link_D1                     Link D                 
>                                   *200* 

http://example.com/                       http://example.com/link_E         
>     *EMPTY*                                                     Link E   
>                                                  *EMPTY*


Short term and super ugly solution: Disable RFPDupeFilter but no, I don't 
want to do that, I want to find another way!

Conclusion: This approach failed again. Still, I have the feeling that 
there is a simpler way, like capturing an already crawled url 
from RFPDupeFilter or something like that? 
I mean, Scrapy is crawling all these urls so, it must be possible to create 
the linkmap with the status, the question is what is the right way to do 
it. 
Maybe I am wrong to wait until the info gets to the pipeline?

Thanks for your help, reflexions, critics and ideas!
Antoine.

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to