Hello, 1. I use scrapy to create a *linkmap *table, i.e. listing not only all links on crawled pages but also the link status.
> url link > anchor > http://example.com/ http://example.com/link_A > Link A > http://example.com/ http://example.com/link_B > Link B > http://example.com/ http://example.com/link_C > Link C To do that, I extract links from the response then create a request with a specific callback for href_element in response.xpath("//a['@href']") link = urljoin( response.url, href_element.xpath(attribute). extract_first() ) yield Request(link, callback=self.parse_item) When urls & their links are crawled, they are stored in a table named *urls* : > url status > http://example.com/link_A 200 > http://example.com/link_B 200 > http://example.com/link_C 404 Then I extract the status in the first table based on link and url. """select linkmap.url, linkmap.link, linkmap.anchor, urls.status from linkemap, urls where linkmap.link=urls.url""" And the result is the following: > url link > anchor > status > http://example.com/ http://example.com/link_A > Link A 200 > http://example.com/ http://example.com/link_B > Link B 200 > http://example.com/ http://example.com/link_C > Link C 404 2a. Problems arise when redirections are getting in the way... If http://example.com/link_D has a 301 pointing to http://example.com/link_D1: Since the link url and the final url are different, the query based on the url returns nothing, so I cannot connect the status Table *linkmap* > url link > anchor > http://example.com/ http://example.com/*link_D* > Link D Table *urls* > url status http://example.com/*link_D1* 200 > => And the query result is the following: > url link > anchor > status http://example.com/ http://example.com/link_D > Link D *EMPTY* > 2b. So, I add the redirection path in the item with *response.meta.get('redirect_urls'). *Then, in pipelines.py, link_final is added to the structure based on the initial url: links = item.get('redirect_urls', []) if (len(links) > 0 and url != ''): link_final = item.get('url', '') cursor.execute("""UPDATE pagemap SET link_final=%s WHERE link=%s AND url=%s""", (link_final, links[0], url)) And now we have: Table *linkmap* > url link > link_final > anchor http://example.com/ http://example.com/link_D > *http://example.com/link_D1* Link D Table *urls* > url status > http://example.com/link_D1 200 => The query is updated to: """select linkmap.url, linkmap.link, linkmap.anchor, urls.status from linkemap, urls where linkmap.link=urls.url *or linkmap.link_final=urls.url* """ And the query result is now the following and life is beautiful: > url link > link_final > anchor > status http://example.com/ http://example.com/link_D > *http://example.com/link_D1 * Link D > *200* > 3. However, life is tough, and more problems arise when more redirections get in the way... Let's say http://example.com/link_E also redirects to http://example.com/link_D1, and it is crawled after http://example.com/link_D ... Because of RFPDupeFilter, link_final is not updated for http://example.com/link_E, so the linkmap for link_E of them cannot be updated: Table linkmap > url link > link_final > anchor > http://example.com/ http://example.com/link_D > *http://example.com/link_D1* Link D > http://example.com/ http://example.com/link_E > *EMPTY* Link E Table urls url status http://example.com/link_D1 200 => And the query result is the following: > url link > link_final > anchor > status http://example.com/ http://example.com/link_D > http://example.com/link_D1 Link D > *200* http://example.com/ http://example.com/link_E > *EMPTY* Link E > *EMPTY* Short term and super ugly solution: Disable RFPDupeFilter but no, I don't want to do that, I want to find another way! Conclusion: This approach failed again. Still, I have the feeling that there is a simpler way, like capturing an already crawled url from RFPDupeFilter or something like that? I mean, Scrapy is crawling all these urls so, it must be possible to create the linkmap with the status, the question is what is the right way to do it. Maybe I am wrong to wait until the info gets to the pipeline? Thanks for your help, reflexions, critics and ideas! Antoine. -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
