To catch all redirection paths, including when the final url was already
crawled, I wrote a custom duplicate filter:
from scrapy.dupefilters import RFPDupeFilter
from myscraper.items import RedirectionItem
class CustomURLFilter(RFPDupeFilter):
def __init__(self, path=None, debug=False):
super(CustomURLFilter, self).__init__(path, debug)
def request_seen(self, request):
request_seen = super(CustomURLFilter, self).request_seen(request)
if request_seen is True:
item = RedirectionItem()
item['sources'] = [ u for u in request.meta.get('redirect_urls',
u'') ]
item['destination'] = request.url
return request_seen
Now, how can I send the RedirectionItem directly to the pipeline?
Is there a way to instantiate the pipeline from the custom filter so that I
can send data directly? Or shall I also create a custom scheduler and get
the pipeline from there but how?
Thanks!
On Monday, August 29, 2016 at 12:48:12 PM UTC+2, Antoine Brunel wrote:
>
> Hello,
>
> 1. I use scrapy to create a *linkmap *table, i.e. listing not only all
> links on crawled pages but also the link status.
>
>> url link
>> anchor
>> http://example.com/ http://example.com/link_A
>> Link A
>> http://example.com/ http://example.com/link_B
>> Link B
>> http://example.com/ http://example.com/link_C
>> Link C
>
>
> To do that, I extract links from the response then create a request with a
> specific callback
> for href_element in response.xpath("//a['@href']")
> link = urljoin( response.url, href_element.xpath(attribute).
> extract_first() )
> yield Request(link, callback=self.parse_item)
>
> When urls & their links are crawled, they are stored in a table named
> *urls*:
>
>> url status
>> http://example.com/link_A 200
>> http://example.com/link_B 200
>> http://example.com/link_C 404
>
>
> Then I extract the status in the first table based on link and url.
> """select linkmap.url, linkmap.link, linkmap.anchor, urls.status from
> linkemap, urls where linkmap.link=urls.url"""
>
> And the result is the following:
>
>> url link
>> anchor
>> status
>> http://example.com/ http://example.com/link_A
>> Link A 200
>> http://example.com/ http://example.com/link_B
>> Link B 200
>> http://example.com/ http://example.com/link_C
>> Link C 404
>
>
> 2a. Problems arise when redirections are getting in the way...
> If http://example.com/link_D has a 301 pointing to
> http://example.com/link_D1:
> Since the link url and the final url are different, the query based on the
> url returns nothing, so I cannot connect the status
>
> Table *linkmap*
>
>> url link
>> anchor
>> http://example.com/ http://example.com/*link_D*
>> Link D
>
>
> Table *urls*
>
>> url status
>
> http://example.com/*link_D1* 200
>>
>
> => And the query result is the following:
>
>> url link
>> anchor
>> status
>
> http://example.com/ http://example.com/link_D
>> Link D *EMPTY*
>>
>
>
> 2b. So, I add the redirection path in the item with
> *response.meta.get('redirect_urls').
> *Then, in pipelines.py, link_final is added to the structure based on the
> initial url:
> links = item.get('redirect_urls', [])
>
> if (len(links) > 0 and url != ''):
> link_final = item.get('url', '')
> cursor.execute("""UPDATE pagemap SET link_final=%s WHERE link=%s AND
> url=%s""", (link_final, links[0], url))
>
> And now we have:
> Table *linkmap*
>
>> url link
>> link_final
>> anchor
>
> http://example.com/ http://example.com/link_D
>> *http://example.com/link_D1 <http://example.com/link_D1>*
>> Link D
>
>
> Table *urls*
>
>> url status
>> http://example.com/link_D1 200
>
>
> => The query is updated to:
> """select linkmap.url, linkmap.link, linkmap.anchor, urls.status from
> linkemap, urls where linkmap.link=urls.url *or
> linkmap.link_final=urls.url*"""
>
> And the query result is now the following and life is beautiful:
>
>> url link
>> link_final
>> anchor
>> status
>
> http://example.com/ http://example.com/link_D
>> *http://example.com/link_D1 <http://example.com/link_D1> *
>> Link D *200*
>>
>
>
> 3. However, life is tough, and more problems arise when more redirections
> get in the way...
> Let's say http://example.com/link_E also redirects to
> http://example.com/link_D1, and it is crawled after
> http://example.com/link_D ...
>
> Because of RFPDupeFilter, link_final is not updated for
> http://example.com/link_E, so the linkmap for link_E of them cannot be
> updated:
> Table linkmap
>
>> url link
>> link_final
>> anchor
>> http://example.com/ http://example.com/link_D
>> *http://example.com/link_D1 <http://example.com/link_D1>*
>> Link D
>> http://example.com/ http://example.com/link_E
>> *EMPTY* Link E
>
>
> Table urls
> url status
> http://example.com/link_D1 200
>
> => And the query result is the following:
>
>> url link
>> link_final
>> anchor
>> status
>
> http://example.com/ http://example.com/link_D
>> http://example.com/link_D1 Link D
>> *200*
>
> http://example.com/ http://example.com/link_E
>> *EMPTY* Link E
>> *EMPTY*
>
>
> Short term and super ugly solution: Disable RFPDupeFilter but no, I don't
> want to do that, I want to find another way!
>
> Conclusion: This approach failed again. Still, I have the feeling that
> there is a simpler way, like capturing an already crawled url
> from RFPDupeFilter or something like that?
> I mean, Scrapy is crawling all these urls so, it must be possible to
> create the linkmap with the status, the question is what is the right way
> to do it.
> Maybe I am wrong to wait until the info gets to the pipeline?
>
> Thanks for your help, reflexions, critics and ideas!
> Antoine.
>
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.