Re: Scrapy confusion about Item Pipeline and Middlewares

Pablo Hoffman Fri, 16 May 2014 09:19:37 -0700

On Thu, Dec 26, 2013 at 11:08 AM, Mrudul Tarwatkar <
[email protected]> wrote:


> Are *Downloader Middleware processed before the downloader? Before the
> url is scrapped?*
>

Before and after, it "wraps" the downloader. process_request is processed
before the url is downloaded, and process_response is processed afterwards
(with the HTTP response fetched). By "scraping" we typically refer to the
action of extracting data, which happens in the spider, outside the
downloader

Are *Pipelines processed after the url is crawled (downloaded) and the
> spider items are set?*
>

Pipelines are called after the item is scraped from the spider.

Now, Let's say* I store the fingerprint of every response in an visit_id
> item* using the request_fingerprint in scrapy.
>
So If I want to write a *downloader middleware which avoids visiting of
> already visited url's in subsequent runs of a spider* , how would it be?
>

Like this one:
http://snipplr.com/view/67018/middleware-to-avoid-revisiting-already-visited-items/

Note that that is a spider middleware, not a downloader middleware, which
wraps the spider, not the downloader.

Pablo.

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Scrapy confusion about Item Pipeline and Middlewares

Reply via email to