Yeah it's most likely because of the async deferred object that Twisted is returning in the download_request function. Although each request seems to have a `download_latency` within it's meta field, that records the time it takes from entering the twisted reactor to being returned.
https://github.com/scrapy/scrapy/blob/0.24/scrapy/core/downloader/handlers/http11.py#L219 On Friday, May 22, 2015 at 3:22:30 PM UTC-7, Philipp Bussche wrote: > > Thanks Daniel, > > I put my hooks into http11.py and see the calls to the websites now in my > monitoring tool. > It is just the response time it shows seems to quick for some of the sites > which are known to be a bit slow. > Could this be because of the asynch mechanism thats used all over the > place in twisted ? > Where would you say I should put my hooks to exactly capture the time it > takes to download/crawl a website ? > > Thanks > Philipp > > On Friday, May 22, 2015 at 12:45:17 AM UTC+2, Daniel Fockler wrote: >> >> Yeah, you're right the HTTP request is happening in the twisted reactor >> >> >> https://github.com/scrapy/scrapy/blob/0.24/scrapy/core/downloader/handlers/http10.py >> >> On Thursday, May 21, 2015 at 2:47:59 PM UTC-7, Philipp Bussche wrote: >>> >>> Thanks Daniel, >>> >>> that sounds like a good idea and I will have a look at that. >>> >>> But I would also be interested to instrument the call to crawl the >>> actual URL so I can put some monitoring code before and after it. >>> Do you know how the actual crawl is being done ? Is it done via twisted >>> ? It does not look like httplib is being used for that. >>> >>> Thanks >>> Philipp >>> >>> On Thursday, May 21, 2015 at 10:29:44 PM UTC+2, Daniel Fockler wrote: >>>> >>>> Hey, >>>> >>>> Not sure exactly what you are looking for, but you can implement a >>>> Scrapy Downloader Middleware and run a process_request function that will >>>> pass each request into that function so you can examine it. Here's the >>>> docs >>>> for that. >>>> >>>> >>>> http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html >>>> >>>> On Thursday, May 21, 2015 at 7:03:35 AM UTC-7, Philipp Bussche wrote: >>>>> >>>>> Hi there, >>>>> I am working on some monitoring for my python/scrapy deployment using >>>>> one of the commercial APM tools. >>>>> I was able to instrument the parsing of the response as well as the >>>>> pipeline which pushes the items into an ElasticSearch instance. >>>>> You can see in the attached screenshot how that is visualized in the >>>>> tool. >>>>> I would now also like to see the outgoing calls that Scrapy is making >>>>> through the downloader to actually crawl the http pages (which is >>>>> obviously >>>>> happening before parsing and pipelining). >>>>> But I can't figure out where in the code the actual http call is made >>>>> so that I could put my monitoring hook around it. >>>>> Could you guys please point me to the class that is actually doing the >>>>> http calls ? >>>>> >>>>> Thanks >>>>> Philipp >>>>> >>>> -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
