Re: Need to know where in the code scrapy is doing the actual http request call

Daniel Fockler Wed, 27 May 2015 10:07:40 -0700

Yeah it's most likely because of the async deferred object that Twisted is 
returning in the download_request function. Although each request seems to 
have a `download_latency` within it's meta field, that records the time it 
takes from entering the twisted reactor to being returned.


https://github.com/scrapy/scrapy/blob/0.24/scrapy/core/downloader/handlers/http11.py#L219

On Friday, May 22, 2015 at 3:22:30 PM UTC-7, Philipp Bussche wrote:
>
> Thanks Daniel,
>
> I put my hooks into http11.py and see the calls to the websites now in my 
> monitoring tool.
> It is just the response time it shows seems to quick for some of the sites 
> which are known to be a bit slow.
> Could this be because of the asynch mechanism thats used all over the 
> place in twisted ?
> Where would you say I should put my hooks to exactly capture the time it 
> takes to download/crawl a website ?
>
> Thanks
> Philipp
>
> On Friday, May 22, 2015 at 12:45:17 AM UTC+2, Daniel Fockler wrote:
>>
>> Yeah, you're right the HTTP request is happening in the twisted reactor
>>
>>
>> https://github.com/scrapy/scrapy/blob/0.24/scrapy/core/downloader/handlers/http10.py
>>
>> On Thursday, May 21, 2015 at 2:47:59 PM UTC-7, Philipp Bussche wrote:
>>>
>>> Thanks Daniel,
>>>
>>> that sounds like a good idea and I will have a look at that.
>>>
>>> But I would also be interested to instrument the call to crawl the 
>>> actual URL so I can put some monitoring code before and after it.
>>> Do you know how the actual crawl is being done ? Is it done via twisted 
>>> ? It does not look like httplib is being used for that.
>>>
>>> Thanks
>>> Philipp
>>>
>>> On Thursday, May 21, 2015 at 10:29:44 PM UTC+2, Daniel Fockler wrote:
>>>>
>>>> Hey,
>>>>
>>>> Not sure exactly what you are looking for, but you can implement a 
>>>> Scrapy Downloader Middleware and run a process_request function that will 
>>>> pass each request into that function so you can examine it. Here's the 
>>>> docs 
>>>> for that.
>>>>
>>>>
>>>> http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
>>>>
>>>> On Thursday, May 21, 2015 at 7:03:35 AM UTC-7, Philipp Bussche wrote:
>>>>>
>>>>> Hi there,
>>>>> I am working on some monitoring for my python/scrapy deployment using 
>>>>> one of the commercial APM tools.
>>>>> I was able to instrument the parsing of the response as well as the 
>>>>> pipeline which pushes the items into an ElasticSearch instance.
>>>>> You can see in the attached screenshot how that is visualized in the 
>>>>> tool.
>>>>> I would now also like to see the outgoing calls that Scrapy is making 
>>>>> through the downloader to actually crawl the http pages (which is 
>>>>> obviously 
>>>>> happening before parsing and pipelining).
>>>>> But I can't figure out where in the code the actual http call is made 
>>>>> so that I could put my monitoring hook around it.
>>>>> Could you guys please point me to the class that is actually doing the 
>>>>> http calls ?
>>>>>
>>>>> Thanks
>>>>> Philipp
>>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Need to know where in the code scrapy is doing the actual http request call

Reply via email to