This pull request (to limit response size) got merged today in trunk: https://github.com/scrapy/scrapy/pull/946
Perhaps you can use that or an extend that functionality to truncate responses to certain sizes (patches welcome!). On Fri, Mar 7, 2014 at 11:48 AM, Gheorghe Chirica <[email protected] > wrote: > Now, my questions are: > > Is this approach ok? If no, what is the best way to achieve this? > > How can I send some custom *reason *to loseConnection(reason??)? I tries > to send smth like reason = failure.Failure(ConnectionAborted())but I do > not receive this in connectionLost > <https://github.com/scrapy/scrapy/blob/fb770852e87d97196e31f27c33ee8eee89aecc27/scrapy/core/downloader/handlers/http11.py#L145> > . > > How can we change the chunk size when receiving data(this may be a > question related to twisted) ? > > > Thx. > > > On Friday, March 7, 2014 3:42:15 PM UTC+2, Gheorghe Chirica wrote: >> >> Hi. >> >> Recently I'm working on a small crawler which check some site and see >> what is the status(and other info) of all urls. >> >> My initial idea was to make a GET request to all html resources and a >> HEAD request to resources other then html. >> >> The problem in this case is that some servers do not implement HEAD >> request(I noticed this on urls to facebook and twitter) and I get a >> TimeoutError. >> >> Note that I can have the same issue not only with plain html pages, but >> also with other assets. >> >> My next idea was to make a GET request instead of HEAD. But in this case >> I don't need to get the resource body for assets(images, js, css). >> >> In this case I need somehow to make GET request, but to request only for >> small chunk of data, which will include headers, and then close connection. >> No need to download 10 MB file If I need only it's status(200, 301) >> >> Now, from theory to code. I checked the scrapy code related to >> downloading requests. So, ScrapyAgent >> <https://github.com/scrapy/scrapy/blob/fb770852e87d97196e31f27c33ee8eee89aecc27/scrapy/core/downloader/handlers/http11.py#L41> >> is >> responsable for downloading pages via download_request >> <https://github.com/scrapy/scrapy/blob/fb770852e87d97196e31f27c33ee8eee89aecc27/scrapy/core/downloader/handlers/http11.py#L64> >> . >> >> The code responsable for receiving data from the socket is dataReceived >> <https://github.com/scrapy/scrapy/blob/fb770852e87d97196e31f27c33ee8eee89aecc27/scrapy/core/downloader/handlers/http11.py#L142>. >> Here I plugged in some logic which close the connection after first >> received chunk: >> >> if allowed_mimetype: >> >> self._txresponse._transport._producer.loseConnection() >> >> >> -- > You received this message because you are subscribed to the Google Groups > "scrapy-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
