Re: Seriously strange spider behavior

Mohammed Hamdy Tue, 02 Dec 2014 06:50:42 -0800

Thanks for your reply. I was using a proxy middleware to route requests 
through tor. It was the site denying requests from tor nodes that's causing 
that problem. So I disabled the middleware and now requests go through 
fine. As I said before, there's nothing wrong with the spider itself, since 
it works outside the application just fine.


On Tuesday, December 2, 2014 4:46:07 PM UTC+2, Nicolás Alejandro Ramírez 
Quiros wrote:
>
> I suggest you to try with a simple spider, I already tried myself and 
> works fine so must be something on your environment:
> 1 import 
> subprocess$                                                             
> 2 
> $                                                                             
>   
>
> 3 
> $                                                                             
>   
>
> 4 if __name__ == 
> '__main__':$                                                     
> 5     subprocess.Popen(['scrapy', 'crawl', 'followall'])$  
>
> Also this seems to be wrong:
>     BaseCrawler.__init__(self, ["dummy-unused"], spiderName, spiderID, 
> **kwargs)
>     CrawlSpider.__init__(self)
>
> I don't know how exactly this works, but can be what is causing you 
> problems. Try with this spider 
> https://github.com/scrapinghub/testspiders/blob/master/testspiders/spiders/followall.py
>
> El martes, 2 de diciembre de 2014 00:25:48 UTC-2, Mohammed Hamdy escribió:
>>
>> The spider is part of my API. Here's it's code:
>>
>> class ScrapyPageListCrawler(BaseCrawler, CrawlSpider):
>>   """
>>   A crawler that crawls an arbitrary URL list, based on a URL generator, 
>> which is 
>>     just a Python generator
>>   """
>>   
>>   def __init__(self, urlGenerator, itemSelector, spiderID, 
>>                spiderName="ScrapyPageListCrawler", filterPredicate=None, 
>>                **kwargs):
>>     # get a url from the generator for BaseCrawler to be able to get 
>> URL_PARAMS
>>     BaseCrawler.__init__(self, ["dummy-unused"], spiderName, spiderID, 
>> **kwargs)
>>     CrawlSpider.__init__(self)
>>     self.start_urls  = urlGenerator()
>>     self.item_extractor = FilteringItemExtractor(itemSelector, 
>> self.item_loader, 
>>                             SpiderTypes.TYPE_SCRAPY, self.name, 
>> self._id, 
>>                             filterPredicate=filterPredicate)
>>     
>>   def parse(self, response):
>>     if self.favicon_required:
>>       self.favicon_required = False
>>       yield self.item_extractor.extract_favicon_item(response.url)
>>     yield self.item_extractor.extract_item(response)
>>     
>>
>> I managed to get it running from twisted using *subprocess.Popen()*. 
>> Now, I have this funny log:
>>
>> 2014-12-02 04:14:12+0200 [scrapy] INFO: Enabled extensions: LogStats, 
>> TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
>> 2014-12-02 04:14:12+0200 [scrapy] INFO: Enabled downloader middlewares: 
>> ProxyMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, 
>> RandomUserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, 
>> MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, 
>> RetryChangeProxyMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, 
>> DownloaderStats
>> 2014-12-02 04:14:12+0200 [scrapy] INFO: Enabled spider middlewares: 
>> HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, 
>> UrlLengthMiddleware, DepthMiddleware
>> 2014-12-02 04:14:12+0200 [scrapy] INFO: Enabled item pipelines: 
>> ItemPostProcessor, FilterFieldsPipeline, StripFaxFieldPipeline, 
>> AddSuffixPipeline, PushToHandlerPipeline
>> 2014-12-02 04:14:12+0200 [WLWCrawler] INFO: Spider opened
>> 2014-12-02 04:14:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 
>> pages/min), scraped 0 items (at 0 items/min)
>> 2014-12-02 04:15:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 
>> pages/min), scraped 0 items (at 0 items/min)
>> 2014-12-02 04:16:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 
>> pages/min), scraped 0 items (at 0 items/min)
>> 2014-12-02 04:17:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 
>> pages/min), scraped 0 items (at 0 items/min)
>> 2014-12-02 04:18:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 
>> pages/min), scraped 0 items (at 0 items/min)
>> 2014-12-02 04:19:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 
>> pages/min), scraped 0 items (at 0 items/min)
>> 2014-12-02 04:20:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 
>> pages/min), scraped 0 items (at 0 items/min)
>> 2014-12-02 04:21:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 
>> pages/min), scraped 0 items (at 0 items/min)
>> 2014-12-02 04:22:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 
>> pages/min), scraped 0 items (at 0 items/min)
>> 2014-12-02 04:23:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 
>> pages/min), scraped 0 items (at 0 items/min)
>> 2014-12-02 04:24:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 
>> pages/min), scraped 0 items (at 0 items/min)
>> 2014-12-02 04:25:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 
>> pages/min), scraped 0 items (at 0 items/min)
>>
>> Ideas?
>>
>>
>> On Monday, December 1, 2014 7:58:56 PM UTC+2, Nicolás Alejandro Ramírez 
>> Quiros wrote:
>>>
>>> Can you share the code?
>>>
>>> El viernes, 28 de noviembre de 2014 20:11:00 UTC-2, Mohammed Hamdy 
>>> escribió:
>>>>
>>>> I tried launching the spider in another process. It's now worse and 
>>>> doesn't even log that it's finished.
>>>>
>>>> On Friday, November 28, 2014 12:06:08 PM UTC+2, Mohammed Hamdy wrote:
>>>>>
>>>>> Hi there,
>>>>>
>>>>> I'm developing a distributed crawler using Scrapy and Twisted. There's 
>>>>> a server that assigns crawling jobs to clients (*so clients create 
>>>>> scrapy spiders*) and so on. The clients are twisted *LineReceivers*. 
>>>>> I have scrapy 0.24.4 and twisted 14.0.2.
>>>>>
>>>>> I'm stuck with this for a couple of days now. The spider works fine 
>>>>> when run alone outside of the twisted client. When it's run from the 
>>>>> client 
>>>>> something strange happens, it's never closed and stays idle forever. If I 
>>>>> look at the logs, I should say that the spider was closed : 
>>>>>
>>>>> 2014-11-27 14:55:15+0200 [WLWClientProtocol,client] WebService 
>>>>> starting on 6080
>>>>> 2014-11-27 14:55:15+0200 [scrapy] Web service listening on 127.0.0.1:
>>>>> 6080
>>>>> 2014-11-27 14:55:15+0200 [scrapy] Closing spider (finished)
>>>>> 2014-11-27 14:55:15+0200 [scrapy] Dumping Scrapy stats:
>>>>>  {'finish_reason': 'finished',
>>>>>  'finish_time': datetime.datetime(2014, 11, 27, 12, 55, 15, 240062),
>>>>>  'start_time': datetime.datetime(2014, 11, 27, 12, 55, 15, 238374)}
>>>>> 2014-11-27 14:55:15+0200 [scrapy] Spider closed (finished)
>>>>> 2014-11-27 14:55:15+0200 [-] (TCP Port 6023 Closed)
>>>>> 2014-11-27 14:55:15+0200 [-] (TCP Port 6080 Closed)
>>>>>
>>>>> But the *spider_closed *signal is never emitted (again, this spider 
>>>>> works fine outside the client, so the signal is properly connected). And 
>>>>> I 
>>>>> depend on this signal for sending results back to server, not to mention 
>>>>> that the spider stays open, which counts as a leak.
>>>>>
>>>>> Using the debugger reveals some facts:
>>>>> From *ExecutionEngine.spider_is_idle() *method:
>>>>>    a- The scraper is always idle (*scraper_idle *is always True) and 
>>>>> the spider's *parse()* method is never called.
>>>>>    b- *downloading *is always True. And the 
>>>>> *Downloader.fetch()._deactivate()* is never called.
>>>>>
>>>>> Is there any hints at what I should be doing?. Debugging deferred code 
>>>>> is not that easy, and stacks come out of nowhere.
>>>>>
>>>>> Thanks
>>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Seriously strange spider behavior

Reply via email to