Re: Seriously strange spider behavior

Mohammed Hamdy Mon, 01 Dec 2014 18:25:57 -0800

The spider is part of my API. Here's it's code:

class ScrapyPageListCrawler(BaseCrawler, CrawlSpider):
  """
  A crawler that crawls an arbitrary URL list, based on a URL generator, 
which is 
    just a Python generator
  """
  
  def __init__(self, urlGenerator, itemSelector, spiderID, 
               spiderName="ScrapyPageListCrawler", filterPredicate=None, 
               **kwargs):
    # get a url from the generator for BaseCrawler to be able to get 
URL_PARAMS
    BaseCrawler.__init__(self, ["dummy-unused"], spiderName, spiderID, 
**kwargs)
    CrawlSpider.__init__(self)
    self.start_urls  = urlGenerator()
    self.item_extractor = FilteringItemExtractor(itemSelector, 
self.item_loader, 
                            SpiderTypes.TYPE_SCRAPY, self.name, self._id, 
                            filterPredicate=filterPredicate)
    
  def parse(self, response):
    if self.favicon_required:
      self.favicon_required = False
      yield self.item_extractor.extract_favicon_item(response.url)
    yield self.item_extractor.extract_item(response)


I managed to get it running from twisted using *subprocess.Popen()*. Now, I 
have this funny log:

2014-12-02 04:14:12+0200 [scrapy] INFO: Enabled extensions: LogStats, 
TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-12-02 04:14:12+0200 [scrapy] INFO: Enabled downloader middlewares: 
ProxyMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, 
RandomUserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, 
MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, 
RetryChangeProxyMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, 
DownloaderStats
2014-12-02 04:14:12+0200 [scrapy] INFO: Enabled spider middlewares: 
HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, 
UrlLengthMiddleware, DepthMiddleware
2014-12-02 04:14:12+0200 [scrapy] INFO: Enabled item pipelines: 
ItemPostProcessor, FilterFieldsPipeline, StripFaxFieldPipeline, 
AddSuffixPipeline, PushToHandlerPipeline
2014-12-02 04:14:12+0200 [WLWCrawler] INFO: Spider opened
2014-12-02 04:14:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 
pages/min), scraped 0 items (at 0 items/min)
2014-12-02 04:15:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 
pages/min), scraped 0 items (at 0 items/min)
2014-12-02 04:16:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 
pages/min), scraped 0 items (at 0 items/min)
2014-12-02 04:17:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 
pages/min), scraped 0 items (at 0 items/min)
2014-12-02 04:18:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 
pages/min), scraped 0 items (at 0 items/min)
2014-12-02 04:19:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 
pages/min), scraped 0 items (at 0 items/min)
2014-12-02 04:20:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 
pages/min), scraped 0 items (at 0 items/min)
2014-12-02 04:21:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 
pages/min), scraped 0 items (at 0 items/min)
2014-12-02 04:22:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 
pages/min), scraped 0 items (at 0 items/min)
2014-12-02 04:23:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 
pages/min), scraped 0 items (at 0 items/min)
2014-12-02 04:24:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 
pages/min), scraped 0 items (at 0 items/min)
2014-12-02 04:25:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 
pages/min), scraped 0 items (at 0 items/min)

Ideas?


On Monday, December 1, 2014 7:58:56 PM UTC+2, Nicolás Alejandro Ramírez 
Quiros wrote:
>
> Can you share the code?
>
> El viernes, 28 de noviembre de 2014 20:11:00 UTC-2, Mohammed Hamdy 
> escribió:
>>
>> I tried launching the spider in another process. It's now worse and 
>> doesn't even log that it's finished.
>>
>> On Friday, November 28, 2014 12:06:08 PM UTC+2, Mohammed Hamdy wrote:
>>>
>>> Hi there,
>>>
>>> I'm developing a distributed crawler using Scrapy and Twisted. There's a 
>>> server that assigns crawling jobs to clients (*so clients create scrapy 
>>> spiders*) and so on. The clients are twisted *LineReceivers*. I have 
>>> scrapy 0.24.4 and twisted 14.0.2.
>>>
>>> I'm stuck with this for a couple of days now. The spider works fine when 
>>> run alone outside of the twisted client. When it's run from the client 
>>> something strange happens, it's never closed and stays idle forever. If I 
>>> look at the logs, I should say that the spider was closed : 
>>>
>>> 2014-11-27 14:55:15+0200 [WLWClientProtocol,client] WebService starting 
>>> on 6080
>>> 2014-11-27 14:55:15+0200 [scrapy] Web service listening on 127.0.0.1:
>>> 6080
>>> 2014-11-27 14:55:15+0200 [scrapy] Closing spider (finished)
>>> 2014-11-27 14:55:15+0200 [scrapy] Dumping Scrapy stats:
>>>  {'finish_reason': 'finished',
>>>  'finish_time': datetime.datetime(2014, 11, 27, 12, 55, 15, 240062),
>>>  'start_time': datetime.datetime(2014, 11, 27, 12, 55, 15, 238374)}
>>> 2014-11-27 14:55:15+0200 [scrapy] Spider closed (finished)
>>> 2014-11-27 14:55:15+0200 [-] (TCP Port 6023 Closed)
>>> 2014-11-27 14:55:15+0200 [-] (TCP Port 6080 Closed)
>>>
>>> But the *spider_closed *signal is never emitted (again, this spider 
>>> works fine outside the client, so the signal is properly connected). And I 
>>> depend on this signal for sending results back to server, not to mention 
>>> that the spider stays open, which counts as a leak.
>>>
>>> Using the debugger reveals some facts:
>>> From *ExecutionEngine.spider_is_idle() *method:
>>>    a- The scraper is always idle (*scraper_idle *is always True) and 
>>> the spider's *parse()* method is never called.
>>>    b- *downloading *is always True. And the 
>>> *Downloader.fetch()._deactivate()* is never called.
>>>
>>> Is there any hints at what I should be doing?. Debugging deferred code 
>>> is not that easy, and stacks come out of nowhere.
>>>
>>> Thanks
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Seriously strange spider behavior

Reply via email to