I suggest you to try with a simple spider, I already tried myself and works
fine so must be something on your environment:
1 import
subprocess$
2
$
3
$
4 if __name__ ==
'__main__':$
5 subprocess.Popen(['scrapy', 'crawl', 'followall'])$
Also this seems to be wrong:
BaseCrawler.__init__(self, ["dummy-unused"], spiderName, spiderID,
**kwargs)
CrawlSpider.__init__(self)
I don't know how exactly this works, but can be what is causing you
problems. Try with this spider
https://github.com/scrapinghub/testspiders/blob/master/testspiders/spiders/followall.py
El martes, 2 de diciembre de 2014 00:25:48 UTC-2, Mohammed Hamdy escribió:
>
> The spider is part of my API. Here's it's code:
>
> class ScrapyPageListCrawler(BaseCrawler, CrawlSpider):
> """
> A crawler that crawls an arbitrary URL list, based on a URL generator,
> which is
> just a Python generator
> """
>
> def __init__(self, urlGenerator, itemSelector, spiderID,
> spiderName="ScrapyPageListCrawler", filterPredicate=None,
> **kwargs):
> # get a url from the generator for BaseCrawler to be able to get
> URL_PARAMS
> BaseCrawler.__init__(self, ["dummy-unused"], spiderName, spiderID,
> **kwargs)
> CrawlSpider.__init__(self)
> self.start_urls = urlGenerator()
> self.item_extractor = FilteringItemExtractor(itemSelector,
> self.item_loader,
> SpiderTypes.TYPE_SCRAPY, self.name, self._id,
> filterPredicate=filterPredicate)
>
> def parse(self, response):
> if self.favicon_required:
> self.favicon_required = False
> yield self.item_extractor.extract_favicon_item(response.url)
> yield self.item_extractor.extract_item(response)
>
>
> I managed to get it running from twisted using *subprocess.Popen()*. Now,
> I have this funny log:
>
> 2014-12-02 04:14:12+0200 [scrapy] INFO: Enabled extensions: LogStats,
> TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
> 2014-12-02 04:14:12+0200 [scrapy] INFO: Enabled downloader middlewares:
> ProxyMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware,
> RandomUserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware,
> MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware,
> RetryChangeProxyMiddleware, CookiesMiddleware, ChunkedTransferMiddleware,
> DownloaderStats
> 2014-12-02 04:14:12+0200 [scrapy] INFO: Enabled spider middlewares:
> HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware,
> UrlLengthMiddleware, DepthMiddleware
> 2014-12-02 04:14:12+0200 [scrapy] INFO: Enabled item pipelines:
> ItemPostProcessor, FilterFieldsPipeline, StripFaxFieldPipeline,
> AddSuffixPipeline, PushToHandlerPipeline
> 2014-12-02 04:14:12+0200 [WLWCrawler] INFO: Spider opened
> 2014-12-02 04:14:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0
> pages/min), scraped 0 items (at 0 items/min)
> 2014-12-02 04:15:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0
> pages/min), scraped 0 items (at 0 items/min)
> 2014-12-02 04:16:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0
> pages/min), scraped 0 items (at 0 items/min)
> 2014-12-02 04:17:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0
> pages/min), scraped 0 items (at 0 items/min)
> 2014-12-02 04:18:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0
> pages/min), scraped 0 items (at 0 items/min)
> 2014-12-02 04:19:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0
> pages/min), scraped 0 items (at 0 items/min)
> 2014-12-02 04:20:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0
> pages/min), scraped 0 items (at 0 items/min)
> 2014-12-02 04:21:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0
> pages/min), scraped 0 items (at 0 items/min)
> 2014-12-02 04:22:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0
> pages/min), scraped 0 items (at 0 items/min)
> 2014-12-02 04:23:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0
> pages/min), scraped 0 items (at 0 items/min)
> 2014-12-02 04:24:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0
> pages/min), scraped 0 items (at 0 items/min)
> 2014-12-02 04:25:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0
> pages/min), scraped 0 items (at 0 items/min)
>
> Ideas?
>
>
> On Monday, December 1, 2014 7:58:56 PM UTC+2, Nicolás Alejandro Ramírez
> Quiros wrote:
>>
>> Can you share the code?
>>
>> El viernes, 28 de noviembre de 2014 20:11:00 UTC-2, Mohammed Hamdy
>> escribió:
>>>
>>> I tried launching the spider in another process. It's now worse and
>>> doesn't even log that it's finished.
>>>
>>> On Friday, November 28, 2014 12:06:08 PM UTC+2, Mohammed Hamdy wrote:
>>>>
>>>> Hi there,
>>>>
>>>> I'm developing a distributed crawler using Scrapy and Twisted. There's
>>>> a server that assigns crawling jobs to clients (*so clients create
>>>> scrapy spiders*) and so on. The clients are twisted *LineReceivers*. I
>>>> have scrapy 0.24.4 and twisted 14.0.2.
>>>>
>>>> I'm stuck with this for a couple of days now. The spider works fine
>>>> when run alone outside of the twisted client. When it's run from the
>>>> client
>>>> something strange happens, it's never closed and stays idle forever. If I
>>>> look at the logs, I should say that the spider was closed :
>>>>
>>>> 2014-11-27 14:55:15+0200 [WLWClientProtocol,client] WebService
>>>> starting on 6080
>>>> 2014-11-27 14:55:15+0200 [scrapy] Web service listening on 127.0.0.1:
>>>> 6080
>>>> 2014-11-27 14:55:15+0200 [scrapy] Closing spider (finished)
>>>> 2014-11-27 14:55:15+0200 [scrapy] Dumping Scrapy stats:
>>>> {'finish_reason': 'finished',
>>>> 'finish_time': datetime.datetime(2014, 11, 27, 12, 55, 15, 240062),
>>>> 'start_time': datetime.datetime(2014, 11, 27, 12, 55, 15, 238374)}
>>>> 2014-11-27 14:55:15+0200 [scrapy] Spider closed (finished)
>>>> 2014-11-27 14:55:15+0200 [-] (TCP Port 6023 Closed)
>>>> 2014-11-27 14:55:15+0200 [-] (TCP Port 6080 Closed)
>>>>
>>>> But the *spider_closed *signal is never emitted (again, this spider
>>>> works fine outside the client, so the signal is properly connected). And I
>>>> depend on this signal for sending results back to server, not to mention
>>>> that the spider stays open, which counts as a leak.
>>>>
>>>> Using the debugger reveals some facts:
>>>> From *ExecutionEngine.spider_is_idle() *method:
>>>> a- The scraper is always idle (*scraper_idle *is always True) and
>>>> the spider's *parse()* method is never called.
>>>> b- *downloading *is always True. And the
>>>> *Downloader.fetch()._deactivate()* is never called.
>>>>
>>>> Is there any hints at what I should be doing?. Debugging deferred code
>>>> is not that easy, and stacks come out of nowhere.
>>>>
>>>> Thanks
>>>>
>>>
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.