Thanks for your reply. I was using a proxy middleware to route requests through tor. It was the site denying requests from tor nodes that's causing that problem. So I disabled the middleware and now requests go through fine. As I said before, there's nothing wrong with the spider itself, since it works outside the application just fine.
On Tuesday, December 2, 2014 4:46:07 PM UTC+2, Nicolás Alejandro Ramírez Quiros wrote: > > I suggest you to try with a simple spider, I already tried myself and > works fine so must be something on your environment: > 1 import > subprocess$ > 2 > $ > > > 3 > $ > > > 4 if __name__ == > '__main__':$ > 5 subprocess.Popen(['scrapy', 'crawl', 'followall'])$ > > Also this seems to be wrong: > BaseCrawler.__init__(self, ["dummy-unused"], spiderName, spiderID, > **kwargs) > CrawlSpider.__init__(self) > > I don't know how exactly this works, but can be what is causing you > problems. Try with this spider > https://github.com/scrapinghub/testspiders/blob/master/testspiders/spiders/followall.py > > El martes, 2 de diciembre de 2014 00:25:48 UTC-2, Mohammed Hamdy escribió: >> >> The spider is part of my API. Here's it's code: >> >> class ScrapyPageListCrawler(BaseCrawler, CrawlSpider): >> """ >> A crawler that crawls an arbitrary URL list, based on a URL generator, >> which is >> just a Python generator >> """ >> >> def __init__(self, urlGenerator, itemSelector, spiderID, >> spiderName="ScrapyPageListCrawler", filterPredicate=None, >> **kwargs): >> # get a url from the generator for BaseCrawler to be able to get >> URL_PARAMS >> BaseCrawler.__init__(self, ["dummy-unused"], spiderName, spiderID, >> **kwargs) >> CrawlSpider.__init__(self) >> self.start_urls = urlGenerator() >> self.item_extractor = FilteringItemExtractor(itemSelector, >> self.item_loader, >> SpiderTypes.TYPE_SCRAPY, self.name, >> self._id, >> filterPredicate=filterPredicate) >> >> def parse(self, response): >> if self.favicon_required: >> self.favicon_required = False >> yield self.item_extractor.extract_favicon_item(response.url) >> yield self.item_extractor.extract_item(response) >> >> >> I managed to get it running from twisted using *subprocess.Popen()*. >> Now, I have this funny log: >> >> 2014-12-02 04:14:12+0200 [scrapy] INFO: Enabled extensions: LogStats, >> TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState >> 2014-12-02 04:14:12+0200 [scrapy] INFO: Enabled downloader middlewares: >> ProxyMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, >> RandomUserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, >> MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, >> RetryChangeProxyMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, >> DownloaderStats >> 2014-12-02 04:14:12+0200 [scrapy] INFO: Enabled spider middlewares: >> HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, >> UrlLengthMiddleware, DepthMiddleware >> 2014-12-02 04:14:12+0200 [scrapy] INFO: Enabled item pipelines: >> ItemPostProcessor, FilterFieldsPipeline, StripFaxFieldPipeline, >> AddSuffixPipeline, PushToHandlerPipeline >> 2014-12-02 04:14:12+0200 [WLWCrawler] INFO: Spider opened >> 2014-12-02 04:14:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 >> pages/min), scraped 0 items (at 0 items/min) >> 2014-12-02 04:15:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 >> pages/min), scraped 0 items (at 0 items/min) >> 2014-12-02 04:16:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 >> pages/min), scraped 0 items (at 0 items/min) >> 2014-12-02 04:17:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 >> pages/min), scraped 0 items (at 0 items/min) >> 2014-12-02 04:18:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 >> pages/min), scraped 0 items (at 0 items/min) >> 2014-12-02 04:19:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 >> pages/min), scraped 0 items (at 0 items/min) >> 2014-12-02 04:20:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 >> pages/min), scraped 0 items (at 0 items/min) >> 2014-12-02 04:21:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 >> pages/min), scraped 0 items (at 0 items/min) >> 2014-12-02 04:22:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 >> pages/min), scraped 0 items (at 0 items/min) >> 2014-12-02 04:23:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 >> pages/min), scraped 0 items (at 0 items/min) >> 2014-12-02 04:24:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 >> pages/min), scraped 0 items (at 0 items/min) >> 2014-12-02 04:25:12+0200 [WLWCrawler] INFO: Crawled 0 pages (at 0 >> pages/min), scraped 0 items (at 0 items/min) >> >> Ideas? >> >> >> On Monday, December 1, 2014 7:58:56 PM UTC+2, Nicolás Alejandro Ramírez >> Quiros wrote: >>> >>> Can you share the code? >>> >>> El viernes, 28 de noviembre de 2014 20:11:00 UTC-2, Mohammed Hamdy >>> escribió: >>>> >>>> I tried launching the spider in another process. It's now worse and >>>> doesn't even log that it's finished. >>>> >>>> On Friday, November 28, 2014 12:06:08 PM UTC+2, Mohammed Hamdy wrote: >>>>> >>>>> Hi there, >>>>> >>>>> I'm developing a distributed crawler using Scrapy and Twisted. There's >>>>> a server that assigns crawling jobs to clients (*so clients create >>>>> scrapy spiders*) and so on. The clients are twisted *LineReceivers*. >>>>> I have scrapy 0.24.4 and twisted 14.0.2. >>>>> >>>>> I'm stuck with this for a couple of days now. The spider works fine >>>>> when run alone outside of the twisted client. When it's run from the >>>>> client >>>>> something strange happens, it's never closed and stays idle forever. If I >>>>> look at the logs, I should say that the spider was closed : >>>>> >>>>> 2014-11-27 14:55:15+0200 [WLWClientProtocol,client] WebService >>>>> starting on 6080 >>>>> 2014-11-27 14:55:15+0200 [scrapy] Web service listening on 127.0.0.1: >>>>> 6080 >>>>> 2014-11-27 14:55:15+0200 [scrapy] Closing spider (finished) >>>>> 2014-11-27 14:55:15+0200 [scrapy] Dumping Scrapy stats: >>>>> {'finish_reason': 'finished', >>>>> 'finish_time': datetime.datetime(2014, 11, 27, 12, 55, 15, 240062), >>>>> 'start_time': datetime.datetime(2014, 11, 27, 12, 55, 15, 238374)} >>>>> 2014-11-27 14:55:15+0200 [scrapy] Spider closed (finished) >>>>> 2014-11-27 14:55:15+0200 [-] (TCP Port 6023 Closed) >>>>> 2014-11-27 14:55:15+0200 [-] (TCP Port 6080 Closed) >>>>> >>>>> But the *spider_closed *signal is never emitted (again, this spider >>>>> works fine outside the client, so the signal is properly connected). And >>>>> I >>>>> depend on this signal for sending results back to server, not to mention >>>>> that the spider stays open, which counts as a leak. >>>>> >>>>> Using the debugger reveals some facts: >>>>> From *ExecutionEngine.spider_is_idle() *method: >>>>> a- The scraper is always idle (*scraper_idle *is always True) and >>>>> the spider's *parse()* method is never called. >>>>> b- *downloading *is always True. And the >>>>> *Downloader.fetch()._deactivate()* is never called. >>>>> >>>>> Is there any hints at what I should be doing?. Debugging deferred code >>>>> is not that easy, and stacks come out of nowhere. >>>>> >>>>> Thanks >>>>> >>>> -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
