Hi Rocio, unfortunately it does not work either. I tried using scrapy crawl govcrawl_main -a with my attributes, but it still doesn't follow the rule. Best, Michele C
Il giorno martedì 4 novembre 2014 15:58:35 UTC-5, Rocío Aramberri ha scritto: > > Hi Michele, > > I haven't dig on what your problem exactly is but I think that it can be > better to use Scrapy arguments > <http://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments> > instead of scheduling the job with a script. > > To the arguments you can send a single url, a file name, or na url that > contains all the other urls. > > Good luck! > Rocio > > 2014-11-04 17:52 GMT-02:00 Michele Coscia <[email protected] > <javascript:>>: > >> Note that if I just run >> >> scrapy crawl govcrawl_main >> >> hard-coding all DomainSpider's attribute in the class, I get no "Filtered >> offsite request", which is what usually happens in these questions >> because the "allow" parameter is not correctly set. That appears not to be >> the case here, see log: >> >> 2014-11-04 14:48:45-0500 [scrapy] INFO: Scrapy 0.24.4 started (bot: >> govcrawl) >> 2014-11-04 14:48:45-0500 [scrapy] INFO: Optional features available: ssl, >> http11, boto, django >> 2014-11-04 14:48:45-0500 [scrapy] INFO: Overridden settings: { >> 'NEWSPIDER_MODULE': 'govcrawl.spiders', 'DEPTH_LIMIT': 3, >> 'SPIDER_MODULES': ['govcrawl.spiders'], 'BOT_NAME': 'govcrawl', >> 'DOWNLOAD_TIMEOUT': 60, 'USER_AGENT': 'Mozilla/5.0 (X11; Ubuntu; Linux >> x86_64; rv:30.0) Gecko/20100101 Firefox/30.0', 'DOWNLOAD_DELAY': 1.5} >> 2014-11-04 14:48:45-0500 [scrapy] INFO: Enabled extensions: LogStats, >> TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState >> 2014-11-04 14:48:46-0500 [scrapy] INFO: Enabled downloader middlewares: >> HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, >> RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, >> HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, >> ChunkedTransferMiddleware, DownloaderStats >> 2014-11-04 14:48:46-0500 [scrapy] INFO: Enabled spider middlewares: >> HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, >> UrlLengthMiddleware, DepthMiddleware >> 2014-11-04 14:48:46-0500 [scrapy] INFO: Enabled item pipelines: >> DomainPipeline >> 2014-11-04 14:48:46-0500 [govcrawl_main] INFO: Spider opened >> 2014-11-04 14:48:46-0500 [govcrawl_main] INFO: Crawled 0 pages (at 0 >> pages/min), scraped 0 items (at 0 items/min) >> 2014-11-04 14:48:46-0500 [scrapy] DEBUG: Telnet console listening on >> 127.0.0.1:6023 >> 2014-11-04 14:48:46-0500 [scrapy] DEBUG: Web service listening on 127.0. >> 0.1:6080 >> 2014-11-04 14:48:46-0500 [govcrawl_main] DEBUG: Crawled (200) <GET http: >> //www.mass.gov/eea/agencies/dfg/der/> (referer: None) >> 2014-11-04 14:48:46-0500 [govcrawl_main] INFO: URL: http:// >> www.mass.gov/eea/agencies/dfg/der/ (0) Crawled 1 pages. To Crawl: 0 >> 2014-11-04 14:48:46-0500 [govcrawl_main] DEBUG: Scraped from <200 http:// >> www.mass.gov/eea/agencies/dfg/der/> >> None >> 2014-11-04 14:48:46-0500 [govcrawl_main] INFO: Closing spider (finished) >> 2014-11-04 14:48:46-0500 [govcrawl_main] INFO: Dumping Scrapy stats: >> {'downloader/request_bytes': 274, >> 'downloader/request_count': 1, >> 'downloader/request_method_count/GET': 1, >> 'downloader/response_bytes': 24320, >> 'downloader/response_count': 1, >> 'downloader/response_status_count/200': 1, >> 'finish_reason': 'finished', >> 'finish_time': datetime.datetime(2014, 11, 4, 19, 48, 46, 156057), >> 'item_scraped_count': 1, >> 'log_count/DEBUG': 4, >> 'log_count/INFO': 8, >> 'pages_crawled': 1, >> 'response_received_count': 1, >> 'scheduler/dequeued': 1, >> 'scheduler/dequeued/memory': 1, >> 'scheduler/enqueued': 1, >> 'scheduler/enqueued/memory': 1, >> 'start_time': datetime.datetime(2014, 11, 4, 19, 48, 46, 61865)} >> 2014-11-04 14:48:46-0500 [govcrawl_main] INFO: Spider closed (finished) >> >> Thanks! >> Michele C >> >> -- >> You received this message because you are subscribed to the Google Groups >> "scrapy-users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at http://groups.google.com/group/scrapy-users. >> For more options, visit https://groups.google.com/d/optout. >> > > > > -- > Rocío Aramberri > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
