Re: CrawlSpider fails to follow rule for some websites

Michele Coscia Tue, 04 Nov 2014 13:50:02 -0800

Hi Rocio,
unfortunately it does not work either. I tried using scrapy crawl 
govcrawl_main -a with my attributes, but it still doesn't follow the rule.
Best,
Michele C


Il giorno martedì 4 novembre 2014 15:58:35 UTC-5, Rocío Aramberri ha 
scritto:
>
> Hi Michele,
>
> I haven't dig on what your problem exactly is but I think that it can be 
> better to use Scrapy arguments 
> <http://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments> 
> instead of scheduling the job with a script.
>
> To the arguments you can send a single url, a file name, or na url that 
> contains all the other urls.
>
> Good luck!
> Rocio
>
> 2014-11-04 17:52 GMT-02:00 Michele Coscia <[email protected] 
> <javascript:>>:
>
>> Note that if I just run
>>
>> scrapy crawl govcrawl_main
>>
>> hard-coding all DomainSpider's attribute in the class, I get no "Filtered 
>> offsite request", which is what usually happens in these questions 
>> because the "allow" parameter is not correctly set. That appears not to be 
>> the case here, see log:
>>
>> 2014-11-04 14:48:45-0500 [scrapy] INFO: Scrapy 0.24.4 started (bot: 
>> govcrawl)
>> 2014-11-04 14:48:45-0500 [scrapy] INFO: Optional features available: ssl, 
>> http11, boto, django
>> 2014-11-04 14:48:45-0500 [scrapy] INFO: Overridden settings: {
>> 'NEWSPIDER_MODULE': 'govcrawl.spiders', 'DEPTH_LIMIT': 3, 
>> 'SPIDER_MODULES': ['govcrawl.spiders'], 'BOT_NAME': 'govcrawl', 
>> 'DOWNLOAD_TIMEOUT': 60, 'USER_AGENT': 'Mozilla/5.0 (X11; Ubuntu; Linux 
>> x86_64; rv:30.0) Gecko/20100101 Firefox/30.0', 'DOWNLOAD_DELAY': 1.5}
>> 2014-11-04 14:48:45-0500 [scrapy] INFO: Enabled extensions: LogStats, 
>> TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
>> 2014-11-04 14:48:46-0500 [scrapy] INFO: Enabled downloader middlewares: 
>> HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, 
>> RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, 
>> HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, 
>> ChunkedTransferMiddleware, DownloaderStats
>> 2014-11-04 14:48:46-0500 [scrapy] INFO: Enabled spider middlewares: 
>> HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, 
>> UrlLengthMiddleware, DepthMiddleware
>> 2014-11-04 14:48:46-0500 [scrapy] INFO: Enabled item pipelines: 
>> DomainPipeline
>> 2014-11-04 14:48:46-0500 [govcrawl_main] INFO: Spider opened
>> 2014-11-04 14:48:46-0500 [govcrawl_main] INFO: Crawled 0 pages (at 0 
>> pages/min), scraped 0 items (at 0 items/min)
>> 2014-11-04 14:48:46-0500 [scrapy] DEBUG: Telnet console listening on 
>> 127.0.0.1:6023
>> 2014-11-04 14:48:46-0500 [scrapy] DEBUG: Web service listening on 127.0.
>> 0.1:6080
>> 2014-11-04 14:48:46-0500 [govcrawl_main] DEBUG: Crawled (200) <GET http:
>> //www.mass.gov/eea/agencies/dfg/der/> (referer: None)
>> 2014-11-04 14:48:46-0500 [govcrawl_main] INFO: URL: http://
>> www.mass.gov/eea/agencies/dfg/der/ (0) Crawled 1 pages. To Crawl: 0
>> 2014-11-04 14:48:46-0500 [govcrawl_main] DEBUG: Scraped from <200 http://
>> www.mass.gov/eea/agencies/dfg/der/>
>>     None
>> 2014-11-04 14:48:46-0500 [govcrawl_main] INFO: Closing spider (finished)
>> 2014-11-04 14:48:46-0500 [govcrawl_main] INFO: Dumping Scrapy stats:
>>     {'downloader/request_bytes': 274,
>>      'downloader/request_count': 1,
>>      'downloader/request_method_count/GET': 1,
>>      'downloader/response_bytes': 24320,
>>      'downloader/response_count': 1,
>>      'downloader/response_status_count/200': 1,
>>      'finish_reason': 'finished',
>>      'finish_time': datetime.datetime(2014, 11, 4, 19, 48, 46, 156057),
>>      'item_scraped_count': 1,
>>      'log_count/DEBUG': 4,
>>      'log_count/INFO': 8,
>>      'pages_crawled': 1,
>>      'response_received_count': 1,
>>      'scheduler/dequeued': 1,
>>      'scheduler/dequeued/memory': 1,
>>      'scheduler/enqueued': 1,
>>      'scheduler/enqueued/memory': 1,
>>      'start_time': datetime.datetime(2014, 11, 4, 19, 48, 46, 61865)}
>> 2014-11-04 14:48:46-0500 [govcrawl_main] INFO: Spider closed (finished)
>>
>> Thanks!
>> Michele C
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> -- 
> Rocío Aramberri
>  

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: CrawlSpider fails to follow rule for some websites

Reply via email to