Add 'dont_filter = True' to your request.
    def start_requests(self):
        for url in self.start_urls:
           yield Request(url, dont_filter=True)


El martes, 18 de noviembre de 2014 15:20:16 UTC-2, Michele Coscia escribió:
>
> So, I am trying to crawl this website: http://mhs.mt.gov/
>
> If I launch the scrapy shell, I see that it 302 redirects to itself. The 
> shell handles it nicely and I get a proper response object:
>
> scrapy shell http://mhs.mt.gov/
> 2014-11-18 12:09:01-0500 [scrapy] INFO: Scrapy 0.24.4 started (bot: 
> govcrawl)
> 2014-11-18 12:09:01-0500 [scrapy] INFO: Optional features available: ssl, 
> http11, boto, django
> 2014-11-18 12:09:01-0500 [scrapy] INFO: Overridden settings: {
> 'NEWSPIDER_MODULE': 'govcrawl.spiders', 'DEPTH_LIMIT': 3, 'SPIDER_MODULES'
> : ['govcrawl.spiders'], 'BOT_NAME': 'govcrawl', 'DOWNLOAD_TIMEOUT': 35, 
> 'LOGSTATS_INTERVAL': 0, 'USER_AGENT': 'Mozilla/5.0 (X11; Ubuntu; Linux 
> x86_64; rv:30.0) Gecko/20100101 Firefox/30.0', 'DOWNLOAD_DELAY': 1.5}
> 2014-11-18 12:09:01-0500 [scrapy] INFO: Enabled extensions: TelnetConsole, 
> CloseSpider, WebService, CoreStats, SpiderState
> 2014-11-18 12:09:01-0500 [scrapy] INFO: Enabled downloader middlewares: 
> HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, 
> RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, 
> HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, 
> ChunkedTransferMiddleware, DownloaderStats, MimeFixerMiddleware
> 2014-11-18 12:09:01-0500 [scrapy] INFO: Enabled spider middlewares: 
> HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, 
> UrlLengthMiddleware, DepthMiddleware
> 2014-11-18 12:09:01-0500 [scrapy] INFO: Enabled item pipelines: 
> DomainPipeline
> 2014-11-18 12:09:01-0500 [scrapy] DEBUG: Telnet console listening on 127.0
> .0.1:6023
> 2014-11-18 12:09:01-0500 [scrapy] DEBUG: Web service listening on 127.0.
> 0.1:6080
> 2014-11-18 12:09:01-0500 [default] INFO: Spider opened
> 2014-11-18 12:09:01-0500 [default] DEBUG: Redirecting (302) to <GET http:
> //mhs.mt.gov/> from <GET http://mhs.mt.gov/>
> 2014-11-18 12:09:03-0500 [default] DEBUG: Crawled (200) <GET http://
> mhs.mt.gov/> (referer: None)
> [s] Available Scrapy objects:
> [s]   crawler    <scrapy.crawler.Crawler object at 0x1fe1190>
> [s]   item       {}
> [s]   request    <GET http://mhs.mt.gov/>
> [s]   response   <200 http://mhs.mt.gov/>
> [s]   settings   <scrapy.settings.Settings object at 0x1a59e10>
> [s]   spider     <Spider 'default' at 0x332c790>
> [s] Useful shortcuts:
> [s]   shelp()           Shell help (print this help)
> [s]   fetch(req_or_url) Fetch request (or URL) and update local objects
> [s]   view(response)    View response in a browser
>
> In [1]:
>
> However, in my crawler this does not happen. The spider does not enter in 
> the parse method. I overwrote the start_request method as
>
>     def start_requests(self):
>         for url in self.start_urls:
>            yield Request
> ...

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to