Your recent debug output doesn't have that error, so you must have fixed it.
The current error feels like it's either a javascript-loaded page, or you're getting blocked from scraping by the server. Google around for how to scrape a javscript page with scrapy, and using a proxy. Those guides will be your friend. On Thu, Apr 2, 2015 at 12:58 PM, Troy Perkins <[email protected]> wrote: > Hi Travis, thanks for the response. Not sure why its not able to find it, > its there, see below: > > pawnbahnimac:spiders pawnbahn$ pwd > /Users/pawnbahn/tm/tm/spiders > pawnbahnimac:spiders pawnbahn$ ls > Books Resources __init__.py __init__.pyc items.json tm_spider.py > tm_spider.pyc > pawnbahnimac:spiders pawnbahn$ > > It only behave like this on this site for some reason. Running the dmoz > example works fine. > > pawnbahnimac:spiders pawnbahn$ scrapy crawl tm > :0: UserWarning: You do not have a working installation of the > service_identity module: 'No module named service_identity'. Please > install it from <https://pypi.python.org/pypi/service_identity> and make > sure all of its dependencies are satisfied. Without the service_identity > module and a recent enough pyOpenSSL to support it, Twisted can perform > only rudimentary TLS client hostname verification. Many valid > certificate/hostname mappings may be rejected. > 2015-04-02 14:56:01-0500 [scrapy] INFO: Scrapy 0.24.5 started (bot: tm) > 2015-04-02 14:56:01-0500 [scrapy] INFO: Optional features available: ssl, > http11 > 2015-04-02 14:56:01-0500 [scrapy] INFO: Overridden settings: > {'NEWSPIDER_MODULE': 'tm.spiders', 'SPIDER_MODULES': ['tm.spiders'], > 'BOT_NAME': 'tm'} > 2015-04-02 14:56:01-0500 [scrapy] INFO: Enabled extensions: LogStats, > TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState > 2015-04-02 14:56:01-0500 [scrapy] INFO: Enabled downloader middlewares: > HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, > RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, > HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, > ChunkedTransferMiddleware, DownloaderStats > 2015-04-02 14:56:01-0500 [scrapy] INFO: Enabled spider middlewares: > HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, > UrlLengthMiddleware, DepthMiddleware > 2015-04-02 14:56:01-0500 [scrapy] INFO: Enabled item pipelines: > 2015-04-02 14:56:01-0500 [tm] INFO: Spider opened > 2015-04-02 14:56:01-0500 [tm] INFO: Crawled 0 pages (at 0 pages/min), > scraped 0 items (at 0 items/min) > 2015-04-02 14:56:01-0500 [scrapy] DEBUG: Telnet console listening on > 127.0.0.1:6023 > 2015-04-02 14:56:01-0500 [scrapy] DEBUG: Web service listening on > 127.0.0.1:6080 > 2015-04-02 14:56:01-0500 [tm] DEBUG: Crawled (200) <GET > http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> > (referer: None) > 2015-04-02 14:56:01-0500 [tm] INFO: Closing spider (finished) > 2015-04-02 14:56:01-0500 [tm] INFO: Dumping Scrapy stats: > {'downloader/request_bytes': 260, > 'downloader/request_count': 1, > 'downloader/request_method_count/GET': 1, > 'downloader/response_bytes': 6234, > 'downloader/response_count': 1, > 'downloader/response_status_count/200': 1, > 'finish_reason': 'finished', > 'finish_time': datetime.datetime(2015, 4, 2, 19, 56, 1, 861714), > 'log_count/DEBUG': 3, > 'log_count/INFO': 7, > 'response_received_count': 1, > 'scheduler/dequeued': 1, > 'scheduler/dequeued/memory': 1, > 'scheduler/enqueued': 1, > 'scheduler/enqueued/memory': 1, > 'start_time': datetime.datetime(2015, 4, 2, 19, 56, 1, 494696)} > 2015-04-02 14:56:01-0500 [tm] INFO: Spider closed (finished) > > > > On Thursday, April 2, 2015 at 11:30:41 AM UTC-5, Travis Leleu wrote: >> >> Python can't find the file whose path is stored in filename. Used in >> line 13 of your spider. Read your scrapy debug output to find out more >> information. >> >> File "/Users/pawnbahn/tm/tm/spiders/tm_spider.py", line 13, in parse >> with open(filename, 'wb') as f: >> exceptions.IOError: [Errno 2] No such file or directory: '' >> >> On Wed, Apr 1, 2015 at 10:38 PM, Troy Perkins <[email protected]> >> wrote: >> >>> Greetings all: >>> >>> I'm new to scrapy and managed to get everything installed and working. >>> However my simple test project has proven not so simple, at least for me. >>> >>> I'm simply wanting to request the home page of t 1 c k e t m a s t e r d >>> o t c o m, click the red Just Announced tab down the middle of the page and >>> -o the list of results out to an email address once a day via cron. I want >>> to be able to keep up with the announcements because their mailing lists >>> simply don't send them soon enough. >>> >>> Here is my starting spider, which I've tested with other sites and its >>> works fine. I believe the error is due to it being a javascript rendered >>> site. I've used firebug to look for clues but I'm too new at this to >>> understand as well as understand javascript. I'm hoping someone would be >>> willing to point this noob a direction. I've also tried removing >>> middleware in the settings.py file with same results. >>> >>> I've purposely masked out the site address as though I don't mean any >>> harm, I'm not quite sure of their ToS as of yet. I plan to poll once a day >>> anyway for personal use. >>> >>> import scrapy >>> >>> from tm.items import TmItem >>> >>> class TmSpider(scrapy.Spider): >>> name = "tm" >>> allowed_domains = ["www.************.com"] >>> start_urls = [ >>> "http://www.***********.com" >>> ] >>> def parse(self, response): >>> filename = response.url.split("/")[-2] >>> with open(filename, 'wb') as f: >>> f.write(response.body) >>> >>> scrapy crawl tm results in the following: >>> >>> :0: UserWarning: You do not have a working installation of the >>> service_identity module: 'No module named service_identity'. Please >>> install it from <https://pypi.python.org/pypi/service_identity> and >>> make sure all of its dependencies are satisfied. Without the >>> service_identity module and a recent enough pyOpenSSL to support it, >>> Twisted can perform only rudimentary TLS client hostname verification. >>> Many valid certificate/hostname mappings may be rejected. >>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Scrapy 0.24.5 started (bot: tm) >>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Optional features available: >>> ssl, http11 >>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Overridden settings: >>> {'NEWSPIDER_MODULE': 'tm.spiders', 'SPIDER_MODULES': ['tm.spiders'], >>> 'BOT_NAME': 'tm'} >>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled extensions: LogStats, >>> TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState >>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled downloader middlewares: >>> HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, >>> RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, >>> HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, >>> ChunkedTransferMiddleware, DownloaderStats >>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled spider middlewares: >>> HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, >>> UrlLengthMiddleware, DepthMiddleware >>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled item pipelines: >>> 2015-04-02 00:30:12-0500 [tm] INFO: Spider opened >>> 2015-04-02 00:30:12-0500 [tm] INFO: Crawled 0 pages (at 0 pages/min), >>> scraped 0 items (at 0 items/min) >>> 2015-04-02 00:30:12-0500 [scrapy] DEBUG: Telnet console listening on >>> 127.0.0.1:6023 >>> 2015-04-02 00:30:12-0500 [scrapy] DEBUG: Web service listening on >>> 127.0.0.1:6080 >>> 2015-04-02 00:30:13-0500 [tm] DEBUG: Crawled (200) <GET http://www. >>> ****************com> (referer: None) >>> 2015-04-02 00:30:13-0500 [tm] ERROR: Spider error processing <GET >>> http://www.****************.com> >>> Traceback (most recent call last): >>> File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", >>> line 1201, in mainLoop >>> self.runUntilCurrent() >>> File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", >>> line 824, in runUntilCurrent >>> call.func(*call.args, **call.kw) >>> File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", >>> line 383, in callback >>> self._startRunCallbacks(result) >>> File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", >>> line 491, in _startRunCallbacks >>> self._runCallbacks() >>> --- <exception caught here> --- >>> File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", >>> line 578, in _runCallbacks >>> current.result = callback(current.result, *args, **kw) >>> File "/Users/pawnbahn/tm/tm/spiders/tm_spider.py", line 13, in parse >>> with open(filename, 'wb') as f: >>> exceptions.IOError: [Errno 2] No such file or directory: '' >>> 2015-04-02 00:30:13-0500 [tm] INFO: Closing spider (finished) >>> 2015-04-02 00:30:13-0500 [tm] INFO: Dumping Scrapy stats: >>> {'downloader/request_bytes': 219, >>> 'downloader/request_count': 1, >>> 'downloader/request_method_count/GET': 1, >>> 'downloader/response_bytes': 73266, >>> 'downloader/response_count': 1, >>> 'downloader/response_status_count/200': 1, >>> 'finish_reason': 'finished', >>> 'finish_time': datetime.datetime(2015, 4, 2, 5, 30, 13, 3001), >>> 'log_count/DEBUG': 3, >>> 'log_count/ERROR': 1, >>> 'log_count/INFO': 7, >>> 'response_received_count': 1, >>> 'scheduler/dequeued': 1, >>> 'scheduler/dequeued/memory': 1, >>> 'scheduler/enqueued': 1, >>> 'scheduler/enqueued/memory': 1, >>> 'spider_exceptions/IOError': 1, >>> 'start_time': datetime.datetime(2015, 4, 2, 5, 30, 12, 344868)} >>> 2015-04-02 00:30:13-0500 [tm] INFO: Spider closed (finished) >>> >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "scrapy-users" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at http://groups.google.com/group/scrapy-users. >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- > You received this message because you are subscribed to the Google Groups > "scrapy-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
