Re: scrapy noob trying to do a simple data extraction (not so simple site)

Troy Perkins Thu, 02 Apr 2015 12:58:33 -0700

Hi Travis, thanks for the response.  Not sure why its not able to find it, 
its there, see below:


pawnbahnimac:spiders pawnbahn$ pwd
/Users/pawnbahn/tm/tm/spiders
pawnbahnimac:spiders pawnbahn$ ls
Books Resources __init__.py __init__.pyc items.json tm_spider.py 
tm_spider.pyc
pawnbahnimac:spiders pawnbahn$ 

It only behave like this on this site for some reason.  Running the dmoz 
example works fine.

pawnbahnimac:spiders pawnbahn$ scrapy crawl tm
:0: UserWarning: You do not have a working installation of the 
service_identity module: 'No module named service_identity'.  Please 
install it from <https://pypi.python.org/pypi/service_identity> and make 
sure all of its dependencies are satisfied.  Without the service_identity 
module and a recent enough pyOpenSSL to support it, Twisted can perform 
only rudimentary TLS client hostname verification.  Many valid 
certificate/hostname mappings may be rejected.
2015-04-02 14:56:01-0500 [scrapy] INFO: Scrapy 0.24.5 started (bot: tm)
2015-04-02 14:56:01-0500 [scrapy] INFO: Optional features available: ssl, 
http11
2015-04-02 14:56:01-0500 [scrapy] INFO: Overridden settings: 
{'NEWSPIDER_MODULE': 'tm.spiders', 'SPIDER_MODULES': ['tm.spiders'], 
'BOT_NAME': 'tm'}
2015-04-02 14:56:01-0500 [scrapy] INFO: Enabled extensions: LogStats, 
TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-04-02 14:56:01-0500 [scrapy] INFO: Enabled downloader middlewares: 
HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, 
RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, 
HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, 
ChunkedTransferMiddleware, DownloaderStats
2015-04-02 14:56:01-0500 [scrapy] INFO: Enabled spider middlewares: 
HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, 
UrlLengthMiddleware, DepthMiddleware
2015-04-02 14:56:01-0500 [scrapy] INFO: Enabled item pipelines: 
2015-04-02 14:56:01-0500 [tm] INFO: Spider opened
2015-04-02 14:56:01-0500 [tm] INFO: Crawled 0 pages (at 0 pages/min), 
scraped 0 items (at 0 items/min)
2015-04-02 14:56:01-0500 [scrapy] DEBUG: Telnet console listening on 
127.0.0.1:6023
2015-04-02 14:56:01-0500 [scrapy] DEBUG: Web service listening on 
127.0.0.1:6080
2015-04-02 14:56:01-0500 [tm] DEBUG: Crawled (200) <GET 
http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> 
(referer: None)
2015-04-02 14:56:01-0500 [tm] INFO: Closing spider (finished)
2015-04-02 14:56:01-0500 [tm] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 260,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 6234,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 4, 2, 19, 56, 1, 861714),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2015, 4, 2, 19, 56, 1, 494696)}
2015-04-02 14:56:01-0500 [tm] INFO: Spider closed (finished)



On Thursday, April 2, 2015 at 11:30:41 AM UTC-5, Travis Leleu wrote:
>
> Python can't find the file whose path is stored in filename.  Used in line 
> 13 of your spider.  Read your scrapy debug output to find out more 
> information.
>
>   File "/Users/pawnbahn/tm/tm/spiders/tm_spider.py", line 13, in parse
>     with open(filename, 'wb') as f:
> exceptions.IOError: [Errno 2] No such file or directory: ''
>
> On Wed, Apr 1, 2015 at 10:38 PM, Troy Perkins <[email protected] 
> <javascript:>> wrote:
>
>> Greetings all:
>>
>> I'm new to scrapy and managed to get everything installed and working.  
>> However my simple test project has proven not so simple, at least for me.
>>
>> I'm simply wanting to request the home page of t 1 c k e t m a s t e r d 
>> o t c o m, click the red Just Announced tab down the middle of the page and 
>> -o the list of results out to an email address once a day via cron.  I want 
>> to be able to keep up with the announcements because their mailing lists 
>> simply don't send them soon enough.
>>
>> Here is my starting spider, which I've tested with other sites and its 
>> works fine.  I believe the error is due to it being a javascript rendered 
>> site.  I've used firebug to look for clues but I'm too new at this to 
>> understand as well as understand javascript.  I'm hoping someone would be 
>> willing to point this noob a direction.  I've also tried removing 
>> middleware in the settings.py file with same results.
>>
>> I've purposely masked out the site address as though I don't mean any 
>> harm, I'm not quite sure of their ToS as of yet.  I plan to poll once a day 
>> anyway for personal use.
>>
>> import scrapy
>>
>> from tm.items import TmItem
>>
>> class TmSpider(scrapy.Spider):
>>    name = "tm"
>>    allowed_domains = ["www.************.com"]
>>    start_urls = [
>>        "http://www.***********.com";
>>    ]
>>    def parse(self, response):
>>        filename = response.url.split("/")[-2]
>>        with open(filename, 'wb') as f:
>>            f.write(response.body)
>>
>> scrapy crawl tm results in the following:
>>
>> :0: UserWarning: You do not have a working installation of the 
>> service_identity module: 'No module named service_identity'.  Please 
>> install it from <https://pypi.python.org/pypi/service_identity> and make 
>> sure all of its dependencies are satisfied.  Without the service_identity 
>> module and a recent enough pyOpenSSL to support it, Twisted can perform 
>> only rudimentary TLS client hostname verification.  Many valid 
>> certificate/hostname mappings may be rejected.
>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Scrapy 0.24.5 started (bot: tm)
>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Optional features available: ssl, 
>> http11
>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Overridden settings: 
>> {'NEWSPIDER_MODULE': 'tm.spiders', 'SPIDER_MODULES': ['tm.spiders'], 
>> 'BOT_NAME': 'tm'}
>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled extensions: LogStats, 
>> TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled downloader middlewares: 
>> HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, 
>> RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, 
>> HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, 
>> ChunkedTransferMiddleware, DownloaderStats
>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled spider middlewares: 
>> HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, 
>> UrlLengthMiddleware, DepthMiddleware
>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled item pipelines: 
>> 2015-04-02 00:30:12-0500 [tm] INFO: Spider opened
>> 2015-04-02 00:30:12-0500 [tm] INFO: Crawled 0 pages (at 0 pages/min), 
>> scraped 0 items (at 0 items/min)
>> 2015-04-02 00:30:12-0500 [scrapy] DEBUG: Telnet console listening on 
>> 127.0.0.1:6023
>> 2015-04-02 00:30:12-0500 [scrapy] DEBUG: Web service listening on 
>> 127.0.0.1:6080
>> 2015-04-02 00:30:13-0500 [tm] DEBUG: Crawled (200) <GET 
>> http://www.****************com> 
>> (referer: None)
>> 2015-04-02 00:30:13-0500 [tm] ERROR: Spider error processing <GET 
>> http://www.****************.com>
>> Traceback (most recent call last):
>>   File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", 
>> line 1201, in mainLoop
>>     self.runUntilCurrent()
>>   File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", 
>> line 824, in runUntilCurrent
>>     call.func(*call.args, **call.kw)
>>   File 
>> "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 
>> 383, in callback
>>     self._startRunCallbacks(result)
>>   File 
>> "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 
>> 491, in _startRunCallbacks
>>     self._runCallbacks()
>> --- <exception caught here> ---
>>   File 
>> "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 
>> 578, in _runCallbacks
>>     current.result = callback(current.result, *args, **kw)
>>   File "/Users/pawnbahn/tm/tm/spiders/tm_spider.py", line 13, in parse
>>     with open(filename, 'wb') as f:
>> exceptions.IOError: [Errno 2] No such file or directory: ''
>> 2015-04-02 00:30:13-0500 [tm] INFO: Closing spider (finished)
>> 2015-04-02 00:30:13-0500 [tm] INFO: Dumping Scrapy stats:
>> {'downloader/request_bytes': 219,
>>  'downloader/request_count': 1,
>>  'downloader/request_method_count/GET': 1,
>>  'downloader/response_bytes': 73266,
>>  'downloader/response_count': 1,
>>  'downloader/response_status_count/200': 1,
>>  'finish_reason': 'finished',
>>  'finish_time': datetime.datetime(2015, 4, 2, 5, 30, 13, 3001),
>>  'log_count/DEBUG': 3,
>>  'log_count/ERROR': 1,
>>  'log_count/INFO': 7,
>>  'response_received_count': 1,
>>  'scheduler/dequeued': 1,
>>  'scheduler/dequeued/memory': 1,
>>  'scheduler/enqueued': 1,
>>  'scheduler/enqueued/memory': 1,
>>  'spider_exceptions/IOError': 1,
>>  'start_time': datetime.datetime(2015, 4, 2, 5, 30, 12, 344868)}
>> 2015-04-02 00:30:13-0500 [tm] INFO: Spider closed (finished)
>>
>>
>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: scrapy noob trying to do a simple data extraction (not so simple site)

Reply via email to