Re: scrapy noob trying to do a simple data extraction (not so simple site)

Travis Leleu Thu, 02 Apr 2015 13:04:12 -0700

Your recent debug output doesn't have that error, so you must have fixed it.


The current error feels like it's either a javascript-loaded page, or
you're getting blocked from scraping by the server.

Google around for how to scrape a javscript page with scrapy, and using a
proxy.  Those guides will be your friend.

On Thu, Apr 2, 2015 at 12:58 PM, Troy Perkins <[email protected]>
wrote:

> Hi Travis, thanks for the response.  Not sure why its not able to find it,
> its there, see below:
>
> pawnbahnimac:spiders pawnbahn$ pwd
> /Users/pawnbahn/tm/tm/spiders
> pawnbahnimac:spiders pawnbahn$ ls
> Books Resources __init__.py __init__.pyc items.json tm_spider.py
> tm_spider.pyc
> pawnbahnimac:spiders pawnbahn$
>
> It only behave like this on this site for some reason.  Running the dmoz
> example works fine.
>
> pawnbahnimac:spiders pawnbahn$ scrapy crawl tm
> :0: UserWarning: You do not have a working installation of the
> service_identity module: 'No module named service_identity'.  Please
> install it from <https://pypi.python.org/pypi/service_identity> and make
> sure all of its dependencies are satisfied.  Without the service_identity
> module and a recent enough pyOpenSSL to support it, Twisted can perform
> only rudimentary TLS client hostname verification.  Many valid
> certificate/hostname mappings may be rejected.
> 2015-04-02 14:56:01-0500 [scrapy] INFO: Scrapy 0.24.5 started (bot: tm)
> 2015-04-02 14:56:01-0500 [scrapy] INFO: Optional features available: ssl,
> http11
> 2015-04-02 14:56:01-0500 [scrapy] INFO: Overridden settings:
> {'NEWSPIDER_MODULE': 'tm.spiders', 'SPIDER_MODULES': ['tm.spiders'],
> 'BOT_NAME': 'tm'}
> 2015-04-02 14:56:01-0500 [scrapy] INFO: Enabled extensions: LogStats,
> TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
> 2015-04-02 14:56:01-0500 [scrapy] INFO: Enabled downloader middlewares:
> HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware,
> RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware,
> HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware,
> ChunkedTransferMiddleware, DownloaderStats
> 2015-04-02 14:56:01-0500 [scrapy] INFO: Enabled spider middlewares:
> HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware,
> UrlLengthMiddleware, DepthMiddleware
> 2015-04-02 14:56:01-0500 [scrapy] INFO: Enabled item pipelines:
> 2015-04-02 14:56:01-0500 [tm] INFO: Spider opened
> 2015-04-02 14:56:01-0500 [tm] INFO: Crawled 0 pages (at 0 pages/min),
> scraped 0 items (at 0 items/min)
> 2015-04-02 14:56:01-0500 [scrapy] DEBUG: Telnet console listening on
> 127.0.0.1:6023
> 2015-04-02 14:56:01-0500 [scrapy] DEBUG: Web service listening on
> 127.0.0.1:6080
> 2015-04-02 14:56:01-0500 [tm] DEBUG: Crawled (200) <GET
> http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/>
> (referer: None)
> 2015-04-02 14:56:01-0500 [tm] INFO: Closing spider (finished)
> 2015-04-02 14:56:01-0500 [tm] INFO: Dumping Scrapy stats:
> {'downloader/request_bytes': 260,
>  'downloader/request_count': 1,
>  'downloader/request_method_count/GET': 1,
>  'downloader/response_bytes': 6234,
>  'downloader/response_count': 1,
>  'downloader/response_status_count/200': 1,
>  'finish_reason': 'finished',
>  'finish_time': datetime.datetime(2015, 4, 2, 19, 56, 1, 861714),
>  'log_count/DEBUG': 3,
>  'log_count/INFO': 7,
>  'response_received_count': 1,
>  'scheduler/dequeued': 1,
>  'scheduler/dequeued/memory': 1,
>  'scheduler/enqueued': 1,
>  'scheduler/enqueued/memory': 1,
>  'start_time': datetime.datetime(2015, 4, 2, 19, 56, 1, 494696)}
> 2015-04-02 14:56:01-0500 [tm] INFO: Spider closed (finished)
>
>
>
> On Thursday, April 2, 2015 at 11:30:41 AM UTC-5, Travis Leleu wrote:
>>
>> Python can't find the file whose path is stored in filename.  Used in
>> line 13 of your spider.  Read your scrapy debug output to find out more
>> information.
>>
>>   File "/Users/pawnbahn/tm/tm/spiders/tm_spider.py", line 13, in parse
>>     with open(filename, 'wb') as f:
>> exceptions.IOError: [Errno 2] No such file or directory: ''
>>
>> On Wed, Apr 1, 2015 at 10:38 PM, Troy Perkins <[email protected]>
>> wrote:
>>
>>> Greetings all:
>>>
>>> I'm new to scrapy and managed to get everything installed and working.
>>> However my simple test project has proven not so simple, at least for me.
>>>
>>> I'm simply wanting to request the home page of t 1 c k e t m a s t e r d
>>> o t c o m, click the red Just Announced tab down the middle of the page and
>>> -o the list of results out to an email address once a day via cron.  I want
>>> to be able to keep up with the announcements because their mailing lists
>>> simply don't send them soon enough.
>>>
>>> Here is my starting spider, which I've tested with other sites and its
>>> works fine.  I believe the error is due to it being a javascript rendered
>>> site.  I've used firebug to look for clues but I'm too new at this to
>>> understand as well as understand javascript.  I'm hoping someone would be
>>> willing to point this noob a direction.  I've also tried removing
>>> middleware in the settings.py file with same results.
>>>
>>> I've purposely masked out the site address as though I don't mean any
>>> harm, I'm not quite sure of their ToS as of yet.  I plan to poll once a day
>>> anyway for personal use.
>>>
>>> import scrapy
>>>
>>> from tm.items import TmItem
>>>
>>> class TmSpider(scrapy.Spider):
>>>    name = "tm"
>>>    allowed_domains = ["www.************.com"]
>>>    start_urls = [
>>>        "http://www.***********.com";
>>>    ]
>>>    def parse(self, response):
>>>        filename = response.url.split("/")[-2]
>>>        with open(filename, 'wb') as f:
>>>            f.write(response.body)
>>>
>>> scrapy crawl tm results in the following:
>>>
>>> :0: UserWarning: You do not have a working installation of the
>>> service_identity module: 'No module named service_identity'.  Please
>>> install it from <https://pypi.python.org/pypi/service_identity> and
>>> make sure all of its dependencies are satisfied.  Without the
>>> service_identity module and a recent enough pyOpenSSL to support it,
>>> Twisted can perform only rudimentary TLS client hostname verification.
>>> Many valid certificate/hostname mappings may be rejected.
>>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Scrapy 0.24.5 started (bot: tm)
>>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Optional features available:
>>> ssl, http11
>>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Overridden settings:
>>> {'NEWSPIDER_MODULE': 'tm.spiders', 'SPIDER_MODULES': ['tm.spiders'],
>>> 'BOT_NAME': 'tm'}
>>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled extensions: LogStats,
>>> TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
>>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled downloader middlewares:
>>> HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware,
>>> RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware,
>>> HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware,
>>> ChunkedTransferMiddleware, DownloaderStats
>>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled spider middlewares:
>>> HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware,
>>> UrlLengthMiddleware, DepthMiddleware
>>> 2015-04-02 00:30:12-0500 [scrapy] INFO: Enabled item pipelines:
>>> 2015-04-02 00:30:12-0500 [tm] INFO: Spider opened
>>> 2015-04-02 00:30:12-0500 [tm] INFO: Crawled 0 pages (at 0 pages/min),
>>> scraped 0 items (at 0 items/min)
>>> 2015-04-02 00:30:12-0500 [scrapy] DEBUG: Telnet console listening on
>>> 127.0.0.1:6023
>>> 2015-04-02 00:30:12-0500 [scrapy] DEBUG: Web service listening on
>>> 127.0.0.1:6080
>>> 2015-04-02 00:30:13-0500 [tm] DEBUG: Crawled (200) <GET http://www.
>>> ****************com> (referer: None)
>>> 2015-04-02 00:30:13-0500 [tm] ERROR: Spider error processing <GET
>>> http://www.****************.com>
>>> Traceback (most recent call last):
>>>   File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py",
>>> line 1201, in mainLoop
>>>     self.runUntilCurrent()
>>>   File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py",
>>> line 824, in runUntilCurrent
>>>     call.func(*call.args, **call.kw)
>>>   File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py",
>>> line 383, in callback
>>>     self._startRunCallbacks(result)
>>>   File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py",
>>> line 491, in _startRunCallbacks
>>>     self._runCallbacks()
>>> --- <exception caught here> ---
>>>   File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py",
>>> line 578, in _runCallbacks
>>>     current.result = callback(current.result, *args, **kw)
>>>   File "/Users/pawnbahn/tm/tm/spiders/tm_spider.py", line 13, in parse
>>>     with open(filename, 'wb') as f:
>>> exceptions.IOError: [Errno 2] No such file or directory: ''
>>> 2015-04-02 00:30:13-0500 [tm] INFO: Closing spider (finished)
>>> 2015-04-02 00:30:13-0500 [tm] INFO: Dumping Scrapy stats:
>>> {'downloader/request_bytes': 219,
>>>  'downloader/request_count': 1,
>>>  'downloader/request_method_count/GET': 1,
>>>  'downloader/response_bytes': 73266,
>>>  'downloader/response_count': 1,
>>>  'downloader/response_status_count/200': 1,
>>>  'finish_reason': 'finished',
>>>  'finish_time': datetime.datetime(2015, 4, 2, 5, 30, 13, 3001),
>>>  'log_count/DEBUG': 3,
>>>  'log_count/ERROR': 1,
>>>  'log_count/INFO': 7,
>>>  'response_received_count': 1,
>>>  'scheduler/dequeued': 1,
>>>  'scheduler/dequeued/memory': 1,
>>>  'scheduler/enqueued': 1,
>>>  'scheduler/enqueued/memory': 1,
>>>  'spider_exceptions/IOError': 1,
>>>  'start_time': datetime.datetime(2015, 4, 2, 5, 30, 12, 344868)}
>>> 2015-04-02 00:30:13-0500 [tm] INFO: Spider closed (finished)
>>>
>>>
>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "scrapy-users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: scrapy noob trying to do a simple data extraction (not so simple site)

Reply via email to