How to correctly transfer into a script a working scraper? (request of Help)

Marco Ippolito Sat, 08 Feb 2014 09:24:38 -0800

Hi everybody,

following the indications here:
http://scrapy.readthedocs.org/en/0.18/topics/practices.html


where: from testspiders.spiders.followall import FollowAllSpider means:
import class "FollowAllSpider" contained in the file followall.py, which is 
located in folder testspiders/spiders

I'm trying to transfer into a script my working scraper.

so this is my file:
#!/usr/bin/python

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log, signals
#from <https://plus.google.com/s/%23from> sole24ore.sole24ore.spiders.sole 
import SoleSpider

spider = SoleSpider(domain='sole24ore.com')
crawler = Crawler(Settings())
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here until the spider_closed signal 
was sent

but when running the script:
python soleLinksScrapy.py
Traceback (most recent call last):
  File "soleLinksScrapy.py", line 25, in <module>
    from sole24ore.sole24ore.spiders.sole import SoleSpider
  File "/home/ubuntu/ggc/prove/sole24ore/sole24ore/spiders/sole.py", line 
6, in <module>
    from sole24ore.items import Sole24OreItem
ImportError: No module named items

The scraperm when typying in its folder scrapy crawl sole works fine:
scrapy crawl sole
2014-02-08 17:00:52+0000 [scrapy] INFO: Scrapy 0.18.4 started (bot: 
sole24ore)
2014-02-08 17:00:52+0000 [scrapy] DEBUG: Optional features available: ssl, 
http11, boto
2014-02-08 17:00:52+0000 [scrapy] DEBUG: Overridden settings: 
{'NEWSPIDER_MODULE': 'sole24ore.spiders', 'SPIDER_MODULES': 
['sole24ore.spiders'], 'BOT_NAME': 'sole24ore'}
2014-02-08 17:00:52+0000 [scrapy] DEBUG: Enabled extensions: LogStats, 
TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-02-08 17:00:52+0000 [scrapy] DEBUG: Enabled downloader middlewares: 
HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, 
RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, 
HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, 
ChunkedTransferMiddleware, DownloaderStats
2014-02-08 17:00:52+0000 [scrapy] DEBUG: Enabled spider middlewares: 
HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, 
UrlLengthMiddleware, DepthMiddleware
2014-02-08 17:00:52+0000 [scrapy] DEBUG: Enabled item pipelines:
2014-02-08 17:00:52+0000 [sole] INFO: Spider opened
2014-02-08 17:00:52+0000 [sole] INFO: Crawled 0 pages (at 0 pages/min), 
scraped 0 items (at 0 items/min)
2014-02-08 17:00:52+0000 [scrapy] DEBUG: Telnet console listening on 
0.0.0.0:6023
2014-02-08 17:00:52+0000 [scrapy] DEBUG: Web service listening on 
0.0.0.0:6080
2014-02-08 17:00:53+0000 [sole] DEBUG: Redirecting (301) to <GET 
http://www.ilsole24ore.com/> from <GET http://www.sole24ore.com/>
2014-02-08 17:00:53+0000 [sole] DEBUG: Crawled (200) <GET 
http://www.ilsole24ore.com/> (referer: None)
[s] Available Scrapy objects:
[s]   hxs        <HtmlXPathSelector xpath=None data=u'<html 
xmlns="http://www.w3.org/1999/xhtm'>
[s]   item       {}
[s]   request    <GET http://www.ilsole24ore.com/>
[s]   response   <200 http://www.ilsole24ore.com/>
[s]   settings   <CrawlerSettings module=<module 'sole24ore.settings' from 
'/home/ubuntu/ggc/prove/sole24ore/sole24ore/settings.pyc'>>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser

In [1]:
Do you really want to exit ([y]/n)? y

2014-02-08 17:00:58+0000 [sole] DEBUG: Scraped from <200 
http://www.ilsole24ore.com/>
        {'url': 
[u'http://www.ilsole24ore.com/ebook/norme-e-tributi/2014/crisi_impresa/index.shtml',
                
 
u'http://www.ilsole24ore.com/ebook/norme-e-tributi/2014/crisi_impresa/index.shtml',
                 u'http://www.ilsole24ore.com/cultura.shtml',
                 u'http://www.casa24.ilsole24ore.com/',
                 u'http://www.moda24.ilsole24ore.com/',
                 u'http://food24.ilsole24ore.com/',
                 u'http://www.motori24.ilsole24ore.com/',
                 u'http://job24.ilsole24ore.com/',
                 u'http://stream24.ilsole24ore.com/',
                 u'http://www.viaggi24.ilsole24ore.com/',
                 u'http://www.salute24.ilsole24ore.com/',
                 u'http://www.shopping24.ilsole24ore.com/',
                 u'http://www.radio24.ilsole24ore.com/',

the scraper folder is 'sole24ore' folder, which is in 
~/ggc/prove/sole24ore...
while the script I would like to make it working is in ~/ggc/prove

Any hints?

Thanks for your help.
Kind regards.
Marco

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

How to correctly transfer into a script a working scraper? (request of Help)

Reply via email to