Hi everybody,
following the hint here:
http://stackoverflow.com/questions/9681114/how-to-give-url-to-scrapy-for-crawling/9682714#9682714
I modified my spider in this way:
from scrapy.spider import Spider
from scrapy.selector import HtmlXPathSelector
from urls_listing.items import UrlsListingItem
class UrlsGraspingSpider(Spider):
name = 'urls_grasping'
#allowed_domains = ['sole24ore.com']
#start_urls = [
# 'http://www.sole24ore.com/'
#]
def __init__(self, *args, **kwargs):
super(UrlsGraspingSpider, self).__init__(*args, **kwargs)
self.start_urls = [kwargs.get('start_url')]
def parse(self, response):
hxs = HtmlXPathSelector(response)
item = UrlsListingItem()
url = hxs.select('//a[contains(@href, "http")]/@href').extract()
item['url'] = url
return item
SPIDER = UrlsGraspingSpider()
and it seems working:
urls_listing$ scrapy crawl urls_grasping -a
start_url="http://www.sole24ore.com"
2015-02-27 18:14:42+0100 [scrapy] INFO: Scrapy 0.24.4 started (bot:
urls_listing)
2015-02-27 18:14:42+0100 [scrapy] INFO: Optional features available: ssl,
http11
2015-02-27 18:14:42+0100 [scrapy] INFO: Overridden settings:
{'NEWSPIDER_MODULE': 'urls_listing.spiders', 'FEED_FORMAT': 'json',
'SPIDER_MODULES': ['urls_listing.spiders'], 'FEED_URI':
'file://home/marco/crawlscrape/urls_listing/output.json', 'BOT_NAME':
'urls_listing'}
2015-02-27 18:14:42+0100 [scrapy] INFO: Enabled extensions: FeedExporter,
LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-02-27 18:14:42+0100 [scrapy] INFO: Enabled downloader middlewares:
HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware,
RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware,
HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware,
ChunkedTransferMiddleware, DownloaderStats
2015-02-27 18:14:42+0100 [scrapy] INFO: Enabled spider middlewares:
HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware,
UrlLengthMiddleware, DepthMiddleware
2015-02-27 18:14:42+0100 [scrapy] INFO: Enabled item pipelines:
2015-02-27 18:14:42+0100 [urls_grasping] INFO: Spider opened
2015-02-27 18:14:42+0100 [urls_grasping] ERROR: Error caught on signal
handler: <bound method ?.open_spider of
<scrapy.contrib.feedexport.FeedExporter object at 0x7f6cb86c4dd0>>
Traceback (most recent call last):
File
"/usr/local/lib/python2.7/dist-packages/Twisted-14.0.2-py2.7-linux-x86_64.egg/twisted/internet/defer.py",
line 1099, in _inlineCallbacks
result = g.send(result)
File
"/usr/local/lib/python2.7/dist-packages/Scrapy-0.24.4-py2.7.egg/scrapy/core/engine.py",
line 232, in open_spider
yield
self.signals.send_catch_log_deferred(signals.spider_opened, spider=spider)
File
"/usr/local/lib/python2.7/dist-packages/Scrapy-0.24.4-py2.7.egg/scrapy/signalmanager.py",
line 23, in send_catch_log_deferred
return signal.send_catch_log_deferred(*a, **kw)
File
"/usr/local/lib/python2.7/dist-packages/Scrapy-0.24.4-py2.7.egg/scrapy/utils/signal.py",
line 53, in send_catch_log_deferred
*arguments, **named)
--- <exception caught here> ---
File
"/usr/local/lib/python2.7/dist-packages/Twisted-14.0.2-py2.7-linux-x86_64.egg/twisted/internet/defer.py",
line 139, in maybeDeferred
result = f(*args, **kw)
File
"/usr/local/lib/python2.7/dist-packages/Scrapy-0.24.4-py2.7.egg/scrapy/xlib/pydispatch/robustapply.py",
line 54, in robustApply
return receiver(*arguments, **named)
File
"/usr/local/lib/python2.7/dist-packages/Scrapy-0.24.4-py2.7.egg/scrapy/contrib/feedexport.py",
line 171, in open_spider
file = storage.open(spider)
File
"/usr/local/lib/python2.7/dist-packages/Scrapy-0.24.4-py2.7.egg/scrapy/contrib/feedexport.py",
line 76, in open
return open(self.path, 'ab')
exceptions.IOError: [Errno 13] Permission denied:
'/marco/crawlscrape/urls_listing/output.json'
2015-02-27 18:14:42+0100 [urls_grasping] INFO: Crawled 0 pages (at 0
pages/min), scraped 0 items (at 0 items/min)
2015-02-27 18:14:42+0100 [scrapy] DEBUG: Telnet console listening on
127.0.0.1:6023
2015-02-27 18:14:42+0100 [scrapy] DEBUG: Web service listening on
127.0.0.1:6080
2015-02-27 18:14:42+0100 [urls_grasping] DEBUG: Redirecting (301) to <GET
http://www.ilsole24ore.com/> from <GET http://www.sole24ore.com>
2015-02-27 18:14:42+0100 [urls_grasping] DEBUG: Crawled (200) <GET
http://www.ilsole24ore.com/> (referer: None)
/home/marco/crawlscrape/urls_listing/urls_listing/spiders/urls_grasping.py:31:
ScrapyDeprecationWarning: scrapy.selector.HtmlXPathSelector is deprecated,
instantiate scrapy.Selector instead.
hxs = HtmlXPathSelector(response)
/home/marco/crawlscrape/urls_listing/urls_listing/spiders/urls_grasping.py:39:
ScrapyDeprecationWarning: Call to deprecated function select. Use .xpath()
instead.
url = hxs.select('//a[contains(@href, "http")]/@href').extract()
/usr/local/lib/python2.7/dist-packages/Scrapy-0.24.4-py2.7.egg/scrapy/selector/unified.py:106:
ScrapyDeprecationWarning: scrapy.selector.HtmlXPathSelector is deprecated,
instantiate scrapy.Selector instead.
for x in result]
2015-02-27 18:14:42+0100 [urls_grasping] DEBUG: Scraped from <200
http://www.ilsole24ore.com/>
{'url':
[u'http://www.ilsole24ore.com/ebook/norme-e-tributi/2015/rating_legalita/index.shtml',
u'http://www.ilsole24ore.com/ebook/norme-e-tributi/2015/rating_legalita/index.shtml',
u'http://www.ilsole24ore.com/cultura.shtml',
u'http://www.casa24.ilsole24ore.com/',
u'http://www.moda24.ilsole24ore.com/',
and I managed to call the crawling process from a python script, but
without giving the url to scrapy:
so in my spider:
class UrlsGraspingSpider(Spider):
name = 'urls_grasping'
start_urls = [
'http://www.sole24ore.com/'
]
and the external calling python script:
import subprocess
class SpiderActivation:
def __init__(self, project_name, spider_name):
self.project_name = project_name
self.spider_name = spider_name
self.project_name_recall = "project=%s" % (self.project_name)
self.spider_name_recall = "spider=%s" % (self.spider_name)
def spider_activation_meth(self):
subprocess.call(["scrapyd-deploy", "urls_listing", "-p",
"urls_listing"])
subprocess.call(["curl", "http://localhost:6800/schedule.json",
"-d", self.project_name_recall, "-d", self.spider_name_recall])
def get_spider_activated(self):
return self.spider_activation_meth()
if __name__ == '__main__':
nome_progetto = 'urls_listing'
nome_spider = 'urls_grasping'
url_iniziale = 'http://www.ilsole24ore.com'
spider_activation = SpiderActivation(nome_progetto, nome_spider)
#spider_activation = SpiderActivation(nome_progetto, nome_spider,
url_iniziale)
spider_activation.get_spider_activated()
so when I run this script:
time ./spiderActivation_iniziale.py
Packing version 1425061010
Deploying to project "urls_listing" in http://localhost:6800/addversion.json
Server response (200):
{"status": "ok", "project": "urls_listing", "version": "1425061010",
"spiders": 1}
{"status": "ok", "jobid": "ccf04074beac11e4a18fc04a00090e80"}
/var/lib/scrapyd/items/urls_listing/urls_grasping$ sudo emacs -nw
ccf04074beac11e4a18fc04a00090e80.jl
[{"url":
["http://www.ilsole24ore.com/ebook/norme-e-tributi/2015/rating_legalita/index.shtml",
"http://www.ilsole24ore.com/ebook/norme-e-tributi/2015/rating_legalita/index.shtml",
"http://www.ilsole24ore.com/cultura.shtml", "http://www.\
casa24.ilsole24ore.com/", "http://www.moda24.ilsole24ore.com/",
"http://food24.ilsole24ore.com/", "http://www.motori24.ilsole24ore.com/",
"http://job24.ilsole24ore.com/", "http://stream24.ilsole24ore.com/",
"http://www.viaggi24.ilsole24\
ore.com/", "http://www.salute24.ilsole24ore.com/",
"http://www.shopping24.ilsole24ore.com/",
"http://www.radio24.ilsole24ore.com/", "http://america24.com/",
"http://meteo24.ilsole24ore.com/", "https://24orecloud.ilsole24ore.com/",
"http\
://www.ilsole24ore.com/feed/agora/agora.shtml",
"http://www.bs.ilsole24ore.com/", "http://nova.ilsole24ore.com/",
"http://www.eventi.ilsole24ore.com/",
"http://www.applicazioni-mobile.ilsole24ore.com/",
"https://my24areautente.ilsole24o\
re.com", "http://www.ilsole24ore.com/",
But when I try to combine both, meaning give url to scrapy from a python
script, something strange happens:
This is the spider:
class UrlsGraspingSpider(Spider):
name = 'urls_grasping'
#allowed_domains = ['sole24ore.com']
#start_urls = [
# 'http://www.sole24ore.com/'
#]
def __init__(self, *args, **kwargs):
super(UrlsGraspingSpider, self).__init__(*args, **kwargs)
self.start_urls = [kwargs.get('start_url')]
and this the calling python script:
import subprocess
class SpiderActivation:
#def __init__(self, project_name, spider_name):
def __init__(self, project_name, spider_name, url_to_start_from):
self.project_name = project_name
self.spider_name = spider_name
self.url_to_start_from = url_to_start_from
self.project_name_recall = "project=%s" % (self.project_name)
self.spider_name_recall = "spider=%s" % (self.spider_name)
def spider_activation_meth(self):
# scrapyd must be already started: sudo stop scrapyd sudo start
scrapyd
subprocess.call(["scrapyd-deploy", "urls_listing", "-p",
"urls_listing"])
#subprocess.call(["curl", "http://localhost:6800/schedule.json",
"-d", self.project_name_recall, "-d", self.spider_name_recall])
subprocess.call(["curl", "http://localhost:6800/schedule.json",
"-d", self.project_name_recall, "-d", self.spider_name_recall,
self.url_to_start_from])
def get_spider_activated(self):
return self.spider_activation_meth()
if __name__ == '__main__':
nome_progetto = 'urls_listing'
nome_spider = 'urls_grasping'
url_iniziale = 'http://www.ilsole24ore.com'
spider_activation = SpiderActivation(nome_progetto, nome_spider)
spider_activation = SpiderActivation(nome_progetto, nome_spider,
url_iniziale)
#spider_activation.get_spider_activated()
when I run it this is the strange output I get:
marco@pc:~/crawlscrape/urls_listing$ time ./spiderActivation_iniziale.py
</h5>
</div>
</li>
<li>
<a
href="http://video.ilsole24ore.com/SoleOnLine5/Video/Notizie/Italia/SOLDI-VOSTRI/2015/02/soldi-vostri-25-febbraio/Soldi_Vostri_lo_Conte_25_2_2015.php"
name="&lid=ABN99U0C&lpos=mediacenter"><img class="stream24_ico"
width="25" src="/img2013/ico-video.png"><img width="90"
src="http://i.res.24o.it/images2010/Editrice/ILSOLE24ORE/ILSOLE24ORE/Online/Immagini/MediaCenter/Video/2015/02/schermata-2015-02-25-alle-15.14-khib--153x...@ilsole24ore-web.jpg"></a>
<div style="float:left;">
<p class="rubrica">
<a href="http://stream24.ilsole24ore.com/programmi/soldi-vostri/">SOLDI
VOSTRI</a>
</p>
<h5>
<a
href="http://video.ilsole24ore.com/SoleOnLine5/Video/Notizie/Italia/SOLDI-VOSTRI/2015/02/soldi-vostri-25-febbraio/Soldi_Vostri_lo_Conte_25_2_2015.php">Come
il petrolio a buon mercato sta ...
</a>
</h5>
</div>
</li>
</ul><ul class="cf"><li>
<a
href="http://www.motori24.ilsole24ore.com/MediaCenter/Gallery/Saloni/2015/audi-r8-ginevra/audi-r8-ginevra_fotogallery.php"
class="streamtappo" name="&lid=ABN73z0C&lpos=mediacenter"><img
class="stream24_ico" width="25" src="/img2013/ico-foto.png"><img width="90"
src="http://i.res.24o.it/art/impresa-e-territori/2015-02-26/audi-r8-seconda-serie-supercar-quattro-anelli--112905/images/fotohome4.jpg?v1.20150227190217"></a>
<div>
<h5>
<a
href="http://www.motori24.ilsole24ore.com/MediaCenter/Gallery/Saloni/2015/audi-r8-ginevra/audi-r8-ginevra_fotogallery.php">Audi
R8, la seconda serie della supercar dei ...
</a>
</h5>
</div>
</li>
<li>
<a
href="http://foto.ilsole24ore.com/SoleOnLine5/Cultura/Musica/2015/madonna-brit-music-awards/madonna-brit-music-awards_fotogallery.php"
name="&lid=ABOi7w0C&lpos=mediacenter"><img class="stream24_ico"
width="25" src="/img2013/ico-foto.png"><img width="90"
src="http://i.res.24o.it/images2010/Editrice/ILSOLE24ORE/ILSOLE24ORE/Online/Cultura/Musica/2015/madonna-brit-music-awards/img_madonna-brit-music-awards/copertina-reuters_153-110.jpg"></a>
<div>
<h5>
<a
href="http://foto.ilsole24ore.com/SoleOnLine5/Cultura/Musica/2015/madonna-brit-music-awards/madonna-brit-music-awards_fotogallery.php">Brit
Awards 2015: Madonna cade sul palco ...
</a>
</h5>
</div>
</li>
</ul><ul class="cf"><li>
<a
href="http://www.motori24.ilsole24ore.com/MediaCenter/Video/Prove-Guida/2015/jeep-tutorial-offRoad/jeep-tutorial-offRoad.php"
class="streamtappo" name="&lid=ABjMkP0C&lpos=mediacenter"><img
class="stream24_ico" width="25" src="/img2013/ico-video.png"><img
width="90"
src="http://i.res.24o.it/art/impresa-e-territori/2015-02-25/i-segreti-guida-fuoristrada-prima-parte-122855/images/fotohome4.jpg?v1.20150227190217"></a>
<div>
<h5>
<a
href="http://www.motori24.ilsole24ore.com/MediaCenter/Video/Prove-Guida/2015/jeep-tutorial-offRoad/jeep-tutorial-offRoad.php">I
segreti della guida in fuoristrada – Prima ...
</a>
</h5>
</div>
</li>
<li>
<a
href="http://food24.ilsole24ore.com/2015/02/autogrill-porta-il-mercato-del-cibo-piazza-duomo/"
class="streamtappo" name="&lid=ABmj00zC&lpos=mediacenter"><img
width="90"
src="http://i.res.24o.it/images2010/Editrice/ILSOLE24ORE/ILSOLE24ORE/Online/Immagini/ArticoliTappo/2015/02/mercato154.jpg?v1.20150227190217"></a>
<div>
<h5>
<a
href="http://food24.ilsole24ore.com/2015/02/autogrill-porta-il-mercato-del-cibo-piazza-duomo/">Autogrill
porta il Mercato del cibo in ...
</a>
</h5>
</div>
</li>
</ul></div>
</div>
</div>
<div class="tab_content" id="tab_video">
<ul class="cf" style="width:100%">
<li>
<a
href="http://video.ilsole24ore.com/SoleOnLine5/Video/Notizie/Italia/2015/finanza-mondo-mezzo/mondo-mezzo.php"
name="&lid=ff05ad3a-be69-11e4-8c64-5b0c6420a41c&lpos=mediacenter"><img
class="stream24_ico" width="25" src="/img2013/ico-video.png"><img
width="90"
src="http://i.res.24o.it/video/Editrice/ILSOLE24ORE/ILSOLE24ORE/Online/Immagini/Video/2015/02/[email protected]"
/></a>
<div style="float:left;">
<h5><a
href="http://video.ilsole24ore.com/SoleOnLine5/Video/Notizie/Italia/2015/finanza-mondo-mezzo/mondo-mezzo.php"
name="&lid=ff05ad3a-be69-11e4-8c64-5b0c6420a41c&lpos=mediacenter">Mafia
Capitale, nuovo sequestro da 3,5 milioni</a></h5>
</div>
</li>
<li>
<a
href="http://video.ilsole24ore.com/SoleOnLine5/Video/Notizie/Italia/2015/quotidiano-in-classe-26-febbraio-ragazzo-orrore/Lettera-a-un-ragazzo-fuggito-dall-orrore.php"
name="&lid=44bbad1e-bdbd-11e4-9270-f1e6273ab4a0&lpos=mediacenter"><img
class="stream24_ico" width="25" src="/img2013/ico-video.png"><img
width="90"
src="http://i.res.24o.it/video/Editrice/ILSOLE24ORE/ILSOLE24ORE/Online/Immagini/Video/2015/02/[email protected]"
/></a>
Instead of getting the usual urls list, I get the entire scraped webpages.
Looking forward to your kind help.
Marco
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.