I've been working on downloading all pdf files in one go using Scrapy.
I can't understand why even though I don't see an error in the code, it's
not downloading any files.
Here's my code.
My spider:
from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from cs.items import CsItem
from scrapy.item import Item, Field
from scrapy.selector import HtmlXPathSelector
class CsSpider(CrawlSpider):
name = "cs"
allowed_domains = ["cs.org"]
start_urls = [
"http://cs.org/projects.html"
]
rules =
(Rule(SgmlLinkExtractor(allow_domains=('http://cs.org/projects.html', )),
callback='parse_urls', follow=True),)
def parse_urls(self, response):
hxs=HtmlXPathSelector(response)
item=CsItem()
item['pdf_urls']=hxs.select('//a/@href')
pdf_urls=hxs.select('//a/@href').extract()
for url in pdf_urls :
yield scrapy.Request(url,callback=self.save_pdf)
def save_pdf(self,response):
path=self.get_path(item['url'])
with open(path,"wb") as f:
f.write(response.body)
items.py:
import scrapy
from scrapy.item import Item, Field
class CsItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pdf_urls=Field()
files=Field()
settings.py:
BOT_NAME = 'cs'
SPIDER_MODULES = ['cs.spiders']
NEWSPIDER_MODULE = 'cs.spiders'
ITEM_PIPELINES = {
'scrapy.contrib.pipeline.files.FilesPipeline':1
}
FILES_STORE = '/home/amitoj/Projects/Scrapy/PDFScraper/cs/cs/downloads'
I'd appreciate help in finding out possible loopholes in the code and other
(better) ways of downloading multiple pdf files.
Thanks,
Amitoj
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.