Hi all,
as the subject suggests I'm a complete noob at web scraping, I've done all
the usual googling, gone through the tutorial in the documentation, even
watched a few tutorials on youtube and have now come up against a wall.
what i'm trying to achieve:
essentially I am looking to crawl news sites, looking for a particular
search term, (or terms) and return; the link to the story, headline, first
paragraph of the actual article and the date the article was published, and
insert this into an mssql database. I've gotten it crawling a particular
site but can't even seem to get any output to look for search terms.
what I've got so far:
#--------import the required classes-----
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from news24.items import News24Item
class News24SpiderSpider(CrawlSpider):
name = 'news24_spider'
allowed_domains = ['news24.com']
start_urls = ['http://www.news24.com/']
#-------the news24.com doesn't seem to have many stories attached to it
directly
#-------so I haven't defined the "allow" parameter for the rule
rules = (Rule (SgmlLinkExtractor(allow=("news24.com/", ))
, callback="parse_items", follow= True),
)
#-------the below i've copied from
#-----http://mherman.org/blog/2012/11/08/recursively-scraping-web-pages-with-scrapy/#.VRUgBHkcQqg
#-----and changed appropriately
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
headlines = hxs.xpath("/@html")
items = []
for headlines in headlines:
item = News24Item()
item ["Headline"] =
response.xpath('//*[@id="article_special"]//h1/text()').extract()
item ["Article"] =
response.xpath('//*[@id="article-body"]/p[1]/text()').extract()
item ["Date"] =
response.xpath('//*[@id="spnDate"]/text()').extract()
item ["Link"] = headlines.select("a/@href").extract()
items.append(item)
return(items)
#-----end spider
What I get when I run the spider, (scrapy crawl news24_spider -o
test.json) shows that it is indeed recursively scraping pages to a depth of
2, (set in settings for testing purposes) and finding pages that SHOULD
meet the xpath requirements set out above. when I open test.json however
all I get is "[[[[[["
any help is appreciated.
Kind regards,
Grant
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.