Scrapy noob looking to make an (apparently not so) simple crawlspider to crawl news sites

Grant Basson Fri, 27 Mar 2015 02:46:19 -0700

Hi all,

as the subject suggests I'm a complete noob at web scraping, I've done all 
the usual googling, gone through the tutorial in the documentation, even 
watched a few tutorials on youtube and have now come up against a wall.


what i'm trying to achieve:

essentially I am looking to crawl news sites, looking for a particular 
search term, (or terms) and return; the link to the story, headline, first 
paragraph of the actual article and the date the article was published, and 
insert this into an mssql database. I've gotten it crawling a particular 
site but can't even seem to get any output to look for search terms.

what I've got so far:

#--------import the required classes-----

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from news24.items import News24Item


class News24SpiderSpider(CrawlSpider):
    name = 'news24_spider'
    allowed_domains = ['news24.com']
    start_urls = ['http://www.news24.com/']

#-------the news24.com doesn't seem to have many stories attached to it 
directly
#-------so I haven't defined the "allow" parameter for the rule

    rules = (Rule (SgmlLinkExtractor(allow=("news24.com/", ))
    , callback="parse_items", follow= True),
    )  

#-------the below i've copied from 
#-----http://mherman.org/blog/2012/11/08/recursively-scraping-web-pages-with-scrapy/#.VRUgBHkcQqg
#-----and changed appropriately

    def parse_items(self, response):
        hxs = HtmlXPathSelector(response)
        headlines = hxs.xpath("/@html")
        items = []
        for headlines in headlines:
            item = News24Item()
            item ["Headline"] = 
response.xpath('//*[@id="article_special"]//h1/text()').extract()
            item ["Article"] = 
response.xpath('//*[@id="article-body"]/p[1]/text()').extract()
            item ["Date"] = 
response.xpath('//*[@id="spnDate"]/text()').extract()
            item ["Link"] = headlines.select("a/@href").extract()
            items.append(item)
        return(items)


#-----end spider

What I get when I run the spider, (scrapy crawl news24_spider -o 
test.json) shows that it is indeed recursively scraping pages to a depth of 
2, (set in settings for testing purposes) and finding pages that SHOULD 
meet the xpath requirements set out above. when I open test.json however 
all I get is "[[[[[["

any help is appreciated.

Kind regards,
Grant

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Scrapy noob looking to make an (apparently not so) simple crawlspider to crawl news sites

Reply via email to