I'm working on a project where I need to search the whole of a digital catalog and scrape only those responses relevant to my subject based on whether they contain a member of a given set of keywords. My keyword set is also large enough that it is preferable to place them in a separate file and call for it in the spider. I'm having trouble making the transition from an explicit definition of keywords within the crawler itself to a method that pulls them in from another file. Simplified versions of my code for the two approaches follow below.
*Definition of keywords within spider (tested and working):* import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from Rainbow_Crawler.items import RHPCatalogItem from scrapy.loader.processors import TakeFirst from scrapy.loader import ItemLoader from scrapy.loader.processors import Join class RHPSpider(CrawlSpider): name = 'rainbow' allowed_domains = [] start_urls = [ 'http://rainbowhistory.omeka.net/items/browse' ] rules = ( Rule(LinkExtractor(allow=('items/show/.*',)),callback='parse_CatalogRecord' , #follow=True #this section is commented out to produce a small test crawl. ),) def parse_CatalogRecord(self, response): CatalogRecord = ItemLoader(item=RHPCatalogItem(), response=response) CatalogRecord.default_output_processor = TakeFirst() if response.xpath('//*[text()[contains(.,"gay") or contains(.,"lesbian")]]' ): CatalogRecord.add_xpath('title', './/div[@id="dublin-core-title"]/div[@class="element-text"]/text()') return CatalogRecord.load_item() *External definition of keywords (runs without throwing errors, but also reports 0 items crawled, 0 items scraped): * import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from Rainbow_Crawler.items import RHPCatalogItem from scrapy.loader.processors import TakeFirst from scrapy.loader import ItemLoader from scrapy.loader.processors import Join import re class RHPSpider(CrawlSpider): name = 'rainbow' allowed_domains = [] start_urls = [ 'http://rainbowhistory.omeka.net/items/browse' ] rules = ( Rule(LinkExtractor(allow=('items/show/.*',)),callback='parse_CatalogRecord', #follow=True #this section is commented out to produce a small test crawl. ),) def parse_CatalogRecord(self, response): CatalogRecord = ItemLoader(item=RHPCatalogItem(), response=response) CatalogRecord.default_output_processor = TakeFirst() keywords = '|'.join(re.escape(word.strip()) for word in open('keys.txt')) r = re.compile('.*(%s).*' % keywords, re.MULTILINE|re.UNICODE) if r.match(response.body_as_unicode()): CatalogRecord.add_xpath('title', './/div[@id="dublin-core-title"]/div[@class="element-text"]/text()') return CatalogRecord.load_item() I generated the approach to defining keywords externally from the responses I received to this question on stackexchange: http://stackoverflow.com/questions/31899862/checking-text-for-the-presence-of-a-large-set-of-keywords/31905330#31905330. Attempting to directly implement the regex solution provided there returned errors, which I resolved by inserting parentheses after word.strip and resonse.body_as_unicode. I'm stumped as to why scrapy isn't finding data to extract though. I would really appreciate a pair of more experienced eyes. Thanks! -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
