Identifying a Subset of Responses to Parse

tricia . potmesil Tue, 11 Aug 2015 12:29:35 -0700

 

I'm working on a project where I need to search the whole of a digital 
catalog and scrape only those responses relevant to my subject based on 
whether they contain a member of a given set of keywords. My keyword set is 
also large enough that it is preferable to place them in a separate file 
and call for it in the spider. I'm having trouble making the transition 
from an explicit definition of keywords within the crawler itself to a 
method that pulls them in from another file. Simplified versions of my code 
for the two approaches follow below.



*Definition of keywords within spider (tested and working):*


import scrapy
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 
from Rainbow_Crawler.items import RHPCatalogItem 
from scrapy.loader.processors import TakeFirst 
from scrapy.loader import ItemLoader 
from scrapy.loader.processors import Join 

class RHPSpider(CrawlSpider): 
name = 'rainbow'
allowed_domains = []
start_urls = [
 'http://rainbowhistory.omeka.net/items/browse'
 ]
 rules = (
 Rule(LinkExtractor(allow=('items/show/.*',)),callback='parse_CatalogRecord'
,
 #follow=True #this section is commented out to produce a small test crawl.
 ),) 

def parse_CatalogRecord(self, response): 
CatalogRecord = ItemLoader(item=RHPCatalogItem(), response=response) 
CatalogRecord.default_output_processor = TakeFirst() 
if response.xpath('//*[text()[contains(.,"gay") or contains(.,"lesbian")]]'
): 
CatalogRecord.add_xpath('title', 
'.//div[@id="dublin-core-title"]/div[@class="element-text"]/text()') 
return CatalogRecord.load_item() 



*External definition of keywords (runs without throwing errors, but also 
reports 0 items crawled, 0 items scraped): *


import scrapy
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 
from Rainbow_Crawler.items import RHPCatalogItem 
from scrapy.loader.processors import TakeFirst 
from scrapy.loader import ItemLoader 
from scrapy.loader.processors import Join 
import re 

class RHPSpider(CrawlSpider): 
name = 'rainbow' 
allowed_domains = [] 
start_urls = [ 
'http://rainbowhistory.omeka.net/items/browse'
 ] 
rules = ( 
Rule(LinkExtractor(allow=('items/show/.*',)),callback='parse_CatalogRecord', 
#follow=True #this section is commented out to produce a small test crawl. 
),) 

def parse_CatalogRecord(self, response): 
CatalogRecord = ItemLoader(item=RHPCatalogItem(), response=response) 
CatalogRecord.default_output_processor = TakeFirst() 
keywords = '|'.join(re.escape(word.strip()) for word in open('keys.txt')) 
r = re.compile('.*(%s).*' % keywords, re.MULTILINE|re.UNICODE) 
if r.match(response.body_as_unicode()): 
CatalogRecord.add_xpath('title', 
'.//div[@id="dublin-core-title"]/div[@class="element-text"]/text()') 
return CatalogRecord.load_item()



I generated the approach to defining keywords externally from the responses 
I received to this question on stackexchange: 
http://stackoverflow.com/questions/31899862/checking-text-for-the-presence-of-a-large-set-of-keywords/31905330#31905330.
 
Attempting to directly implement the regex solution provided there returned 
errors, which I resolved by inserting parentheses after word.strip and 
resonse.body_as_unicode. I'm stumped as to why scrapy isn't finding data to 
extract though. I would really appreciate a pair of more experienced eyes.


Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Identifying a Subset of Responses to Parse

Reply via email to