Crawling multiple links with Scrapy

shehrumbk Fri, 04 Nov 2016 11:37:42 -0700

Hey guys, I'm new to Scrapy and trying to implement a broad crawl. My goal 
is to visit all internal links of any given website, avoid duplicates, and 
save the body text.
E.g There is a website example.com
I want to visit all static URLs of example.com such as
example.com/A
example.com/A/zxc
example.com/A/zxc/f


and do the same for another domain such as
exampleB.com/A <http://example.com/A>
exampleB.com/A/zxc <http://example.com/A/zxc>
exampleB.com/A/zxc/f <http://example.com/A/zxc/f>


and so on. The rule should apply for all my links that I retrieve from my 
database. Provided that I don't traverse the links already visited, and end 
my crawl of the website if there are no more static links to be visited. 
Can anyone guide me in how to achieve this?

What I tried: 

class PHscrapy(scrapy.Spider):
    name = "PHscrapy"
    crawled_links = []

    rules = (Rule(LxmlLinkExtractor(allow=(), 
unique=True),callback='parse',canonicalize=True, follow=True),)

    def start_requests(self):
        db = MySQLdb.connect("localhost", "****", "****", "***")
        cursor = db.cursor()
        cursor.execute("SELECT website FROM SHOPPING")
        links = cursor.fetchall()
        for url in links:
            yield scrapy.Request(url=url[0], meta={'base_url': url[0]}, 
callback=self.parse)
   
def parse(self, response):
    base_url = response.meta['base_url']
    for link in 
LxmlLinkExtractor(allow=(base_url+'/*'),unique=True,canonicalize=True).extract_links(response):
        print(link.url)
        yield scrapy.Request(link.url,callback=self.parse,meta=response.meta)

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Crawling multiple links with Scrapy

Reply via email to