Re: input from file (preferably on loop) for url crawling

Jakob de Maeyer Sat, 25 Apr 2015 02:20:08 -0700

Hey Kevin,

you don't have to use the Spider.start_urls attribute if you want to do 
more than just provide a list of URLs. Define Spider.start_requests() 
instead:


# Within your spider class
def start_requests(self):
    BASEURL = "http://website.com/";
    for line in open('urls.txt', 'r'):
        yield self.make_requests_from_url(BASEURL + line.rstrip('\n'))

Two Notes:
 - Scrapy expects the method to return scrapy.Request objects, not URLs, so 
we call the make_requests_from_url() method to convert our URLs
 - Python keeps the newline characters ('\n') intact when reading lines 
from a file, so we remove them with rstrip()

The above code will keep the file opened as long as your spider is running, 
which may or may not be an issue depending on the lenght of the file and 
how long you need for parsing. You can avoid that (at the cost of memory 
efficiency) by reading the full file at once:

def start_requests(self):
    BASEURL = "http://website.com/";
    urls = [ BASEURL + line.rstrip('\n') for line in open('urls.txt', 'r') ]
    for url in urls:
        yield self.make_requests_from_url(url)

Here's the documentation: 
http://doc.scrapy.org/en/latest/topics/spiders.html


Cheers,
-Jakob

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: input from file (preferably on loop) for url crawling

Reply via email to