Hey Kevin,
you don't have to use the Spider.start_urls attribute if you want to do
more than just provide a list of URLs. Define Spider.start_requests()
instead:
# Within your spider class
def start_requests(self):
BASEURL = "http://website.com/"
for line in open('urls.txt', 'r'):
yield self.make_requests_from_url(BASEURL + line.rstrip('\n'))
Two Notes:
- Scrapy expects the method to return scrapy.Request objects, not URLs, so
we call the make_requests_from_url() method to convert our URLs
- Python keeps the newline characters ('\n') intact when reading lines
from a file, so we remove them with rstrip()
The above code will keep the file opened as long as your spider is running,
which may or may not be an issue depending on the lenght of the file and
how long you need for parsing. You can avoid that (at the cost of memory
efficiency) by reading the full file at once:
def start_requests(self):
BASEURL = "http://website.com/"
urls = [ BASEURL + line.rstrip('\n') for line in open('urls.txt', 'r') ]
for url in urls:
yield self.make_requests_from_url(url)
Here's the documentation:
http://doc.scrapy.org/en/latest/topics/spiders.html
Cheers,
-Jakob
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.