Hi all,

I came across scrapy and I think its ideal for my needs but I'm not sure 
exactly how to go about designing my spider(s).

I need to crawl a number of specific websites (greater than 30) and 
identify pages and links that contain specific keywords. The number of 
keywords will probably increase after an initial pass over the sites but I 
want to avoid excess load on the website while still getting as much 
relevant content as possible. 
I'm not currently planning on using scrapy to extract specific entities 
during its crawl, I'd just be happy to get the pages and/or list of urls I 
can then feed into other processes for text mining later on.

The websites are 

   1. all on a specific subject 
   2. they *don't share* the same platform 
   3. May link to external sites that could have useful resources 
   containing the keywords I need.

Is it possible to provide keywords to scrapy so it crawls pages that 
contain them in any particular order?

Do you recommend I build multiple spiders as I was thinking a sitemap 
spider might be a good starting point or is there a way to direct scrapy to 
use the sitemap for each site as a starting point?

Michael

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to