Hi Travis, Thanks for responding.
Does it matter that the seed urls I have are subdomains in their own right? Is there a way for scrapy to handle seed urls that might be a mix of subdomain.domain.com and domain.com/subdomain and still get every page on the sites but not stray ? On Tuesday, 3 May 2016 18:15:06 UTC+1, Travis Leleu wrote: > > Addressing just point 3, you'll need to define your business logic. Most > likely, if you don't limit the domains to crawl, you'll crawl forever. > > Personally, I tend to approach these types of scrapes in passes. As you > mentioned, your NLP will be done separately, so your first pass will use > scrapy to cache every page on your 30+ seed sites. Build that, get your > data (the websites) on disk locally. Then you can do multiple passes to do > a controlled discovery of other sites. [One technique, in your post-scrape > processing, would be to look at links to other domains that have your > keywords in them. Take those links, push into a queue for further scraping > [you might do an exploratory scrape of these new sites, maybe 20 or so > pages, and evaluate keyword density in order to determine if it's worth > scraping more.] > > Sounds like you have a good start. I would advise you think of scrapy as > your acquisition process, which could have multiple spiders in it. Cache > to disk, then you can run and rerun your extraction to your heart's desire. > > > > On Tue, May 3, 2016 at 7:04 AM, <[email protected] <javascript:>> wrote: > >> Hi all, >> >> I came across scrapy and I think its ideal for my needs but I'm not sure >> exactly how to go about designing my spider(s). >> >> I need to crawl a number of specific websites (greater than 30) and >> identify pages and links that contain specific keywords. The number of >> keywords will probably increase after an initial pass over the sites but I >> want to avoid excess load on the website while still getting as much >> relevant content as possible. >> I'm not currently planning on using scrapy to extract specific entities >> during its crawl, I'd just be happy to get the pages and/or list of urls I >> can then feed into other processes for text mining later on. >> >> The websites are >> >> 1. all on a specific subject >> 2. they *don't share* the same platform >> 3. May link to external sites that could have useful resources >> containing the keywords I need. >> >> Is it possible to provide keywords to scrapy so it crawls pages that >> contain them in any particular order? >> >> Do you recommend I build multiple spiders as I was thinking a sitemap >> spider might be a good starting point or is there a way to direct scrapy to >> use the sitemap for each site as a starting point? >> >> Michael >> >> -- >> You received this message because you are subscribed to the Google Groups >> "scrapy-users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/scrapy-users. >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
