Hi Travis,

Thanks for responding. 

Does it matter that the seed urls I have are subdomains in their own right? 
Is there a way for scrapy to handle seed urls that might be a mix of 
subdomain.domain.com and domain.com/subdomain and still get every page on 
the sites but not stray ?



On Tuesday, 3 May 2016 18:15:06 UTC+1, Travis Leleu wrote:
>
> Addressing just point 3, you'll need to define your business logic.  Most 
> likely, if you don't limit the domains to crawl, you'll crawl forever.
>
> Personally, I tend to approach these types of scrapes in passes.  As you 
> mentioned, your NLP will be done separately, so your first pass will use 
> scrapy to cache every page on your 30+ seed sites.  Build that, get your 
> data (the websites) on disk locally.  Then you can do multiple passes to do 
> a controlled discovery of other sites.  [One technique, in your post-scrape 
> processing, would be to look at links to other domains that have your 
> keywords in them.  Take those links, push into a queue for further scraping 
> [you might do an exploratory scrape of these new sites, maybe 20 or so 
> pages, and evaluate keyword density in order to determine if it's worth 
> scraping more.]
>
> Sounds like you have a good start.  I would advise you think of scrapy as 
> your acquisition process, which could have multiple spiders in it.  Cache 
> to disk, then you can run and rerun your extraction to your heart's desire.
>
>
>
> On Tue, May 3, 2016 at 7:04 AM, <[email protected] <javascript:>> wrote:
>
>> Hi all,
>>
>> I came across scrapy and I think its ideal for my needs but I'm not sure 
>> exactly how to go about designing my spider(s).
>>
>> I need to crawl a number of specific websites (greater than 30) and 
>> identify pages and links that contain specific keywords. The number of 
>> keywords will probably increase after an initial pass over the sites but I 
>> want to avoid excess load on the website while still getting as much 
>> relevant content as possible. 
>> I'm not currently planning on using scrapy to extract specific entities 
>> during its crawl, I'd just be happy to get the pages and/or list of urls I 
>> can then feed into other processes for text mining later on.
>>
>> The websites are 
>>
>>    1. all on a specific subject 
>>    2. they *don't share* the same platform 
>>    3. May link to external sites that could have useful resources 
>>    containing the keywords I need.
>>
>> Is it possible to provide keywords to scrapy so it crawls pages that 
>> contain them in any particular order?
>>
>> Do you recommend I build multiple spiders as I was thinking a sitemap 
>> spider might be a good starting point or is there a way to direct scrapy to 
>> use the sitemap for each site as a starting point?
>>
>> Michael
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to