Addressing just point 3, you'll need to define your business logic.  Most
likely, if you don't limit the domains to crawl, you'll crawl forever.

Personally, I tend to approach these types of scrapes in passes.  As you
mentioned, your NLP will be done separately, so your first pass will use
scrapy to cache every page on your 30+ seed sites.  Build that, get your
data (the websites) on disk locally.  Then you can do multiple passes to do
a controlled discovery of other sites.  [One technique, in your post-scrape
processing, would be to look at links to other domains that have your
keywords in them.  Take those links, push into a queue for further scraping
[you might do an exploratory scrape of these new sites, maybe 20 or so
pages, and evaluate keyword density in order to determine if it's worth
scraping more.]

Sounds like you have a good start.  I would advise you think of scrapy as
your acquisition process, which could have multiple spiders in it.  Cache
to disk, then you can run and rerun your extraction to your heart's desire.



On Tue, May 3, 2016 at 7:04 AM, <[email protected]> wrote:

> Hi all,
>
> I came across scrapy and I think its ideal for my needs but I'm not sure
> exactly how to go about designing my spider(s).
>
> I need to crawl a number of specific websites (greater than 30) and
> identify pages and links that contain specific keywords. The number of
> keywords will probably increase after an initial pass over the sites but I
> want to avoid excess load on the website while still getting as much
> relevant content as possible.
> I'm not currently planning on using scrapy to extract specific entities
> during its crawl, I'd just be happy to get the pages and/or list of urls I
> can then feed into other processes for text mining later on.
>
> The websites are
>
>    1. all on a specific subject
>    2. they *don't share* the same platform
>    3. May link to external sites that could have useful resources
>    containing the keywords I need.
>
> Is it possible to provide keywords to scrapy so it crawls pages that
> contain them in any particular order?
>
> Do you recommend I build multiple spiders as I was thinking a sitemap
> spider might be a good starting point or is there a way to direct scrapy to
> use the sitemap for each site as a starting point?
>
> Michael
>
> --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to