We're running a crawl using nutch and the last crawl seemed to be taking
a long time. Looking at the output, it seems it's gone into AOL's
search and is actually crawling search results (it's also crawling some
cgi-bin search results page on another site). This sure seems like it
could go on forever.
Admittedly we haven't looked at this very deeply yet (I'm not sure why
it's got so many search pages on AOL to crawl), but this strikes me that
it's likely a common occurrence if it's acting that way. Is there
something we should be doing to prevent this situation?
Thanks.
- Crawling search engines and cgi scripts Insurance Squared Inc.
-