Crawling the entire web

Ridwan Naibi Wed, 10 Jan 2024 07:59:22 -0800

Hi everyone,

I hope this meets you well. I am planning to crawl the entire web. I already 
know how to setup nutch, solr, and a database. I need advice on how to crawl 
the entire web. How many nodes should I have? What instances should I use on 
say AWS Cloud? What should my setup be like? This may sound like I am asking to 
be spoon-fed but I am hoping someone has already done same so that I can get 
good advice. I want to get all active links on the web for research purposes. 
Google APIs have a limitation of 100 links. Accessing zone files have been a 
Herculean task because no single registrar seems to have them all not at a 
prohibitive cost. And I want to get weekly updates or monthly of all links. I 
am open all suggestions.


Best
Ridwan

Crawling the entire web

Reply via email to