Hi everyone, I hope this meets you well. I am planning to crawl the entire web. I already know how to setup nutch, solr, and a database. I need advice on how to crawl the entire web. How many nodes should I have? What instances should I use on say AWS Cloud? What should my setup be like? This may sound like I am asking to be spoon-fed but I am hoping someone has already done same so that I can get good advice. I want to get all active links on the web for research purposes. Google APIs have a limitation of 100 links. Accessing zone files have been a Herculean task because no single registrar seems to have them all not at a prohibitive cost. And I want to get weekly updates or monthly of all links. I am open all suggestions.
Best Ridwan