There is still the url crawl db which had over 1Billion urls at last count. So it might be a good starting point for crawling the web. At last count though it was 250G in size so no downloadable unless you have a fast connection. It is available for anyone that wants it though.

Dennis

Otis Gospodnetic wrote:
Paul,

There was talk of this in the past, at least between some other people here and me, 
possibly "off-line".  Your best bet may be going to what's left of Wikia Search 
and getting their old index.  But, you see, this is exactly the problem - the index may 
be quite outdated by now.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
From: Paul Jones <[email protected]>
To: [email protected]
Sent: Sunday, June 21, 2009 7:17:21 PM
Subject: adding pre-indexed DB's together

Hi

A newbie to the world of lucene, nutch , mahout, spent all weekend on Mahout, and now looking at Nutch. So I have a question, its seems (after reading the archives) that alot of people are using Nutch to index the web, whether for vertical searches, or just the web as a whole. Now rather than everyone starting again from scratch, and since very little (if any) "IP" would exist in the index, since nothing clever has been done to them except being processed by Nutch, would it not be possible to "share" all these indexes with each other, i.e if someone has built an index of all blogs, or all car related websites, or just indexed 100 million webpages at random. Maybe there is some tech reason I am missing.

Paul

Reply via email to