There is still the url crawl db which had over 1Billion urls at last
count. So it might be a good starting point for crawling the web. At
last count though it was 250G in size so no downloadable unless you have
a fast connection. It is available for anyone that wants it though.
Dennis
Otis Gospodnetic wrote:
Paul,
There was talk of this in the past, at least between some other people here and me,
possibly "off-line". Your best bet may be going to what's left of Wikia Search
and getting their old index. But, you see, this is exactly the problem - the index may
be quite outdated by now.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
From: Paul Jones <[email protected]>
To: [email protected]
Sent: Sunday, June 21, 2009 7:17:21 PM
Subject: adding pre-indexed DB's together
Hi
A newbie to the world of lucene, nutch , mahout, spent all weekend on Mahout,
and now looking at Nutch. So I have a question, its seems (after reading the
archives) that alot of people are using Nutch to index the web, whether for
vertical searches, or just the web as a whole. Now rather than everyone starting
again from scratch, and since very little (if any) "IP" would exist in the
index, since nothing clever has been done to them except being processed by
Nutch, would it not be possible to "share" all these indexes with each other,
i.e if someone has built an index of all blogs, or all car related websites, or
just indexed 100 million webpages at random. Maybe there is some tech reason I
am missing.
Paul