Re: adding pre-indexed DB's together

Dennis Kubes Mon, 22 Jun 2009 11:45:47 -0700

There is still the url crawl db which had over 1Billion urls at lastcount. So it might be a good starting point for crawling the web. Atlast count though it was 250G in size so no downloadable unless you havea fast connection. It is available for anyone that wants it though.


Dennis


Otis Gospodnetic wrote:

Paul,

There was talk of this in the past, at least between some other people here and me, 
possibly "off-line".  Your best bet may be going to what's left of Wikia Search 
and getting their old index.  But, you see, this is exactly the problem - the index may 
be quite outdated by now.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
From: Paul Jones <[email protected]>
To: [email protected]
Sent: Sunday, June 21, 2009 7:17:21 PM
Subject: adding pre-indexed DB's together

Hi
A newbie to the world of lucene, nutch , mahout, spent all weekend on Mahout,and now looking at Nutch. So I have a question, its seems (after reading thearchives) that alot of people are using Nutch to index the web, whether forvertical searches, or just the web as a whole. Now rather than everyone startingagain from scratch, and since very little (if any) "IP" would exist in theindex, since nothing clever has been done to them except being processed byNutch, would it not be possible to "share" all these indexes with each other,i.e if someone has built an index of all blogs, or all car related websites, orjust indexed 100 million webpages at random. Maybe there is some tech reason Iam missing.
Paul

Re: adding pre-indexed DB's together

Reply via email to