Yea, you are right. You have to have a constrained set of domains to search and to be honest, that works pretty well. The only thing, I still get a lot of junk links. I would say that 30% are valid or interesting links while the other is kind of worthless. I guess it is a matter of studying spam filters and removing that but I have been kind of lazy in doing so.
http://botspiritcompany.com/botlist/spring/search/global_search.html?query=bush&querymode=enabled I have already built a site that I am describing, based on a short list of popular domains using the very basic aspects of nutch. You can search above and see what you think. I had about 100k links with my last crawl. On 10/13/07, Pike <[EMAIL PROTECTED]> wrote: > Hi > > > My question; have you build a general site to crawl the internet and > > how did you find links that people would be interested in as opposed > > to capturing a lot of the junk out there. > > interesting question. are you planning to build a new google ? > if you are planning to crawl without any limit on f.e. a few > domains, your indexes will go wild very quickly :-) > > we are using nutch now with an extensive list of > 'interesting domains' - this list is an editorial effort. > search results are limited to those domains. > http://www.labforculture.org/opensearch/custom > > another application would be to use nutch to crawl > certain pages, like 'interesting' search results from > other sites, with a limited depth. this would yield > 'interesting' indexes. > > yet another application would be to crawl 'interesting' > rss feeds with a depth of 1. I haven't got that working > yet (see the parse-rss discussion these days). > > nevertheless, I am interested in the question: > anyone else having examples of "possible public > applications with nutch" ? > > $2c, > *pike > > > > > > -- Berlin Brown http://botspiritcompany.com/botlist/spring/help/about.html newspirit technologies