Yea, you are right.  You have to have a constrained set of domains to
search and to be honest, that works pretty well.  The only thing, I
still get a lot of junk links.  I would say that 30% are valid or
interesting links while the other is kind of worthless.  I guess it is
a matter of studying spam filters and removing that but I have been
kind of lazy in doing so.

http://botspiritcompany.com/botlist/spring/search/global_search.html?query=bush&querymode=enabled

I have already built a site that I am describing, based on a short
list of popular domains using the very basic aspects of nutch.   You
can search above and see what you think.  I had about 100k links with
my last crawl.

On 10/13/07, Pike <[EMAIL PROTECTED]> wrote:
> Hi
>
> > My question; have you build a general site to crawl the internet and
> > how did you find links that people would be interested in as opposed
> > to capturing a lot of the junk out there.
>
> interesting question. are you planning to build a new google ?
> if you are planning to crawl without any limit on f.e. a few
> domains, your indexes will go wild very quickly :-)
>
> we are using nutch now with an extensive list of
> 'interesting domains' - this list is an editorial effort.
> search results are limited to those domains.
> http://www.labforculture.org/opensearch/custom
>
> another application would be to use nutch to crawl
> certain pages, like 'interesting' search results from
> other sites, with a limited depth. this would yield
> 'interesting' indexes.
>
> yet another application would be to crawl 'interesting'
> rss feeds with a depth of 1. I haven't got that working
> yet (see the parse-rss discussion these days).
>
> nevertheless, I am interested in the question:
> anyone else having examples of "possible public
> applications with nutch" ?
>
> $2c,
> *pike
>
>
>
>
>
>


-- 
Berlin Brown
http://botspiritcompany.com/botlist/spring/help/about.html
newspirit technologies

Reply via email to