RE: [Nutch-general] Re: Assumptions About Internet Search

Dave Cohen Fri, 06 Feb 2004 19:08:14 -0800

Thanks for this clarification.

The opportunity I like best about the Nutch technology
is the ability to create and run topic-specific search 
engines (as for medical/social science/technology) searches
which have comprehensive specially-constructed indexes
to support such searches. These indexes need not be that
large - certainly a few million pages would do the trick
in most cases.

I was under the false impression that Nutch wanted to be
a single catch-all interface designed to match or surpass
something like Google. Naturally, I felt that goal was not
realistic. Also, I feel it's worth mentioning that Matt
Wells is doing something similar at his site, which is
http://www.gigablast.com/. As far as I know, however, his
is not an open-source project. He's in business and who
can blame him. He has done some good work.

best,

Dave

-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] Behalf Of Doug
Cutting
Sent: Thursday, February 05, 2004 10:57 AM
To: [EMAIL PROTECTED]
Subject: Re: [Nutch-general] Re: Assumptions About Internet Search

Dave Cohen wrote:
> Maybe it could, that's a design issue. I'm not sure Nutch will ever
> achieve "lift off" if it's just like Google and requires the same
> resources and money to exist.

Nutch's core differentiator from Google, Inktomi, etc. is that it's open 
source: anyone can run the software.  In the short term, we don't aim to 
be better, but rather just to bring more-or-less equivalent techology to 
the open source world.

This alone creates lots of opportunities.  For example: it enables folks 
to run their own search engines with unlimited API-based queries per 
day; folks can analyze query data; folks can experiment with ranking 
algorithms; etc.  So equivalent technology doesn't mean that nothing new 
can happen.  In many ways Linux is equivalent to XP, yet the fact that 
it is open source enables many things that are impossible with XP.

As for the costs and resources: on a per query basis Google's operation 
is actually quite inexpensive.  Things only get really expensive when 
you have to quantity of traffic that Google has.  A Nutch implementation 
with a few hundred million URLs that can handle a few queries per second 
can be maintained on a handful of inexpensive machines.

If even that is beyond your means, we're deploying a ~200M page 
installation (hosted by the Internet Archive) that we can give 
experimental access to.  So if you develop Nutch-based code that you'd 
like to run on a larger collection than you can afford to maintain, 
please contact us, and we can set you up with an account.

Cheers,

Doug

-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Nutch-general mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-general

-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Nutch-general mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-general

RE: [Nutch-general] Re: Assumptions About Internet Search

Reply via email to